* Exercise description
  The task in this exercise is to implement a 5-stage pipelined processor for
  the [[./instructions.org][RISCV32I instruction set]].
  
  For exercise 1 you will build a 5-stage processor which handles one instruction
  at a time, whereas in exercise 2 your design will handle multiple instructions
  at a time.
  This is done by inserting 4 NOP instructions inbetween each source instruction,
  enabling us to use the same tests and harness for both exercise 1 and 2.
  
  Once you are done with exercise 1, you can up the difficulty by setting nopPad
  to false and start reading the [[exercise2.org][ex2 guide]].

  In the project skeleton files ([[./src/main/scala/][Found here]]) you can see that a lot of code has
  already been provided, which can make it difficult to get started.
  Hopefully this document can help clear up at least some of the confusion.
  First an overview of what you are designing is presented, followed by a walk-through
  for getting the most basic instructions to work.
  
  In order to orient yourself you first need a map, thus a high level overview of the 
  processor you're going to design is showed underneath:
  Keep in mind that this is just a high level sketch, omitting many details as well
  entire features (for instance branch logic)

  *Important*
  When you are done, use the provided ./deliver.sh script to pack up the archive.
  If you're unable to run bash scripts then please ensure that you deliver a *zip* archive.
  Not .rar or anything else, just use zip because my grading script knows how to handle that
  in addition to the one used by deliver.sh
  named after your username. Nothing more, nothing less, just your username.
  This archive should be runnable as is, thus you need to include all the necessary files.
  (I may or may not diff the tests to check if you're screwing with them)

  #+CAPTION: A very high level processor schematic. Registers, Instruction and data memory are already implemented.
  [[./Images/FiveStage.png]]
  
  Now that you have an idea of what you're building it is time to take inventory of
  the files included in the skeleton, and what, if anything should be added.

  + [[./src/main/scala/Tile.scala]]
    This is the top level module for the system as a whole. This is where the test
    harness accessses your design, providing the necessary IO. 
    *You should not modify this module for other purposes than debugging.*

  + [[./src/main/scala/CPU.scala]]
    This is the top level module for your processor.
    In this module the various stages and barriers that make up your processor
    should be declared and wired together.
    Some of these modules have already been declared in order to wire up the
    debugging logic for your test harness.
    This file corresponds to the high-level overview in its entirety.
    *This module is intended to be further fleshed out by you.*
    As you work with this module, try keeping logic to a minimum to help readability.
    If you end up with a lot of signal select logic, consider moving that to a separate
    module.
    
  + [[./src/main/scala/IF.scala]]
    This is the instruction fetch stage.
    In this stage instruction fetching should happen, meaning you will have to
    add logic for handling branches, jumps, and for exercise 2, stalls.
    The reason this module is already included is that it contains the instruction
    memory, described next which is heavily coupled to the testing harness.
    *This module is intended to be further fleshed out by you.*
    
  + [[./src/main/scala/IMem.scala]]
    This module contains the instruction memory for your processor.
    Upon testing the test harness loads your program into the instruction memory,
    freeing you from the hassle.
    *You should not modify this module for other purposes than maaaaybe debugging.*

  + [[./src/main/scala/ID.scala]]
    The instruction decode stage.
    The reason this module is included is that the registers reside here, thus
    for the test harness to work it must be wired up to the register unit to
    record its state updates.
    *This module is intended to be further fleshed out by you.*
    
  + [[./src/main/scala/Registers.scala]]
    Contains the registers for your processor. Note that the zero register is alredy
    disabled, you do not need to do this yourself.
    The test harness ensures that all register updates are recorded.
    *You should not modify this module for other purposes than maaaaybe debugging.*
    
  + [[./src/main/scala/MEM.scala]]
    Like ID and IF, the MEM skeleton module is included so that the test harness
    can set up and monitor the data memory
    *This module is intended to be further fleshed out by you.*

  + [[./src/main/scala/DMem.scala]]
    Like the registers and Imem, the DMem is already implemented.
    *You should not modify this module for other purposes than maaaaybe debugging.*
    
  + [[./src/main/scala/Const.scala]]
    Contains helpful constants for decoding, used by the decoder which is provided.
    *This module may be fleshed out further by you if you so choose.*

  + [[./src/main/scala/Decoder.scala]]
    The decoder shows how to conveniently demux the instruction.
    In the provided ID.scala file a decoder module has already been instantiated.
    You should flesh it out further.
    You may find it useful to alter this module, especially in exercise 2.
    *This module should be further fleshed out by you.*

  + [[./src/main/scala/ToplevelSignals.scala]]
    Contains helpful constants. 
    You should add your own constants here when you find the need for them.
    You are not required to use it at all, but it is very helpful.
    *This module can be further fleshed out by you.*
    
  + [[./src/main/scala/SetupSignals.scala]]
    You should obviously not modify this file.
    You may choose to create a similar file for debug signals, modeled on how
    the test harness is built.
    *You should not modify this module at all.*
  

** Tests
   In addition to the skeleton files it's useful to take a look at how the tests work.
   You will not need to alter anything here other than the [[./src/test/scala/Manifest.scala][test manifest]], but some
   of these settings can be quite useful to alter.
   The main attraction is the test options. By altering the verbosity settings you
   may change what is output.
   The settings are:

   + printIfSuccessful
     Enables logging on tests that succeed.
     You typically want this turned off, at least for the full test runner.

   + printErrors
     Enables logging of errors. You obviously want this one on, at least on the single
     test.

   + printParsedProgram
     Prints the desugared program. Useful when the test asm contains instructions that
     needs to be expanded or altered.
     Unsure what "bnez" means? Turn this setting on and see!
     
   + printVMtrace
     Enables printing of the VM trace, showing how the ideal machine executes a test

   + printVMfinal
     Enables printing of the final VM state, showing how the registers look after
     completion. Useful if you want to see what a program returns.

   + printMergedTrace
     Enables printing of a merged trace. With this option enabled you get to see how
     the VM and your processor executed the program side by side.
     This setting is extremely helpful to track down where your program goes wrong!
     This option attempts to synchronize the execution traces as best as it can, however
     once your processor design derails this becomes impossible, leading to rather
     nonsensical output.
     Instructions that were only executed by either VM or Your design is colored red or
     blue.
     
     *IF YOU ARE COLOR BLIND YOU SHOULD ALTER THE DISPLAY COLORS!*
     
   + nopPadded
     Set this to false when you're ready to enter the big-boy league

   + breakPoints
     Not implemented. It's there as a teaser, urging you to implement it so I don't have to.


** Getting started
   In order to make a correct design in a somewhat expedient fashion you need to be
   *methodical!* 
   
   This means you should have a good idea of how your processor should work *before*
   you start writing code. While chisel is more pleasent to work with than other HDLs
   the [[https://i.imgur.com/6IpVNA7.jpg][bricoleur]] approach is not recommended.
   
   My recommended approach is therefore to create an RTL sketch of your processor design.
   Start with an overall sketch showing all the components, then drill down.
   In your sketch you will eventually add a box for registers, IMEM and DMEM, which
   should make it clear how the already finished modules fit into the grander design,
   making the skeleton-code less mysterious.
   
   To give you an idea of how a drill down looks like, here is my sketch of the ID stage:
   #+CAPTION: Instruction decode stage, showing the various signals.
   [[./Images/IDstage.png]]
   
   I would generally advice to do these on paper, but don't half-ass them.


** Adding numbers
   In order to get started designing your processor the following steps guide you to
   implementing the necessary functionality for adding two integers.

   Info is progressively being omitted in the latter steps in order to not bog you down
   in repeated details. After all brevity is ~~the soul of~~ wit
   
*** Step 0
    In order to verify that the project is set up properly, open sbt in your project root
    by typing ~./sbt.sh~ (or simply sbt if you already use scala).
    sbt, which stands for scala build tool will provide you with a repl where you can
    compile and test your code.
   
    The initial run will take quite a while to boot as all the necessary stuff is downloaded.

**** Step ¼:
     In your console, type ~compile~ to verify that everything compiles correctly.

**** Step ½:
     In your console, type ~test~ to verify that the tests run, and that chisel can correctly
     build your design.
     This command will unleash the full battery of tests on you.

**** Step ¾:
     In your console, type ~testOnly FiveStage.SingleTest~ to run only the tests that you
     have defined in the [[./src/test/scala/Manifest.scala][test manifest]] (currently set to ~forward2.s~).

     As you will first implement addition you should change this to the [[./src/test/resources/tests/basic/immediate/addi.s][add immediate test]].
     Luckily you do not have to deal with file paths, simply changing ~forward2.s~ to
     ~addi.s~ suffices.

     Ensure that the addi test is run by repeating the ~testOnly FiveStage.SingleTest~
     command.
   
*** Step 1:
    In order to execute instructions your processor must be able to fetch them.
    In [[./src/test/main/IF.scala]] you can see that the IMEM module is already set to fetch
    the current program counter address (line 41), however since the current PC is stuck
    at 0 it will fetch the same instruction over and over. Rectify this by commenting in
    ~// PC := PC + 4.U~ at line 48.
    You can now verify that your design fetches new instructions each cycle by running
    the test as in the previous step.

*** Step 2:
    Next, the instruction must be forwarded to the ID stage, so you will need to add the
    instruction to the io interface of the IF module as an output signal.
    In [[./src/test/main/IF.scala]] at line 21 you can see how the program counter is already
    defined as an output. 
    You should do the same with the instruction signal.


*** Step 3:
    As you defined the instruction as an output for your IF module, declare it as an input
    in your ID module ([[./src/test/main/ID.scala]] line 21).

    Next you need to ensure that the registers and decoder gets the relevant data from the
    instruction.

    This is made more convenient by the fact that ~Instruction~ is a class, allowing you
    to access methods defined on it.
    Keep in mind that it is only a class during compile and build time, it will be 
    indistinguishable from a regular ~UInt(32.W)~ in your finished circuit.
    The methods can be accessed like this:
    #+BEGIN_SRC scala
    // Drive funct6 of myModule with the 26th to 31st bit of instruction
    myModule.io.funct6 := io.instruction.funct6
    #+END_SRC

*** Step 4:
    Your IF should now have an instruction as an OUTPUT, and your ID as an INPUT, however
    they are not connected. This must be done in the CPU class where both the ID and IF are
    instantiated.
    In the overview sketch you probably noticed the barriers between IF and ID.
    In accordance with the overview, it is incorrect to directly connect the two modules,
    instead you must connect them using a *barrier*.
    A barrier is responsible for keeping a value inbetween cycles, facilitating pipelining.
    There is however one complicating matter: It takes a cycle to get the instruction from the
    instruction memory, thus we don't want to delay it in the barrier!
    
    In order to make code readable I suggest adding a new file for your barriers, containing
    four different modules for the barriers your design will need.

    Start with implementing your IF barrier module, which should contain the following:
    + An input and output for PC where the output is delayed by a single cycle.
    + An input and output for instruction where the output is wired directly to the input with
      no delay.
      
    The sketch for your barrier looks like this
    #+CAPTION: The barrier between IF and ID. Note the passthrough for the instruction
    [[./Images/IFID.png]]

**** Step 4½:
     You can now verify that the correct control signals are produced. Using printf, ensure
     that:
     + The program counter is increasing in increments of 4
     + The instruction in ID is as expected
     + The decoder output is as expected
     + The correct operands are fetched from the registers

     Keep in mind that printf might not always be cycle accurate, the point is to ensure that
     your processor design at least does something! In general it is better to use debug signals
     and println, but for quick and dirty debugging printf is passable.

*** Step 5:
    You will now have to create the EX stage. Use the structure of the IF and ID modules to
    guide you here.
    In your EX stage you should have an ALU, preferrable in its own module a la registers in ID.
    While the ALU is hugely complex, it's very easy to describle in hardware design languages!
    Using the same approach as in the decoder should be sufficient:

    #+BEGIN_SRC scala
    val ALUopMap = Array(
      ADD    -> (io.op1 + io.op2),
      SUB    -> (io.op1 - io.op2),
      ...
      )

    // MuxLookup API: https://github.com/freechipsproject/chisel3/wiki/Muxes-and-Input-Selection#muxlookup
    io.aluResult := MuxLookup(io.aluOp, 0.U(32.W), ALUopMap)
    #+END_SRC
    
    As with the ID stage, you will need a barrier between ID and EX stage.
    In this case, as the overview sketch indicates, all values should be delayed one cycle.
    
    When you have finished the barrier, instantiate it and wire ID and EX together with the barrier in the 
    same fashion as IF and ID.
    You don't need to add every single signal for your barrier, rather you should add them as they
    become needed.

*** Step 6:
    Your MEM stage does very little when an ADDI instruction is executed, so implementing it should 
    be easy. All you have to do is forward signals.
    
    From the overview sketch you can see that the same trick used in the IF/ID barrier is utilized
    here, bypassing the data memory read value since it is already delayed by a cycle.

*** Step 7:
    You now need to actually write the result back to your register bank. 
    This should be handled at the CPU level.
    If you sketched your processor already you probably made sure to keep track of the control 
    signals for the instruction currently in WB, so writing to the correct register address should
    be easy for you ;)
    
    If you ended up driving the register write address with the instruction from IF you should take
    a moment to reflect on why that was the wrong choice.

*** Step 8:
    Ensure that the simplest addi test works, and give yourself a pat on the back!
    You've just found the corner pieces of the puzzle, so filling in the rest is "simply" being methodical.

* Delivery
  Once you are done simply run the deliver.sh script to get an archive.