TDT4255/exercise2.org

* Exercise 2
  Safety wheels are now officially off.
  To verify this, set nopPadded to false in Manifest.scala and observe as all hell
  breaks loose.

  Let's break down what's going wrong and what we can do about it.

** RAW Hazards
   Consider the following program:
   #+begin_src asm
   main:
     add x1, x1, x2
     add x1, x1, x1
     add x1, x1, x2
   #+end_src

   In your implementation this will give you wrong results since the results
   of the first add operation will not be available in the registers before
   x1 is fetched for the second and third operation.

   Your first task should therefore be to implement a forwarding unit

   The forwarding unit is located in the EX stage and is responsible for selecting
   the ALU input from three possible sources.
   These sources are:
   + The value received from the register bank
   + The ALU result currently in MEM
   + The writeback result

   The forwarder prioritizes as follows:
   + If the input register address is not the destination in either MEM or WB, select the
     register.
   + If the input register address is the destination register in WB, but not in MEM, select
     the writeback signal.
   + If the input register address is the destination register for the operation currently
     in MEM, select that operation.

   There is a special case you need take into account, namely load operations.
   Considering the following program:
   #+begin_src asm
   main:
     lw x1, 0(x2)
     add x1, x1, x1
     add x1, x1, x1
   #+end_src

   When the second operation (~add x1, x1, x1~) is in EX, the third clause in the forwarder
   is triggered, however the result is not yet ready since fetching memory costs a cycle.
   In order to fix this the forwarder must issue a signal that freezes the pipeline.
   This is done by issuing a signal to the barrier registers, telling them to _not_ update
   their contents, essentially repeating the last instruction.

   There are many subtleties to consider here.
   For instance: What should happen to the instruction currently
   in MEM? If it too is repeated the hazard detector will trigger next cycle, effectively
   deadlocking your processor.
   What about when the write address and read address are similar in the ID stage?

   Designing a forwarder can take very long, or it can be done very quickly, all depending
   on how *methodical* you are. My advice is to design the algorithm first, then when you're
   satisfied implement it on hardware.


** Control hazards

   Consider the following code

   #+begin_src asm
   main:
     beq zero, zero, target
     add x1, x1, x1
     add x1, x1, x1
     j main
   target:
     sub x1, x2, x2
     sub x2, x2, x2
   #+end_src

   Depending on your design the two add instructions will be fetched before the beq jump happens.
   Whenever a branch happens it is necessary to flush the spurious instructions that were fetched
   before the branch was noticed.
   However, simply waiting until the branch has been decided is not acceptable since that guarantees
   lost cycles.
   In your first iteration, simply assume branches will not be taken, and if they are taken issue
   a warning to the barriers that hold spurious instructions telling them to render them impotent
   by setting the control signals to do nothing.

* Putting the pedal to the metal
  Once you have a design that RAW and control hazards you're ready to up the ante and add some
  improvements to your design.
  Some suggestions:

** Branch predictor
   Instead of assuming branch not taken, use a branch predictor. There are many different schemes,
   but I advice you to stick to a simple one, such as 1-bit or 2-bit.

** Fast branch handling
   Certaing branches like BEQ and BNE can be calculated very quickly, wheras size comparison branches
   (BGE, BLE and friends) take longer. It is therefore feasible to do these checks in the ID stage.

** Adding a data cache
   Unless you have already done the two suggested improvements, do not attempt to create a cache.
   The first thing you need to do is to add a latency for memory fetching, if not you will have
   nothing to improve upon.

   If you still insist, start with the BTreeManyO3.s program and analyze the memory access pattern.
   What sort of eviction policy and cache size would you choose for this pattern?
   You should try writing additional benchmarking programs, it is important to have something measurable,
   and the current programs are not made to test cache performance!!