TDT4255/theory2.org

* Question 0 - Testing hazards
  This question is mandatory, but rewards no points (not directly at least).

  The tests found in the testing framework are useful for testing a fully working processor, however it
  leaves much to be desired for when you actually want to design one from whole cloth.

  To rectify this, you should write some tests of your own that should serve as a minimal case for various
  hazards that you will encounter. You do not need to deliver anything here, but I expect you to have
  these tests if you ask me for help debugging your design during lab hours.
  (You can of course come to lab hours if you're having trouble writing these tests)


** Forwarding
   The tests in forward1.s and forward2.s are automatically generated, long, and non-specific,
   thus not very suited for debugging.

   You should write one (or more) test(s) that systematically expose your processor to dependency
   hazards, including instructions that:
   + Needs forwarding from MEM and WB (i.e dependencies with NOPs between them).
   + Exposes results that should *not* be forwarded due to regWrite being false.
   + Writes and reads to/from the zero register.


** Load freezes
   Loads freezes are tricky since they have an interaction with the forwarding unit, often causing
   bugs that appear with low frequency in the supplied test programs.

   You should write tests (I suggest one test per case) that systematically expose your processor to
   dependency hazards where one or more of the dependencies are memory accesses, including instructions that:
   + Needs forwarding from MEM and WB where MEM, WB or both are load instructions.
   + Exposes false dependencies from MEM and WB where one or more are loads.
     For instance, consider ~addi x1, x1, 0x10~ in machine code with the rs2 field highlighted:
     0x00a08093 = 0b00000000 | 10100 | 0001000000010010011
     In this case there is a false dependency on x20 since x20 is only an artefact of the immediate
     value which could cause an unecessary freeze.
   + Writes and reads to/from the zero register, which could trigger an unecessary freeze
   + Instructions that causes multiple freezes in a row.
   + Instructions that causes multiple freezes in a row followed by an instruction with multiple
     dependencies.


** Control hazards
   There are a lot of possible interactions when jumping and branching, you need to write tests
   that ensures that instructions are properly bubbled if they shouldn't have been fetched.
   You should also test for interactions between forwarding and freezing here, i.e what happens
   when the address calculation relies on forwarded values? What happens if the forwarded value
   comes from a load instruction necessitating a freeze?


* TODO Question 1 - Hazards
  Write programs here that are less of a crapshoot. Clarify dependency vs hazards etc etc and
  *enforce* a format that is easy to grade.


* Question 2 - Handling hazards
  For this question, keep in mind that the forwarder does not care if the values it forwards are being used or not!
  Even for a JAL instructions which has neither an rs1 or rs2 field, the forwarder must still forward its values.

** Data hazards 1
   At some cycle the following instructions can be found in a 5 stage design:

   #+begin_src text
   EX:                  ||     MEM:                ||      WB:
   ---------------------||-------------------------||--------------------------
   rs1: 4               ||     rs1: 4              ||      rs1: 1
   rs2: 5               ||     rs2: 6              ||      rs2: 2
   rd:  6               ||     rd:  4              ||      rd:  5
   regWrite = true      ||     regWrite = false    ||      regWrite = true
   memWrite = false     ||     memWrite = false    ||      memWrite = false
   branch   = false     ||     branch   = true     ||      branch   = false
   jump     = false     ||     jump     = false    ||      jump     = false
   #+end_src

   For the operation currently in EX, from where (ID, MEM or WB) should the forwarder get data from for rs1 and rs2?
   Answer should be on the form:

   rs1: Narnia
   rs2: Wikipedia

** Data hazards 2

   At some cycle the following instructions can be found in a 5 stage design:

   #+begin_src text
   EX:                  ||     MEM:                ||      WB:
   ---------------------||-------------------------||--------------------------
   rs1: 1               ||     rs1: 4              ||      rs1: 1
   rs2: 5               ||     rs2: 6              ||      rs2: 0
   rd:  0               ||     rd:  1              ||      rd:  0
   memToReg = false     ||     memToReg = false    ||      memToReg = false
   regWrite = true      ||     regWrite = true     ||      regWrite = true
   memWrite = false     ||     memWrite = false    ||      memWrite = false
   branch   = false     ||     branch   = true     ||      branch   = false
   jump     = true      ||     jump     = true     ||      jump     = false
   #+end_src

   For the operation currently in EX, from where (ID, MEM or WB) should the forwarder get data from for rs1 and rs2?
   Answer should be on the form:

   rs1: Random noise
   rs2: WB (MEM if it's a tuesday)

** Data hazards 3

   At some cycle the following instructions can be found in a 5 stage design:

   #+begin_src text
   EX:                  ||     MEM:                ||      WB:
   ---------------------||-------------------------||--------------------------
   rs1: 2               ||     rs1: 4              ||      rs1: 3
   rs2: 5               ||     rs2: 6              ||      rs2: 4
   rd:  1               ||     rd:  1              ||      rd:  5
   memToReg = false     ||     memToReg = true     ||      memToReg = false
   regWrite = false     ||     regWrite = true     ||      regWrite = true
   memWrite = true      ||     memWrite = false    ||      memWrite = false
   branch   = false     ||     branch   = false    ||      branch   = false
   jump     = false     ||     jump     = false    ||      jump     = false
   #+end_src

   Should the forwarding unit issue a load hazard signal? *This is a yes/no question*
   (Hint: what are the semantics of the instruction currently in EX stage?)

* Question 3 - Branch prediction
  Consider a 2 bit branch predictor with only 4 slots for a 32 bit architecture (without BTB), where the decision to
  take a branch or not is decided in accordance to the following table:
  #+begin_src text
  state  ||  predict taken  ||  next state if taken  ||  next state if not taken ||
  =======||=================||=======================||==========================||
  00     ||  NO             ||  01                   ||  00                      ||
  01     ||  NO             ||  11                   ||  00                      ||
  10     ||  YES            ||  11                   ||  00                      ||
  11     ||  YES            ||  11                   ||  10                      ||
  #+end_src

  Which corresponds to this figure:
  #+CAPTION: FSM of a 2 bit branch predictor. Note that it is not a 2bit saturating counter.
  [[./Images/BranchPredictor.png]]

  At some point during execution the program counter is ~0xc~ and the branch predictor table looks like this:
  #+begin_src text
  slot  ||  value
  ======||========
  00    ||  01
  01    ||  00
  10    ||  01
  11    ||  10
  #+end_src

  For the following program:
  #+begin_src asm
  .L1:
  0x0C addi x1, x1, 1
  0x10 add  x2, x2, x1
  0x14 bge  x2, x3, .L1
  0x18 j    .L2
  .L3:
  0x1C addi x2, x2, 0x10
  0x20 slli x2, 0x4
  0x24 jr   ra
  #+end_src

  At cycle 0 the state of the machine is as following:
  #+begin_src text
  PC = 0x0C
  x1 = 0x0
  x2 = 0x0
  x3 = 0x7
  #+end_src

  At which cycle will the PC be 0x24 given a 2 cycle delay for mispredicts?

* Question 4 - Benchmarking a branch profiler
  In order to gauge the performance increase from adding branch predictors it is necessary to do some testing.
  Rather than writing a test from scratch it is better to use the tester already in use in the test harness.
  When running a program the VM outputs a log of all events, including which branches have been taken and which
  haven't, which as it turns out is the only information we actually need to gauge the effectiveness of a branch
  predictor!

  For this exercise you will write a program that parses a log of branch events.

  #+BEGIN_SRC scala
  sealed trait BranchEvent
  case class Taken(from: Int, to: Int) extends BranchEvent
  case class NotTaken(at: Int) extends BranchEvent


  def profile(events: List[BranchEvent]): Int = ???
  #+END_SRC

  To help you get started, I have provided you with much of the necessary code.
  In order to get an idea for how you should profile branch misses, consider the following profiler which calculates
  misses for a processor with a branch predictor with a 1 bit predictor with infinite slots:

  #+BEGIN_SRC scala
  def OneBitInfiniteSlots(events: List[BranchEvent]): Int = {

    // Helper inspects the next element of the event list. If the event is a mispredict the prediction table is updated
    // to reflect this.
    // As long as there are remaining events the helper calls itself recursively on the remainder
    def helper(events: List[BranchEvent], predictionTable: Map[Int, Boolean]): Int = {
      events match {

	// Scala syntax for matching a list with a head element of some type and a tail
	// `case h :: t =>`
	// means we want to match a list with at least a head and a tail (tail can be Nil, so we
	// essentially want to match a list with at least one element)
	// h is the first element of the list, t is the remainder (which can be Nil, aka empty)

	// `case Constructor(arg1, arg2) :: t => `
	// means we want to match a list whose first element is of type Constructor, giving us access to its internal
	// values.

	// `case Constructor(arg1, arg2) :: t => if(p(arg1, arg2))`
	// means we want to match a list whose first element is of type Constructor while satisfying some predicate p,
	// called an if guard.
	case Taken(from, to) :: t if( predictionTable(from)) => helper(t, predictionTable)
	case Taken(from, to) :: t if(!predictionTable(from)) => 1 + helper(t, predictionTable.updated(from, true))
	case NotTaken(addr)  :: t if( predictionTable(addr)) => 1 + helper(t, predictionTable.updated(addr, false))
	case NotTaken(addr)  :: t if(!predictionTable(addr)) => helper(t, predictionTable)
	case _ => 0
      }
    }

    // Initially every possible branch is set to false since the initial state of the predictor is to assume branch not taken
    def initState = events.map{
      case Taken(addr)    => (addr, false)
      case NotTaken(addr) => (addr, false)
    }.toMap

    helper(events, initState)
  }
  #+END_SRC

** Your task
   Your job is to implement a test that checks how many misses occur for a 2 bit branch predictor with 8 slots.
   The rule table is the same as in question 3.
   The predictor does not use a branch target buffer (BTB), which means that the address will always be decoded in
   the ID stage.
   For you this means you do not need to keep track of branch targets, simplifying your simulation quite a bit.
   (If not you would need to add logic for when BTB value does not match actual value)

   For simplicity's sake, assume that every value in the table is initialized to 00.

   For this task it is necessary to use something more sophisticated than ~Map[(Int, Boolean)]~ to represent
   your branch predictor model.

   The skeleton code is located in ~testRunner.scala~ and can be run using testOnly FiveStage.ProfileBranching.

   With a 2 bit 8 slot scheme, how many mispredicts will happen?
   Answer with a number.

   Hint: Use the getTag method defined on int (in DataTypes.scala) to get the tag for an address.
   #+BEGIN_SRC scala
   val slots = 8
   say(0x1C40.getTag(slots)) // prints 0
   say(0x1C44.getTag(slots)) // prints 1
   say(0x1C48.getTag(slots)) // prints 2
   say(0x1C4C.getTag(slots)) // prints 3
   say(0x1C50.getTag(slots)) // prints 4
   say(0x1C54.getTag(slots)) // prints 5
   say(0x1C58.getTag(slots)) // prints 6
   say(0x1C5C.getTag(slots)) // prints 7
   say(0x1C60.getTag(slots)) // prints 0 (thus conflicts with 0x1C40)
   #+END_SRC


* Question 5 - Cache profiling
  Unlike our design which has a very limited memory pool, real designs have access to vast amounts of memory, offset
  by a steep cost in access latency.
  To amend this a modern processor features several caches where even the smallest fastest cache has more memory than
  your entire design.
  In order to investigate how caches can alter performance it is therefore necessary to make some rather
  unrealistic assumptions to see how different cache schemes impacts performance.

  For this exercise you will write a program that parses a log of memory events, similar to previous task
  #+BEGIN_SRC scala
  sealed trait MemoryEvent
  case class Write(addr: Int) extends MemoryEvent
  case class Read(addr: Int) extends MemoryEvent


  def profile(events: List[MemoryEvent]): Int = ???
  #+END_SRC

** TODO Your task
   Your job is to implement a *parameterised* model that tests how many delay cycles will occur for a cache with
   the following configuration:
   + Follows an n-way associative scheme (parameter)
   + Is write-through write allocate.
   + Eviction policy is LRU (least recently used)

   To make this task easier a data structure with stub methods has been implemented for you.

   Answer by pasting the output from running the branchProfiler test.