2024 update

2024-08-15 19:43:21 +03:00 · 2024-08-15 19:43:21 +03:00 · bbec50480e
commit bbec50480e
parent 0125e50b02
3 changed files with 9 additions and 481 deletions
--- a/README.org
+++ b/README.org
@ -6,13 +6,19 @@ The project itself is broken into four seperate milestones where each one adds a
 **** Milestone 1: Creating a Basic Datapath (30 Points)
 The goal of the first milestone is to establish basic CPU functionality and correctness for simple programs. To facilite this four NOP instructions will be inserted in between each regular instruction so that the CPU does not have to deal with any kind of hazards or forwarding. To get full credit on milestone 1 all "basic" tests must run with NOPs inserted.

-**** Milestone 2: Completeing the Datapath (30 Points)
+**** Milestone 2: Completing the Datapath (30 Points)
 Milestone 2 is a simple evolution of milestone 1. To get full credit on milestone 2 all of the RISCV32I instructions should be added to the design, and the full battery of tests in "basic" and "programs" should successfully execute when NOPs are turned on.

-**** Milestone 3: Pipelining the CPU (30 Points)
+**** Milestone 3: Pipelining the CPU (Basic Datapath) (30 Points) 
 For the 3rd milestone NOPs between every instruction are disabled, and the support for pipelining is added so as to increase CPU performance. Getting this working requires adding support to handle RAW Hazards, Control Hazards, and delay after load. To get full credit for milestone 3 NOPs must be turned off, and all tests in "basic" and "programs" must run successfully.

-**** Milestone 4: Further Performance Improvments: (100 Points)
+To get full credit on milestone 3 all "basic" tests must run with NOPs inserted.
+
+**** Milestone 4: Pipelining the CPU (Completing the Datapath) (30 Points)
+
+To get full credit on milestone 3 all "programs" tests must run with NOPs inserted.
+
+**** CPU project final delivery: (70 Points)
 The forth and final milstone represents the culmination of all previous work into a single deisgn while aiming to further increase performance. For the final part of this project students are required to add additional hardware of their own choosing to the CPU to try and make it even faster than what was done in milestone 3. Student have the freedom to choose their own component to work on, though we have several optional suggestions as well. To get full credit on milestone 4 a simple version of at least one of the following hardware upgrades must be implemented, this will need to be explained verbally to a TA during a lab session. Additionally all tests in "basic" and "programs" must run successfully. Finally, your code must be uploaded to the course website where it will be evaluated.

 - Branch Prediction
--- a/theory1.org
+++ b/theory1.org
@ -1,157 +0,0 @@
-* Theory questions EX1
-  
-  Keep in mind that your design and your implementation are separate entities,
-  thus you should answer these questions based on your ideal design, not your
-  finished implementation. Consequently I will not consider your implementation
-  when grading these questions, thus even with no implementation at all you
-  should still be able to score 100% on the theory questions.
-  
-  All questions can be answered in a few sentences. Remember that brevity is wit,
-  and also the key to getting a good score.
-  You should easily be able to fit your entire answer on a single screen.
-
-** Question 1
-   *2 points.*
-*** Part 1
-**** Part 1½
-    *½ points.*
-    When decoding the BNE branch instruction in the above assembly program
-    #+begin_src asm
-    bne x6, x2, "loop",
-    #+end_src
-    
-    In your design, what is the value of each of the control signals below?
-     
-    + regWrite
-    + memRead
-    + memWrite
-    + branch
-    + jump
-
-**** Part 1¼
-    *½ points.*
-    When decoding the LW instruction in the above assembly program
-    #+begin_src asm
-    jal x1, 0x10(x1)
-    #+end_src
-    
-    In your design, what is the value of each of the control signals below?
-     
-    + regWrite
-    + memRead
-    + memWrite
-    + branch
-    + jump
-      
-    Keep in mind that your design and your implementation are separate entities, thus
-    you should answer this question based on your ideal design, not your finished 
-    implementation.
-   
-*** Part 2
-   During execution, at some arbitrary cycle the control signals are:
-
-   + regWrite = 1
-   + memRead  = 0
-   + memWrite = 0
-   + branch   = 0
-   + jump     = 1
-   
-   In your design, which intruction(s) could be executing?
-   
-
-   Keep in mind that your design and your implementation are separate entities, thus
-   you should answer this question based on your ideal design, not your finished 
-   implementation.
-   
-** Question 2
-   *4 points.*
-   Partial credit awarded on a case by case basis.
-
-   Reading the binary of a RISC-V program you get the following:
-
-   #+begin_src text
-   0x0:  0x00a00293   --   0000 0000 1010 0000 0000 0010 1001 0011
-   0x4:  0x01400313   --   0000 0001 0100 0000 0000 0011 0001 0011
-   0x8:  0xfff30313   --   1111 1111 1111 0011 0000 0011 0001 0011
-   0xc:  0x00628463   --   0000 0000 0110 0010 1000 0100 0110 0011
-   0x10: 0xff9ff06f   --   1111 1111 1001 1111 1111 0000 0110 1111
-   #+end_src
-
-   For each instruction describe the format and name and corresponding RISC-V source
-   To give you an idea on how decoding would work, here is the decoding of 0x40635293:
-
-   #+begin_src text
-   0x40635293   --   0100 0000 0110 0011 0101 0010 1001 0011
-
-   Opcode: 0010011 => Format is IType, thus the funct3 field is used to decode further
-   funct3: 101     => Instruction is of type SRAI, the instruction looks like ~srai rd, rx, imm~
-
-   rs1:    00110   => x6
-   rd:     00101   => x5
-   shamt:  000110  => 6
-   
-   Resulting in ~srai x5, x6, 6~
-   #+end_src
-   
-   *Your answer should be in the form of a simple asm program.*
-   + hint 1: 
-     the original asm program had a label, you need to infer where that label was
-
-   + hint 2: 
-     Verify your conclusion by assembling your answer.
-     To do this, make an asm program, place it with the rest of the tests and set
-     ~printBinary~ to ~true~ in ~singleTestOptions~ in ~Manifest.scala~ which will
-     print the full binary of your program.
-     As long as your program generates the same binary as the supplied your program
-     is correct.
-     
-
-** Question 3
-   *4 points.*
-   Partial credit awarded on a case by case basis.
-
-   In order to load a large number LUI and ADDI are used.
-   consider the following program
-   #+begin_src asm
-   li x0, 0xFF
-   li x1, 0x600
-   li x2, 0x8EE
-   li x3, 0xBABEFACE
-   li x4, 0xBABE07CE
-   #+end_src
-   
-   a) Which of these instructions will be split into ADDI LUI pairs?
-   b) Explain in 3 sentences or less *how* the two last ops are handled differently and *why*.
-   
-   + hint 1: 
-     The parser and assembler in the test suite can help you answer the first part of
-     this question (a).
-     Create an asm file, put it with the rest of the tests and run it, setting the correct
-     test options in ~singleTestOptions~ defined in ~Manifest.scala~ and observe the output.
-     
-   + hint 2:
-     While it's probably easier to solve this problem using the internet, however you 
-     can also figure out what is happening by browsing the assembler source code which
-     will hopefully give you a deeper insight into what is going on here.
-     
-     Look at ~Parser.scala~, specifically what happens when an ~li~ instruction is parsed.
-     When parsing an instruction the parser first attempts to apply the 
-     ~singleInstruction~ rule, however this only succeeds if the immediate value
-     obeys certain restrictions (~nBits <= 12~), if not it fails.
-     
-     If the ~singleInstruction~ rule fails the parser then attempts to apply the
-     ~multipleInstructions~ rule instead which expands operations into a list of real ops.
-     When this happens the resulting operations are defined as the following:
-     #+begin_src scala
-     stringWs("li") ~> (reg <~ sep, (hex | int).map(_.splitHiLo(20))).mapN{ case(rd, (hi, lo)) => {
-       List(
-       ArithImm.add(rd, rd, lo),
-       LUI(rd, if(lo > 0) hi else hi+1),
-     )}}.map(_.widen[Op]),
-     #+end_src
-     This is quite a lot to unpack, but you can focus on the line where the ~LUI~ is constructed.
-     ~hi~ and ~lo~ are the results of ~splitHiLo~ which splits a 32 bit word into a 12 bit and a
-     20 bit.
-     Try this for yourself on paper; what happens when ~lo~ ends up being a negative number?
-     What is the interplay between incrementing ~hi~ with 1 and adding a ~lo~ that is represented
-     as a negative value?
--- a/theory2.org
+++ b/theory2.org
@ -1,321 +0,0 @@
-* Question 0 - Testing hazards
-  This question is mandatory, but rewards no points (not directly at least).
-
-  The tests found in the testing framework are useful for testing a fully working processor, however it
-  leaves much to be desired for when you actually want to design one from whole cloth.
-
-  To rectify this, you should write some tests of your own that should serve as a minimal case for various
-  hazards that you will encounter. You do not need to deliver anything here, but I expect you to have
-  these tests if you ask me for help debugging your design during lab hours.
-  (You can of course come to lab hours if you're having trouble writing these tests)
-
-
-** Forwarding
-   The tests in forward1.s and forward2.s are automatically generated, long, and non-specific,
-   thus not very suited for debugging.
-
-   You should write one (or more) test(s) that systematically expose your processor to dependency
-   hazards, including instructions that:
-   + Needs forwarding from MEM and WB (i.e dependencies with NOPs between them).
-   + Exposes results that should *not* be forwarded due to regWrite being false.
-   + Writes and reads to/from the zero register.
-
-
-** Load freezes
-   Loads freezes are tricky since they have an interaction with the forwarding unit, often causing
-   bugs that appear with low frequency in the supplied test programs.
-
-   You should write tests (I suggest one test per case) that systematically expose your processor to
-   dependency hazards where one or more of the dependencies are memory accesses, including instructions that:
-   + Needs forwarding from MEM and WB where MEM, WB or both are load instructions.
-   + Exposes false dependencies from MEM and WB where one or more are loads.
-     For instance, consider ~addi x1, x1, 0x10~ in machine code with the rs2 field highlighted:
-     0x00a08093 = 0b00000000 | 10100 | 0001000000010010011
-     In this case there is a false dependency on x20 since x20 is only an artefact of the immediate
-     value which could cause an unecessary freeze.
-   + Writes and reads to/from the zero register, which could trigger an unecessary freeze
-   + Instructions that causes multiple freezes in a row.
-   + Instructions that causes multiple freezes in a row followed by an instruction with multiple
-     dependencies.
-
-
-** Control hazards
-   There are a lot of possible interactions when jumping and branching, you need to write tests
-   that ensures that instructions are properly bubbled if they shouldn't have been fetched.
-   You should also test for interactions between forwarding and freezing here, i.e what happens
-   when the address calculation relies on forwarded values? What happens if the forwarded value
-   comes from a load instruction necessitating a freeze?
-
-
-* TODO Question 1 - Hazards
-  For each of the following programs, describe each hazard with type (data or control), line number and a
-  small (max one sentence) description.
-  
-** program 1
-  #+begin_src asm
-1    addi t0,   zero,  10
-2    addi t1,   zero,  20
-  L2:
-3    sub  t1,   t1,    t0
-4    beq  t1,   zero, .L2
-5    jr   ra
-  #+end_src
-
-** program 2
-  #+begin_src asm
-1    addi t0,   zero,  10
-2    lw   t0,   10(t0)
-3    beq  t0,   zero,  .L3
-4    jr   ra
-  #+end_src
-  
-** program 3
-  #+begin_src asm
-1  lw   t0,   0(t0)
-2  lw   t1,   4(t0)
-3  sw   t0,   8(t1)
-4  lw   t1,   12(t0)
-5  beq  t0,   t1,  .L3
-6  jr   ra
-  #+end_src
-
-* Question 2 - Handling hazards
-  For this question, keep in mind that the forwarder does not care if the values it forwards are being used or not!
-  Even for a JAL instructions which has neither an rs1 or rs2 field, the forwarder must still forward its values.
-
-** Data hazards 1
-   At some cycle the following instructions can be found in a 5 stage design:
-
-   #+begin_src text
-   EX:                  ||     MEM:                ||      WB:
-   ---------------------||-------------------------||--------------------------
-   rs1: 4               ||     rs1: 4              ||      rs1: 1
-   rs2: 5               ||     rs2: 6              ||      rs2: 2
-   rd:  6               ||     rd:  4              ||      rd:  5
-   regWrite = true      ||     regWrite = false    ||      regWrite = true
-   memWrite = false     ||     memWrite = false    ||      memWrite = false
-   branch   = false     ||     branch   = true     ||      branch   = false
-   jump     = false     ||     jump     = false    ||      jump     = false
-   #+end_src
-
-   For the operation currently in EX, from where (ID, MEM or WB) should the forwarder get data from for rs1 and rs2?
-   Answer should be on the form:
-
-   rs1: Narnia
-   rs2: Wikipedia
-
-** Data hazards 2
-
-   At some cycle the following instructions can be found in a 5 stage design:
-
-   #+begin_src text
-   EX:                  ||     MEM:                ||      WB:
-   ---------------------||-------------------------||--------------------------
-   rs1: 1               ||     rs1: 4              ||      rs1: 1
-   rs2: 5               ||     rs2: 6              ||      rs2: 0
-   rd:  0               ||     rd:  1              ||      rd:  0
-   memToReg = false     ||     memToReg = false    ||      memToReg = false
-   regWrite = true      ||     regWrite = true     ||      regWrite = true
-   memWrite = false     ||     memWrite = false    ||      memWrite = false
-   branch   = false     ||     branch   = true     ||      branch   = false
-   jump     = true      ||     jump     = true     ||      jump     = false
-   #+end_src
-
-   For the operation currently in EX, from where (ID, MEM or WB) should the forwarder get data from for rs1 and rs2?
-   Answer should be on the form:
-
-   rs1: Random noise
-   rs2: WB (MEM if it's a tuesday)
-
-** Data hazards 3
-
-   At some cycle the following instructions can be found in a 5 stage design:
-
-   #+begin_src text
-   EX:                  ||     MEM:                ||      WB:
-   ---------------------||-------------------------||--------------------------
-   rs1: 2               ||     rs1: 4              ||      rs1: 3
-   rs2: 5               ||     rs2: 6              ||      rs2: 4
-   rd:  1               ||     rd:  1              ||      rd:  5
-   memToReg = false     ||     memToReg = true     ||      memToReg = false
-   regWrite = false     ||     regWrite = true     ||      regWrite = true
-   memWrite = true      ||     memWrite = false    ||      memWrite = false
-   branch   = false     ||     branch   = false    ||      branch   = false
-   jump     = false     ||     jump     = false    ||      jump     = false
-   #+end_src
-
-   Should the forwarding unit issue a load hazard signal? *This is a yes/no question*
-   (Hint: what are the semantics of the instruction currently in EX stage?)
-
-* Question 3 - Branch prediction
-  Consider a 2 bit branch predictor with only 4 slots for a 32 bit architecture (without BTB), where the decision to
-  take a branch or not is decided in accordance to the following table:
-  #+begin_src text
-  state  ||  predict taken  ||  next state if taken  ||  next state if not taken ||
-  =======||=================||=======================||==========================||
-  00     ||  NO             ||  01                   ||  00                      ||
-  01     ||  NO             ||  11                   ||  00                      ||
-  10     ||  YES            ||  11                   ||  00                      ||
-  11     ||  YES            ||  11                   ||  10                      ||
-  #+end_src
-
-  Which corresponds to this figure:
-  #+CAPTION: FSM of a 2 bit branch predictor. Note that it is not a 2bit saturating counter.
-  [[./Images/BranchPredictor.png]]
-
-  At some point during execution the program counter is ~0xc~ and the branch predictor table looks like this:
-  #+begin_src text
-  slot  ||  value
-  ======||========
-  00    ||  01
-  01    ||  00
-  10    ||  01
-  11    ||  10
-  #+end_src
-
-  For the following program:
-  #+begin_src asm
-  .L1:
-  0x0C addi x1, x1, 1
-  0x10 add  x2, x2, x1
-  0x14 bge  x2, x3, .L1
-  0x18 j    .L2
-  .L2:
-  0x1C addi x2, x2, 0x10
-  0x20 slli x2, 0x4
-  0x24 jr   ra
-  #+end_src
-
-  At cycle 0 the state of the machine is as following:
-  #+begin_src text
-  PC = 0x0C
-  x1 = 0x0
-  x2 = 0x0
-  x3 = 0x7
-  #+end_src
-
-  At which cycle will the PC be 0x24 given a 2 cycle delay for mispredicts?
-
-* Question 4 - Benchmarking a branch profiler
-  In order to gauge the performance increase from adding branch predictors it is necessary to do some testing.
-  Rather than writing a test from scratch it is better to use the tester already in use in the test harness.
-  When running a program the VM outputs a log of all events, including which branches have been taken and which
-  haven't, which as it turns out is the only information we actually need to gauge the effectiveness of a branch
-  predictor!
-
-  For this exercise you will write a program that parses a log of branch events.
-
-  #+BEGIN_SRC scala
-  sealed trait BranchEvent
-  case class Taken(from: Int, to: Int) extends BranchEvent
-  case class NotTaken(at: Int) extends BranchEvent
-
-
-  def profile(events: List[BranchEvent]): Int = ???
-  #+END_SRC
-
-  To help you get started, I have provided you with much of the necessary code.
-  In order to get an idea for how you should profile branch misses, consider the following profiler which calculates
-  misses for a processor with a branch predictor with a 1 bit predictor with infinite slots:
-
-  #+BEGIN_SRC scala
-  def OneBitInfiniteSlots(events: List[BranchEvent]): Int = {
-
-    // Helper inspects the next element of the event list. If the event is a mispredict the prediction table is updated
-    // to reflect this.
-    // As long as there are remaining events the helper calls itself recursively on the remainder
-    def helper(events: List[BranchEvent], predictionTable: Map[Int, Boolean]): Int = {
-      events match {
-
-	// Scala syntax for matching a list with a head element of some type and a tail
-	// `case h :: t =>`
-	// means we want to match a list with at least a head and a tail (tail can be Nil, so we
-	// essentially want to match a list with at least one element)
-	// h is the first element of the list, t is the remainder (which can be Nil, aka empty)
-
-	// `case Constructor(arg1, arg2) :: t => `
-	// means we want to match a list whose first element is of type Constructor, giving us access to its internal
-	// values.
-
-	// `case Constructor(arg1, arg2) :: t => if(p(arg1, arg2))`
-	// means we want to match a list whose first element is of type Constructor while satisfying some predicate p,
-	// called an if guard.
-	case Taken(from, to) :: t if( predictionTable(from)) => helper(t, predictionTable)
-	case Taken(from, to) :: t if(!predictionTable(from)) => 1 + helper(t, predictionTable.updated(from, true))
-	case NotTaken(addr)  :: t if( predictionTable(addr)) => 1 + helper(t, predictionTable.updated(addr, false))
-	case NotTaken(addr)  :: t if(!predictionTable(addr)) => helper(t, predictionTable)
-	case _ => 0
-      }
-    }
-
-    // Initially every possible branch is set to false since the initial state of the predictor is to assume branch not taken
-    def initState = events.map{
-      case Taken(addr)    => (addr, false)
-      case NotTaken(addr) => (addr, false)
-    }.toMap
-
-    helper(events, initState)
-  }
-  #+END_SRC
-
-** Your task
-   Your job is to implement a test that checks how many misses occur for a 2 bit branch predictor with 8 slots.
-   The rule table is the same as in question 3.
-   The predictor does not use a branch target buffer (BTB), which means that the address will always be decoded in
-   the ID stage.
-   For you this means you do not need to keep track of branch targets, simplifying your simulation quite a bit.
-   (If not you would need to add logic for when BTB value does not match actual value)
-
-   For simplicity's sake, assume that every value in the table is initialized to 00.
-
-   For this task it is necessary to use something more sophisticated than ~Map[(Int, Boolean)]~ to represent
-   your branch predictor model.
-
-   The skeleton code is located in ~testRunner.scala~ and can be run using testOnly FiveStage.ProfileBranching.
-
-   With a 2 bit 8 slot scheme, how many mispredicts will happen?
-   Answer with a number.
-
-   Hint: Use the getTag method defined on int (in DataTypes.scala) to get the tag for an address.
-   #+BEGIN_SRC scala
-   val slots = 8
-   say(0x1C40.getTag(slots)) // prints 0
-   say(0x1C44.getTag(slots)) // prints 1
-   say(0x1C48.getTag(slots)) // prints 2
-   say(0x1C4C.getTag(slots)) // prints 3
-   say(0x1C50.getTag(slots)) // prints 4
-   say(0x1C54.getTag(slots)) // prints 5
-   say(0x1C58.getTag(slots)) // prints 6
-   say(0x1C5C.getTag(slots)) // prints 7
-   say(0x1C60.getTag(slots)) // prints 0 (thus conflicts with 0x1C40)
-   #+END_SRC
-
-
-* Question 5 - Cache profiling
-  Unlike our design which has a very limited memory pool, real designs have access to vast amounts of memory, offset
-  by a steep cost in access latency.
-  To amend this a modern processor features several caches where even the smallest fastest cache has more memory than
-  your entire design.
-  In order to investigate how caches can alter performance it is therefore necessary to make some rather
-  unrealistic assumptions to see how different cache schemes impacts performance.
-
-  For this exercise you will write a program that parses a log of memory events, similar to previous task
-  #+BEGIN_SRC scala
-  sealed trait MemoryEvent
-  case class Write(addr: Int) extends MemoryEvent
-  case class Read(addr: Int) extends MemoryEvent
-
-
-  def profile(events: List[MemoryEvent]): Int = ???
-  #+END_SRC
-
-** TODO Your task
-   Your job is to implement a *parameterised* model that tests how many delay cycles will occur for a cache with
-   the following configuration:
-   + Follows an n-way associative scheme (parameter)
-   + Is write-through write allocate.
-   + Eviction policy is LRU (least recently used)
-
-   To make this task easier a data structure with stub methods has been implemented for you.
-   
-   Answer by pasting the output from running the branchProfiler test.