277 lines
12 KiB
Org Mode
277 lines
12 KiB
Org Mode
* Question 1 - Hazards
|
|
For the following programs describe each hazard with type (data or control), line number and a
|
|
small (max one sentence) description
|
|
|
|
** program 1
|
|
#+begin_src asm
|
|
addi t0, zero, 10
|
|
addi t1, zero, 20
|
|
L2:
|
|
sub t1, t1, t0
|
|
beq t1, zero, .L2
|
|
jr ra
|
|
#+end_src
|
|
|
|
|
|
** program 2
|
|
#+begin_src asm
|
|
addi t0, zero, 10
|
|
lw t0, 10(t0)
|
|
beq t0, zero, .L3
|
|
jr ra
|
|
#+end_src
|
|
|
|
|
|
** program 3
|
|
#+begin_src asm
|
|
lw t0, 0(t0)
|
|
lw t1, 4(t0)
|
|
sw t0, 8(t1)
|
|
lw t1, 12(t0)
|
|
beq t0, t1, .L3
|
|
jr ra
|
|
#+end_src
|
|
|
|
|
|
* Question 2 - Handling hazards
|
|
For this question, keep in mind that the forwarder does not care if the values it forwards are being used or not!
|
|
Even for a JAL instructions which has neither an rs1 or rs2 field, the forwarder must still forward its values.
|
|
|
|
** Data hazards 1
|
|
At some cycle the following instructions can be found in a 5 stage design:
|
|
|
|
#+begin_src text
|
|
EX: || MEM: || WB:
|
|
---------------------||-------------------------||--------------------------
|
|
rs1: 4 || rs1: 4 || rs1: 1
|
|
rs2: 5 || rs2: 6 || rs2: 2
|
|
rd: 6 || rd: 4 || rd: 5
|
|
memToReg = false || memToReg = false || memToReg = false
|
|
regWrite = true || regWrite = false || regWrite = true
|
|
memWrite = false || memWrite = false || memWrite = false
|
|
branch = false || branch = true || branch = false
|
|
jump = false || jump = false || jump = false
|
|
#+end_src
|
|
|
|
For the operation currently in EX, from where (ID, MEM or WB) should the forwarder get data from for rs1 and rs2?
|
|
|
|
** Data hazards 2
|
|
|
|
At some cycle the following instructions can be found in a 5 stage design:
|
|
|
|
#+begin_src text
|
|
EX: || MEM: || WB:
|
|
---------------------||-------------------------||--------------------------
|
|
rs1: 1 || rs1: 4 || rs1: 1
|
|
rs2: 5 || rs2: 6 || rs2: 0
|
|
rd: 0 || rd: 1 || rd: 0
|
|
memToReg = false || memToReg = false || memToReg = false
|
|
regWrite = true || regWrite = true || regWrite = true
|
|
memWrite = false || memWrite = false || memWrite = false
|
|
branch = false || branch = true || branch = false
|
|
jump = true || jump = true || jump = false
|
|
#+end_src
|
|
|
|
For the operation currently in EX, from where (ID, MEM or WB) should the forwarder get data from for rs1 and rs2?
|
|
|
|
** Data hazards 3
|
|
|
|
At some cycle the following instructions can be found in a 5 stage design:
|
|
|
|
#+begin_src text
|
|
EX: || MEM: || WB:
|
|
---------------------||-------------------------||--------------------------
|
|
rs1: 2 || rs1: 4 || rs1: 3
|
|
rs2: 5 || rs2: 6 || rs2: 4
|
|
rd: 1 || rd: 1 || rd: 5
|
|
memToReg = false || memToReg = true || memToReg = false
|
|
regWrite = false || regWrite = true || regWrite = true
|
|
memWrite = true || memWrite = false || memWrite = false
|
|
branch = false || branch = false || branch = false
|
|
jump = false || jump = false || jump = false
|
|
|
|
Should the forwarding unit issue a load hazard signal?
|
|
(Hint: what are the semantics of the instruction currently in EX stage?)
|
|
#+end_src
|
|
|
|
* Question 3 - Branch prediction
|
|
Consider a 2 bit branch predictor with only 4 slots for a 32 bit architecture (without BTB), where the decision to
|
|
take a branch or not is decided in accordance to the following table:
|
|
#+begin_src text
|
|
state || predict taken || next state if taken || next state if not taken ||
|
|
=======||=================||=======================||==========================||
|
|
00 || NO || 01 || 00 ||
|
|
01 || NO || 10 || 00 ||
|
|
10 || YES || 11 || 01 ||
|
|
11 || YES || 11 || 10 ||
|
|
#+end_src
|
|
|
|
(This is known as a saturating 2bit counter, it is *not* the same scheme as in the lecture slides)
|
|
|
|
At some point during execution the program counter is ~0xc~ and the branch predictor table looks like this:
|
|
#+begin_src text
|
|
slot || value
|
|
======||========
|
|
00 || 01
|
|
01 || 00
|
|
10 || 11
|
|
11 || 01
|
|
#+end_src
|
|
|
|
For the following program:
|
|
#+begin_src asm
|
|
0xc addi x1, x3, 10
|
|
0x10 add x2, x1, x1
|
|
0x14 beq x1, x2, .L1
|
|
0x18 j .L2
|
|
#+end_src
|
|
|
|
Will the predictor predict taken or not taken for the beq instruction?
|
|
|
|
* Question 4 - Benchmarking
|
|
In order to gauge the performance increase from adding branch predictors it is necessary to do some testing.
|
|
Rather than writing a test from scratch it is better to use the tester already in use in the test harness.
|
|
When running a program the VM outputs a log of all events, including which branches have been taken and which
|
|
haven't, which as it turns out is the only information we actually need to gauge the effectiveness of a branch
|
|
predictor!
|
|
|
|
For this exercise you will write a program that parses a log of branch events.
|
|
|
|
#+BEGIN_SRC scala
|
|
sealed trait BranchEvent
|
|
case class Taken(from: Int, to: Int) extends BranchEvent
|
|
case class NotTaken(at: Int) extends BranchEvent
|
|
|
|
|
|
def profile(events: List[BranchEvent]): Int = ???
|
|
#+END_SRC
|
|
|
|
To help you get started, I have provided you with much of the necessary code.
|
|
In order to get an idea for how you should profile branch misses, consider the following profiler which calculates
|
|
misses for a processor with a branch predictor with a 1 bit predictor with infinite memory:
|
|
|
|
#+BEGIN_SRC scala
|
|
def OneBitInfiniteSlots(events: List[BranchEvent]): Int = {
|
|
|
|
// Helper inspects the next element of the event list. If the event is a mispredict the prediction table is updated
|
|
// to reflect this.
|
|
// As long as there are remaining events the helper calls itself recursively on the remainder
|
|
def helper(events: List[BranchEvent], predictionTable: Map[Int, Boolean]): Int = {
|
|
events match {
|
|
|
|
// Scala syntax for matching a list with a head element of some type and a tail
|
|
// `case h :: t =>`
|
|
// means we want to match a list with at least a head and a tail (tail can be Nil, so we
|
|
// essentially want to match a list with at least one element)
|
|
// h is the first element of the list, t is the remainder (which can be Nil, aka empty)
|
|
|
|
// `case Constructor(arg1, arg2) :: t => `
|
|
// means we want to match a list whose first element is of type Constructor, giving us access to its internal
|
|
// values.
|
|
|
|
// `case Constructor(arg1, arg2) :: t => if(p(arg1, arg2))`
|
|
// means we want to match a list whose first element is of type Constructor while satisfying some predicate p,
|
|
// called an if guard.
|
|
case Taken(from, to) :: t if( predictionTable(from)) => helper(t, predictionTable)
|
|
case Taken(from, to) :: t if(!predictionTable(from)) => 1 + helper(t, predictionTable.updated(from, true))
|
|
case NotTaken(addr) :: t if( predictionTable(addr)) => 1 + helper(t, predictionTable.updated(addr, false))
|
|
case NotTaken(addr) :: t if(!predictionTable(addr)) => helper(t, predictionTable)
|
|
case _ => 0
|
|
}
|
|
}
|
|
|
|
// Initially every possible branch is set to false since the initial state of the predictor is to assume branch not taken
|
|
def initState = events.map{
|
|
case Taken(addr) => (addr, false)
|
|
case NotTaken(addr) => (addr, false)
|
|
}.toMap
|
|
|
|
helper(events, initState)
|
|
}
|
|
#+END_SRC
|
|
|
|
** Your task
|
|
Your job is to implement a test that checks how many misses occur for a 2 bit branch predictor with 8 slots.
|
|
The rule table is the same as in question 3.
|
|
The predictor does not use a branch target buffer (BTB), which means that the address will always be decoded in
|
|
the ID stage.
|
|
For you this means you do not need to keep track of branch targets, simplifying your simulation quite a bit.
|
|
(If not you would need to add logic for when BTB value does not match actual value)
|
|
|
|
For simplicity's sake, assume that every value in the table is initialized to 00.
|
|
|
|
For this task it is necessary to use something more sophisticated than ~Map[(Int, Boolean)]~ to represent
|
|
your branch predictor model.
|
|
|
|
The skeleton code is located in ~testRunner.scala~ and can be run using testOnly FiveStage.ProfileBranching.
|
|
|
|
With a 2 bit 8 slot scheme, how many mispredicts will happen?
|
|
Answer with a number.
|
|
|
|
Hint: Use the getTag method defined on int (in DataTypes.scala) to get the tag for an address.
|
|
#+BEGIN_SRC scala
|
|
val slots = 8
|
|
say(0x1C40.getTag(slots)) // prints 0
|
|
say(0x1C44.getTag(slots)) // prints 1
|
|
say(0x1C48.getTag(slots)) // prints 2
|
|
say(0x1C4C.getTag(slots)) // prints 3
|
|
say(0x1C50.getTag(slots)) // prints 4
|
|
say(0x1C54.getTag(slots)) // prints 5
|
|
say(0x1C58.getTag(slots)) // prints 6
|
|
say(0x1C5C.getTag(slots)) // prints 7
|
|
say(0x1C60.getTag(slots)) // prints 0 (thus conflicts with 0x1C40)
|
|
#+END_SRC
|
|
|
|
|
|
* Question 5 - Cache profiling
|
|
Unlike our design which has a very limited memory pool, real designs have access to vast amounts of memory, offset
|
|
by a steep cost in access latency.
|
|
To amend this a modern processor features several caches where even the smallest fastest cache has more memory than
|
|
your entire design.
|
|
In order to investigate how caches can alter performance it is therefore necessary to make some rather
|
|
unrealistic assumptions to see how different cache schemes impacts performance.
|
|
|
|
We will therefore assume the following:
|
|
+ Reads from main memory takes 5 cycles
|
|
+ cache has a total storage of 8 words (256 bits)
|
|
+ cache reads work as they do now (i.e no additional latency)
|
|
|
|
For this exercise you will write a program that parses a log of memory events, similar to previous task
|
|
#+BEGIN_SRC scala
|
|
sealed trait MemoryEvent
|
|
case class Write(addr: Int) extends MemoryEvent
|
|
case class Read(addr: Int) extends MemoryEvent
|
|
|
|
|
|
def profile(events: List[MemoryEvent]): Int = ???
|
|
#+END_SRC
|
|
|
|
** Your task
|
|
Your job is to implement a model that tests how many delay cycles will occur for a cache which:
|
|
+ Follows a 2-way associative scheme
|
|
+ set size is 4 words (128 bits) (total cache size: a whopping 256 bits)
|
|
+ Block size is 1 word (32 bits) meaning that we *do not need a block offset*.
|
|
+ Is write-through write no-allocate (this means that you can ignore stores, only loads will affect the cache)
|
|
+ Eviction policy is LRU (least recently used)
|
|
|
|
In the typical cache each block has more than 32 bits, requiring an offset, however the
|
|
simulated cache does not.
|
|
This means that the simulated cache has two sets of 4 words, greatly reducing the complexity
|
|
of your implementation.
|
|
|
|
Additionally, assume that writes does not change the the LRU counter.
|
|
This means that that your cache will only consider which value was most recently loaded,
|
|
not written.
|
|
It's not realistic, but it allows you to completely disregard write events (you can
|
|
just filter them out if you want.)
|
|
|
|
Your answer should be the number of cache miss latency cycles when using this cache.
|
|
|
|
*** Further study
|
|
If you have the time I strongly encourage you to experiment with a larger cache with bigger
|
|
block sizes, forcing you to implement the additional complexity of block offsets.
|
|
Likewise, by trying a different scheme than write-through no-allocate you will get a much
|
|
better grasp on how exactly the cache works.
|
|
This is *not* a deliverable, just something I encourage you to tinker with to get a better
|
|
understanding.
|
|
|