TDT4255/theory2.org
peteraaser 8dc92fb8e1 Remove MemToReg.
Pretty sure MemToReg is a MIPS relic, it is redundant so long as
all memory reads are put into registers.
2020-06-02 14:58:06 +02:00

12 KiB

Question 1 - Hazards

For the following programs describe each hazard with type (data or control), line number and a small (max one sentence) description

program 1

  addi t0,   zero,  10
  addi t1,   zero,  20
L2:
  sub  t1,   t1,    t0
  beq  t1,   zero, .L2
  jr   ra

program 2

  addi t0,   zero,  10
  lw   t0,   10(t0)
  beq  t0,   zero,  .L3
  jr   ra

program 3

lw   t0,   0(t0)
lw   t1,   4(t0)
sw   t0,   8(t1)
lw   t1,   12(t0)
beq  t0,   t1,  .L3
jr   ra

Question 2 - Handling hazards

For this question, keep in mind that the forwarder does not care if the values it forwards are being used or not! Even for a JAL instructions which has neither an rs1 or rs2 field, the forwarder must still forward its values.

Data hazards 1

At some cycle the following instructions can be found in a 5 stage design:

EX:                  ||     MEM:                ||      WB:
---------------------||-------------------------||--------------------------
rs1: 4               ||     rs1: 4              ||      rs1: 1
rs2: 5               ||     rs2: 6              ||      rs2: 2
rd:  6               ||     rd:  4              ||      rd:  5
regWrite = true      ||     regWrite = false    ||      regWrite = true
memWrite = false     ||     memWrite = false    ||      memWrite = false
branch   = false     ||     branch   = true     ||      branch   = false
jump     = false     ||     jump     = false    ||      jump     = false

For the operation currently in EX, from where (ID, MEM or WB) should the forwarder get data from for rs1 and rs2?

Data hazards 2

At some cycle the following instructions can be found in a 5 stage design:

EX:                  ||     MEM:                ||      WB:
---------------------||-------------------------||--------------------------
rs1: 1               ||     rs1: 4              ||      rs1: 1
rs2: 5               ||     rs2: 6              ||      rs2: 0
rd:  0               ||     rd:  1              ||      rd:  0
memToReg = false     ||     memToReg = false    ||      memToReg = false
regWrite = true      ||     regWrite = true     ||      regWrite = true
memWrite = false     ||     memWrite = false    ||      memWrite = false
branch   = false     ||     branch   = true     ||      branch   = false
jump     = true      ||     jump     = true     ||      jump     = false

For the operation currently in EX, from where (ID, MEM or WB) should the forwarder get data from for rs1 and rs2?

Data hazards 3

At some cycle the following instructions can be found in a 5 stage design:

EX:                  ||     MEM:                ||      WB:
---------------------||-------------------------||--------------------------
rs1: 2               ||     rs1: 4              ||      rs1: 3
rs2: 5               ||     rs2: 6              ||      rs2: 4
rd:  1               ||     rd:  1              ||      rd:  5
memToReg = false     ||     memToReg = true     ||      memToReg = false
regWrite = false     ||     regWrite = true     ||      regWrite = true
memWrite = true      ||     memWrite = false    ||      memWrite = false
branch   = false     ||     branch   = false    ||      branch   = false
jump     = false     ||     jump     = false    ||      jump     = false

Should the forwarding unit issue a load hazard signal?
(Hint: what are the semantics of the instruction currently in EX stage?)

Question 3 - Branch prediction

Consider a 2 bit branch predictor with only 4 slots for a 32 bit architecture (without BTB), where the decision to take a branch or not is decided in accordance to the following table:

state  ||  predict taken  ||  next state if taken  ||  next state if not taken ||
=======||=================||=======================||==========================||
00     ||  NO             ||  01                   ||  00                      ||
01     ||  NO             ||  10                   ||  00                      ||
10     ||  YES            ||  11                   ||  01                      ||
11     ||  YES            ||  11                   ||  10                      ||

(This is known as a saturating 2bit counter, it is not the same scheme as in the lecture slides)

At some point during execution the program counter is 0xc and the branch predictor table looks like this:

slot  ||  value
======||========
00    ||  01
01    ||  00
10    ||  11
11    ||  01

For the following program:

0xc  addi x1, x3, 10
0x10 add  x2, x1, x1
0x14 beq  x1, x2, .L1 
0x18 j    .L2

Will the predictor predict taken or not taken for the beq instruction?

Question 4 - Benchmarking

In order to gauge the performance increase from adding branch predictors it is necessary to do some testing. Rather than writing a test from scratch it is better to use the tester already in use in the test harness. When running a program the VM outputs a log of all events, including which branches have been taken and which haven't, which as it turns out is the only information we actually need to gauge the effectiveness of a branch predictor!

For this exercise you will write a program that parses a log of branch events.

sealed trait BranchEvent
case class Taken(from: Int, to: Int) extends BranchEvent
case class NotTaken(at: Int) extends BranchEvent


def profile(events: List[BranchEvent]): Int = ???

To help you get started, I have provided you with much of the necessary code. In order to get an idea for how you should profile branch misses, consider the following profiler which calculates misses for a processor with a branch predictor with a 1 bit predictor with infinite memory:

def OneBitInfiniteSlots(events: List[BranchEvent]): Int = {

  // Helper inspects the next element of the event list. If the event is a mispredict the prediction table is updated
  // to reflect this.
  // As long as there are remaining events the helper calls itself recursively on the remainder
  def helper(events: List[BranchEvent], predictionTable: Map[Int, Boolean]): Int = {
    events match {

// Scala syntax for matching a list with a head element of some type and a tail
// `case h :: t =>`
// means we want to match a list with at least a head and a tail (tail can be Nil, so we
// essentially want to match a list with at least one element)
// h is the first element of the list, t is the remainder (which can be Nil, aka empty)

// `case Constructor(arg1, arg2) :: t => `
// means we want to match a list whose first element is of type Constructor, giving us access to its internal
// values.

// `case Constructor(arg1, arg2) :: t => if(p(arg1, arg2))`
// means we want to match a list whose first element is of type Constructor while satisfying some predicate p,
// called an if guard.
      case Taken(from, to) :: t if( predictionTable(from)) => helper(t, predictionTable)
      case Taken(from, to) :: t if(!predictionTable(from)) => 1 + helper(t, predictionTable.updated(from, true))
      case NotTaken(addr)  :: t if( predictionTable(addr)) => 1 + helper(t, predictionTable.updated(addr, false))
      case NotTaken(addr)  :: t if(!predictionTable(addr)) => helper(t, predictionTable)
      case _ => 0
    }
  }

  // Initially every possible branch is set to false since the initial state of the predictor is to assume branch not taken
  def initState = events.map{
    case Taken(addr)    => (addr, false)
    case NotTaken(addr) => (addr, false)
  }.toMap

  helper(events, initState)
}

Your task

Your job is to implement a test that checks how many misses occur for a 2 bit branch predictor with 8 slots. The rule table is the same as in question 3. The predictor does not use a branch target buffer (BTB), which means that the address will always be decoded in the ID stage. For you this means you do not need to keep track of branch targets, simplifying your simulation quite a bit. (If not you would need to add logic for when BTB value does not match actual value)

For simplicity's sake, assume that every value in the table is initialized to 00.

For this task it is necessary to use something more sophisticated than Map[(Int, Boolean)] to represent your branch predictor model.

The skeleton code is located in testRunner.scala and can be run using testOnly FiveStage.ProfileBranching.

With a 2 bit 8 slot scheme, how many mispredicts will happen? Answer with a number.

Hint: Use the getTag method defined on int (in DataTypes.scala) to get the tag for an address.

val slots = 8
say(0x1C40.getTag(slots)) // prints 0
say(0x1C44.getTag(slots)) // prints 1
say(0x1C48.getTag(slots)) // prints 2
say(0x1C4C.getTag(slots)) // prints 3
say(0x1C50.getTag(slots)) // prints 4
say(0x1C54.getTag(slots)) // prints 5
say(0x1C58.getTag(slots)) // prints 6
say(0x1C5C.getTag(slots)) // prints 7
say(0x1C60.getTag(slots)) // prints 0 (thus conflicts with 0x1C40)

Question 5 - Cache profiling

Unlike our design which has a very limited memory pool, real designs have access to vast amounts of memory, offset by a steep cost in access latency. To amend this a modern processor features several caches where even the smallest fastest cache has more memory than your entire design. In order to investigate how caches can alter performance it is therefore necessary to make some rather unrealistic assumptions to see how different cache schemes impacts performance.

We will therefore assume the following:

  • Reads from main memory takes 5 cycles
  • cache has a total storage of 8 words (256 bits)
  • cache reads work as they do now (i.e no additional latency)

For this exercise you will write a program that parses a log of memory events, similar to previous task

sealed trait MemoryEvent
case class Write(addr: Int) extends MemoryEvent
case class Read(addr: Int) extends MemoryEvent


def profile(events: List[MemoryEvent]): Int = ???

Your task

Your job is to implement a model that tests how many delay cycles will occur for a cache which:

  • Follows a 2-way associative scheme
  • set size is 4 words (128 bits) (total cache size: a whopping 256 bits)
  • Block size is 1 word (32 bits) meaning that we do not need a block offset.
  • Is write-through write no-allocate (this means that you can ignore stores, only loads will affect the cache)
  • Eviction policy is LRU (least recently used)

In the typical cache each block has more than 32 bits, requiring an offset, however the simulated cache does not. This means that the simulated cache has two sets of 4 words, greatly reducing the complexity of your implementation.

Additionally, assume that writes does not change the the LRU counter. This means that that your cache will only consider which value was most recently loaded, not written. It's not realistic, but it allows you to completely disregard write events (you can just filter them out if you want.)

Your answer should be the number of cache miss latency cycles when using this cache.

Further study

If you have the time I strongly encourage you to experiment with a larger cache with bigger block sizes, forcing you to implement the additional complexity of block offsets. Likewise, by trying a different scheme than write-through no-allocate you will get a much better grasp on how exactly the cache works. This is not a deliverable, just something I encourage you to tinker with to get a better understanding.