109 lines
4.4 KiB
Org Mode
109 lines
4.4 KiB
Org Mode
* Exercise 2
|
|
Safety wheels are now officially off.
|
|
To verify this, set nopPadded to false in Manifest.scala and observe as all hell
|
|
breaks loose.
|
|
|
|
Let's break down what's going wrong and what we can do about it.
|
|
|
|
** RAW Hazards
|
|
Consider the following program:
|
|
#+begin_src asm
|
|
main:
|
|
add x1, x1, x2
|
|
add x1, x1, x1
|
|
add x1, x1, x2
|
|
#+end_src
|
|
|
|
In your implementation this will give you wrong results since the results
|
|
of the first add operation will not be available in the registers before
|
|
x1 is fetched for the second and third operation.
|
|
|
|
Your first task should therefore be to implement a forwarding unit
|
|
|
|
The forwarding unit is located in the EX stage and is responsible for selecting
|
|
the ALU input from three possible sources.
|
|
These sources are:
|
|
+ The value received from the register bank
|
|
+ The ALU result currently in MEM
|
|
+ The writeback result
|
|
|
|
The forwarder prioritizes as follows:
|
|
+ If the input register address is not the destination in either MEM or WB, select the
|
|
register.
|
|
+ If the input register address is the destination register in WB, but not in MEM, select
|
|
the writeback signal.
|
|
+ If the input register address is the destination register for the operation currently
|
|
in MEM, select that operation.
|
|
|
|
There is a special case you need take into account, namely load operations.
|
|
Considering the following program:
|
|
#+begin_src asm
|
|
main:
|
|
lw x1, 0(x2)
|
|
add x1, x1, x1
|
|
add x1, x1, x1
|
|
#+end_src
|
|
|
|
When the second operation (~add x1, x1, x1~) is in EX, the third clause in the forwarder
|
|
is triggered, however the result is not yet ready since fetching memory costs a cycle.
|
|
In order to fix this the forwarder must issue a signal that freezes the pipeline.
|
|
This is done by issuing a signal to the barrier registers, telling them to _not_ update
|
|
their contents, essentially repeating the last instruction.
|
|
|
|
There are many subtleties to consider here.
|
|
For instance: What should happen to the instruction currently
|
|
in MEM? If it too is repeated the hazard detector will trigger next cycle, effectively
|
|
deadlocking your processor.
|
|
What about when the write address and read address are similar in the ID stage?
|
|
|
|
Designing a forwarder can take very long, or it can be done very quickly, all depending
|
|
on how *methodical* you are. My advice is to design the algorithm first, then when you're
|
|
satisfied implement it on hardware.
|
|
|
|
|
|
** Control hazards
|
|
|
|
Consider the following code
|
|
|
|
#+begin_src asm
|
|
main:
|
|
beq zero, zero, target
|
|
add x1, x1, x1
|
|
add x1, x1, x1
|
|
j main
|
|
target:
|
|
sub x1, x2, x2
|
|
sub x2, x2, x2
|
|
#+end_src
|
|
|
|
Depending on your design the two add instructions will be fetched before the beq jump happens.
|
|
Whenever a branch happens it is necessary to flush the spurious instructions that were fetched
|
|
before the branch was noticed.
|
|
However, simply waiting until the branch has been decided is not acceptable since that guarantees
|
|
lost cycles.
|
|
In your first iteration, simply assume branches will not be taken, and if they are taken issue
|
|
a warning to the barriers that hold spurious instructions telling them to render them impotent
|
|
by setting the control signals to do nothing.
|
|
|
|
* Putting the pedal to the metal
|
|
Once you have a design that RAW and control hazards you're ready to up the ante and add some
|
|
improvements to your design.
|
|
Some suggestions:
|
|
|
|
** Branch predictor
|
|
Instead of assuming branch not taken, use a branch predictor. There are many different schemes,
|
|
but I advice you to stick to a simple one, such as 1-bit or 2-bit.
|
|
|
|
** Fast branch handling
|
|
Certaing branches like BEQ and BNE can be calculated very quickly, wheras size comparison branches
|
|
(BGE, BLE and friends) take longer. It is therefore feasible to do these checks in the ID stage.
|
|
|
|
** Adding a data cache
|
|
Unless you have already done the two suggested improvements, do not attempt to create a cache.
|
|
The first thing you need to do is to add a latency for memory fetching, if not you will have
|
|
nothing to improve upon.
|
|
|
|
If you still insist, start with the BTreeManyO3.s program and analyze the memory access pattern.
|
|
What sort of eviction policy and cache size would you choose for this pattern?
|
|
You should try writing additional benchmarking programs, it is important to have something measurable,
|
|
and the current programs are not made to test cache performance!!
|