# CS450

Computer Architecture.

Andrew Morton
Section 001
Email: andrew.morton@uwaterloo.ca
Website: https://www.student.cs.uwaterloo.ca/~cs450/w17/outline.shtml
Office Hours: Tuesdays/Thursdays 11:30am-12:30pm in EIT-4015
Tuesdays/Thursdays 1:00pm-2:20pm

# 3/1/16

In this course, the main project (worth 30%) is to design a MIPS processor in Verilog. No required textbooks. The main topic is about how modern CPUs manage to achieve the performance that they do, building on top of the processor basics covered in earlier processor design courses.

There are also two written assignments (worth 20% total) and the final exam (worth 50%). Assignments have 4 grace days allowed.

Hardware description languages are used for modelling hardware, generally for the purpose of simulation or synthesis. Most complex digital hardware is designed using a hardware description language, which is then compiled and converted into specifications for FPGAs, ASICs, programmable logic, and so on.

The main hardware description languages in use today are System Verilog and VHDL. VHDL is more verbose and Ada-like, while Verilog is more minimalist and C-like. Most projects in industry use Verilog for this reason. System Verilog is a superset of plain Verilog, and all Verilog code works in System Verilog. There's also System C, which looks a lot more like C/C++, but it's not commonly used. In this course, we will be using Icarus Verilog (an implementation of System Verilog), version 10 or later (in order to get support for SystemVerilog).

The simplest/lowest-level of hardware modelling in System Verilog is gate-level modelling. At this level, we have logic gate primitives like AND/OR, and we can connect them together into circuits via wires:

# a1 is the name of the gate, which is optional but useful for labelling gates
# y is the gate output, and a/b/c are the gate inputs (and supports one or more of these inputs, and one output)
# the inputs and outputs are wires, identified by name - everywhere that same name is used can be thought of as being all connected together by a physical wire
and a1(y, a, b, c)

# every argument except the last are the NOT gate's outputs, and the last one is the input
not (y1, y2, a)

## Gate-level modelling

In gate-level modelling, signal/wire values can be 0 (low), 1 (high), x/X (unknown/ignored value), or z/Z (high impedance).

Gate-level modelling for a full adder (addend bits as a/b, carry-in as c_in, carry-out as c_out, sum as sum):

// full_adder_gate_level_modelling.sv
// for a full adder, sum = a XOR b XOR c_in, and c_out = ((a XOR b) AND c_in) OR a AND b

module full_adder(input a, b, c_in, output c_out, sum);
// a logic is basically a connection/wire - here, we're declaring our wires
// a wire is equivalent to a wire logic, which is a logic that supports multiple drivers
// generally, we just use logic unless we need multiple drivers - the compiler will choose the best structure depending on how they're used
// there are also bit and byte, which are basically logics that can only be 0 or 1, not X or Z
logic w1, w2, w3; // these are wires that are used for connecting gates in the full adder together
xor x1(w1, a, b);
xor x2(sum, w1, c_in);
and a1(w2, a, b);
and a2(w3, w1, c_in);
or o1(c_out, w2, w3);
endmodule

This is a design module. It fully specifies the full adder as a self-contained unit.

While this looks like normal procedural code, with various function calls, it is more intuitive to think of this as a series of logical declarations - as if each of the xor/and/or lines were all continuously running at the same time.

A test bench is a set of definitions that instantiates the design module, and simulates the circuit under different input conditions to test for functional correctness. For our full adder example:

// full_adder_test_bench.sv
logic carry_in, x, y;
logic carry_out, z;

// "dut" stands for "device under test"
// below, we're basically instantiating a full adder (defined in the code above) that's connected to our logics, for testing purposes
// the .name(other_name) syntax is a "named port map", which lets us connect logics to devices by signal name
// alternatively, we can just map the signals in order, like full_adder dut(x, y, carry_in, carry_out, z), but this can cause issues if we change the inputs/outputs of the full adder
full_adder dut(.c_out(carry_out), .sum(z), .c_in(carry_in), .a(x), .b(y));

// this is a procedural block - inside, the statements are evaluated sequentially from top to bottom
initial begin // run once at time 0
carry_in = 0; x = 0; y = 0;
// delays must be used to ensure statements actually run sequentially - statements in between delays all execute in parallel
#10 x = 1; // delay 10 time units, then continue and set x to 1
#10 y = 1; // delay 10 time units, then continue and set y to 1
#10 carry_in = 1; // delay 10 time units, then continue and set carry_in to 1
#10 $stop; // delay 10 time units, then stop the simulation (this will not compile if trying to generate hardware from this) end // now we want to dump the signals out into waveforms so we can see them initial$dumpvars(0, full_adder_test_bench); // the first parameter is the dump level (0 means "dump all variables in the module and in the modules that instantiate it"), and the rest of the parameters are the module or variables to dump
endmodule

To compile, run iverilog -g2005-sv -s full_adder_test_bench.sv -o full_adder.vvp full_adder.sv full_adder_test_bench.sv:

• iverilog is the Icarus Verilog compiler.
• -g2005-sv specifies that the compiler should use the 2005 System Verilog standard.
• -s full_adder_test_bench.sv specifies the top-level/root module.
• -o full_adder.vvp specifies the output for the compiled Verilog.
• full_adder.sv full_adder_test_bench.sv specifies the modules to include in the compilation.

To simulate, run vvp -n full_adder.vvp -lx2:

• vvp is the Icarus Verilog simulation tool.

# 12/1/17

Complex number multiplier:

// recall that given two complex numbers $a$ and $b$, $p = a \times b = (\Re(a)\Re(b) - \Im(a)\Im(b)) + \imag (\Re(a)\Im(b) + \Im(a)\Re(b))$
// this needs 4 multiplications, 1 subtract, and 1 add
// we will use a synchronous circuit with one multiplier and one adder/subtractor to do this, in order to demonstrate synchronous design, and also to save space (at the expense of it being slower)
// therefore, we break the operation into 5 steps, each one only using one multiplication or one addition: pp1 <= a_real * b_real, pp2 <= a_imag * b_imag, p_real <= pp1 - pp2, pp1 <= a_real * b_imag, pp2 <= a_imag * b_real, and p_imag = pp1 + pp2, which we can implement with an FSM
module complex_multiplier(
output logic signed [7:-24] p_r, p_i, // this defines pr_ and p_i as fixed-point numbers with 24 bits to the right of the decimal point, and 7 bits to the left (actual value is the value of p_r/p_i as a two's complement integer, divided by $2^{24}$)
input signed [3:-12] a_r, a_i, b_r, b_i,
input clk, reset, go // the clock input, reset signal, and the trigger that starts the multiplication
);
logic a_sel, b_sel, pp1_en, pp2_en, sub, p_r_en, p_i_en;
logic signed [3:-12] a_operand, b_operand;
logic signed [7:-24] pp, sum; // partial product, sum
logic signed [7:-24] pp1, pp2;

// multiplier with two multiplexers for input, to select inputs between real/imaginary components of a and b
assign a_operand = ~a_sel ? a_r : a_i;
assign b_operand = ~b_sel ? b_r : b_i;
assign pp = a_operand * b_operand;

// update pp1 and pp2 registers on clock edge
always @ (posedge clk)
if (pp1_en) pp1 <= pp;
always @ (posedge clk)
if (pp2_en) pp2 <= pp;

assign sum = ~sub ? pp1 + pp2 : pp1 - -pp2;

// update output registers on clock edge
always @ (posedge clk)
if (p_r_en) p_r <= sum;
always @ (posedge clk)
if (p_i_en) p_i <= sum;

// finite state machine to drive the above hardware
enum bit [2:0] { step0, step1, step2, step3, step4 } current_state, next_state;
always @ (posedge clk or posedge reset)
if (reset) current_state <= step0;
else current_state <= next_state;
always @ * begin // update whenever go or current_state change
case (current_state)
step0: next_state = go ? step1 : step0;
step1: next_state = step2;
step2: next_state = step3;
step3: next_state = step4;
step4: next_state = step0;
endcase
end
always @ (current_state, reset) begin
a_sel; b_sel = 0; pp1_en = 0; pp2_en = 0;
sub = 0; p_r_en = 0; p_i_en = 0;
case (current_state)
step0: begin pp1_en = 0; end
step1: begin a_sel = 1; b_sel = 1; pp2_en = 1; end
step2: begin b_sel = 1; pp1_en = 1; sub = 1; sub = 1; p_r_en = 1; end
step3: begin a_sel = 1; pp2_en = 1; end
step4: begin p_i_en = 1; end
endcase
end
endmodule

Verilog has sized literals for numbers. For example, 16'h0001 means a 16-bit hexadecimal value 1, 1'b1 means the 1-bit binary value 1, and 4'd8 means the 4-bit decimal value 8.

Overview of assignment 1. Assignment involves filling out Verilog module stubs and writing test benches.

# 17/1/17

Room has been moved to HH 1102, which should be a lot closer to campus than OPT.

## Processor Design

There are multiple levels of abstraction when talking about a processor design. From most abstract to least abstract, the main ones are:

• Instruction-set architecture (ISA) - defines functional behaviour like instruction format, addressing modes, hazards, registers, privileges, exceptions, etc. For example, IA32 (also known as x86), IA64/AMD64 (also known as x86-64), PowerPC, MIPS, ARM, RISC-V.
• Microarchitecture - defines the hardware structure like pipeline structure, cache organization, branch prediction schemes. For example, Intel i386, Intel i486, Intel Skylake, MIPS R16000.
• Implementation - defines physical realization like gate design, transistor technology, fabrication technology.

In the 1980s the main focus in processor design was improving single core performance and clock speed. Today, focus is on more cores at a lower clock speed, and doing more with each cycle (modern Intel CPUs can do 4-8 instructions per cycle!). New microarchitecture-level techniques like instruction-level parallelism (parallelism within 1 thread) and hyperthreading (parallelism between multiple threads on one core), speculative execution (running both sides of a branch and then throwing away the wrong one once the branch is done), out of order execution (rearranging instructions to improve parallelism) have also resulted in significant performance gains. The main issue preventing more single core performance and clock speed is the power wall.

Dynamic power consumption is the power consumed that depends on the processor workload. When a transistor switches, it must use energy proportional to the capacitive load times the voltage squared. So the power used (switching frequency times switching energy) quicky increases linearly with the switching frequency, resulting in increased cooling requirements and worse battery life. One technique modern processors use to try to get around this is to reduce the switching voltage to quadratically reduce the switching energy required - for example, a 15% reduction in voltage results in a 28% reduction in switching energy.

There's also static energy consumption, the power consumed regardless of the processor workload. As transistors get smaller, the leakage current increases (due to quantum tunneling in the transistor junctions), which means more power used per transistor.

Modern techniques involve thread-level parallelism (doing multiple tasks at once), and data-level parallelism (doing something to multiple pieces of data at once).

Fynn's taxonomy characterizes four classes of parallelism:

• Single-instruction single-data - single processing unit acting on single pieces of data. For example, a core calculating one scalar value. For these, we can take advantage of instruction-level parallelism.
• Single-instruction multiple-data - single processing unit acting on multiple pieces of data. For example, DSP units, GPUs, vector instructions. For these, we can take advantage of data-level and instruction-level parallelism.
• Multiple-instruction single-data - multiple processing units acting on single pieces of data. Not really used in practice.
• Multiple-instruction multiple-data - multiple processing units acting on multiple pieces of data. For example, multithreading and multiprocessing. For these, we can take advantage of thread-level, data-level, and instruction-level parallelism.

Recall MIPS:

• 32-bit instructions, in one of two formats. First format: 6 bits for opcode, 5 bits for register operand 1, 5 bits for register operand 2, upper 16 bits for immediate value. Second format: 6-bit opcode, 5 bits each for register operand 1, 2, and 3, 5-bit shift amount, and 6-bit shift type specifier. See CS241 notes for more details about MIPS instructions.
• 32 registers R0 to R31, R0 always has value 0.
• In this course, we'll use a 5-stage pipeline for fetch, decode, execute, memory, and writeback. We can overlap up to 5 instructions at once.
• Instruction flow is linear (statically scheduled), and one instruction is completed per cycle (ideally). In other words, instructions are run in whatever order they're stored as, and we don't dynamically rearrange to improve parallelism.

# 19/1/17

Sometimes instructions in a pipeline depend on others. There are three main types of these dependencies:

• A read-after-write dependency is when an instruction needs to read out the value that some other instruction wrote - if the CPU ran both at once, we might read the wrong value. This is the only real dependency in scalar, statically-scheduled pipelines.
• A write-after-read dependency is an anti-dependency - if the CPU reorders these instructions so that they're in the opposite order, we might read the wrong value.
• A write-after-write dependency is an output dependency - if the CPU reorders these instructions so that they're in the opposite order, the final value of the register may be wrong.

When instructions aren't dependent on each other, we can more-or-less freely rearrange them. This allows efficient pipelining and parallelism.

Total sequential execution (TSE) is an assumption in our programming model that one instruction finishes before the next starts. TSE is sufficient but not necessary for semantic correctness. In fact, semantic correctness can be satisfied just by satisfying the inter-instruction dependencies of the original program - we can otherwise rearrange and overlap instructions however we want.

Since we have five stages (fetch, decode, execute, memory, writeback), an instruction that writes registers takes 4 cycles after it's been fetched before the write actually takes effect, assuming 1 cycle per pipeline stage. Likewise, an instruction that reads registers takes 1 cycle after fetching before actually performing the read.

A control dependency is basically a special case of a register dependency, where the register is the program counter. An instruction that branches (writes to the program counter) has its write take effect 1 cycle after being fetched (the MIPS datapath reads PC in fetch stage and writes in decode stage).

In the same way, an instruction that writes to memory has a memory dependency, and has its write take effect 3 cycles after being fetched.

Therefore, if we have two instructions $$A, B$$ that run at clock cycle $$i$$ and $$j$$, respectively, such that $$i > j$$:

• If there is a read-after-write register dependency from $$B$$ on $$A$$, we must satisfy $$j + 1 > i + 4$$. Hazards occur if $$j - i \le 3$$.
• If there is a write-after-read register dependency from $$B$$ on $$A$$, we must satisfy $$j + 4 > i + 1$$. Hazards are impossible without reordering instructions.
• If there is a write-after-write register dependency from $$B$$ on $$A$$, we must satisfy $$j + 4 > i + 4$$. Hazards are impossible without reordering instructions.
• If there is a read-after-write control dependency from $$B$$ on $$A$$, we must satisfy $$j + 0 > i + 1$$. Hazards occur if $$j - i \le 1$$.
• If there is a write-after-read control dependency from $$B$$ on $$A$$, we must satisfy $$j + 1 > i + 0$$. Hazards are impossible without reordering instructions.
• If there is a write-after-write control dependency from $$B$$ on $$A$$, we must satisfy $$j + 1 > i + 1$$. Hazards are impossible without reordering instructions.
• If there is a read-after-write memory dependency from $$B$$ on $$A$$, we must satisfy $$j + 3 > i + 3$$. Hazards are impossible without reordering instructions.
• If there is a write-after-read memory dependency from $$B$$ on $$A$$, we must satisfy $$j + 3 > i + 3$$. Hazards are impossible without reordering instructions.
• If there is a write-after-write memory dependency from $$B$$ on $$A$$, we must satisfy $$j + 3 > i + 3$$. Hazards are impossible without reordering instructions.

A hazard is a potential for a violated dependency (and therefore, violation of semantic correctness). In our pipelined, statically scheduled CPU, the hazards are a 3-cycle register read-after-write, and a 1-cycle control read-after-write.

To avoid hazards, we can insert no-op instructions - 3 NOP instructions after each register read-after-write, or 1 after every control read-after-write. These can be inserted either by the compiler, or the CPU can stall for the necessary number of cycles when the relevant dependencies are found. Compilers

However, a 3-cycle stall for a 3-cycle register read-after-write is quite long. We can shorten register/memory dependency stalls by using data forwarding - routing the output of later stages directly to earlier stages to avoid the need for a write/read - earlier stages can read directly from the later stages.

For our purposes, we can route the output of the memory stage to the execute stage (for $$j = i + 1$$), the output of the writeback stage to the execute stage (for $$j = i + 2$$), and the output of the writeback stage to the decode stage (for $$j = i + 3$$). With that in place, each stage always has the data it needs, so we never have to stall.

To implement data forwarding in each stage, we can use multiplexers that choose between either reading/writing registers, or taking input directly from later stages.

Turns out, this can't eliminate all our stalls, though it does make a lot of them much shorter. Consider lw R1, (R2) followed by add R4, R1, R1 - the load-word instruction doesn't have its result until the memory stage is done, so we must stall the add for 1 cycle so we can forward it from the write-back stage to the execute stage.

To solve the control dependency read-after-write stall (when we have a conditional branch), we might assume that the branch won't happen, keep executing the next instruction, and then squash/suppress it in the pipeline if we do end up taking the branch.

MIPS uses a different approach - the CPU will execute the instruction after the branch (the instruction after the branch is in a position called the branch delay slot), regardless of whether the branch is taken or not. The compiler will generally try to fill the branch delay slot with an instruction that doesn't depend on the branch, or NOP if it can't find one.

# 24/1/17

Recall that in static scheduling, instructions are executed in the order that they appear in, and the compiler is responsible for arranging instructions in a way that avoids hazards.

Code example with instruction reordering by the compiler for hazard avoidance.

The cycles per instruction (CPI) is a measure of CPU performance, and is ideally somewhere around 1 for our scalar processor. However, penalty cycles caused by hazards will increase this. For example, if we have 6 instructions that incur 3 penalty cycles, we get $$\frac{6 + 3}{6} = 1.5$$ as the CPI.

The program execution time is $$\frac{n \times \text{CPI}}{f}$$, where $$n$$ is the number of instructions in the basic block, and $$f$$ is the clock frequency.

Local scheduling is when the compiler rearranges instructions within a basic block. A basic block is a sequence of consecutive instructions such that if one instruction is executed, all of them are - that means only the first instruction can be the target of a branch instruction, and only the last instruction can be a branch instruction (or the second to last for MIPS, due to the branch delay slot).

To perform local scheduling, we first identify the hazards in the code, leaving slots to put other instructions in according to how many penalty cycles are present. For example, if we have a load followed by a dependent add instruction, we might have 1 penalty cycle in our architecture, so we would leave one slot for an instruction. Then, we try to fill in the slots with other instructions while keeping dependencies satisfied.

Global scheduling is when the compiler rearranges instructions across multiple basic blocks.

A common example of this is loop unrolling, where the body of a loop known to execute $$n$$ times is simply duplicated $$n$$ times - this reduces branching overhead, and makes local scheduling better since it may result in larger basic blocks. While loop unrolling can make our code faster, it results in larger programs, which might cause more instruction cache misses. The compiler must make the tradeoff depending on what would be faster.

# 26/1/17

Project part 1 marks come out on Monday. Part 2 is out on the course website, building a 5-stage MIPS pipeline. In-class overview of assignment.

In the scalar pipelines we are looking at, exceptions are events that alter program flow. Exceptions are harder to handle than normal program flow because they can occur at any time, and we must make sure registers and memory are properly stored/restored to ensure the TPE assumption is upheld.

Exceptions can be generated by things like interrupts from peripherals, unusual cases for some instructions (invalid opcode, page fault), and trap instructions (instructions that invoke the OS).

Precise exceptions are those that are recoverable - we can return to thread after the exception is handled.

What if we have multiple instructions that generate exceptions simultaneously in the pipeline? For example, we might have a load instruction that tries to dereference a null pointer at the memory stage, while we're trying to decode an invalid instruction in the decode stage - the load instruction generates an exception from the memory stage, and the invalid instruction generates one at the decode stage. To make sure we handle all instructions in the pipeline, we only mark the instruction as having caused an error, then actually handle them in the write-back stage.

There are three types of exceptions:

• Interrupts are asynchronous exceptions, usually caused by peripherals like keyboards and timers.
• When an interrupt occurs: the pipeline stops fetching, runs until the pipeline is empty (the pipeline is drained), saves state, invokes the interrupt handler routine, restores state, and then jumps to the next instruction in the interrupted thread..
• Faults are synchronous exceptions, handled in write-back stage.
• When a fault occurs: the pipeline stops fetching, cancels all instructions in the pipeline (the pipeline is flushed), saves state, invokes the fault handler routine, restores state, and then jumps back to the instruction that caused the fault.
• Traps are synchronous exceptions, detected in write-back stage.
• When a fault occurs: the pipeline stops fetching, finishes the trap, cancels all instructions in the pipeline (the pipeline is flushed), saves state, invokes the trap handler routine, restores state, and then jumps to the instruction after the one that caused the trap.

Very Long Instruction Word (WLIV) architectures are architectures that have very large instructions, generally able to issue multiple operations at once. For example, Intel Itanium has 128-bit instructions, able to contain up to 3 operations each.

If and only if an architecture can handle multiple instructions per cycle, it is known as a superscalar architecture.

VLIW architectures are generally statically scheduled, where each instruction contains multiple scalar operations. They generally don't do hazard detection or even full data forwarding, instead relying on the compiler to schedule operations in a good way.

The advantage of VLIW architectures is that the compiler gets more control over the scheduling and how different functional units (FPUs, ALUs, etc.) are used, so we can devote more transistors to the actual functional units themselves, rather than hazard detection and so on.

Consider a VLIW architecture where each instruction contains two memory operations, two floating point operations, and one ALU/branch operation. Each operation can have its own pipeline, so we'll give memory operations the usual fetch/decode/execute/memory/writeback, floating point operations a fetch/decode/execute 1/execute 2/execute 3/writeback pipeline, and ALU/branch operations a fetch/decode/execute/writeback pipeline. We'll also say branches resolve in the decode stage for all types of operation, so we need 1 branch delay slot (since decode is 1 pipeline stage after fetch).

Fetches occur at the same time for every pipeline - one instruction is fetched for every cycle.

Assuming full data forwarding is implemented, the latency of memory operations could be up to 1 if the execute stage has a dependency on the memory stage, the latency of ALU/branch operations could be up to 1 if the decode stage has a dependency on the execute stage, and the latency of floating point operations could be up to 2 if the execute 1 stage has a dependency on the execute 3 stage.

Slot utilization is how well the available operation slots for each instructions are filled, defined as $$\frac{\text{slots filled}}{\text{number of instructions} \times \text{slots per instruction}}$$.

# 31/1/17

;wip: global scheduling, software pipelining, catch up on the rest, gotta run for interviews

# 2/1/17

VLIW processors have many techniques for improving throughput and parallelism.

Speculative execution is basically just when we keep executing down the predicted path when we reach a branch, regardless of whether that path is actually taken or not, and then undoing the effects of that if we don't end up taking the predicted path. Implemented correctly, it eliminates branch delays when the branch predictor is right.

Speculative loading is a technique in which the compiler tries to move memory load instructions upward to avoid memory latency, even if they possibly won't be needed, like if a branch isn't taken.

In IA-64, we denote speculative instructions with the .s suffix - ld8.s is the speculative version of ld8. We denote an advanced load version of the load instruction with the lw8.a suffix, and the ALAT check version of loads with lw8.c. Each lw8.a instruction is paired up with a lw8.c instruction later on - the lw8.a instruction starts the actual data transfer into the ALAT, and then when the corresponding lw8.c instruction is encountered, we can just load it from the ALAT rather than making a trip all the way to memory. Cache entries are invalidated if their addresses are stored to in between those instructions.

Deferred exceptions are a technique for making speculative execution easier. If we want to hoist an instruction above a conditional branch, and the exception causes an exception at runtime, we need a way to mark the instruction in a way that the exceptions are only handled if the conditional branch isn't taken, in order to make sure the hoisted code works the same as the original.

Consider a LW R1, (R5) instruction, which gets hoisted above a BEQZ R4, L1 instrucion. To implement deferred exceptions, we can add a poison bit to each register, and if and only if the speculatively executed instruction raises an exception, we can set the poison bit instead. Then, when we reach the original instruction, we can check for the poison bit being set, and if it is, actually raise the exception there.

Predicated execution is a technique in which instructoins only commit their results if some predicate is true. This is useful because we can often avoid branches and use predicates instead, therefore avoiding branch delays. IA-64, for example, has 64 1-bit predicate registers, which can be set by various operations like comparison, and used to enable/disable different instructions.

Predicated execution essentially lets the compiler turn control dependencies into data dependencies, by letting it change branches into predicated instructions. All of the instructions run, but only the ones with true predicates will actually have an effect. It's very useful when we have lots of small basic blocks.

For example, in IA-64, we can denote p1 = r1 == 0; p2 = r1 !== 0 as cmp.eq p1, p2 = 0, r1;;, where p1 to p64 are the predicate registers. Then, we can do (p1) add r4 = r4, 1 to increment r4 if and only if predicate register p1 is 1.

The Itanium architecture, IA-64, was also called an EPIC architecture (explicitly parallel instruction computing - VLIW with data forwarding and dynamic events like cache misses). It has 128 65-bit (64 bits, plus a poison bit, called the Not-A-Thing bit) integer registers, 128 82-bit floating point registers, and 64 1-bit predicate registers.

# 7/2/17

## Out of order pipelines

Out of order pipelines dyanmically schedule instructions. The benefit is that they can take advantage of instruction-level parallelism not available at compile time, and allow us to make compilers a lot simpler, at the expense of more complicated CPUs.

When we're designing these pipelines, we need to design around hazards in register data, memory data, and control flow.

Out of order execution is inherently superscalar - there's no point in doing dynamic scheduling unless we have multiple functional units.

Scheduling is usually done using Tomasulo's algorithm, an improved version of previous techniques like scoreboarding. It was invented around 1967 and used for an IBM FPU.

Tomasulo's FPU has two functional units - a 2-stage pipelined, 2-cycle floating point adder, and a non-pipelined, 3-cycle floating point multiplier. The adder has 3 reservation stations, and the multiplier has 2.

The FPU also has 3 sets of registers:

• Float Registers (FLR) F0, F2, F4, F6 are the floating point registers used for FPU computation.
• Store Data Buffer (SDB) stores 3 results to be written to memory.

The common data bus allows parts of the FPU to broadcast a float along with a 4-bit source tag to specify where it came from (like a reservation station or the FLB).

Units like reservation stations, the FLR, and the SDB hold data and the data's tag.

The key concept here is the reservation station, and the use of tags to

Here's the FPU's workflow:

1. Up to two instructions are dispatched (in their original order) to the reservation station.
2. Operands are copied to the reservation station.
3. The destination register number is compared to the reservation station number.
4. Instructions are sent to the functional unit (adder or multiplier) when all of its operands are ready and the functional unit is free.
5. After computing the result, the functional broadcasts the result on the CDB, and waiting units like the FLR, reservation stations, and SDB update their contents.

;wip: what even is this, read https://en.wikipedia.org/wiki/Tomasulo_algorithm#Implementation_concepts

Tomasulo's algorithm has imprecise exceptions. For example, if register F4 is renamed by instruction X in cycle 1 and renamed by instruction Y in cycle 2, then W never writes its result to register F4, so if W faults, then recovery isn't possible since the semantics require the result be written. There's no real way to recover from this except to restart the computation from an earlier point.

An out of order pipeline looks something like the following:

• Fetch - retrieve multiple instructions per cycle (in order).
• Decode - identify instruction boundaries, opcodes, operands (in order).
• For Intel x86 processors, CISC instructions are translated into internal RISC instructions.
• Dispatch - send the decoded instructions off to the reservation stations (in order).
• Issue - send instructions from reservation stations to the execution units (out of order).
• Finish - broadcast result in the forwarding bus (out of order).
• Commit - update destination registers in order (in order).
• Retire - update memory-store instructions (in order).
• This is where SW instructions actually take effect.

The re-order buffer (ROB) remembers program order, so we can put things back in order after the Issue stage. At the dispatch stage, each instruction gets an entry in the ROB. At the finish stage, the ROB entry is updated to ;wip. At the Commit stage, the ROB entry is removed.

The ROB also keeps track of which instructions raise exceptions, so we can handle them at the Commit stage of the pipeline (this is analogous to the Itanium case, where we used a poison bit to only raise exceptions when we reach the Writeback stage). This ensures that the processor state is correct/precise at the instruction that raised the exception, which allows us to recover from the exception.

The width of an out of order pipeline is the number of instructions dispatched/committed per cycle.

# 9/2/17

Assignment 1 (not project 1) is due after reading week. It's about scheduling instructions in a pipelined processor.

### Register renaming/assignment

Data dependencies within a block of code form a data-flow graph (DFG) in the processor, where each edge $$a \to b$$ represents the fact that $$b$$ depends on the result of $$a$$, so $$b$$ shouldn't run before $$a$$. To help us run Tomasulo's algorithm, we can draw out the DFG and arrange it to minimize height.

The longest path (also known as the critical path) within the dataflow graph is the minimum number of cycles we can execute those instructions in - this is called the dataflow limit. In other words, with one thread, it's impossible to execute faster than the length of the critical path of true dependencies.

Operations in the CPU produce values, which are then assigned to registers. The live range of a value is the range within the code starting from when the value is first written to a register, up to and including the last use (when the value is last read from the register).

DFGs can also have virtual dependencies between pairs of instructions, called false dependencies, which give us write-after-read or write-after-write hazards if we change the order of those instructions.

Since there are a limited number of registers, the compiler needs to recycle registers - the compiler can use a register for a new value after the current value has been read for the last time. However, reusing registers imposes write-after-read dependencies.

Register renaming is a technique that tries to get rid of false data dependencies, which helps us get better parallelism. The

To implement this, each operation's output value is assigned to a temporary rename register within a rename register file. The actual destination register is associated with the rename register ID, and on the commit step, we can actually copy the value from the rename register to the destination register. If we have enough rename registers, we can essentially have single assignment for each register, eliminating false dependencies and allowing us to improve parallelism.

Register renaming is used by Tomasulo's algorithm.

In hardware, we have an ARF (architected register file), which maps architecture registers (R0-R31 in MIPS) to rename registers. The RRF (rename register file) has one entry for each ROB entry, and the

With ;wip, we replace the ARF with an RAT (register alias table), which maps registers R0-R31 to architected and rename register entries in the PRF. The RAT has two columns - the retirement RAT (which has indices of architected register in the PRF), and the front-end RAT (which points to renamed register indices in the PRF). In case of an exception, the retirement RAT is copied to the front-end RAT, which undoes any renames done by the instructions after the fault, allowing us to raise precise exceptions.

;wip: read textbook section 3.4.6 (or 5.3? dunno what the numbers mean) and https://en.wikipedia.org/wiki/Register_renaming

The PRF (physical register file) stores the actual contents of the architected and rename registers. The

Here's what this looks like:

• During dispatch:
• Allocate an ROB entry
• Allocate a reservation station entry
• Rename the destination register of the operation in the PRF and RAT.
• Copy operands from PRF (using front-end RAT) to reservation station (the values of the operands if the data is available, or the register number if not)
• Rename the front-end RAT with the rename register number.
• On commit:
• Update the retirement RAT.
• Add the old PRF number to the PRF free list.

Delayed read is a technique in which the reservation station only stored the operand tag and status of the data, rather than the data itself, reducing data duplication. It can also time instruction issuing to correspond with the broadcast of the result of the operation it depends on. ;wip: what

# 14/2/17

Register renaming is a technique for undoing the hazards caused by register recycling.

Speculative execution is a technique in which we predict the direction of a branch and start executing that path while the branch is being resolved. However, this presents issues for register renaming, because speculatively executed instructions need to rename registers too - we must make sure the effects of the renaming are undone if the branch turns out to be wrong. To do this, we can do either rollbacks or checkpointing.

To do rollbacks, we save frontend RAT entries into the ROB (reorder buffer) as soon as the speculative instruction is dispatched. To recover upon a mispredicted branch, we can copy back the rename register number from the ROB to the frontend RAT, one at a time, until the branch instruction is actually completed. This works, but it's quite slow because of all the copying we have to do over multiple cycles.

To do checkpointing, we save a full copy of the entire frontend RAT as soon as the speculative instruction is dispatched. The saved copy of the frontend RAT is copied back to the frontend RAT is copied back if we mispredict the branch. This can be a lot faster than rollbacks, but it requires more storage and hardware to support this, depending on how many branches we are speculating past.

For Intel Skylake chips, we have 224 ROB entries, 180-entry integer PRF, and 168-entry floating point PRF. Haswell chips support up to 48 checkpoints - the ability to speculatively execute 48 branches at once!

## Memory Data Flow

Data in memory includes the data segments, stack, and heap. Loads usually start a chain of instructions, because they have to wait on RAM, which is slow compared to processors. How do these work with pipelined processors?

For store instructions:

1. Upon dispatch, calculate the effective address and store it with the data in the store buffer (SB).
2. After commit, write the store buffer entry out to the processor cache (so it gets written to RAM).

1. Upon dispatch, calculate the effective address, and read the data from either the store buffer (if available) or the processor cache (so it gets read from RAM).
2. After commit, broadcast the destination read on the forwarding bus, so the other units know that the destination register has been updated.

We want to be able to issue store instructions out of order, to improve performance.

Speculative loads are loads performed under the assumption that they are unaliased to recent stores - loads that are done from the cache, not the store buffer. When we do speculative loads, we need to detect if the load gave a valid result, and if not, retry it.

1. Upon dispatch, calculate the effective address. If a committed store is in the store buffer, forward the data along the forwarding bus. Otherwise, get the data from the processor cache. Add the speculative load's location and effective address to the FLB (finished load buffer).
2. After commit, remove the created entry from the FLB.

The store process, when using speculative loads, becomes:

1. Upon dispatch, calculate the effective address and store it with the data in the store buffer.
2. Upon commit, if aliased loads are present in the FLB, suppress those speculative loads. Write the store buffer entry out to the processor cache (so it gets written to RAM).

This ensures that if a speculative load finishes before a store completes, the effect of the load gets undone.

Another technique for speeding up loads is data prefetching, where data gets loaded from RAM into the processor data cache before it's actually loaded, so when we actually load it, we can do so directly from the cache. For example, in AMD64 the the prefetch` instruction does this.

Data prefetching can also be done automatically - hardware prefetchers might detect patterns of sequential access, strided access (access with a linear offset each time, like accessing the diagonals of a matrix or accessing the columns of a row-major ordered matrix). One downside of this is that it might bring in data we don't actually want, reducing cache hit rate. Some systems like SPARC have a separate cache for prefetched data, called the prefetch buffer, to avoid this issue.

# 16/2/17

Review of caching and memory (see ECE222 notes for more details):

• Cache lines store the tag (the other bits of the address, enough to reconstruct the address later), the flags (dirty, valid, etc.), and the data itself.
• Direct-mapped caching: blocks of memory are mapped to cache lines according to their block index modulo the number of blocks.
• Fully-associative caching: blocks of memory are mapped to cache lines in an LRU fashion, and associative memory allows lookup by block index.
• Set-associative: the first few bits of the block index are used to determine which set of cache lines to map to, and each set of cache lines forms a fully associative cache (fully-associative caching can be thought of as 1-way set-associative caching, and direct-mapped caching can be thought of as all-way set-associative caching). Most instruction/data caches work best in practice at around 8-way set associative caching.
• Three types of cache misses: compulsory (data has never been in the cache before), capacity (cache is too small to hold the data, so we kicked the data we needed earlier), conflict (the associativity is too low).
• Suppose we have 32-bit addresses, and 1024 direct-mapped 64-byte cache lines. Then the first 16 bits of the address are stored as tags, the next 10 bits are used as the block index (because taking the bits before the block offset modulo 1024 is the same as just taking the last 10 bits), and the last 6 bits are for the offset within the block (addresses point to individual bytes).
• Virtual memory provides protection against other processes modifying a given process' memory, and allows processes to work with a larger, virtual address space.
• The MMU translates virtual to physical addresses. It translates the virtual page number to the physical segment number, and keeps the offset bits.
• The MMU can use multiple different indexing modes:
• PI/PT indexing (physical indexing, physical tagging) - MMU entries are indexed and tagged by physical addresses, which means translation from virtual to physical address completes before cache lookup starts - this is easy to implement but slow.
• VI/PT indexing (virtual indexing, physical tagging) - MMU entries are indexed by virtual addresses but tagged by physical addresses, which means translation and cache lookup can occur in parallel. However, VI/PT results in the synonym problem, where physical blocks are cached in the MMU multiple times for shared memory, since each physical address can have a different virtual address for different processes. To solve this, the OS can do page colouring to make sure the virtual page number is always the same in all virtual address spaces. This is widely used in practice.
• VI/VT (virtual indexing, virtual tagging) - MMU entries are indexed and tagged by virtual addresses, which means we only need to translate virtual page numbers to physical page numbers upon cache misses. This also has the synonym problem, but it also has the homonym problem, where the same virtual tag can refer to different possible physical blocks. OSs solve this by storing the process ID in the tag. This is rarely used in practice.

In MIPS, control flow happens either as a branch instruction (which does PC-relative addressing), or as a jump (which does indirect addressing).

Speculative execution has three main concerns:

• Target speculation:
• The branch target buffer maps branch instruction addresses to branch target addresses. It stores predictions for branches.
• In the fetch stage for a branch instruction, if the branch target buffer has an entry for the address of this branch instruction, we load the branch target address into PC, giving us a 0-latency branch.
• In the execute stage, the branch unit actually resolves the branch to validate whether the prediction is correct or not. If it correct, we can add/update the branch target buffer entry, or remove the branch target buffer entry if it wasn't correct (to delete the prediction).
• This works well for branches, but not really for jumps, which use indirect addressing. For example, returning from a subroutine called from many places means that the branch target address will often change, causing frequent branch mispredictions.