# CS450

Computer Architecture.

Andrew Morton
Section 001
Email: andrew.morton@uwaterloo.ca
Website: https://www.student.cs.uwaterloo.ca/~cs450/w17/outline.shtml
Office Hours: Tuesdays/Thursdays 11:30am-12:30pm in EIT-4015
Tuesdays/Thursdays 1:00pm-2:20pm

# 3/1/16

In this course, the main project (worth 30%) is to design a MIPS processor in Verilog. No required textbooks. The main topic is about how modern CPUs manage to achieve the performance that they do, building on top of the processor basics covered in earlier processor design courses.

There are also two written assignments (worth 20% total) and the final exam (worth 50%). Assignments have 4 grace days allowed.

Hardware description languages are used for modelling hardware, generally for the purpose of simulation or synthesis. Most complex digital hardware is designed using a hardware description language, which is then compiled and converted into specifications for FPGAs, ASICs, programmable logic, and so on.

The main hardware description languages in use today are System Verilog and VHDL. VHDL is more verbose and Ada-like, while Verilog is more minimalist and C-like. Most projects in industry use Verilog for this reason. System Verilog is a superset of plain Verilog, and all Verilog code works in System Verilog. There's also System C, which looks a lot more like C/C++, but it's not commonly used. In this course, we will be using Icarus Verilog (an implementation of System Verilog), version 10 or later (in order to get support for SystemVerilog).

The simplest/lowest-level of hardware modelling in System Verilog is gate-level modelling. At this level, we have logic gate primitives like AND/OR, and we can connect them together into circuits via wires:

# a1 is the name of the gate, which is optional but useful for labelling gates
# y is the gate output, and a/b/c are the gate inputs (and supports one or more of these inputs, and one output)
# the inputs and outputs are wires, identified by name - everywhere that same name is used can be thought of as being all connected together by a physical wire
and a1(y, a, b, c)

# every argument except the last are the NOT gate's outputs, and the last one is the input
not (y1, y2, a)

## Gate-level modelling

In gate-level modelling, signal/wire values can be 0 (low), 1 (high), x/X (unknown/ignored value), or z/Z (high impedance).

Gate-level modelling for a full adder (addend bits as a/b, carry-in as c_in, carry-out as c_out, sum as sum):

// full_adder_gate_level_modelling.sv
// for a full adder, sum = a XOR b XOR c_in, and c_out = ((a XOR b) AND c_in) OR a AND b

module full_adder(input a, b, c_in, output c_out, sum);
// a logic is basically a connection/wire - here, we're declaring our wires
// a wire is equivalent to a wire logic, which is a logic that supports multiple drivers
// generally, we just use logic unless we need multiple drivers - the compiler will choose the best structure depending on how they're used
// there are also bit and byte, which are basically logics that can only be 0 or 1, not X or Z
logic w1, w2, w3; // these are wires that are used for connecting gates in the full adder together
xor x1(w1, a, b);
xor x2(sum, w1, c_in);
and a1(w2, a, b);
and a2(w3, w1, c_in);
or o1(c_out, w2, w3);
endmodule

This is a design module. It fully specifies the full adder as a self-contained unit.

While this looks like normal procedural code, with various function calls, it is more intuitive to think of this as a series of logical declarations - as if each of the xor/and/or lines were all continuously running at the same time.

A test bench is a set of definitions that instantiates the design module, and simulates the circuit under different input conditions to test for functional correctness. For our full adder example:

// full_adder_test_bench.sv
logic carry_in, x, y;
logic carry_out, z;

// "dut" stands for "device under test"
// below, we're basically instantiating a full adder (defined in the code above) that's connected to our logics, for testing purposes
// the .name(other_name) syntax is a "named port map", which lets us connect logics to devices by signal name
// alternatively, we can just map the signals in order, like full_adder dut(x, y, carry_in, carry_out, z), but this can cause issues if we change the inputs/outputs of the full adder
full_adder dut(.c_out(carry_out), .sum(z), .c_in(carry_in), .a(x), .b(y));

// this is a procedural block - inside, the statements are evaluated sequentially from top to bottom
initial begin // run once at time 0
carry_in = 0; x = 0; y = 0;
// delays must be used to ensure statements actually run sequentially - statements in between delays all execute in parallel
#10 x = 1; // delay 10 time units, then continue and set x to 1
#10 y = 1; // delay 10 time units, then continue and set y to 1
#10 carry_in = 1; // delay 10 time units, then continue and set carry_in to 1
#10 $stop; // delay 10 time units, then stop the simulation (this will not compile if trying to generate hardware from this) end // now we want to dump the signals out into waveforms so we can see them initial$dumpvars(0, full_adder_test_bench); // the first parameter is the dump level (0 means "dump all variables in the module and in the modules that instantiate it"), and the rest of the parameters are the module or variables to dump
endmodule

To compile, run iverilog -g2005-sv -s full_adder_test_bench.sv -o full_adder.vvp full_adder.sv full_adder_test_bench.sv:

• iverilog is the Icarus Verilog compiler.
• -g2005-sv specifies that the compiler should use the 2005 System Verilog standard.
• -s full_adder_test_bench.sv specifies the top-level/root module.
• -o full_adder.vvp specifies the output for the compiled Verilog.
• full_adder.sv full_adder_test_bench.sv specifies the modules to include in the compilation.

To simulate, run vvp -n full_adder.vvp -lx2:

• vvp is the Icarus Verilog simulation tool.

# 12/1/17

Complex number multiplier:

// recall that given two complex numbers $a$ and $b$, $p = a \times b = (\Re(a)\Re(b) - \Im(a)\Im(b)) + \imag (\Re(a)\Im(b) + \Im(a)\Re(b))$
// this needs 4 multiplications, 1 subtract, and 1 add
// we will use a synchronous circuit with one multiplier and one adder/subtractor to do this, in order to demonstrate synchronous design, and also to save space (at the expense of it being slower)
// therefore, we break the operation into 5 steps, each one only using one multiplication or one addition: pp1 <= a_real * b_real, pp2 <= a_imag * b_imag, p_real <= pp1 - pp2, pp1 <= a_real * b_imag, pp2 <= a_imag * b_real, and p_imag = pp1 + pp2, which we can implement with an FSM
module complex_multiplier(
output logic signed [7:-24] p_r, p_i, // this defines pr_ and p_i as fixed-point numbers with 24 bits to the right of the decimal point, and 7 bits to the left (actual value is the value of p_r/p_i as a two's complement integer, divided by $2^{24}$)
input signed [3:-12] a_r, a_i, b_r, b_i,
input clk, reset, go // the clock input, reset signal, and the trigger that starts the multiplication
);
logic a_sel, b_sel, pp1_en, pp2_en, sub, p_r_en, p_i_en;
logic signed [3:-12] a_operand, b_operand;
logic signed [7:-24] pp, sum; // partial product, sum
logic signed [7:-24] pp1, pp2;

// multiplier with two multiplexers for input, to select inputs between real/imaginary components of a and b
assign a_operand = ~a_sel ? a_r : a_i;
assign b_operand = ~b_sel ? b_r : b_i;
assign pp = a_operand * b_operand;

// update pp1 and pp2 registers on clock edge
always @ (posedge clk)
if (pp1_en) pp1 <= pp;
always @ (posedge clk)
if (pp2_en) pp2 <= pp;

assign sum = ~sub ? pp1 + pp2 : pp1 - -pp2;

// update output registers on clock edge
always @ (posedge clk)
if (p_r_en) p_r <= sum;
always @ (posedge clk)
if (p_i_en) p_i <= sum;

// finite state machine to drive the above hardware
enum bit [2:0] { step0, step1, step2, step3, step4 } current_state, next_state;
always @ (posedge clk or posedge reset)
if (reset) current_state <= step0;
else current_state <= next_state;
always @ * begin // update whenever go or current_state change
case (current_state)
step0: next_state = go ? step1 : step0;
step1: next_state = step2;
step2: next_state = step3;
step3: next_state = step4;
step4: next_state = step0;
endcase
end
always @ (current_state, reset) begin
a_sel; b_sel = 0; pp1_en = 0; pp2_en = 0;
sub = 0; p_r_en = 0; p_i_en = 0;
case (current_state)
step0: begin pp1_en = 0; end
step1: begin a_sel = 1; b_sel = 1; pp2_en = 1; end
step2: begin b_sel = 1; pp1_en = 1; sub = 1; sub = 1; p_r_en = 1; end
step3: begin a_sel = 1; pp2_en = 1; end
step4: begin p_i_en = 1; end
endcase
end
endmodule

Verilog has sized literals for numbers. For example, 16'h0001 means a 16-bit hexadecimal value 1, 1'b1 means the 1-bit binary value 1, and 4'd8 means the 4-bit decimal value 8.

Overview of assignment 1. Assignment involves filling out Verilog module stubs and writing test benches.

# 17/1/17

Room has been moved to HH 1102, which should be a lot closer to campus than OPT.

## Processor Design

There are multiple levels of abstraction when talking about a processor design. From most abstract to least abstract, the main ones are:

• Instruction-set architecture (ISA) - defines functional behaviour like instruction format, addressing modes, hazards, registers, privileges, exceptions, etc. For example, IA32 (also known as x86), IA64/AMD64 (also known as x86-64), PowerPC, MIPS, ARM, RISC-V.
• Microarchitecture - defines the hardware structure like pipeline structure, cache organization, branch prediction schemes. For example, Intel i386, Intel i486, Intel Skylake, MIPS R16000.
• Implementation - defines physical realization like gate design, transistor technology, fabrication technology.

In the 1980s the main focus in processor design was improving single core performance and clock speed. Today, focus is on more cores at a lower clock speed, and doing more with each cycle (modern Intel CPUs can do 4-8 instructions per cycle!). New microarchitecture-level techniques like instruction-level parallelism (parallelism within 1 thread) and hyperthreading (parallelism between multiple threads on one core), speculative execution (running both sides of a branch and then throwing away the wrong one once the branch is done), out of order execution (rearranging instructions to improve parallelism) have also resulted in significant performance gains. The main issue preventing more single core performance and clock speed is the power wall.

Dynamic power consumption is the power consumed that depends on the processor workload. When a transistor switches, it must use energy proportional to the capacitive load times the voltage squared. So the power used (switching frequency times switching energy) quicky increases linearly with the switching frequency, resulting in increased cooling requirements and worse battery life. One technique modern processors use to try to get around this is to reduce the switching voltage to quadratically reduce the switching energy required - for example, a 15% reduction in voltage results in a 28% reduction in switching energy.

There's also static energy consumption, the power consumed regardless of the processor workload. As transistors get smaller, the leakage current increases (due to quantum tunneling in the transistor junctions), which means more power used per transistor.

Modern techniques involve thread-level parallelism (doing multiple tasks at once), and data-level parallelism (doing something to multiple pieces of data at once).

Fynn's taxonomy characterizes four classes of parallelism:

• Single-instruction single-data - single processing unit acting on single pieces of data. For example, a core calculating one scalar value. For these, we can take advantage of instruction-level parallelism.
• Single-instruction multiple-data - single processing unit acting on multiple pieces of data. For example, DSP units, GPUs, vector instructions. For these, we can take advantage of data-level and instruction-level parallelism.
• Multiple-instruction single-data - multiple processing units acting on single pieces of data. Not really used in practice.
• Multiple-instruction multiple-data - multiple processing units acting on multiple pieces of data. For example, multithreading and multiprocessing. For these, we can take advantage of thread-level, data-level, and instruction-level parallelism.

Recall MIPS:

• 32-bit instructions, in one of two formats. First format: 6 bits for opcode, 5 bits for register operand 1, 5 bits for register operand 2, upper 16 bits for immediate value. Second format: 6-bit opcode, 5 bits each for register operand 1, 2, and 3, 5-bit shift amount, and 6-bit shift type specifier. See CS241 notes for more details about MIPS instructions.
• 32 registers R0 to R31, R0 always has value 0.
• In this course, we'll use a 5-stage pipeline for fetch, decode, execute, memory, and writeback. We can overlap up to 5 instructions at once.
• Instruction flow is linear (statically scheduled), and one instruction is completed per cycle (ideally). In other words, instructions are run in whatever order they're stored as, and we don't dynamically rearrange to improve parallelism.

# 19/1/17

Sometimes instructions in a pipeline depend on others. There are three main types of these dependencies:

• A read-after-write dependency is when an instruction needs to read out the value that some other instruction wrote - if the CPU ran both at once, we might read the wrong value. This is the only real dependency in scalar, statically-scheduled pipelines.
• A write-after-read dependency is an anti-dependency - if the CPU reorders these instructions so that they're in the opposite order, we might read the wrong value.
• A write-after-write dependency is an output dependency - if the CPU reorders these instructions so that they're in the opposite order, the final value of the register may be wrong.

When instructions aren't dependent on each other, we can more-or-less freely rearrange them. This allows efficient pipelining and parallelism.

Total sequential execution (TSE) is an assumption in our programming model that one instruction finishes before the next starts. TSE is sufficient but not necessary for semantic correctness. In fact, semantic correctness can be satisfied just by satisfying the inter-instruction dependencies of the original program - we can otherwise rearrange and overlap instructions however we want.

Since we have five stages (fetch, decode, execute, memory, writeback), an instruction that writes registers takes 4 cycles after it's been fetched before the write actually takes effect, assuming 1 cycle per pipeline stage. Likewise, an instruction that reads registers takes 1 cycle after fetching before actually performing the read.

A control dependency is basically a special case of a register dependency, where the register is the program counter. An instruction that branches (writes to the program counter) has its write take effect 1 cycle after being fetched (the MIPS datapath reads PC in fetch stage and writes in decode stage).

In the same way, an instruction that writes to memory has a memory dependency, and has its write take effect 3 cycles after being fetched.

Therefore, if we have two instructions $$A, B$$ that run at clock cycle $$i$$ and $$j$$, respectively, such that $$i > j$$:

• If there is a read-after-write register dependency from $$B$$ on $$A$$, we must satisfy $$j + 1 > i + 4$$. Hazards occur if $$j - i \le 3$$.
• If there is a write-after-read register dependency from $$B$$ on $$A$$, we must satisfy $$j + 4 > i + 1$$. Hazards are impossible without reordering instructions.
• If there is a write-after-write register dependency from $$B$$ on $$A$$, we must satisfy $$j + 4 > i + 4$$. Hazards are impossible without reordering instructions.
• If there is a read-after-write control dependency from $$B$$ on $$A$$, we must satisfy $$j + 0 > i + 1$$. Hazards occur if $$j - i \le 1$$.
• If there is a write-after-read control dependency from $$B$$ on $$A$$, we must satisfy $$j + 1 > i + 0$$. Hazards are impossible without reordering instructions.
• If there is a write-after-write control dependency from $$B$$ on $$A$$, we must satisfy $$j + 1 > i + 1$$. Hazards are impossible without reordering instructions.
• If there is a read-after-write memory dependency from $$B$$ on $$A$$, we must satisfy $$j + 3 > i + 3$$. Hazards are impossible without reordering instructions.
• If there is a write-after-read memory dependency from $$B$$ on $$A$$, we must satisfy $$j + 3 > i + 3$$. Hazards are impossible without reordering instructions.
• If there is a write-after-write memory dependency from $$B$$ on $$A$$, we must satisfy $$j + 3 > i + 3$$. Hazards are impossible without reordering instructions.

A hazard is a potential for a violated dependency (and therefore, violation of semantic correctness). In our pipelined, statically scheduled CPU, the hazards are a 3-cycle register read-after-write, and a 1-cycle control read-after-write.

To avoid hazards, we can insert no-op instructions - 3 NOP instructions after each register read-after-write, or 1 after every control read-after-write. These can be inserted either by the compiler, or the CPU can stall for the necessary number of cycles when the relevant dependencies are found. Compilers

However, a 3-cycle stall for a 3-cycle register read-after-write is quite long. We can shorten register/memory dependency stalls by using data forwarding - routing the output of later stages directly to earlier stages to avoid the need for a write/read - earlier stages can read directly from the later stages.

For our purposes, we can route the output of the memory stage to the execute stage (for $$j = i + 1$$), the output of the writeback stage to the execute stage (for $$j = i + 2$$), and the output of the writeback stage to the decode stage (for $$j = i + 3$$). With that in place, each stage always has the data it needs, so we never have to stall.

To implement data forwarding in each stage, we can use multiplexers that choose between either reading/writing registers, or taking input directly from later stages.

Turns out, this can't eliminate all our stalls, though it does make a lot of them much shorter. Consider lw R1, (R2) followed by add R4, R1, R1 - the load-word instruction doesn't have its result until the memory stage is done, so we must stall the add for 1 cycle so we can forward it from the write-back stage to the execute stage.

To solve the control dependency read-after-write stall (when we have a conditional branch), we might assume that the branch won't happen, keep executing the next instruction, and then squash/suppress it in the pipeline if we do end up taking the branch.

MIPS uses a different approach - the CPU will execute the instruction after the branch (the instruction after the branch is in a position called the branch delay slot), regardless of whether the branch is taken or not. The compiler will generally try to fill the branch delay slot with an instruction that doesn't depend on the branch, or NOP if it can't find one.

# 24/1/17

Recall that in static scheduling, instructions are executed in the order that they appear in, and the compiler is responsible for arranging instructions in a way that avoids hazards.

Code example with instruction reordering by the compiler for hazard avoidance.

The cycles per instruction (CPI) is a measure of CPU performance, and is ideally somewhere around 1 for our scalar processor. However, penalty cycles caused by hazards will increase this. For example, if we have 6 instructions that incur 3 penalty cycles, we get $$\frac{6 + 3}{6} = 1.5$$ as the CPI.

The program execution time is $$\frac{n \times \text{CPI}}{f}$$, where $$n$$ is the number of instructions in the basic block, and $$f$$ is the clock frequency.

Local scheduling is when the compiler rearranges instructions within a basic block. A basic block is a sequence of consecutive instructions such that if one instruction is executed, all of them are - that means only the first instruction can be the target of a branch instruction, and only the last instruction can be a branch instruction (or the second to last for MIPS, due to the branch delay slot).

To perform local scheduling, we first identify the hazards in the code, leaving slots to put other instructions in according to how many penalty cycles are present. For example, if we have a load followed by a dependent add instruction, we might have 1 penalty cycle in our architecture, so we would leave one slot for an instruction. Then, we try to fill in the slots with other instructions while keeping dependencies satisfied.

Global scheduling is when the compiler rearranges instructions across multiple basic blocks.

A common example of this is loop unrolling, where the body of a loop known to execute $$n$$ times is simply duplicated $$n$$ times - this reduces branching overhead, and makes local scheduling better since it may result in larger basic blocks. While loop unrolling can make our code faster, it results in larger programs, which might cause more instruction cache misses. The compiler must make the tradeoff depending on what would be faster.

# 26/1/17

Project part 1 marks come out on Monday. Part 2 is out on the course website, building a 5-stage MIPS pipeline. In-class overview of assignment.

In the scalar pipelines we are looking at, exceptions are events that alter program flow. Exceptions are harder to handle than normal program flow because they can occur at any time, and we must make sure registers and memory are properly stored/restored to ensure the TPE assumption is upheld.

Exceptions can be generated by things like interrupts from peripherals, unusual cases for some instructions (invalid opcode, page fault), and trap instructions (instructions that invoke the OS).

Precise exceptions are those that are recoverable - we can return to thread after the exception is handled.

What if we have multiple instructions that generate exceptions simultaneously in the pipeline? For example, we might have a load instruction that tries to dereference a null pointer at the memory stage, while we're trying to decode an invalid instruction in the decode stage - the load instruction generates an exception from the memory stage, and the invalid instruction generates one at the decode stage. To make sure we handle all instructions in the pipeline, we only mark the instruction as having caused an error, then actually handle them in the write-back stage.

There are three types of exceptions:

• Interrupts are asynchronous exceptions, usually caused by peripherals like keyboards and timers.
• When an interrupt occurs: the pipeline stops fetching, runs until the pipeline is empty (the pipeline is drained), saves state, invokes the interrupt handler routine, restores state, and then jumps to the next instruction in the interrupted thread..
• Faults are synchronous exceptions, handled in write-back stage.
• When a fault occurs: the pipeline stops fetching, cancels all instructions in the pipeline (the pipeline is flushed), saves state, invokes the fault handler routine, restores state, and then jumps back to the instruction that caused the fault.
• Traps are synchronous exceptions, detected in write-back stage.
• When a fault occurs: the pipeline stops fetching, finishes the trap, cancels all instructions in the pipeline (the pipeline is flushed), saves state, invokes the trap handler routine, restores state, and then jumps to the instruction after the one that caused the trap.

Very Long Instruction Word (WLIV) architectures are architectures that have very large instructions, generally able to issue multiple operations at once. For example, Intel Itanium has 128-bit instructions, able to contain up to 3 operations each.

If and only if an architecture can handle multiple instructions per cycle, it is known as a superscalar architecture.

VLIW architectures are generally statically scheduled, where each instruction contains multiple scalar operations. They generally don't do hazard detection or even full data forwarding, instead relying on the compiler to schedule operations in a good way.

The advantage of VLIW architectures is that the compiler gets more control over the scheduling and how different functional units (FPUs, ALUs, etc.) are used, so we can devote more transistors to the actual functional units themselves, rather than hazard detection and so on.

Consider a VLIW architecture where each instruction contains two memory operations, two floating point operations, and one ALU/branch operation. Each operation can have its own pipeline, so we'll give memory operations the usual fetch/decode/execute/memory/writeback, floating point operations a fetch/decode/execute 1/execute 2/execute 3/writeback pipeline, and ALU/branch operations a fetch/decode/execute/writeback pipeline. We'll also say branches resolve in the decode stage for all types of operation, so we need 1 branch delay slot (since decode is 1 pipeline stage after fetch).

Fetches occur at the same time for every pipeline - one instruction is fetched for every cycle.

Assuming full data forwarding is implemented, the latency of memory operations could be up to 1 if the execute stage has a dependency on the memory stage, the latency of ALU/branch operations could be up to 1 if the decode stage has a dependency on the execute stage, and the latency of floating point operations could be up to 2 if the execute 1 stage has a dependency on the execute 3 stage.

Slot utilization is how well the available operation slots for each instructions are filled, defined as $$\frac{\text{slots filled}}{\text{number of instructions} \times \text{slots per instruction}}$$.

# 31/1/17

;wip: global scheduling, software pipelining, catch up on the rest, gotta run for interviews

# 2/1/17

VLIW processors have many techniques for improving throughput and parallelism.

Speculative execution is basically just when we keep executing down the predicted path when we reach a branch, regardless of whether that path is actually taken or not, and then undoing the effects of that if we don't end up taking the predicted path. Implemented correctly, it eliminates branch delays when the branch predictor is right.

Speculative loading is a technique in which the compiler tries to move memory load instructions upward to avoid memory latency, even if they possibly won't be needed, like if a branch isn't taken.

In IA-64, we denote speculative instructions with the .s suffix - ld8.s is the speculative version of ld8. We denote an advanced load version of the load instruction with the lw8.a suffix, and the ALAT check version of loads with lw8.c. Each lw8.a instruction is paired up with a lw8.c instruction later on - the lw8.a instruction starts the actual data transfer into the ALAT, and then when the corresponding lw8.c instruction is encountered, we can just load it from the ALAT rather than making a trip all the way to memory. Cache entries are invalidated if their addresses are stored to in between those instructions.

Deferred exceptions are a technique for making speculative execution easier. If we want to hoist an instruction above a conditional branch, and the exception causes an exception at runtime, we need a way to mark the instruction in a way that the exceptions are only handled if the conditional branch isn't taken, in order to make sure the hoisted code works the same as the original.

Consider a LW R1, (R5) instruction, which gets hoisted above a BEQZ R4, L1 instrucion. To implement deferred exceptions, we can add a poison bit to each register, and if and only if the speculatively executed instruction raises an exception, we can set the poison bit instead. Then, when we reach the original instruction, we can check for the poison bit being set, and if it is, actually raise the exception there.

Predicated execution is a technique in which instructoins only commit their results if some predicate is true. This is useful because we can often avoid branches and use predicates instead, therefore avoiding branch delays. IA-64, for example, has 64 1-bit predicate registers, which can be set by various operations like comparison, and used to enable/disable different instructions.

Predicated execution essentially lets the compiler turn control dependencies into data dependencies, by letting it change branches into predicated instructions. All of the instructions run, but only the ones with true predicates will actually have an effect. It's very useful when we have lots of small basic blocks.

For example, in IA-64, we can denote p1 = r1 == 0; p2 = r1 !== 0 as cmp.eq p1, p2 = 0, r1;;, where p1 to p64 are the predicate registers. Then, we can do (p1) add r4 = r4, 1 to increment r4 if and only if predicate register p1 is 1.

The Itanium architecture, IA-64, was also called an EPIC architecture (explicitly parallel instruction computing - VLIW with data forwarding and dynamic events like cache misses). It has 128 65-bit (64 bits, plus a poison bit, called the Not-A-Thing bit) integer registers, 128 82-bit floating point registers, and 64 1-bit predicate registers.

# 7/2/17

## Out of order pipelines

Out of order pipelines dyanmically schedule instructions. The benefit is that they can take advantage of instruction-level parallelism not available at compile time, and allow us to make compilers a lot simpler, at the expense of more complicated CPUs.

When we're designing these pipelines, we need to design around hazards in register data, memory data, and control flow.

Out of order execution is inherently superscalar - there's no point in doing dynamic scheduling unless we have multiple functional units.

Scheduling is usually done using Tomasulo's algorithm, an improved version of previous techniques like scoreboarding. It was invented around 1967 and used for an IBM FPU.

Tomasulo's FPU has two functional units - a 2-stage pipelined, 2-cycle floating point adder, and a non-pipelined, 3-cycle floating point multiplier. The adder has 3 reservation stations, and the multiplier has 2.

The FPU also has 3 sets of registers:

• Float Registers (FLR) F0, F2, F4, F6 are the floating point registers used for FPU computation.
• Store Data Buffer (SDB) stores 3 results to be written to memory.

The common data bus allows parts of the FPU to broadcast a float along with a 4-bit source tag to specify where it came from (like a reservation station or the FLB).

Units like reservation stations, the FLR, and the SDB hold data and the data's tag.

The key concept here is the reservation station, and the use of tags to

Here's the FPU's workflow:

1. Up to two instructions are dispatched (in their original order) to the reservation station.
2. Operands are copied to the reservation station.
3. The destination register number is compared to the reservation station number.
4. Instructions are sent to the functional unit (adder or multiplier) when all of its operands are ready and the functional unit is free.
5. After computing the result, the functional broadcasts the result on the CDB, and waiting units like the FLR, reservation stations, and SDB update their contents.

;wip: what even is this, read https://en.wikipedia.org/wiki/Tomasulo_algorithm#Implementation_concepts

Tomasulo's algorithm has imprecise exceptions. For example, if register F4 is renamed by instruction X in cycle 1 and renamed by instruction Y in cycle 2, then W never writes its result to register F4, so if W faults, then recovery isn't possible since the semantics require the result be written. There's no real way to recover from this except to restart the computation from an earlier point.

An out of order pipeline looks something like the following:

• Fetch - retrieve multiple instructions per cycle (in order).
• Decode - identify instruction boundaries, opcodes, operands (in order).
• For Intel x86 processors, CISC instructions are translated into internal RISC instructions.
• Dispatch - send the decoded instructions off to the reservation stations (in order).
• Issue - send instructions from reservation stations to the execution units (out of order).
• Finish - broadcast result in the forwarding bus (out of order).
• Commit - update destination registers in order (in order).
• Retire - update memory-store instructions (in order).
• This is where SW instructions actually take effect.

The re-order buffer (ROB) remembers program order, so we can put things back in order after the Issue stage. At the dispatch stage, each instruction gets an entry in the ROB. At the finish stage, the ROB entry is updated to ;wip. At the Commit stage, the ROB entry is removed.

The ROB also keeps track of which instructions raise exceptions, so we can handle them at the Commit stage of the pipeline (this is analogous to the Itanium case, where we used a poison bit to only raise exceptions when we reach the Writeback stage). This ensures that the processor state is correct/precise at the instruction that raised the exception, which allows us to recover from the exception.

The width of an out of order pipeline is the number of instructions dispatched/committed per cycle.

# 9/2/17

Assignment 1 (not project 1) is due after reading week. It's about scheduling instructions in a pipelined processor.

### Register renaming/assignment

Data dependencies within a block of code form a data-flow graph (DFG) in the processor, where each edge $$a \to b$$ represents the fact that $$b$$ depends on the result of $$a$$, so $$b$$ shouldn't run before $$a$$. To help us run Tomasulo's algorithm, we can draw out the DFG and arrange it to minimize height.

The longest path (also known as the critical path) within the dataflow graph is the minimum number of cycles we can execute those instructions in - this is called the dataflow limit. In other words, with one thread, it's impossible to execute faster than the length of the critical path of true dependencies.

Operations in the CPU produce values, which are then assigned to registers. The live range of a value is the range within the code starting from when the value is first written to a register, up to and including the last use (when the value is last read from the register).

DFGs can also have virtual dependencies between pairs of instructions, called false dependencies, which give us write-after-read or write-after-write hazards if we change the order of those instructions.

Since there are a limited number of registers, the compiler needs to recycle registers - the compiler can use a register for a new value after the current value has been read for the last time. However, reusing registers imposes write-after-read dependencies.

Register renaming is a technique that tries to get rid of false data dependencies, which helps us get better parallelism. The

To implement this, each operation's output value is assigned to a temporary rename register within a rename register file. The actual destination register is associated with the rename register ID, and on the commit step, we can actually copy the value from the rename register to the destination register. If we have enough rename registers, we can essentially have single assignment for each register, eliminating false dependencies and allowing us to improve parallelism.

Register renaming is used by Tomasulo's algorithm.

In hardware, we have an ARF (architected register file), which maps architecture registers (R0-R31 in MIPS) to rename registers. The RRF (rename register file) has one entry for each ROB entry, and the

With ;wip, we replace the ARF with an RAT (register alias table), which maps registers R0-R31 to architected and rename register entries in the PRF. The RAT has two columns - the retirement RAT (which has indices of architected register in the PRF), and the front-end RAT (which points to renamed register indices in the PRF). In case of an exception, the retirement RAT is copied to the front-end RAT, which undoes any renames done by the instructions after the fault, allowing us to raise precise exceptions.

;wip: read textbook section 3.4.6 (or 5.3? dunno what the numbers mean) and https://en.wikipedia.org/wiki/Register_renaming

The PRF (physical register file) stores the actual contents of the architected and rename registers. The

Here's what this looks like:

• During dispatch:
• Allocate an ROB entry
• Allocate a reservation station entry
• Rename the destination register of the operation in the PRF and RAT.
• Copy operands from PRF (using front-end RAT) to reservation station (the values of the operands if the data is available, or the register number if not)
• Rename the front-end RAT with the rename register number.
• On commit:
• Update the retirement RAT.
• Add the old PRF number to the PRF free list.

Delayed read is a technique in which the reservation station only stored the operand tag and status of the data, rather than the data itself, reducing data duplication. It can also time instruction issuing to correspond with the broadcast of the result of the operation it depends on. ;wip: what

# 14/2/17

Register renaming is a technique for undoing the hazards caused by register recycling.

Speculative execution is a technique in which we predict the direction of a branch and start executing that path while the branch is being resolved. However, this presents issues for register renaming, because speculatively executed instructions need to rename registers too - we must make sure the effects of the renaming are undone if the branch turns out to be wrong. To do this, we can do either rollbacks or checkpointing.

To do rollbacks, we save frontend RAT entries into the ROB (reorder buffer) as soon as the speculative instruction is dispatched. To recover upon a mispredicted branch, we can copy back the rename register number from the ROB to the frontend RAT, one at a time, until the branch instruction is actually completed. This works, but it's quite slow because of all the copying we have to do over multiple cycles.

To do checkpointing, we save a full copy of the entire frontend RAT as soon as the speculative instruction is dispatched. The saved copy of the frontend RAT is copied back to the frontend RAT is copied back if we mispredict the branch. This can be a lot faster than rollbacks, but it requires more storage and hardware to support this, depending on how many branches we are speculating past.

For Intel Skylake chips, we have 224 ROB entries, 180-entry integer PRF, and 168-entry floating point PRF. Haswell chips support up to 48 checkpoints - the ability to speculatively execute 48 branches at once!

## Memory Data Flow

Data in memory includes the data segments, stack, and heap. Loads usually start a chain of instructions, because they have to wait on RAM, which is slow compared to processors. How do these work with pipelined processors?

For store instructions:

1. Upon dispatch, calculate the effective address and store it with the data in the store buffer (SB).
2. After commit, write the store buffer entry out to the processor cache (so it gets written to RAM).

1. Upon dispatch, calculate the effective address, and read the data from either the store buffer (if available) or the processor cache (so it gets read from RAM).
2. After commit, broadcast the destination read on the forwarding bus, so the other units know that the destination register has been updated.

We want to be able to issue store instructions out of order, to improve performance.

Speculative loads are loads performed under the assumption that they are unaliased to recent stores - loads that are done from the cache, not the store buffer. When we do speculative loads, we need to detect if the load gave a valid result, and if not, retry it.

1. Upon dispatch, calculate the effective address. If a committed store is in the store buffer, forward the data along the forwarding bus. Otherwise, get the data from the processor cache. Add the speculative load's location and effective address to the FLB (finished load buffer).
2. After commit, remove the created entry from the FLB.

The store process, when using speculative loads, becomes:

1. Upon dispatch, calculate the effective address and store it with the data in the store buffer.
2. Upon commit, if aliased loads are present in the FLB, suppress those speculative loads. Write the store buffer entry out to the processor cache (so it gets written to RAM).

This ensures that if a speculative load finishes before a store completes, the effect of the load gets undone.

Another technique for speeding up loads is data prefetching, where data gets loaded from RAM into the processor data cache before it's actually loaded, so when we actually load it, we can do so directly from the cache. For example, in AMD64 the the prefetch instruction does this.

Data prefetching can also be done automatically - hardware prefetchers might detect patterns of sequential access, strided access (access with a linear offset each time, like accessing the diagonals of a matrix or accessing the columns of a row-major ordered matrix). One downside of this is that it might bring in data we don't actually want, reducing cache hit rate. Some systems like SPARC have a separate cache for prefetched data, called the prefetch buffer, to avoid this issue.

# 16/2/17

Review of caching and memory (see ECE222 notes for more details):

• Cache lines store the tag (the other bits of the address, enough to reconstruct the address later), the flags (dirty, valid, etc.), and the data itself.
• Direct-mapped caching: blocks of memory are mapped to cache lines according to their block index modulo the number of blocks.
• Fully-associative caching: blocks of memory are mapped to cache lines in an LRU fashion, and associative memory allows lookup by block index.
• Set-associative: the first few bits of the block index are used to determine which set of cache lines to map to, and each set of cache lines forms a fully associative cache (fully-associative caching can be thought of as 1-way set-associative caching, and direct-mapped caching can be thought of as all-way set-associative caching). Most instruction/data caches work best in practice at around 8-way set associative caching.
• Three types of cache misses: compulsory (data has never been in the cache before), capacity (cache is too small to hold the data, so we kicked the data we needed earlier), conflict (the associativity is too low).
• Suppose we have 32-bit addresses, and 1024 direct-mapped 64-byte cache lines. Then the first 16 bits of the address are stored as tags, the next 10 bits are used as the block index (because taking the bits before the block offset modulo 1024 is the same as just taking the last 10 bits), and the last 6 bits are for the offset within the block (addresses point to individual bytes).
• Virtual memory provides protection against other processes modifying a given process' memory, and allows processes to work with a larger, virtual address space.
• The MMU translates virtual to physical addresses. It translates the virtual page number to the physical segment number, and keeps the offset bits.
• The MMU can use multiple different indexing modes:
• PI/PT indexing (physical indexing, physical tagging) - MMU entries are indexed and tagged by physical addresses, which means translation from virtual to physical address completes before cache lookup starts - this is easy to implement but slow.
• VI/PT indexing (virtual indexing, physical tagging) - MMU entries are indexed by virtual addresses but tagged by physical addresses, which means translation and cache lookup can occur in parallel. However, VI/PT results in the synonym problem, where physical blocks are cached in the MMU multiple times for shared memory, since each physical address can have a different virtual address for different processes. To solve this, the OS can do page colouring to make sure the virtual page number is always the same in all virtual address spaces. This is widely used in practice.
• VI/VT (virtual indexing, virtual tagging) - MMU entries are indexed and tagged by virtual addresses, which means we only need to translate virtual page numbers to physical page numbers upon cache misses. This also has the synonym problem, but it also has the homonym problem, where the same virtual tag can refer to different possible physical blocks. OSs solve this by storing the process ID in the tag. This is rarely used in practice.

In MIPS, control flow happens either as a branch instruction (which does PC-relative addressing), or as a jump (which does indirect addressing).

Speculative execution has three main concerns:

• Target speculation:
• The branch target buffer maps branch instruction addresses to branch target addresses. It can also store branch predictors by the address of the branch instruction that it predicts the target of, but we often use a separate pattern history table (PHT) that stores branch predictors, directly indexed by the lower-order bits of the address of the branch instruction (aliasing is possible, but that's fine for our use case).
• In the fetch stage for a branch instruction, if the branch target buffer has an entry for the address of this branch instruction, we load the branch target address into PC, giving us a 0-latency branch.
• In the execute stage, the branch unit actually resolves the branch to validate whether the prediction is correct or not. If it correct, we can add/update the branch target buffer entry, or remove the branch target buffer entry if it wasn't correct (to delete the prediction).
• This works well for branches, but not really for jumps, which use indirect addressing. For example, returning from a subroutine called from many places means that the branch target address will often change, causing frequent branch mispredictions.

# 28/2/17

;wip: missed due to interviews

;wip: out of order pipelines: condition spec (section 8), dynamic methods based on branch history, adding a pattern history table to the BTB (branch target buffer)

# 2/3/17

The branch predictors we'll look at are 2-bit saturating counters, like one of the questions in Project Part 1. Whenever we take a branch, we increment the counter, and whenever we don't, we decrement. The counter saturating means that incrementing 11 stays at 11, and decrementing 00 stays at 00. The branch is predicted to be taken whenever the counter is 10 or 11. Branch predictors are added to the PHT whenever we encounter a branch instruction. They're initialized to 10 ("weakly taken").

Nair conducted a study in 1995 to check on how accurate branch prediction is in the real world, and found that prediction accuracy was 85%+ on many commonly used programs like SPICE and GCC. This is actually within 1% of an optimal 4-state branch predictor FSM, so saturating 2-bit counters are actually really good at predicting branches in real-world situations (they tried every one of the 5248 possible 4-state branch predicting FSMs!). They also looked at saturating counters with a different number of bits, and aliasing vs. the number of low-order bits we index by in the PHT, and characterized performance as those variables were changed.

;wip: correlating predictors, section 8.2.3, bimodal vs local vs global predictors

A bimodal branch predictor is the kind we've been looking at so far - the PHT is indexed by some subset of the bits of PC. This works decently well, but it doesn't capture behaviour like alternating branches (the branch switches between taken and not taken every time we execute), or similar patterns in branching.

A local branch predictor tries to capture this. To predict a branch, we index some bits of the branch instruction address (the value of PC) into a history table, which stores the corresponding last $$k$$ branch results (for example, 10110 means the branch instruction branched 3 times out of the last 5 executions). Then, those last $$k$$ results are XORed with a subset of bits of the branch instruction address, and the result is used as indices in the PHT, which stores the usual 2-bit saturating counters. In other words, $$\text{prediction} = \text{saturating_counter_prediction}(\text{subset of bits of PC} \oplus \text{PHT}[\text{history}[\text{another subset of bits of PC}]])$$. This allows us to predict based on the history of branch results for individual branch instructions, and it works really well in practice.

A global branch predictor detects correlated behaviour (like two different branch instructions that only branch when the other doesn't), saves the last $$n$$ branch decisions of any branch instructions in a global history register (GR), a plain register that stores historical data for all branches. The global history register is used as indices in the PHT - we can look up our 2-bit saturating counters in the PHT by the history of all branches. Turns out that this doesn't work as well as local branch prediction, mostly because of aliasing - global history often looks the same for a lot of different branch instructions that behave very differently. In other words, $$\text{prediction} = \text{saturating_counter_prediction}(\text{PHT}[\text{history of last } k \text{ branches}])$$.

Real-world systems often combine a global and bimodal branch predictor, indexing into the PHT based on the global history XORed with some subset of the bits of PC. This scheme is what gshare uses, and performs almost as accurately as local branch prediction in real-world uses. However, it has one less level of indirection compared to local predictors (local predictors have to look up a value in history and then PHT, while we only have to look up a value in the PHT), so it can be faster.

Some processors implement meta branch predictors - running different types of branch predictors in parallel, and changing the one used based on their previous performance for each particular branch instruction.

# 7/3/17

Branch prediction examples, step by step. In the diagram, branch results are shifted into the right side of the history entries.

A branch predictor needs to both predict branches, check the results of those branches after they're resolved/executed, and then update the predictors based on that information. When the predictor predicts wrong and we're speculatively executing, we must also rollback the effects of the execution and restart it at the correct address.

To implement this:

• Each branch instruction is assigned a tag bit index (e.g., bit 4 or bit 2).
• Each instruction has a set of tag bits that represent whether it's being speculatively executed for a given tag (e.g., a tag of 0101 means the instruction is speculatively executing for the branch with tag bit index 0 and tag bit index 2). The tag bits are set when the instruction is being dispatched to the reservation stations.
• When a branch is resolved, the tag bits and the result of the branch are broadcast:
• If the prediction was correct, instructions with the branch instruction's tag bit index set in their tag bits have that bit cleared (e.g., if a branch with tag bit index 2 resolves correctly, we clear bit 2 of te tag bits of all instructions).
• If the prediction was not correct, instructions with the bit in the tag bits at the tag bit index set are cancelled, the ROB tail pointer is moved back to the branch instruction, and the frontend RAT checkpoint is restored.

Parallel programming allows us to execute multiple things at the same time on multiple processors. Generally this is done by using shared memory (pthreads, etc.) or message passing between processes (MPI, Scala, etc.). In this course we'll look mainly at the shared memory model.

For our purposes, processes are collections of threads with an address space.

A critical section is code that accesses shared data. We want only one thread to be executing a critical section at a time, and we usually enforce this using locks. In CS350 we looked at the software side of implementing these syncrhonization primitives, but now we'll look at the hardware primitives that allow us to implement these operations.

Common schemes for hardware synchronization from CS350:

• Compare-and-swap - atomically load some memory, compare it to a given value, and conditionally store it back (e.g., CMPXCHG in x86).
• Fetch-and-add - atomically load some memory, add a value to it, and store it back (e.g., fetchadd4 in IA64).
• Load-linked/store-conditional - use load-linked to load some memory, then store-conditional tries to store, failing if the linked location was changed since we load-linked it or if there was an interrupt (e.g., LL/SC in MIPS).

With any of these, we can implement a spinlock, the basic primitive needed to implement all of the other synchronization primitive. For example, with the compare-and-swap primitive cmpswp in MIPS:

spin: cmpswp LOCK_VARIABLE, 0, 1 ; if LOCK_VARIABLE is 0, set it to 1
bne spin ; loop until we successfully set LOCK_VARIABLE to 1 from 0
; we now hold the spinlock, and can do our critical section stuff
sw R0, LOCK_VARIABLE ; release the lock

Another example, with the ll/sc primitives in MIPS:

spin: ll R1, ADDRESS
;wip: where's the critical section?
sc R1, A
bne spin 

# 9/3/17

POSIX threads are implemented in the pthreads library which expose functionality for thread creation/destruction and primitives like mutexes.

#include <pthread.h>
#include stdio.h

#define ARRAYSIZE 1000000

double sum = 0.0;
double a[ARRAYSIZE];

int main(void) {
// initialize a here

pthread_mutex_init(&sum_mutex, NULL); // initialize mutex, with the default (NULL) attributes
for (int i = 0; i < NTHREAD; i ++) {
tid[i] = i;
pthread_create(&threads[i], NULL, add_array, &thread_ids[i]); // create thread from the add_array function, called with argument &thread_ids[i], and store the thread object in &threads[i]
}
for (int i = 0; i < NTHREAD; i ++) {
}
printf("array sum: %f\n", sum);
}

double my_sum = 0.0;
for (int i = start; i < end; i ++) {
my_sum += a[i];
sum += my_sum;
}
}

With the OpenMP library, we can use compiler directives to automatically create threads and assign work to them. This library is widely used in real-world programs. For this example we'll use C++:

#include <iostream>
#include <omp.h>

using namespace std;

const int ARRAYSIZE = 1000000;
double sum = 0.0;
double a[ARRAYSIZE];

int main() {
// initialize a here

double my_sum;

// start the parallel computation using OpenMP, with shared memory, a thread-local variable my_sum, and 4 threads (this defaults to the number of logical cores)
#pragma omp parallel default(shared) private(my_sum) num_threads(4)
{
my_sum = 0.0;
#pragma omp for
for(int i = 0; i < ARRAYSIZE; i ++) {
my_sum += a[i];
}
#pragma omp critical
sum += my_sum;
}

// an alternative way to the above block
#pragma omp parallel for default(shared) reduction(+ : sum) num_threads(4)
for (int i = 0; i < ARRAYSIZE; i ++) {
sum += a[i];
}
}

Symmetric multiprocessing (SMP) describes a shared memory multiprocessor architecture with uniform memory access (every processor can access every memory location with the same access time).

To implement uniform memory access times, we might put processors and memory modules on a shared bus (memory accesses are broadcasted on the bus), or use a crossbar switch (access to memory banks routed by processor).

SMP systems tend to have limited scalability, topping out at a few tens of processors, because of the need for uniform memory access.

Non-uniform memory access (NUMA) is the opposite of UMA architectures - they allow processors to have different access times to different memory locations. This allows us to scale them up a lot more (e.g., IBM Blue Gene has over 65000 nodes!).

To implement non-uniform memory access times, we might use a scalable interconnect system like a tree network (heirarchy of nodes), mesh networks (nodes connect to adjacent nodes, like on a grid), torus networks (nodes connected in a grid that wraps around), hypercube networks (nodes connected with every node with a binary ID that differs by 1 bit), and so on.

In a NUMA architecture, we must ensure that the memory stays coherent. Coherent shared memory means that all processors will see the most recent write to any memory location (i.e., cache coherency for memory). This can be implemented by broadcasting writes on the memory bus (write-through caching), or by broadcasting invalidation messages on the bus, so future reads to that memory location will request its updated value (write-back caching).

One way to do ensure cache coherency is the MESI (Modified Exclusive Shared Invalid) protocol. In this protocol, caches are divided into blocks, and each cache block is either:

• MODIFIED: local value is different from value in main memory (dirty), and this is the only cached copy.
• To read/write from this block, we can just directly use the cached value.
• When the processor needs to evict the value from the cache, it performs the writeback to main memory and sets the block to INVALID.
• Upon a READ message for this value from another processor, the processor responds with a DIRTY message, performs the writeback to main memory, then sets the block to SHARED. ;wip: what's a DIRTY message?
• Upon a WRITE message for this value from another processor, the processor responds with a DIRTY message, performs the writeback to main memory, then sets the block to INVALID. ;wip: what's a DIRTY message?
• EXCLUSIVE: local value is same as value in main memory (clean), and this is the only cached copy.
• To read from this block, we can just directly use the cached value.
• To write to this block, we can change the value and make the block MODIFIED.
• When the processor needs to evict the value from the cache, it sets the block to INVALID.
• Upon a ;wip
• SHARED: local value is same as value in main memory (clean), and there may be other cached copies. ;wip
• INVALID: local value is different from value in main memory (dirty), and there may be other cached copies. ;wip

Since all the memory messages are on the bus, the MESI protocol lets us store all the cache block states in a cache tag array, which replaces all the dirty bits and status bits. The cache controller is responsible for monitoring the bus and keeping the cache tag array up to date.

# 14/3/17

Tips about project part 3: make sure to copy the .x files into the project folder.

MESI is a snooping protocol, which means that processor caches are both listening to requests from the processor, and snooping on traffic generated by other processors on the bus.

Suppose we have processors P1, P2, P3, all sharing one block B1 of memory. P1 writes to B1, then P2 reads from B1. In a MESI system, this shows up as:

1. Initially, B1's corresponding cache entries for each processor are all in the SHARED state.
2. P1 writes to B1, which sends out a P1 upgrade message on the bus. This sets P1's cache entry for B1 to MODIFIED, and P2 and P3's cache entry for B1 to INVALID.
3. P2 reads B1, and since P2's cache entry is INVALID, P2 doesn't do anything. However, P1 gives a snoop response of DIRTY and performs a writeback. This sets all cache entries for B1 to SHARED again.

In addition to MESI, there's also MSI, which implements cache coherency without the EXCLUSIVE state, folding it into the SHARED state instead. MESIF and MOESI are other variants, based on the idea that when there's a cache miss, it's faster for another cache to supply the value than for the main memory to.

MOESI is used by ARM and ARM, and is a probing protocol rather than a snooping protocol - processors can directly ask other processors for data. Basically, this just adds an OWNED state, which means there's only one copy of the cache that serves this particular block. If a processor's cache holds a MODIFIED block and another processor does a bus read, the first one's block becomes an OWNED block, and the first processor forwards the block to the requesting processor. Likewise, bus writes modify the owner's and the writer's copy. The owner of the block simply does the writeback before evicting the block from its cache. MOESI is nice because it reduces writebacks to main memory.

MESIF is used by Intel, and is another probing protocol. Basically, this just adds a FORARDING state, . For a SHARED block, only one of those holders of that SHARED block should forward the block when there's a READ/WRITE on the bus. The most recent reader of a block holds it in the FORWARDING state and does the forwarding. After responding to a bus READ/WRITE, it transitions to the SHARED state (the requesting processor now has the FORWARDING block). This is nice because it makes caches take turns forwarding the data. Also, the FORWARDING block has the freshest copy of the block's data, so it's not likely to evict that block.

# 16/3/17

To implement spinlocks with the atomic exchange instruction:

     daddui R2, R0, 1 ; set R2 to 1
spin exch R2, (R1) ; atomically exhange the value of the lock with 1
bnez R2, spin ; if the lock was originally held (originally had a value of 1), keep spinning

The exch is atomic, so it always generates a write miss (which then causes other lock holders to do writebacks) - this is a lot of bus traffic. This is because the last holder of the lock has the lock value as MODIFIED in memory, and the rest has it as an INVALID value, and this switches every time there's an EXCH instruction.

What we really want is for the lock to be a SHARED value between all the nodes trying to get the lock, at least most of the time. We can implement this by non-atomically reading the value first before trying to atomically exchange it:

spin ld R2, (R1)
bnez R2, spin
exch R2, (R1)
bnez R2, spin

The non-atomic read causes a read miss rather than a write miss, and makes the node with the MODIFIED lock value and the node that caused the read miss to change their lock value to SHARED. After the read miss is resolved (so the lock value becomes SHARED), nodes can now spin on their local cache values. When the lock holder releases the lock, the lock value is invalidated in the other nodes, allowing them to race to try to acquire that lock.

A directory implementation of NUMA has a directory that stores the state of each block of memory, stored with each memory block. Each directory is a buffer with one entry per block, each entry storing:

• The block number corresponding to the entry.
• A bit representing whether the block is modified.
• A sharing vector - a sequence of $$n$$ bits (where $$n$$ is the number of nodes) representing which nodes have an up-to-date copy of the data.

Note that this doesn't explicitly represent the EXCLUSIVE state - this is because EXCLUSIVE is simply a SHARED state where only 1 of the sharing vector bits is set.

The directory adds overhead for each block, specifically $$\frac{\text{directory row or entry size}}{\text{block size}}$$. The general architecture is that each processor is connected to a memory and a directory, and these three are all connected to a scalable interconnect.

A directory implementation is comparable to a snooping implementation - it's faster and has less bus traffic, but it adds a significant amount of overhead to actually track the memory states.

Suppose P1 writes a piece of data, originally SHARED between P1 and P2. P1 must then send an UPGRADE to directory 1 to make P1 the owner of that data, then directory 1 sends INVALIDATE to the rest of the processors that have a copy of the data. P1 is now

When P2 reads after the above happens, P2 sends a READ MISS to directory 1, then directory 1 sends a FETCH to P1. P1 sends a WRITEBACK to directory 1 to make sure it has the most recent data. Then, directory 1 sends a DATA REPLY to P2, completing the read.

## Multi-level caching

There are two schemes for multi-level caching:

• Inclusive: data in faster/higher levels (like L1) is a subset of data in slower/lower levels (like L2, L3). Fetches in higher levels are also fetched into lower levels. In other words, data is always replicated.
• Exclusive: cache levels don't replicate data.
• Non-inclusive: data might be replicated, or it might not be.

In inclusive cache schemes, only the lowest level of cache maintains cache coherency. Remote UPGRADE/WRITE-MISS messages are passed up to higher levels, while local UPGRADE/WRITE-MISS messages are passed to lower levels, sort of like a write-through scheme. This has the advantage of higher levels dealing with most processor requests.

In exclusive cache schemes, all levels need to snoop on UPGRADE/WRITE-MISS messages, and for a directory, messages go to all levels.

In current Intel architectures, L1 and L2 are exclusive caches, and L3 is inclusive.

# 21/3/17

## Hardware Transactional Memory

Transactional memory allows memory operations to be organized into transactions, much like database transactions. Compared to locks, it's a simpler paradigm and is easier to reason about.

There are some research implementations of software transactional memory, but they generally result in poor performance. Hardware implementations can be a lot faster

For our purposes, a transaction is a set of instructions that either all succeed (runs normally), or none do (all those instructions have no effect). In other words, an atomic block of instructions. Our goal is to allow transactions to execute in parallel.

Reads in transactions are tracked in the read set (RS) buffer, and writes are tracked in the write set (WS) buffer (reads/writes aren't actually executed, just tracked). A conflict is when a thread writes to a memory location that is in another transaction's RS or WS.

Locks are pessimistic parallism - they optimize for the case in which a conflict occurs, by preventing conflicts from happening in the first place. Transactions are optimistic parallelism - they optimize for the case in which no conflicts occur, by recovering from conflicts when we encounter them.

Transactional memory example in C++:

#include <random>

using namespace std;

#define NUM_ACCOUNTS 1000

int balance[NUM_ACCOUNTS];

void func() {
random_device rnd;
for (int i = 0; i < 1000000; i ++) {
int a1 = rnd() % NUM_ACCOUNTS;
int a1 = rnd() % NUM_ACCOUNTS;
int amount = rnd() % 100;

// this block executes atomically, as a single transaction
__transaction_atomic {
if (amount > balance[a1]) { amount = balance[a1]; }
balance[a1] -= amount;
balance[a2] += amount;
}
}
}

int main() {
// initialize every account to have balance 100
for (int i = 0; i < NUM_ACCOUNTS; i ++) { balance[i] = 100; }

for (auto &t : thr) { t = thread(func); }
}

This compiles with experimental features of GCC, which implements software transactional memory. However, without hardware support, this turns out to have half the performance of the mutex-based version. The main issue is that resolving conflicts is really expensive in hardware.

Herlihy and Moss proposed the first hardware transactional memory architecture in 1993. This architecture added a transactional cache in parallel to the L1 data cache, and improved performance over software transactional memory by an order of magnitude.

The first commercially available hardware transactional memory implementation was the IBM POWER8 in 2013, followed by the Intel Haswell chips later that year. The POWER8 processor's HTM implementation was complicated by its cache structure, and the fact that it had 8 hardware threads per core.

A typical implementation of HTM looks like this:

• A tbegin FALLBACK_ADDRESS instruction to start a transaction.
• FALLBACK_ADDRESS is the address to resume execution at if the transaction ends up failing.
• Nested tbegin instructions are flattened into a single transaction (since the inner transaction failing would also cause the outer one to fail anyways).
• All reads and writes that happen in the transaction are tracked, but not actually executed until the transaction is committed.
• A tend instruction to atomically commit the current transaction.
• Basically, this atomically runs all of the writes in the transaction's write set.
• For nested transactions, a counter keeps track of the current nesting depth, and ignores all non-top-level commits.
• A tabort` instruction to abort the current transaction.
• For nested transactions, this causes all transactions up the stack, all the way to the top-level transaction, to be aborted.
• Version management for memory.
• Register renaming hardware can be used to checkpoint register contents, just like we did for speculative branches.
• The first-level data cache (L1 for current Intel processors) can be used for storing the WS and RS.
• The last-level data cache (L3 for current Intel processors) can be used to save pre-transaction memory block state.
• Cache-coherence traffic can be used to detect conflicts.
• Atomic writes for the WS need to be implemented on the memory bus.
• For example, suppose we have a 2-level cache where L2 is inclusive of L1. First, we might add a Transaction tag to the L1 cache (set by loads/stores in a transaction). A conflict occurs when a bus upgrade/write-miss occurs on a value in RS or WS, or a read-miss occurs on a value in WS. Upon a conflict occuring, we abort, which clears the Transaction tag from conflicted blocks in RS, invalidates conflicted blocks in WS, and jumps to the fallback address. Upon transaction commit, we copy blocks from the transaction's WS to L2, and clear all Transaction tags matching it.