Lecture Notes by Anthony Zhang.


Computer Architecture.

Andrew Morton
Section 001
Email: andrew.morton@uwaterloo.ca
Website: https://www.student.cs.uwaterloo.ca/~cs450/w17/outline.shtml
Office Hours: Tuesdays/Thursdays 11:30am-12:30pm in EIT-4015
Tuesdays/Thursdays 1:00pm-2:20pm


In this course, the main project (worth 30%) is to design a MIPS processor in Verilog. No required textbooks. The main topic is about how modern CPUs manage to achieve the performance that they do, building on top of the processor basics covered in earlier processor design courses.

There are also two written assignments (worth 20% total) and the final exam (worth 50%). Assignments have 4 grace days allowed.

Hardware description languages are used for modelling hardware, generally for the purpose of simulation or synthesis. Most complex digital hardware is designed using a hardware description language, which is then compiled and converted into specifications for FPGAs, ASICs, programmable logic, and so on.

The main hardware description languages in use today are System Verilog and VHDL. VHDL is more verbose and Ada-like, while Verilog is more minimalist and C-like. Most projects in industry use Verilog for this reason. System Verilog is a superset of plain Verilog, and all Verilog code works in System Verilog. There's also System C, which looks a lot more like C/C++, but it's not commonly used. In this course, we will be using Icarus Verilog (an implementation of System Verilog), version 10 or later (in order to get support for SystemVerilog).

The simplest/lowest-level of hardware modelling in System Verilog is gate-level modelling. At this level, we have logic gate primitives like AND/OR, and we can connect them together into circuits via wires:

# `a1` is the name of the gate, which is optional but useful for labelling gates
# `y` is the gate output, and `a`/`b`/`c` are the gate inputs (`and` supports one or more of these inputs, and one output)
# the inputs and outputs are wires, identified by name - everywhere that same name is used can be thought of as being all connected together by a physical wire
and a1(y, a, b, c)

# every argument except the last are the NOT gate's outputs, and the last one is the input
not (y1, y2, a)

Gate-level modelling

In gate-level modelling, signal/wire values can be 0 (low), 1 (high), x/X (unknown/ignored value), or z/Z (high impedance).

Gate-level modelling for a full adder (addend bits as a/b, carry-in as c_in, carry-out as c_out, sum as sum):

// full_adder_gate_level_modelling.sv
// for a full adder, `sum = a XOR b XOR c_in`, and `c_out = ((a XOR b) AND c_in) OR a AND b`

module full_adder(input a, b, c_in, output c_out, sum);
    // a `logic` is basically a connection/wire - here, we're declaring our wires
    // a `wire` is equivalent to a `wire logic`, which is a logic that supports multiple drivers
    // generally, we just use logic unless we need multiple drivers - the compiler will choose the best structure depending on how they're used
    // there are also `bit` and `byte`, which are basically logics that can only be 0 or 1, not X or Z
    logic w1, w2, w3; // these are wires that are used for connecting gates in the full adder together
    xor x1(w1, a, b);
    xor x2(sum, w1, c_in);
    and a1(w2, a, b);
    and a2(w3, w1, c_in);
    or o1(c_out, w2, w3);

This is a design module. It fully specifies the full adder as a self-contained unit.

While this looks like normal procedural code, with various function calls, it is more intuitive to think of this as a series of logical declarations - as if each of the xor/and/or lines were all continuously running at the same time.

A test bench is a set of definitions that instantiates the design module, and simulates the circuit under different input conditions to test for functional correctness. For our full adder example:

// full_adder_test_bench.sv
module full_adder_test_bench;
    logic carry_in, x, y;
    logic carry_out, z;

    // "dut" stands for "device under test"
    // below, we're basically instantiating a full adder (defined in the code above) that's connected to our logics, for testing purposes
    // the `.name(other_name)` syntax is a "named port map", which lets us connect logics to devices by signal name
    // alternatively, we can just map the signals in order, like `full_adder dut(x, y, carry_in, carry_out, z)`, but this can cause issues if we change the inputs/outputs of the full adder
    full_adder dut(.c_out(carry_out), .sum(z), .c_in(carry_in), .a(x), .b(y));

    // this is a procedural block - inside, the statements are evaluated sequentially from top to bottom
    initial begin // run once at time 0
        carry_in = 0; x = 0; y = 0;
        // delays must be used to ensure statements actually run sequentially - statements in between delays all execute in parallel
        #10 x = 1; // delay 10 time units, then continue and set x to 1
        #10 y = 1; // delay 10 time units, then continue and set y to 1
        #10 carry_in = 1; // delay 10 time units, then continue and set carry_in to 1
        #10 $stop; // delay 10 time units, then stop the simulation (this will not compile if trying to generate hardware from this)

    // now we want to dump the signals out into waveforms so we can see them
    initial $dumpvars(0, full_adder_test_bench); // the first parameter is the dump level (`0` means "dump all variables in the module and in the modules that instantiate it"), and the rest of the parameters are the module or variables to dump

To compile, run iverilog -g2005-sv -s full_adder_test_bench.sv -o full_adder.vvp full_adder.sv full_adder_test_bench.sv:

To simulate, run vvp -n full_adder.vvp -lx2:

To view the waveforms: gtkwave dump.lx2:


Live demonstration of the above Verilog workflow, including an overview of GtkWave's features and a shell script that builds and simulates the module.

Heirarchical design is the idea of using lower-level modules in higher-level modules. For example, a carry-lookahead adder module might instantiate a few ripple-carry adders, which in turn might instantiate a bunch of full-adders:

// adder_4_bit.sv
module adder_4_bit(output c_out, output [3:0] sum, input c_in, input [3:0] a, b);
    logic [3:1] carry;
    full_adder fa0(.c_out(carry[1]), .sum(sum[0]), .c_in(c_in), .a(a[0]), .b(b[0]));
    full_adder fa0(.c_out(carry[2]), .sum(sum[1]), .c_in(carry[1]), .a(a[1]), .b(b[1]));
    full_adder fa0(.c_out(carry[3]), .sum(sum[2]), .c_in(carry[2]), .a(a[2]), .b(b[2]));
    full_adder fa0(.c_out(c_out), .sum(sum[3]), .c_in(carry[3]), .a(a[3]), .b(b[3]));

Note that there's a lot of similar statements for the full adder definitions. In these cases, we can use generate statements to remove duplicated code. It looks somewhat like a loop, but it's more of a macro that gets expanded into multiple statements at compile time:

// adder_4_bit_generate.sv
module adder_4_bit(output c_out, output [3:0] sum, input c_in, input [3:0] a, b);
    logic [4:0] carry;
        genvar i;
        for(i=0; i<4; i=i+1) begin: adder
            full_adder fa(.c_out(carry[i+1]), .sum(sum[i]), .c_in(carry[i]) ,.a(a[i]), .b(b[i]));

    // since it would be really inconvenient to specify our c_in and c_out in the generate statement, we just used extra indices in the carry wire, and then assign the inputs/outputs accordingly
    assign carry[0] = c_in; // this is a continuous assignment statement - it's sort of like a one-way wire for the c_in signal in this case, but it can also do things like combinational logic with different gates and stuff
    assign c_out = carry[4]; // since the assignment is unidirectional, the inputs are on the right side of the equal sign, and the outputs are on the left side

Usually for loops are discouraged, but they're fine for generate statements - in other contexts, they might not compile to actual hardware.

Dataflow modelling

Dataflow modelling is the next level of abstraction above gate-level modelling. Dataflow modelling deals only with combinational circuits, by representing them with continuous assignment statements.

Basically, we represent what logical operations the circuit needs to do, and the compiler will figure out what kind of gate layout we need. For example, for a full adder it might automatically decide to reuse the xor(a, b) gate.

For example, we might model a full adder using dataflow modelling as follows:

// full_adder_dataflow.sv
module full_adder(output c_out, sum, input c_in, a, b);
    // the value of `sum` is continuously updated based on the values of `a`, `b`, and `c_in` - whenever any of those change, `sum` reflects the new value
    // this updating is not true of all continuous assignment statements, however
    assign sum = a ^ b ^ c_in;
    assign c_out = ((a ^ b) & c_in) | (a & b);

    // actually, Verilog has a `+` operator, so it's really as simple as `assign {c_out, sum} = a + b + c_in`
    // `assign` will truncate the result of evaluating the right hand side, if the right hand side has more bits than the left
    // `assign` will zero-extend the result of evaluating the right hand side, if the right hand side has fewer bits than the left

Some of the available operators are ! (logical NOT), && (logical AND), || (logical OR), + (two's complement addition), - (two's complement subtraction), * (multiplication), / (division), % (remainder), ** (exponentiation), <, >, <=, >=, ==, !=, ~ (NOT), & (AND), | (OR), ^ (XOR), ^~ (XNOR), << (bit shift left), >> (bit shift right), >>> (signed shift right - extends sign bit), {a_1, ..., a_n} (concatenate a_1 through to a_n), {n{a}} (replicate a by n times - repeatedly concatenate a), c?a:b (MUX - select a if c is high, b otherwise). They generally have precedence similar to their equivalents in C and Python.

For example, the nibbles of a byte can be swapped with assign {b[3:0], b[7:4]} = a[7:0], and an 8-bit signed integer can be sign-extended to 16 bits with assign b[15:0] = {{8{a[7]}}, a[7:0]}.

With all these useful operators, it's pretty straightforward to make a 4-bit adder:

// full_adder_dataflow.sv
module full_adder(output c_out, output [3:0] sum, input c_in, input [3:0] a, b);
    assign {c_out, sum} = a + b + c_in`

Behaviour modelling

Behaviour modelling is the next level of abstraction above dataflow-level modelling. It uses procedural blocks that are initial.

An initial block runs the statements inside of it once starting at time 0. An always block runs the statements inside of it whenever certain things change.

initial // run once at time 0

Here's a full adder implemented with behaviour modelling:

module full_adder(input c_in, a, b, output logic c_out, sum);
    // the `@ (c_in, a, b)` specifies the "sensitivity list", which means the procedural block should be reevaluated whenever any of the logics in the list change
    // in contrast, the `initial` statement doesn't have a sensitivity list because it simply runs once at the beginning
    // the left hand sides of each assignment must be a `logic` or `reg` - that's why the outputs of this module are declared as `logic`
    // we could also have written `always @ *` to make it re-evaluate the block whenever any of the signals on the right hand sides of any assignment statement changes
    always @ (c_in, a, b) begin
        sum = a ^ b ^ c_in;
        c_out = ((a ^ b) & c_in) | (a & b);


module full_adder(input c_in, a, b, output logic c_out, sum);
    always @ (c_in, a, b) begin
        {c_out, sum}  = a + b + c_in;


First assignment is posted, it shouldn't take more than 5 minutes to complete.

For behaviour modelling, we have a lot of useful SystemVerilog constructs:

Note that with if statements and other conditionals, if we don't explicitly set the value of a logic, then it retains its current value.

Let's implement a mux:

module mux_2_input(output logic d, input i0, i1, select);
    always @ *
        case (select)
            0: d = i0;
            1: d = i1;
            default: d = z; // the default case handles unknown and high impedance - we'll just turn the output off (high impedance)

Synchronous Circuits

Recall that a synchronous circuit is one that only updates on a clock edge. In this course our synchronous circuits are updated on the rising edge of the clock (CMOS is actually clocked on the falling edge).

A test bench for a synchronous circuit needs to generate a clock signal. This might look like the following:

module synchronous_test_bench;
    logic clk;
    initial begin
        clk = 1;
        forever #5 clk = ~clk;

Blocks can be triggered on edge events, like always @ (posedge clk) (trigger on rising edge) or always @ (negedge clk).

SystemVerilog actually has two assignment operators. = is a blocking assignment, where all updates happen in order (we use this for combinational logic). <= is non-blocking assignment, where all the right hand sides are computed before the updates happen (this is used for register-like or latched output). For example, a <= b; b <= a swaps a and b, while a = b; b = a just sets a to b. Generally, we prefer <= over = when possible, so everywhere outside of combinational logic (we generally want outputs to all update at once, at the end of an update).

System Verilog uses discrete event simulation. Simulations consist of concurrent processes, primitives, modules, procedural blocks, and continuous assignments. Changes to signals are update events, and processes respond to events they are sensitive to and add them to a process-specific event queue, which are then executed.

System Verilog lets us define constants with parameter some_constant = 8. This is often used to define how many bits a particular element is. There's also enumerations, like enum bit[1:0] {IDLE, REQUEST, WAIT, RECEIVE} state. Logics can also be defined as arrays, like logic [7:0] some_array [0:3]; (array of 4 8-bit registers), and are accessed with y = a[2] and so on.

System Verilog has various "system tasks", which trigger actions during simulations. For example, $display(...) outputs to console, $monitor(...) outputs to console whenever the given signals are updated, and $stop stops the simulation.


Complex number multiplier:

// recall that given two complex numbers $a$ and $b$, $p = a \times b = (\Re(a)\Re(b) - \Im(a)\Im(b)) + \imag (\Re(a)\Im(b) + \Im(a)\Re(b))$
// this needs 4 multiplications, 1 subtract, and 1 add
// we will use a synchronous circuit with one multiplier and one adder/subtractor to do this, in order to demonstrate synchronous design, and also to save space (at the expense of it being slower)
// therefore, we break the operation into 5 steps, each one only using one multiplication or one addition: `pp1 <= a_real * b_real`, `pp2 <= a_imag * b_imag`, `p_real <= pp1 - pp2`, `pp1 <= a_real * b_imag`, `pp2 <= a_imag * b_real`, and `p_imag = pp1 + pp2`, which we can implement with an FSM
module complex_multiplier(
    output logic signed [7:-24] p_r, p_i, // this defines `pr_` and `p_i` as fixed-point numbers with 24 bits to the right of the decimal point, and 7 bits to the left (actual value is the value of `p_r`/`p_i` as a two's complement integer, divided by $2^{24}$)
    input signed [3:-12] a_r, a_i, b_r, b_i,
    input clk, reset, go // the clock input, reset signal, and the trigger that starts the multiplication
    logic a_sel, b_sel, pp1_en, pp2_en, sub, p_r_en, p_i_en;
    logic signed [3:-12] a_operand, b_operand;
    logic signed [7:-24] pp, sum; // partial product, sum
    logic signed [7:-24] pp1, pp2;
    // multiplier with two multiplexers for input, to select inputs between real/imaginary components of `a` and `b`
    assign a_operand = ~a_sel ? a_r : a_i;
    assign b_operand = ~b_sel ? b_r : b_i;
    assign pp = a_operand * b_operand;

    // update pp1 and pp2 registers on clock edge
    always @ (posedge clk)
        if (pp1_en) pp1 <= pp;
    always @ (posedge clk)
        if (pp2_en) pp2 <= pp;
    assign sum = ~sub ? pp1 + pp2 : pp1 - -pp2;
    // update output registers on clock edge
    always @ (posedge clk)
        if (p_r_en) p_r <= sum;
    always @ (posedge clk)
        if (p_i_en) p_i <= sum;
    // finite state machine to drive the above hardware
    enum bit [2:0] { step0, step1, step2, step3, step4 } current_state, next_state;
    always @ (posedge clk or posedge reset)
        if (reset) current_state <= step0;
        else current_state <= next_state;
    always @ * begin // update whenever go or current_state change
        case (current_state)
            step0: next_state = go ? step1 : step0;
            step1: next_state = step2;
            step2: next_state = step3;
            step3: next_state = step4;
            step4: next_state = step0;
    always @ (current_state, reset) begin
        a_sel; b_sel = 0; pp1_en = 0; pp2_en = 0;
        sub = 0; p_r_en = 0; p_i_en = 0;
        case (current_state)
            step0: begin pp1_en = 0; end
            step1: begin a_sel = 1; b_sel = 1; pp2_en = 1; end
            step2: begin b_sel = 1; pp1_en = 1; sub = 1; sub = 1; p_r_en = 1; end
            step3: begin a_sel = 1; pp2_en = 1; end
            step4: begin p_i_en = 1; end

Verilog has sized literals for numbers. For example, 16'h0001 means a 16-bit hexadecimal value 1, 1'b1 means the 1-bit binary value 1, and 4'd8 means the 4-bit decimal value 8.

Overview of assignment 1. Assignment involves filling out Verilog module stubs and writing test benches.


Room has been moved to HH 1102, which should be a lot closer to campus than OPT.

Processor Design

There are multiple levels of abstraction when talking about a processor design. From most abstract to least abstract, the main ones are:

In the 1980s the main focus in processor design was improving single core performance and clock speed. Today, focus is on more cores at a lower clock speed, and doing more with each cycle (modern Intel CPUs can do 4-8 instructions per cycle!). New microarchitecture-level techniques like instruction-level parallelism (parallelism within 1 thread) and hyperthreading (parallelism between multiple threads on one core), speculative execution (running both sides of a branch and then throwing away the wrong one once the branch is done), out of order execution (rearranging instructions to improve parallelism) have also resulted in significant performance gains. The main issue preventing more single core performance and clock speed is the power wall.

Dynamic power consumption is the power consumed that depends on the processor workload. When a transistor switches, it must use energy proportional to the capacitive load times the voltage squared. So the power used (switching frequency times switching energy) quicky increases linearly with the switching frequency, resulting in increased cooling requirements and worse battery life. One technique modern processors use to try to get around this is to reduce the switching voltage to quadratically reduce the switching energy required - for example, a 15% reduction in voltage results in a 28% reduction in switching energy.

There's also static energy consumption, the power consumed regardless of the processor workload. As transistors get smaller, the leakage current increases (due to quantum tunneling in the transistor junctions), which means more power used per transistor.

Modern techniques involve thread-level parallelism (doing multiple tasks at once), and data-level parallelism (doing something to multiple pieces of data at once).

Fynn's taxonomy characterizes four classes of parallelism:

Recall MIPS:


Sometimes instructions in a pipeline depend on others. There are three main types of these dependencies:

When instructions aren't dependent on each other, we can more-or-less freely rearrange them. This allows efficient pipelining and parallelism.

Total sequential execution (TSE) is an assumption in our programming model that one instruction finishes before the next starts. TSE is sufficient but not necessary for semantic correctness. In fact, semantic correctness can be satisfied just by satisfying the inter-instruction dependencies of the original program - we can otherwise rearrange and overlap instructions however we want.

Since we have five stages (fetch, decode, execute, memory, writeback), an instruction that writes registers takes 4 cycles after it's been fetched before the write actually takes effect, assuming 1 cycle per pipeline stage. Likewise, an instruction that reads registers takes 1 cycle after fetching before actually performing the read.

A control dependency is basically a special case of a register dependency, where the register is the program counter. An instruction that branches (writes to the program counter) has its write take effect 1 cycle after being fetched (the MIPS datapath reads PC in fetch stage and writes in decode stage).

In the same way, an instruction that writes to memory has a memory dependency, and has its write take effect 3 cycles after being fetched.

Therefore, if we have two instructions A, B that run at clock cycle i and j, respectively, such that i > j:

A hazard is a potential for a violated dependency (and therefore, violation of semantic correctness). In our pipelined, statically scheduled CPU, the hazards are a 3-cycle register read-after-write, and a 1-cycle control read-after-write.

To avoid hazards, we can insert no-op instructions - 3 NOP instructions after each register read-after-write, or 1 after every control read-after-write. These can be inserted either by the compiler, or the CPU can stall for the necessary number of cycles when the relevant dependencies are found. Compilers are responsible for this in MIPS.

However, a 3-cycle stall for a 3-cycle register read-after-write is quite long. We can shorten register/memory dependency stalls by using data forwarding - routing the output of later stages directly to earlier stages to avoid the need for a write/read - earlier stages can read directly from the later stages.

For our purposes, we can route the output of the memory stage to the execute stage (for j = i + 1), the output of the writeback stage to the execute stage (for j = i + 2), and the output of the writeback stage to the decode stage (for j = i + 3). With that in place, each stage always has the data it needs, so we never have to stall.

To implement data forwarding in each stage, we can use multiplexers that choose between either reading/writing registers, or taking input directly from later stages.

Turns out, this can't eliminate all our stalls, though it does make a lot of them much shorter. Consider lw R1, (R2) followed by add R4, R1, R1 - the load-word instruction doesn't have its result until the memory stage is done, so we must stall the add for 1 cycle so we can forward it from the write-back stage to the execute stage.

To solve the control dependency read-after-write stall (when we have a conditional branch), we might assume that the branch won't happen, keep executing the next instruction, and then squash/suppress it in the pipeline if we do end up taking the branch.

MIPS uses a different approach - the CPU will execute the instruction after the branch (the instruction after the branch is in a position called the branch delay slot), regardless of whether the branch is taken or not. The compiler will generally try to fill the branch delay slot with an instruction that doesn't depend on the branch, or NOP if it can't find one.


Recall that in static scheduling, instructions are executed in the order that they appear in, and the compiler is responsible for arranging instructions in a way that avoids hazards.

Code example with instruction reordering by the compiler for hazard avoidance.

The cycles per instruction (CPI) is a measure of CPU performance, and is ideally somewhere around 1 for our scalar processor. However, penalty cycles caused by hazards will increase this. For example, if we have 6 instructions that incur 3 penalty cycles, we get \frac{6 + 3}{6} = 1.5 as the CPI.

The program execution time is \frac{n \times \text{CPI}}{f}, where n is the number of instructions in the basic block, and f is the clock frequency. The speedup of any given piece of code is the new CPI over the old CPI.

Local scheduling is when the compiler rearranges instructions within a basic block. A basic block is a sequence of consecutive instructions such that if one instruction is executed, all of them are - that means only the first instruction can be the target of a branch instruction, and only the last instruction can be a branch instruction (or the second to last for MIPS, due to the branch delay slot).

To perform local scheduling, we first identify the hazards in the code, leaving slots to put other instructions in according to how many penalty cycles are present. For example, if we have a load followed by a dependent add instruction, we might have 1 penalty cycle in our architecture, so we would leave one slot for an instruction. Then, we try to fill in the slots with other instructions while keeping dependencies satisfied.

Global scheduling is when the compiler rearranges instructions across multiple basic blocks.

A common example of this is loop unrolling, where the body of a loop known to execute n times is simply duplicated n times - this reduces branching overhead, and makes local scheduling better since it may result in larger basic blocks. While loop unrolling can make our code faster, it results in larger programs, which might cause more instruction cache misses. The compiler must make the tradeoff depending on what would be faster.


Project part 1 marks come out on Monday. Part 2 is out on the course website, building a 5-stage MIPS pipeline. In-class overview of assignment.

In the scalar pipelines we are looking at, exceptions are events that alter program flow. Exceptions are harder to handle than normal program flow because they can occur at any time, and we must make sure registers and memory are properly stored/restored to ensure the TPE assumption is upheld.

Exceptions can be generated by things like interrupts from peripherals, unusual cases for some instructions (invalid opcode, page fault), and trap instructions (instructions that invoke the OS).

Precise exceptions are those that are recoverable - we can return to thread after the exception is handled.

What if we have multiple instructions that generate exceptions simultaneously in the pipeline? For example, we might have a load instruction that tries to dereference a null pointer at the memory stage, while we're trying to decode an invalid instruction in the decode stage - the load instruction generates an exception from the memory stage, and the invalid instruction generates one at the decode stage. To make sure we handle all instructions in the pipeline, we only mark the instruction as having caused an error, then actually handle them in the write-back stage.

There are three types of exceptions:

Very Long Instruction Word (WLIV) architectures are architectures that have very large instructions, generally able to issue multiple operations at once. For example, Intel Itanium has 128-bit instructions, able to contain up to 3 operations each.

If and only if an architecture can handle multiple instructions per cycle, it is known as a superscalar architecture.

VLIW architectures are generally statically scheduled, where each instruction contains multiple scalar operations. They generally don't do hazard detection or even full data forwarding, instead relying on the compiler to schedule operations in a good way.

The advantage of VLIW architectures is that the compiler gets more control over the scheduling and how different functional units (FPUs, ALUs, etc.) are used, so we can devote more transistors to the actual functional units themselves, rather than hazard detection and so on.

Consider a VLIW architecture where each instruction contains two memory operations, two floating point operations, and one ALU/branch operation. Each operation can have its own pipeline, so we'll give memory operations the usual fetch/decode/execute/memory/writeback, floating point operations a fetch/decode/execute 1/execute 2/execute 3/writeback pipeline, and ALU/branch operations a fetch/decode/execute/writeback pipeline. We'll also say branches resolve in the decode stage for all types of operation, so we need 1 branch delay slot (since decode is 1 pipeline stage after fetch).

Fetches occur at the same time for every pipeline - one instruction is fetched for every cycle.

Assuming full data forwarding is implemented, the latency of memory operations could be up to 1 if the execute stage has a dependency on the memory stage, the latency of ALU/branch operations could be up to 1 if the decode stage has a dependency on the execute stage, and the latency of floating point operations could be up to 2 if the execute 1 stage has a dependency on the execute 3 stage.

Slot utilization is how well the available operation slots for each instructions are filled, defined as \frac{\text{slots filled}}{\text{number of instructions} \times \text{slots per instruction}}.


;wip: global scheduling, software pipelining, catch up on the rest, gotta run for interviews


VLIW processors have many techniques for improving throughput and parallelism.

Speculative execution is basically just when we keep executing down the predicted path when we reach a branch, regardless of whether that path is actually taken or not, and then undoing the effects of that if we don't end up taking the predicted path. Implemented correctly, it eliminates branch delays when the branch predictor is right.

Speculative loading is a technique in which the compiler tries to move memory load instructions upward to avoid memory latency, even if they possibly won't be needed, like if a branch isn't taken.

In the IA-64 architecture, the Advanced Load Address Table (ALAT) is a unit of bidirectional associative memory used to implement speculative loading.

In IA-64, we denote speculative instructions with the .s suffix - ld8.s is the speculative version of ld8. We denote an advanced load version of the load instruction with the lw8.a suffix, and the ALAT check version of loads with lw8.c. Each lw8.a instruction is paired up with a lw8.c instruction later on - the lw8.a instruction starts the actual data transfer into the ALAT, and then when the corresponding lw8.c instruction is encountered, we can just load it from the ALAT rather than making a trip all the way to memory. Cache entries are invalidated if their addresses are stored to in between those instructions.

Deferred exceptions are a technique for making speculative execution easier. If we want to hoist an instruction above a conditional branch, and the exception causes an exception at runtime, we need a way to mark the instruction in a way that the exceptions are only handled if the conditional branch isn't taken, in order to make sure the hoisted code works the same as the original.

Consider a LW R1, (R5) instruction, which gets hoisted above a BEQZ R4, L1 instrucion. To implement deferred exceptions, we can add a poison bit to each register, and if and only if the speculatively executed instruction raises an exception, we can set the poison bit instead. Then, when we reach the original instruction, we can check for the poison bit being set, and if it is, actually raise the exception there.

;wip: read textbook section 3.5.6

Predicated execution is a technique in which instructoins only commit their results if some predicate is true. This is useful because we can often avoid branches and use predicates instead, therefore avoiding branch delays. IA-64, for example, has 64 1-bit predicate registers, which can be set by various operations like comparison, and used to enable/disable different instructions.

Predicated execution essentially lets the compiler turn control dependencies into data dependencies, by letting it change branches into predicated instructions. All of the instructions run, but only the ones with true predicates will actually have an effect. It's very useful when we have lots of small basic blocks.

For example, in IA-64, we can denote p1 = r1 == 0; p2 = r1 !== 0 as cmp.eq p1, p2 = 0, r1;;, where p1 to p64 are the predicate registers. Then, we can do (p1) add r4 = r4, 1 to increment r4 if and only if predicate register p1 is 1.

The Itanium architecture, IA-64, was also called an EPIC architecture (explicitly parallel instruction computing - VLIW with data forwarding and dynamic events like cache misses). It has 128 65-bit (64 bits, plus a poison bit, called the Not-A-Thing bit) integer registers, 128 82-bit floating point registers, and 64 1-bit predicate registers.


Out of order pipelines

Out of order pipelines dyanmically schedule instructions. The benefit is that they can take advantage of instruction-level parallelism not available at compile time, and allow us to make compilers a lot simpler, at the expense of more complicated CPUs.

When we're designing these pipelines, we need to design around hazards in register data, memory data, and control flow.

Out of order execution is inherently superscalar - there's no point in doing dynamic scheduling unless we have multiple functional units.

Scheduling is usually done using Tomasulo's algorithm, an improved version of previous techniques like scoreboarding. It was invented around 1967 and used for an IBM FPU.

Tomasulo's FPU has two functional units - a 2-stage pipelined, 2-cycle floating point adder, and a non-pipelined, 3-cycle floating point multiplier. The adder has 3 reservation stations, and the multiplier has 2.

The FPU also has 3 sets of registers:

The common data bus allows parts of the FPU to broadcast a float along with a 4-bit source tag to specify where it came from (like a reservation station or the FLB).

Units like reservation stations, the FLR, and the SDB hold data and the data's tag.

The key concept here is the reservation station, and the use of tags to

Here's the FPU's workflow:

  1. Up to two instructions are dispatched (in their original order) to the reservation station.
  2. Operands are copied to the reservation station.
  3. The destination register number is compared to the reservation station number.
  4. Instructions are sent to the functional unit (adder or multiplier) when all of its operands are ready and the functional unit is free.
  5. After computing the result, the functional broadcasts the result on the CDB, and waiting units like the FLR, reservation stations, and SDB update their contents.

;wip: what even is this, read https://en.wikipedia.org/wiki/Tomasulo_algorithm#Implementation_concepts

Tomasulo's algorithm has imprecise exceptions. For example, if register F4 is renamed by instruction X in cycle 1 and renamed by instruction Y in cycle 2, then W never writes its result to register F4, so if W faults, then recovery isn't possible since the semantics require the result be written. There's no real way to recover from this except to restart the computation from an earlier point.

An out of order pipeline looks something like the following:

The re-order buffer (ROB) remembers program order, so we can put things back in order after the Issue stage. At the dispatch stage, each instruction gets an entry in the ROB. At the finish stage, the ROB entry is updated to ;wip. At the Commit stage, the ROB entry is removed.

The ROB also keeps track of which instructions raise exceptions, so we can handle them at the Commit stage of the pipeline (this is analogous to the Itanium case, where we used a poison bit to only raise exceptions when we reach the Writeback stage). This ensures that the processor state is correct/precise at the instruction that raised the exception, which allows us to recover from the exception.

The width of an out of order pipeline is the number of instructions dispatched/committed per cycle.


Assignment 1 (not project 1) is due after reading week. It's about scheduling instructions in a pipelined processor.

Register renaming/assignment

Data dependencies within a block of code form a data-flow graph (DFG) in the processor, where each edge a \to b represents the fact that b depends on the result of a, so b shouldn't run before a. To help us run Tomasulo's algorithm, we can draw out the DFG and arrange it to minimize height.

The longest path (also known as the critical path) within the dataflow graph is the minimum number of cycles we can execute those instructions in - this is called the dataflow limit. In other words, with one thread, it's impossible to execute faster than the length of the critical path of true dependencies.

Operations in the CPU produce values, which are then assigned to registers. The live range of a value is the range within the code starting from when the value is first written to a register, up to and including the last use (when the value is last read from the register).

DFGs can also have virtual dependencies between pairs of instructions, called false dependencies, which give us write-after-read or write-after-write hazards if we change the order of those instructions.

Since there are a limited number of registers, the compiler needs to recycle registers - the compiler can use a register for a new value after the current value has been read for the last time. However, reusing registers imposes write-after-read dependencies.

Register renaming is a technique that tries to get rid of false data dependencies (i.e., write-after-read hazards), which helps us get better parallelism. Essentially, we're renaming the register that we're reading so that the write becomes independent of the read, allowing us to execute both in parallel.

To implement this, each operation's output value is assigned to a temporary rename register within a rename register file. The actual destination register is associated with the rename register ID, and on the commit step, we can actually copy the value from the rename register to the destination register. If we have enough rename registers, we can essentially have single assignment for each register, eliminating false dependencies and allowing us to improve parallelism.

Register renaming is used by Tomasulo's algorithm.

In hardware, we have an ARF (architected register file), which maps architecture registers (R0-R31 in MIPS) to rename registers. The RRF (rename register file) has one entry for each ROB entry, and the

With ;wip, we replace the ARF with an RAT (register alias table), which maps registers R0-R31 to architected and rename register entries in the PRF. The RAT has two columns - the retirement RAT (which has indices of architected register in the PRF), and the front-end RAT (which points to renamed register indices in the PRF). In case of an exception, the retirement RAT is copied to the front-end RAT, which undoes any renames done by the instructions after the fault, allowing us to raise precise exceptions.

;wip: read textbook section 3.4.6 (or 5.3? dunno what the numbers mean) and https://en.wikipedia.org/wiki/Register_renaming

The PRF (physical register file) stores the actual contents of the architected and rename registers. The

Here's what this looks like:

Delayed read is a technique in which the reservation station only stored the operand tag and status of the data, rather than the data itself, reducing data duplication. It can also time instruction issuing to correspond with the broadcast of the result of the operation it depends on. ;wip: what


Register renaming is a technique for undoing the hazards caused by register recycling.

Speculative execution is a technique in which we predict the direction of a branch and start executing that path while the branch is being resolved. However, this presents issues for register renaming, because speculatively executed instructions need to rename registers too - we must make sure the effects of the renaming are undone if the branch turns out to be wrong. To do this, we can do either rollbacks or checkpointing.

To do rollbacks, we save frontend RAT entries into the ROB (reorder buffer) as soon as the speculative instruction is dispatched. To recover upon a mispredicted branch, we can copy back the rename register number from the ROB to the frontend RAT, one at a time, until the branch instruction is actually completed. This works, but it's quite slow because of all the copying we have to do over multiple cycles.

To do checkpointing, we save a full copy of the entire frontend RAT as soon as the speculative instruction is dispatched. The saved copy of the frontend RAT is copied back to the frontend RAT is copied back if we mispredict the branch. This can be a lot faster than rollbacks, but it requires more storage and hardware to support this, depending on how many branches we are speculating past.

For Intel Skylake chips, we have 224 ROB entries, 180-entry integer PRF, and 168-entry floating point PRF. Haswell chips support up to 48 checkpoints - the ability to speculatively execute 48 branches at once!

Memory Data Flow

Data in memory includes the data segments, stack, and heap. Loads usually start a chain of instructions, because they have to wait on RAM, which is slow compared to processors. How do these work with pipelined processors?

For store instructions:

  1. Upon dispatch, calculate the effective address and store it with the data in the store buffer (SB).
  2. After commit, write the store buffer entry out to the processor cache (so it gets written to RAM).

For load instructions:

  1. Upon dispatch, calculate the effective address, and read the data from either the store buffer (if available) or the processor cache (so it gets read from RAM).
  2. After commit, broadcast the destination read on the forwarding bus, so the other units know that the destination register has been updated.

We want to be able to issue store instructions out of order, to improve performance.

Speculative loads are loads performed under the assumption that they are unaliased to recent stores - loads that are done from the cache, not the store buffer. When we do speculative loads, we need to detect if the load gave a valid result, and if not, retry it.

The speculative load process:

  1. Upon dispatch, calculate the effective address. If a committed store is in the store buffer, forward the data along the forwarding bus. Otherwise, get the data from the processor cache. Add the speculative load's location and effective address to the FLB (finished load buffer).
  2. After commit, remove the created entry from the FLB.

The store process, when using speculative loads, becomes:

  1. Upon dispatch, calculate the effective address and store it with the data in the store buffer.
  2. Upon commit, if aliased loads are present in the FLB, suppress those speculative loads. Write the store buffer entry out to the processor cache (so it gets written to RAM).

This ensures that if a speculative load finishes before a store completes, the effect of the load gets undone.

Another technique for speeding up loads is data prefetching, where data gets loaded from RAM into the processor data cache before it's actually loaded, so when we actually load it, we can do so directly from the cache. For example, in AMD64 the the prefetch instruction does this.

Data prefetching can also be done automatically - hardware prefetchers might detect patterns of sequential access, strided access (access with a linear offset each time, like accessing the diagonals of a matrix or accessing the columns of a row-major ordered matrix). One downside of this is that it might bring in data we don't actually want, reducing cache hit rate. Some systems like SPARC have a separate cache for prefetched data, called the prefetch buffer, to avoid this issue.


Review of caching and memory (see ECE222 notes for more details):

Intel's main competitive advantage nowadays is its superior branch prediction abilities.

In MIPS, control flow happens either as a branch instruction (which does PC-relative addressing), or as a jump (which does indirect addressing).

Speculative execution has three main concerns:


;wip: missed due to interviews

;wip: out of order pipelines: condition spec (section 8), dynamic methods based on branch history, adding a pattern history table to the BTB (branch target buffer)


The branch predictors we'll look at are 2-bit saturating counters, like one of the questions in Project Part 1. Whenever we take a branch, we increment the counter, and whenever we don't, we decrement. The counter saturating means that incrementing 11 stays at 11, and decrementing 00 stays at 00. The branch is predicted to be taken whenever the counter is 10 or 11. Branch predictors are added to the PHT whenever we encounter a branch instruction. They're initialized to 10 ("weakly taken").

Nair conducted a study in 1995 to check on how accurate branch prediction is in the real world, and found that prediction accuracy was 85%+ on many commonly used programs like SPICE and GCC. This is actually within 1% of an optimal 4-state branch predictor FSM, so saturating 2-bit counters are actually really good at predicting branches in real-world situations (they tried every one of the 5248 possible 4-state branch predicting FSMs!). They also looked at saturating counters with a different number of bits, and aliasing vs. the number of low-order bits we index by in the PHT, and characterized performance as those variables were changed.

;wip: correlating predictors, section 8.2.3, bimodal vs local vs global predictors

A bimodal branch predictor is the kind we've been looking at so far - the PHT is indexed by some subset of the bits of PC. This works decently well, but it doesn't capture behaviour like alternating branches (the branch switches between taken and not taken every time we execute), or similar patterns in branching.

A local branch predictor tries to capture this. To predict a branch, we index some bits of the branch instruction address (the value of PC) into a history table, which stores the corresponding last k branch results (for example, 10110 means the branch instruction branched 3 times out of the last 5 executions). Then, those last k results are XORed with a subset of bits of the branch instruction address, and the result is used as indices in the PHT, which stores the usual 2-bit saturating counters. In other words, \text{prediction} = \text{saturating\_counter\_prediction}(\text{subset of bits of PC} \oplus \text{PHT}[\text{history}[\text{another subset of bits of PC}]]). This allows us to predict based on the history of branch results for individual branch instructions, and it works really well in practice.

A global branch predictor detects correlated behaviour (like two different branch instructions that only branch when the other doesn't), saves the last n branch decisions of any branch instructions in a global history register (GR), a plain register that stores historical data for all branches. The global history register is used as indices in the PHT - we can look up our 2-bit saturating counters in the PHT by the history of all branches. Turns out that this doesn't work as well as local branch prediction, mostly because of aliasing - global history often looks the same for a lot of different branch instructions that behave very differently. In other words, \text{prediction} = \text{saturating\_counter\_prediction}(\text{PHT}[\text{history of last } k \text{ branches}]).

Real-world systems often combine a global and bimodal branch predictor, indexing into the PHT based on the global history XORed with some subset of the bits of PC. This scheme is what gshare uses, and performs almost as accurately as local branch prediction in real-world uses. However, it has one less level of indirection compared to local predictors (local predictors have to look up a value in history and then PHT, while we only have to look up a value in the PHT), so it can be faster.

Some processors implement meta branch predictors - running different types of branch predictors in parallel, and changing the one used based on their previous performance for each particular branch instruction.


Branch prediction examples, step by step. In the diagram, branch results are shifted into the right side of the history entries.

A branch predictor needs to both predict branches, check the results of those branches after they're resolved/executed, and then update the predictors based on that information. When the predictor predicts wrong and we're speculatively executing, we must also rollback the effects of the execution and restart it at the correct address.

To implement this:

Thread Level Parallelism

Parallel programming allows us to execute multiple things at the same time on multiple processors. Generally this is done by using shared memory (pthreads, etc.) or message passing between processes (MPI, Scala, etc.). In this course we'll look mainly at the shared memory model.

For our purposes, processes are collections of threads with an address space.

A critical section is code that accesses shared data. We want only one thread to be executing a critical section at a time, and we usually enforce this using locks. In CS350 we looked at the software side of implementing these syncrhonization primitives, but now we'll look at the hardware primitives that allow us to implement these operations.

Common schemes for hardware synchronization from CS350:

With any of these, we can implement a spinlock, the basic primitive needed to implement all of the other synchronization primitive. For example, with the compare-and-swap primitive cmpswp in MIPS:

spin: cmpswp LOCK_VARIABLE, 0, 1 ; if LOCK_VARIABLE is 0, set it to 1
      bne spin ; loop until we successfully set LOCK_VARIABLE to 1 from 0
      ; we now hold the spinlock, and can do our critical section stuff
      sw R0, LOCK_VARIABLE ; release the lock

Another example, with the ll/sc primitives in MIPS:

spin: ll R1, ADDRESS
      addi R1, R1, 1
      ;wip: where's the critical section?
      sc R1, A
      bne spin 


POSIX threads are implemented in the pthreads library which expose functionality for thread creation/destruction and primitives like mutexes.

Pthreads example:

#include <pthread.h>
#include stdio.h

#define NTHREAD 4
#define ARRAYSIZE 1000000

double sum = 0.0;
double a[ARRAYSIZE];
pthread_mutex_t sum_mutex;

void *add_array(void *thread_id);

int main(void) {
    // initialize `a` here
    pthread_mutex_init(&sum_mutex, NULL); // initialize mutex, with the default (NULL) attributes
    int thread_ids[NTHREAD];
    pthread_t threads[NTHREAD];
    for (int i = 0; i < NTHREAD; i ++) {
        tid[i] = i;
        pthread_create(&threads[i], NULL, add_array, &thread_ids[i]); // create thread from the add_array function, called with argument `&thread_ids[i]`, and store the thread object in `&threads[i]`
    for (int i = 0; i < NTHREAD; i ++) {
        pthread_join(thread[i], NULL); // wait for the thread to terminate
    printf("array sum: %f\n", sum);

void *add_array(void *thread_id) {
    int start = (*(int *)thread_id) * ((double)ARRAYSIZE/NTHREAD);
    int end = (*(int *)thread_id + 1) * ((double)ARRAYSIZE/NTHREAD);
    double my_sum = 0.0;
    for (int i = start; i < end; i ++) {
        my_sum += a[i];
        sum += my_sum;

With the OpenMP library, we can use compiler directives to automatically create threads and assign work to them. This library is widely used in real-world programs. For this example we'll use C++:

#include <iostream>
#include <omp.h>

using namespace std;

const int ARRAYSIZE = 1000000;
double sum = 0.0;
double a[ARRAYSIZE];

int main() {
    // initialize `a` here
    double my_sum;
    // start the parallel computation using OpenMP, with shared memory, a thread-local variable `my_sum`, and 4 threads (this defaults to the number of logical cores)
    #pragma omp parallel default(shared) private(my_sum) num_threads(4)
        my_sum = 0.0;
        #pragma omp for
        for(int i = 0; i < ARRAYSIZE; i ++) {
            my_sum += a[i];
        #pragma omp critical
        sum += my_sum;
    // an alternative way to the above block
    #pragma omp parallel for default(shared) reduction(+ : sum) num_threads(4)
    for (int i = 0; i < ARRAYSIZE; i ++) {
        sum += a[i];

Symmetric multiprocessing (SMP) describes a shared memory multiprocessor architecture with uniform memory access (every processor can access every memory location with the same access time).

To implement uniform memory access times, we might put processors and memory modules on a shared bus (memory accesses are broadcasted on the bus), or use a crossbar switch (access to memory banks routed by processor).

SMP systems tend to have limited scalability, topping out at a few tens of processors, because of the need for uniform memory access.

Non-uniform memory access (NUMA) is the opposite of UMA architectures - they allow processors to have different access times to different memory locations. This allows us to scale them up a lot more (e.g., IBM Blue Gene has over 65000 nodes!).

To implement non-uniform memory access times, we might use a scalable interconnect system like a tree network (heirarchy of nodes), mesh networks (nodes connect to adjacent nodes, like on a grid), torus networks (nodes connected in a grid that wraps around), hypercube networks (nodes connected with every node with a binary ID that differs by 1 bit), and so on.

In a NUMA architecture, we must ensure that the memory stays coherent. Coherent shared memory means that all processors will see the most recent write to any memory location (i.e., cache coherency for memory). This can be implemented by broadcasting writes on the memory bus (write-through caching), or by broadcasting invalidation messages on the bus, so future reads to that memory location will request its updated value (write-back caching).

One way to do ensure cache coherency in UMA architectures is the MESI (Modified Exclusive Shared Invalid) protocol. In this protocol, caches are divided into blocks, and each cache block is either:

Since all the memory messages are on the bus, the MESI protocol lets us store all the cache block states in a cache tag array, which replaces all the dirty bits and status bits. The cache controller is responsible for monitoring the bus and keeping the cache tag array up to date.

MESI transactions occur on local requests (requests from a processor to its own cache to read/write data) and on remote requests/bus requests (requests from other processors to their own caches, that we're just listening in on).

A snooping implementation means that every cache is listening to bus traffic from every other cache - they're snooping on each other's bus traffic. The snooping implementation allows caches to update their own states depending on changes in other caches, or tell other caches whether they've got a fresh copy of the data they're reading/writing.


Tips about project part 3: make sure to copy the .x files into the project folder.

MESI is a snooping protocol, which means that processor caches are both listening to requests from the processor, and snooping on traffic generated by other processors on the bus.

Suppose we have processors P1, P2, P3, all sharing one block B1 of memory. P1 writes to B1, then P2 reads from B1. In a MESI system, this shows up as:

  1. Initially, B1's corresponding cache entries for each processor are all in the SHARED state.
  2. P1 writes to B1, which sends out a P1 upgrade message on the bus. This sets P1's cache entry for B1 to MODIFIED, and P2 and P3's cache entry for B1 to INVALID.
  3. P2 reads B1, and since P2's cache entry is INVALID, P2 doesn't do anything. However, P1 gives a snoop response of DIRTY and performs a writeback. This sets all cache entries for B1 to SHARED again.

In addition to MESI, there's also MSI, which implements cache coherency without the EXCLUSIVE state, folding it into the SHARED state instead (it's a bit less efficient because local writes have to send out the UPGRADE message). MESIF and MOESI are other variants, based on the idea that when there's a cache miss, it's faster for another cache to supply the value than for the main memory to.

MOESI is used by ARM and ARM, and is a probing protocol rather than a snooping protocol - processors can directly ask other processors for data. Basically, this just adds an OWNED state, which means there's only one copy of the cache that serves this particular block. If a processor's cache holds a MODIFIED block and another processor does a bus read, the first one's block becomes an OWNED block, and the first processor forwards the block to the requesting processor. Likewise, bus writes modify the owner's and the writer's copy. The owner of the block simply does the writeback before evicting the block from its cache. MOESI is nice because it reduces writebacks to main memory.

MESIF is used by Intel, and is another probing protocol. Basically, this just adds a FORWARDING state, . For a SHARED block, only one of those holders of that SHARED block should forward the block when there's a READ/WRITE on the bus. The most recent reader of a block holds it in the FORWARDING state and does the forwarding. After responding to a bus READ/WRITE, it transitions to the SHARED state (the requesting processor now has the FORWARDING block). This is nice because it makes caches take turns forwarding the data. Also, the FORWARDING block has the freshest copy of the block's data, so it's not likely to evict that block (which would be slow due to needing to write back to main memory).


To implement spinlocks with the atomic exchange instruction:

     daddui R2, R0, 1 ; set R2 to 1
spin exch R2, (R1) ; atomically exhange the value of the lock with 1
     bnez R2, spin ; if the lock was originally held (originally had a value of 1), keep spinning

The exch is atomic, so it always generates a write miss (which then causes other lock holders to do writebacks) - this is a lot of bus traffic. This is because the last holder of the lock has the lock value as MODIFIED in memory, and the rest has it as an INVALID value, and this switches every time there's an EXCH instruction.

What we really want is for the lock to be a SHARED value between all the nodes trying to get the lock, at least most of the time. We can implement this by non-atomically reading the value first before trying to atomically exchange it:

spin ld R2, (R1)
     bnez R2, spin
     daddui R2, R0, 1
     exch R2, (R1)
     bnez R2, spin

The non-atomic read causes a read miss rather than a write miss, and makes the node with the MODIFIED lock value and the node that caused the read miss to change their lock value to SHARED. After the read miss is resolved (so the lock value becomes SHARED), nodes can now spin on their local cache values. When the lock holder releases the lock, the lock value is invalidated in the other nodes, allowing them to race to try to acquire that lock.

A directory implementation of MESI has a directory that stores the state of each block of memory, stored with each memory block. Each directory is a buffer with one entry per block, each entry storing:

Note that this doesn't explicitly represent the EXCLUSIVE state - this is because EXCLUSIVE is simply a SHARED state where only 1 of the sharing vector bits is set.

The directory adds overhead for each block, specifically \frac{\text{directory row or entry size}}{\text{block size}}. The general architecture is that each processor is connected to a memory and a directory, and these three are all connected to a scalable interconnect.

A directory implementation is comparable to a snooping implementation - it's faster and has less bus traffic, but it adds a significant amount of overhead to actually track the memory states.

Suppose P1 writes a piece of data, originally SHARED between P1 and P2. P1 must then send an UPGRADE to directory 1 to make P1 the owner of that data, then directory 1 sends INVALIDATE to the rest of the processors that have a copy of the data. P1 is now

When P2 reads after the above happens, P2 sends a READ MISS to directory 1, then directory 1 sends a FETCH to P1. P1 sends a WRITEBACK to directory 1 to make sure it has the most recent data. Then, directory 1 sends a DATA REPLY to P2, completing the read.

Multi-level caching

There are two schemes for multi-level caching:

In inclusive cache schemes, only the lowest level of cache maintains cache coherency. Remote UPGRADE/WRITE-MISS messages are passed up to higher levels, while local UPGRADE/WRITE-MISS messages are passed to lower levels, sort of like a write-through scheme. This has the advantage of higher levels dealing with most processor requests.

In exclusive cache schemes, all levels need to snoop on UPGRADE/WRITE-MISS messages, and for a directory, messages go to all levels.

In current Intel architectures, L1 and L2 are exclusive caches, and L3 is inclusive.


Hardware Transactional Memory

Transactional memory allows memory operations to be organized into transactions, much like database transactions. Compared to locks, it's a simpler paradigm and is easier to reason about.

There are some research implementations of software transactional memory, but they generally result in poor performance. Hardware implementations can be a lot faster

For our purposes, a transaction is a set of instructions that either all succeed (runs normally), or none do (all those instructions have no effect). In other words, an atomic block of instructions. Our goal is to allow transactions to execute in parallel.

Reads in transactions are tracked in the read set (RS) buffer, and writes are tracked in the write set (WS) buffer (reads/writes aren't actually executed, just tracked). A conflict is when a thread writes to a memory location that is in another transaction's RS or WS.

Locks are pessimistic parallism - they optimize for the case in which a conflict occurs, by preventing conflicts from happening in the first place. Transactions are optimistic parallelism - they optimize for the case in which no conflicts occur, by recovering from conflicts when we encounter them.

Transactional memory example in C++:

#include <random>
#include <thread>

using namespace std;

#define NUM_THREADS 10
#define NUM_ACCOUNTS 1000

int balance[NUM_ACCOUNTS];

void func() {
    random_device rnd;
    for (int i = 0; i < 1000000; i ++) {
        int a1 = rnd() % NUM_ACCOUNTS;
        int a1 = rnd() % NUM_ACCOUNTS;
        int amount = rnd() % 100;

        // this block executes atomically, as a single transaction
        __transaction_atomic {
            if (amount > balance[a1]) { amount = balance[a1]; }
            balance[a1] -= amount;
            balance[a2] += amount;

int main() {
    // initialize every account to have balance 100
    for (int i = 0; i < NUM_ACCOUNTS; i ++) { balance[i] = 100; }

    thread thr[NUM_THREADS];
    for (auto &t : thr) { t = thread(func); }

This compiles with experimental features of GCC, which implements software transactional memory. However, without hardware support, this turns out to have half the performance of the mutex-based version. The main issue is that resolving conflicts is really expensive in hardware.

Herlihy and Moss proposed the first hardware transactional memory architecture in 1993. This architecture added a transactional cache in parallel to the L1 data cache, and improved performance over software transactional memory by an order of magnitude.

The first commercially available hardware transactional memory implementation was the IBM POWER8 in 2013, followed by the Intel Haswell chips later that year. The POWER8 processor's HTM implementation was complicated by its cache structure, and the fact that it had 8 hardware threads per core.

A typical implementation of HTM looks like this:


On the IBM's POWER8 architecture, the caches are write-through, and the L2 cache has a special region of content-addressible memory called transactional memory with 64 entries, each one storing a block address, memory transaction state (load/store/invalid), and ref bits. The L2 cache is inclusive, while the L3 cache is non-inclusive. IBM publishes reference guides with more details and code examples online.

The tbegin and tend instructions must implicitly have a memory barrier - they always need to wait for all memory operations in progress to finish before they can run.

Transactional Lock Elision (TLE) is a technique where we replace lock acquire/release operations with transaction begin/end operations. If there's not going to be a lot of lock contention (lock acquisition doesn't block that often), this can improve parallelism.

Memory Consistency

Memory consistency deals with the ordering of memory references between processors.

Suppose we have two threads, one running A = 5; if (B == 0) f();, and the other running B = 5; if (A == 0) f();. Clearly, within a thread we can freely re-order instructions, because there are no intra-thread data dependencies. However, between threads, if we reorder the instructions in each thread, f can get called twice.

Sequential consistency is the property that the result of any execution of some parallel code is the same as if the operations on all processors were executed in a sequential order, and the operations run by each processor are in their original order specified in the program within this sequence of operations. Essentially, this means that loads and stores must execute in order.

However, running loads/stores in order can significantly reduce performance. Instead, we might use relaxed consistency models - like sequential consistency, but we're allowed to reorder certain sequences of reads/writes (e.g., we might allow writes in different threads to be re-ordered). Note that within a thread, they still appear in order. These are generally used over sequential consistency models in the real world.

Total Store Ordering (TSO) is one of these relaxed consistency models, where a read after a write can actually happen as a write after a read, if the two operations occur on different threads. In this scheme, the order of writes to memory by multiple processors is seen in the same order by all processors. This is used by SPARC and x86.

Weak Ordering (TSO) is another, where all sequences of writes/reads are re-orderable, but two loads are never re-ordered if the result of the first load is used to calculate the address of the second load, and two writes to the same location by the same processor are never re-ordered. This is used by POWER and ARM.

Multithreaded Processors

Suppose we have a chip capable of doing 6 dispatches/commits per cycle, and two threads to run on the chip. How do we divide thread operations between chip resources?

In all of these cases, the hardware maintains the context for all running hardware threads (e.g., registers, RATs, etc.).

The IBM POWER8 architecture can operate in 1-way (single-threaded), 2-way, 4-way, or 8-way SMT modes.


Data-level Parallelism

So far we've seen instruction-level parallelism, thread-level parallelism, and data-level parallelism. Now we'll look at data-level parallelism in more detail.

One of the most common data-level parallelism implementations are GPUs - they're designed to do operations on many pieces of data at once. GPUs are good at large-scale parallel data, but they aren't that good at branching. The raw compute power of GPUs make them a several orders of magnitude faster than even the best CPUs, for certain highly parallel tasks.

For real implementations of the POWER8 architecture, about 25% of the die space is integer/float compute units, around 25% is for the load/store unit (including the L1 data cache), and the rest is mostly fetch/scheduling/out-of-order execution hardware.

For a GTX 1080, about 80% of the die space is for functional units and the L1 cache, and the rest is for the L2 cache and scheduling hardware.

GPUs are usually programmed with CUDA or OpenCL. CUDA is proprietary and designed by and supported by Nvidia alone, while OpenCL is an open standard and is supported by all of the major GPU vendors. CUDA came first, and had a significant performance advantage for a while, though OpenCL is catching up.

In the CUDA programming model, the GPU has compute grids, containing blocks, which in turn contain threads:

For Nvidia GPUs, on the die there are hundreds and hundreds of cores, each core containing a float unit, integer, unit, operand collector, and result queue.


Many-core architectures combine many simple cores onto a single core. They're easier to program than GPUs, since each core is a lot more CPU-like.

For example, the Intel Phi architecture has a VPU (Vector Processing Unit), with 512 bit operands, 32 vector registers (which can hold floats/ints), and 8 mask registers (which determine whether a particular instruction will execute) per thread. Individual cores could run operations on each of these registers independently, allowing us to do things like masked vector add in parallel:

for (int i = 0; i <n; i ++) {
    if (residual[i] > epsilon) {
        x[i] += correction[i];

In the VPU architecture, this entire loop might be able to execute in parallel.

The Intel Phi 7120p is clocked like a GPU, with current offerings with gets 2416 GFLOPS and 1.2 GHz clock.

The final should be half problem solving based on content, and half questions about concepts from the course. Sample final will be posted today. Questions will include:

Overview of assignment 2 questions 1 and 3.

Creative Commons License This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Copyright 2013-2017 Anthony Zhang.