Lecture Notes by Anthony Zhang.

\[ \newcommand{\set}[1]{\left\{ #1 \right\}} \newcommand{\tup}[1]{\left\langle #1 \right\rangle} \newcommand{\abs}[1]{\left\lvert #1 \right\rvert} \newcommand{\floor}[1]{\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\left\lceil#1 \right\rceil} \newcommand{\mb}[1]{\mathbb{#1}} \newcommand{\rem}{\operatorname{rem}} \newcommand{\sign}{\operatorname{sign}} \newcommand{\imag}{\boldsymbol{i}} \newcommand{\dee}{\mathop{}\!\mathrm{d}} \newcommand{\lH}{\overset{\text{l'H}}{=}} \newcommand{\evalat}[1]{\left.\left(#1\right)\right|} \newcommand{\sech}{\operatorname{sech}} \newcommand{\spn}{\operatorname{Span}} \newcommand{\proj}{\operatorname{proj}} \newcommand{\prp}{\operatorname{perp}} \newcommand{\refl}{\operatorname{refl}} \newcommand{\magn}[1]{\left\lVert #1 \right\rVert} \newcommand{\rank}{\operatorname{rank}} \newcommand{\sys}[2]{\left[ #1 \mid #2\hskip2pt \right]} \newcommand{\range}{\operatorname{Range}} \newcommand{\adj}{\operatorname{adj}} \newcommand{\cof}{\operatorname{cof}} \newcommand{\diag}{\operatorname{diag}} \newcommand{\formlp}{\operatorname{Form}(\mathcal{L}^P)} \]

CS450

Computer Architecture.

Andrew Morton
Section 001
Email: andrew.morton@uwaterloo.ca
Website: https://www.student.cs.uwaterloo.ca/~cs450/w17/outline.shtml
Office Hours: Tuesdays/Thursdays 11:30am-12:30pm in EIT-4015
Tuesdays/Thursdays 1:00pm-2:20pm

3/1/16

In this course, the main project (worth 30%) is to design a MIPS processor in Verilog. No required textbooks. The main topic is about how modern CPUs manage to achieve the performance that they do, building on top of the processor basics covered in earlier processor design courses.

There are also two written assignments (worth 20% total) and the final exam (worth 50%). Assignments have 4 grace days allowed.

Hardware description languages are used for modelling hardware, generally for the purpose of simulation or synthesis. Most complex digital hardware is designed using a hardware description language, which is then compiled and converted into specifications for FPGAs, ASICs, programmable logic, and so on.

The main hardware description languages in use today are System Verilog and VHDL. VHDL is more verbose and Ada-like, while Verilog is more minimalist and C-like. Most projects in industry use Verilog for this reason. System Verilog is a superset of plain Verilog, and all Verilog code works in System Verilog. There's also System C, which looks a lot more like C/C++, but it's not commonly used. In this course, we will be using Icarus Verilog (an implementation of System Verilog), version 10 or later (in order to get support for SystemVerilog).

The simplest/lowest-level of hardware modelling in System Verilog is gate-level modelling. At this level, we have logic gate primitives like AND/OR, and we can connect them together into circuits via wires:

# `a1` is the name of the gate, which is optional but useful for labelling gates
# `y` is the gate output, and `a`/`b`/`c` are the gate inputs (`and` supports one or more of these inputs, and one output)
# the inputs and outputs are wires, identified by name - everywhere that same name is used can be thought of as being all connected together by a physical wire
and a1(y, a, b, c)

# every argument except the last are the NOT gate's outputs, and the last one is the input
not (y1, y2, a)

Gate-level modelling

In gate-level modelling, signal/wire values can be 0 (low), 1 (high), x/X (unknown/ignored value), or z/Z (high impedance).

Gate-level modelling for a full adder (addend bits as a/b, carry-in as c_in, carry-out as c_out, sum as sum):

// full_adder_gate_level_modelling.sv
// for a full adder, `sum = a XOR b XOR c_in`, and `c_out = ((a XOR b) AND c_in) OR a AND b`

module full_adder(input a, b, c_in, output c_out, sum);
    // a `logic` is basically a connection/wire - here, we're declaring our wires
    // a `wire` is equivalent to a `wire logic`, which is a logic that supports multiple drivers
    // generally, we just use logic unless we need multiple drivers - the compiler will choose the best structure depending on how they're used
    // there are also `bit` and `byte`, which are basically logics that can only be 0 or 1, not X or Z
    logic w1, w2, w3; // these are wires that are used for connecting gates in the full adder together
    xor x1(w1, a, b);
    xor x2(sum, w1, c_in);
    and a1(w2, a, b);
    and a2(w3, w1, c_in);
    or o1(c_out, w2, w3);
endmodule

This is a design module. It fully specifies the full adder as a self-contained unit.

While this looks like normal procedural code, with various function calls, it is more intuitive to think of this as a series of logical declarations - as if each of the xor/and/or lines were all continuously running at the same time.

A test bench is a set of definitions that instantiates the design module, and simulates the circuit under different input conditions to test for functional correctness. For our full adder example:

// full_adder_test_bench.sv
module full_adder_test_bench;
    logic carry_in, x, y;
    logic carry_out, z;

    // "dut" stands for "device under test"
    // below, we're basically instantiating a full adder (defined in the code above) that's connected to our logics, for testing purposes
    // the `.name(other_name)` syntax is a "named port map", which lets us connect logics to devices by signal name
    // alternatively, we can just map the signals in order, like `full_adder dut(x, y, carry_in, carry_out, z)`, but this can cause issues if we change the inputs/outputs of the full adder
    full_adder dut(.c_out(carry_out), .sum(z), .c_in(carry_in), .a(x), .b(y));

    // this is a procedural block - inside, the statements are evaluated sequentially from top to bottom
    initial begin // run once at time 0
        carry_in = 0; x = 0; y = 0;
        // delays must be used to ensure statements actually run sequentially - statements in between delays all execute in parallel
        #10 x = 1; // delay 10 time units, then continue and set x to 1
        #10 y = 1; // delay 10 time units, then continue and set y to 1
        #10 carry_in = 1; // delay 10 time units, then continue and set carry_in to 1
        #10 $stop; // delay 10 time units, then stop the simulation (this will not compile if trying to generate hardware from this)
    end

    // now we want to dump the signals out into waveforms so we can see them
    initial $dumpvars(0, full_adder_test_bench); // the first parameter is the dump level (`0` means "dump all variables in the module and in the modules that instantiate it"), and the rest of the parameters are the module or variables to dump
endmodule

To compile, run iverilog -g2005-sv -s full_adder_test_bench.sv -o full_adder.vvp full_adder.sv full_adder_test_bench.sv:

To simulate, run vvp -n full_adder.vvp -lx2:

To view the waveforms: gtkwave dump.lx2:

5/1/17

Live demonstration of the above Verilog workflow, including an overview of GtkWave's features and a shell script that builds and simulates the module.

Heirarchical design is the idea of using lower-level modules in higher-level modules. For example, a carry-lookahead adder module might instantiate a few ripple-carry adders, which in turn might instantiate a bunch of full-adders:

// adder_4_bit.sv
module adder_4_bit(output c_out, output [3:0] sum, input c_in, input [3:0] a, b);
    logic [3:1] carry;
    full_adder fa0(.c_out(carry[1]), .sum(sum[0]), .c_in(c_in), .a(a[0]), .b(b[0]));
    full_adder fa0(.c_out(carry[2]), .sum(sum[1]), .c_in(carry[1]), .a(a[1]), .b(b[1]));
    full_adder fa0(.c_out(carry[3]), .sum(sum[2]), .c_in(carry[2]), .a(a[2]), .b(b[2]));
    full_adder fa0(.c_out(c_out), .sum(sum[3]), .c_in(carry[3]), .a(a[3]), .b(b[3]));
endmodule

Note that there's a lot of similar statements for the full adder definitions. In these cases, we can use generate statements to remove duplicated code. It looks somewhat like a loop, but it's more of a macro that gets expanded into multiple statements at compile time:

// adder_4_bit_generate.sv
module adder_4_bit(output c_out, output [3:0] sum, input c_in, input [3:0] a, b);
    logic [4:0] carry;
    
    generate
        genvar i;
        for(i=0; i<4; i=i+1) begin: adder
            full_adder fa(.c_out(carry[i+1]), .sum(sum[i]), .c_in(carry[i]) ,.a(a[i]), .b(b[i]));
        end
    endgenerate

    // since it would be really inconvenient to specify our c_in and c_out in the generate statement, we just used extra indices in the carry wire, and then assign the inputs/outputs accordingly
    assign carry[0] = c_in; // this is a continuous assignment statement - it's sort of like a one-way wire for the c_in signal in this case, but it can also do things like combinational logic with different gates and stuff
    assign c_out = carry[4]; // since the assignment is unidirectional, the inputs are on the right side of the equal sign, and the outputs are on the left side
endmodule

Usually for loops are discouraged, but they're fine for generate statements - in other contexts, they might not compile to actual hardware.

Dataflow modelling

Dataflow modelling is the next level of abstraction above gate-level modelling. Dataflow modelling deals only with combinational circuits, by representing them with continuous assignment statements.

Basically, we represent what logical operations the circuit needs to do, and the compiler will figure out what kind of gate layout we need. For example, for a full adder it might automatically decide to reuse the xor(a, b) gate.

For example, we might model a full adder using dataflow modelling as follows:

// full_adder_dataflow.sv
module full_adder(output c_out, sum, input c_in, a, b);
    // the value of `sum` is continuously updated based on the values of `a`, `b`, and `c_in` - whenever any of those change, `sum` reflects the new value
    // this updating is not true of all continuous assignment statements, however
    assign sum = a ^ b ^ c_in;
    assign c_out = ((a ^ b) & c_in) | (a & b);

    // actually, Verilog has a `+` operator, so it's really as simple as `assign {c_out, sum} = a + b + c_in`
    // `assign` will truncate the result of evaluating the right hand side, if the right hand side has more bits than the left
    // `assign` will zero-extend the result of evaluating the right hand side, if the right hand side has fewer bits than the left
endmodule

Some of the available operators are ! (logical NOT), && (logical AND), || (logical OR), + (two's complement addition), - (two's complement subtraction), * (multiplication), / (division), % (remainder), ** (exponentiation), <, >, <=, >=, ==, !=, ~ (NOT), & (AND), | (OR), ^ (XOR), ^~ (XNOR), << (bit shift left), >> (bit shift right), >>> (signed shift right - extends sign bit), {a_1, ..., a_n} (concatenate a_1 through to a_n), {n{a}} (replicate a by n times - repeatedly concatenate a), c?a:b (MUX - select a if c is high, b otherwise). They generally have precedence similar to their equivalents in C and Python.

For example, the nibbles of a byte can be swapped with assign {b[3:0], b[7:4]} = a[7:0], and an 8-bit signed integer can be sign-extended to 16 bits with assign b[15:0] = {{8{a[7]}}, a[7:0]}.

With all these useful operators, it's pretty straightforward to make a 4-bit adder:

// full_adder_dataflow.sv
module full_adder(output c_out, output [3:0] sum, input c_in, input [3:0] a, b);
    assign {c_out, sum} = a + b + c_in`
endmodule

Behaviour modelling

Behaviour modelling is the next level of abstraction above dataflow-level modelling. It uses procedural blocks that are initial.

An initial block runs the statements inside of it once starting at time 0. An always block runs the statements inside of it whenever certain things change.

initial // run once at time 0

Here's a full adder implemented with behaviour modelling:

module full_adder(input c_in, a, b, output logic c_out, sum);
    // the `@ (c_in, a, b)` specifies the "sensitivity list", which means the procedural block should be reevaluated whenever any of the logics in the list change
    // in contrast, the `initial` statement doesn't have a sensitivity list because it simply runs once at the beginning
    // the left hand sides of each assignment must be a `logic` or `reg` - that's why the outputs of this module are declared as `logic`
    // we could also have written `always @ *` to make it re-evaluate the block whenever any of the signals on the right hand sides of any assignment statement changes
    always @ (c_in, a, b) begin
        sum = a ^ b ^ c_in;
        c_out = ((a ^ b) & c_in) | (a & b);
    end

Alternatively:

module full_adder(input c_in, a, b, output logic c_out, sum);
    always @ (c_in, a, b) begin
        {c_out, sum}  = a + b + c_in;
    end

10/1/17

First assignment is posted, it shouldn't take more than 5 minutes to complete.

For behaviour modelling, we have a lot of useful SystemVerilog constructs:

Note that with if statements and other conditionals, if we don't explicitly set the value of a logic, then it retains its current value.

Let's implement a mux:

module mux_2_input(output logic d, input i0, i1, select);
    always @ *
        case (select)
            0: d = i0;
            1: d = i1;
            default: d = z; // the default case handles unknown and high impedance - we'll just turn the output off (high impedance)

Synchronous Circuits

Recall that a synchronous circuit is one that only updates on a clock edge. In this course our synchronous circuits are updated on the rising edge of the clock (CMOS is actually clocked on the falling edge).

A test bench for a synchronous circuit needs to generate a clock signal. This might look like the following:

module synchronous_test_bench;
    logic clk;
    initial begin
        clk = 1;
        forever #5 clk = ~clk;
    end
endmodule

Blocks can be triggered on edge events, like always @ (posedge clk) (trigger on rising edge) or always @ (negedge clk).

SystemVerilog actually has two assignment operators. = is a blocking assignment, where all updates happen in order (we use this for combinational logic). <= is non-blocking assignment, where all the right hand sides are computed before the updates happen (this is used for register-like or latched output). For example, a <= b; b <= a swaps a and b, while a = b; b = a just sets a to b. Generally, we prefer <= over = when possible, so everywhere outside of combinational logic (we generally want outputs to all update at once, at the end of an update).

System Verilog uses discrete event simulation. Simulations consist of concurrent processes, primitives, modules, procedural blocks, and continuous assignments. Changes to signals are update events, and processes respond to events they are sensitive to and add them to a process-specific event queue, which are then executed.

System Verilog lets us define constants with parameter some_constant = 8. This is often used to define how many bits a particular element is. There's also enumerations, like enum bit[1:0] {IDLE, REQUEST, WAIT, RECEIVE} state. Logics can also be defined as arrays, like logic [7:0] some_array [0:3]; (array of 4 8-bit registers), and are accessed with y = a[2] and so on.

System Verilog has various "system tasks", which trigger actions during simulations. For example, $display(...) outputs to console, $monitor(...) outputs to console whenever the given signals are updated, and $stop stops the simulation.

12/1/17

Complex number multiplier:

// recall that given two complex numbers $a$ and $b$, $p = a \times b = (\Re(a)\Re(b) - \Im(a)\Im(b)) + \imag (\Re(a)\Im(b) + \Im(a)\Re(b))$
// this needs 4 multiplications, 1 subtract, and 1 add
// we will use a synchronous circuit with one multiplier and one adder/subtractor to do this, in order to demonstrate synchronous design, and also to save space (at the expense of it being slower)
// therefore, we break the operation into 5 steps, each one only using one multiplication or one addition: `pp1 <= a_real * b_real`, `pp2 <= a_imag * b_imag`, `p_real <= pp1 - pp2`, `pp1 <= a_real * b_imag`, `pp2 <= a_imag * b_real`, and `p_imag = pp1 + pp2`, which we can implement with an FSM
module complex_multiplier(
    output logic signed [7:-24] p_r, p_i, // this defines `pr_` and `p_i` as fixed-point numbers with 24 bits to the right of the decimal point, and 7 bits to the left (actual value is the value of `p_r`/`p_i` as a two's complement integer, divided by $2^{24}$)
    input signed [3:-12] a_r, a_i, b_r, b_i,
    input clk, reset, go // the clock input, reset signal, and the trigger that starts the multiplication
);
    logic a_sel, b_sel, pp1_en, pp2_en, sub, p_r_en, p_i_en;
    logic signed [3:-12] a_operand, b_operand;
    logic signed [7:-24] pp, sum; // partial product, sum
    logic signed [7:-24] pp1, pp2;
    
    // multiplier with two multiplexers for input, to select inputs between real/imaginary components of `a` and `b`
    assign a_operand = ~a_sel ? a_r : a_i;
    assign b_operand = ~b_sel ? b_r : b_i;
    assign pp = a_operand * b_operand;

    // update pp1 and pp2 registers on clock edge
    always @ (posedge clk)
        if (pp1_en) pp1 <= pp;
    always @ (posedge clk)
        if (pp2_en) pp2 <= pp;
    
    assign sum = ~sub ? pp1 + pp2 : pp1 - -pp2;
    
    // update output registers on clock edge
    always @ (posedge clk)
        if (p_r_en) p_r <= sum;
    always @ (posedge clk)
        if (p_i_en) p_i <= sum;
    
    // finite state machine to drive the above hardware
    enum bit [2:0] { step0, step1, step2, step3, step4 } current_state, next_state;
    always @ (posedge clk or posedge reset)
        if (reset) current_state <= step0;
        else current_state <= next_state;
    always @ * begin // update whenever go or current_state change
        case (current_state)
            step0: next_state = go ? step1 : step0;
            step1: next_state = step2;
            step2: next_state = step3;
            step3: next_state = step4;
            step4: next_state = step0;
        endcase
    end
    always @ (current_state, reset) begin
        a_sel; b_sel = 0; pp1_en = 0; pp2_en = 0;
        sub = 0; p_r_en = 0; p_i_en = 0;
        case (current_state)
            step0: begin pp1_en = 0; end
            step1: begin a_sel = 1; b_sel = 1; pp2_en = 1; end
            step2: begin b_sel = 1; pp1_en = 1; sub = 1; sub = 1; p_r_en = 1; end
            step3: begin a_sel = 1; pp2_en = 1; end
            step4: begin p_i_en = 1; end
        endcase
    end
endmodule

Verilog has sized literals for numbers. For example, 16'h0001 means a 16-bit hexadecimal value 1, 1'b1 means the 1-bit binary value 1, and 4'd8 means the 4-bit decimal value 8.

Overview of assignment 1. Assignment involves filling out Verilog module stubs and writing test benches.

17/1/17

Room has been moved to HH 1102, which should be a lot closer to campus than OPT.

Processor Design

There are multiple levels of abstraction when talking about a processor design. From most abstract to least abstract, the main ones are:

In the 1980s the main focus in processor design was improving single core performance and clock speed. Today, focus is on more cores at a lower clock speed, and doing more with each cycle (modern Intel CPUs can do 4-8 instructions per cycle!). New microarchitecture-level techniques like instruction-level parallelism (parallelism within 1 thread) and hyperthreading (parallelism between multiple threads on one core), speculative execution (running both sides of a branch and then throwing away the wrong one once the branch is done), out of order execution (rearranging instructions to improve parallelism) have also resulted in significant performance gains. The main issue preventing more single core performance and clock speed is the power wall.

Dynamic power consumption is the power consumed that depends on the processor workload. When a transistor switches, it must use energy proportional to the capacitive load times the voltage squared. So the power used (switching frequency times switching energy) quicky increases linearly with the switching frequency, resulting in increased cooling requirements and worse battery life. One technique modern processors use to try to get around this is to reduce the switching voltage to quadratically reduce the switching energy required - for example, a 15% reduction in voltage results in a 28% reduction in switching energy.

There's also static energy consumption, the power consumed regardless of the processor workload. As transistors get smaller, the leakage current increases (due to quantum tunneling in the transistor junctions), which means more power used per transistor.

Modern techniques involve thread-level parallelism (doing multiple tasks at once), and data-level parallelism (doing something to multiple pieces of data at once).

Fynn's taxonomy characterizes four classes of parallelism:

Recall MIPS:

19/1/17

Sometimes instructions in a pipeline depend on others. There are three main types of these dependencies:

When instructions aren't dependent on each other, we can more-or-less freely rearrange them. This allows efficient pipelining and parallelism.

Total sequential execution (TSE) is an assumption in our programming model that one instruction finishes before the next starts. TSE is sufficient but not necessary for semantic correctness. In fact, semantic correctness can be satisfied just by satisfying the inter-instruction dependencies of the original program - we can otherwise rearrange and overlap instructions however we want.

Since we have five stages (fetch, decode, execute, memory, writeback), an instruction that writes registers takes 4 cycles after it's been fetched before the write actually takes effect, assuming 1 cycle per pipeline stage. Likewise, an instruction that reads registers takes 1 cycle after fetching before actually performing the read.

A control dependency is basically a special case of a register dependency, where the register is the program counter. An instruction that branches (writes to the program counter) has its write take effect 1 cycle after being fetched (the MIPS datapath reads PC in fetch stage and writes in decode stage).

In the same way, an instruction that writes to memory has a memory dependency, and has its write take effect 3 cycles after being fetched.

Therefore, if we have two instructions \(A, B\) that run at clock cycle \(i\) and \(j\), respectively, such that \(i > j\):

A hazard is a potential for a violated dependency (and therefore, violation of semantic correctness). In our pipelined, statically scheduled CPU, the hazards are a 3-cycle register read-after-write, and a 1-cycle control read-after-write.

To avoid hazards, we can insert no-op instructions - 3 NOP instructions after each register read-after-write, or 1 after every control read-after-write. These can be inserted either by the compiler, or the CPU can stall for the necessary number of cycles when the relevant dependencies are found. Compilers

However, a 3-cycle stall for a 3-cycle register read-after-write is quite long. We can shorten register/memory dependency stalls by using data forwarding - routing the output of later stages directly to earlier stages to avoid the need for a write/read - earlier stages can read directly from the later stages.

For our purposes, we can route the output of the memory stage to the execute stage (for \(j = i + 1\)), the output of the writeback stage to the execute stage (for \(j = i + 2\)), and the output of the writeback stage to the decode stage (for \(j = i + 3\)). With that in place, each stage always has the data it needs, so we never have to stall.

To implement data forwarding in each stage, we can use multiplexers that choose between either reading/writing registers, or taking input directly from later stages.

Turns out, this can't eliminate all our stalls, though it does make a lot of them much shorter. Consider lw R1, (R2) followed by add R4, R1, R1 - the load-word instruction doesn't have its result until the memory stage is done, so we must stall the add for 1 cycle so we can forward it from the write-back stage to the execute stage.

To solve the control dependency read-after-write stall (when we have a conditional branch), we might assume that the branch won't happen, keep executing the next instruction, and then squash/suppress it in the pipeline if we do end up taking the branch.

MIPS uses a different approach - the CPU will execute the instruction after the branch (the instruction after the branch is in a position called the branch delay slot), regardless of whether the branch is taken or not. The compiler will generally try to fill the branch delay slot with an instruction that doesn't depend on the branch, or NOP if it can't find one.

24/1/17

Recall that in static scheduling, instructions are executed in the order that they appear in, and the compiler is responsible for arranging instructions in a way that avoids hazards.

Code example with instruction reordering by the compiler for hazard avoidance.

The cycles per instruction (CPI) is a measure of CPU performance, and is ideally somewhere around 1 for our scalar processor. However, penalty cycles caused by hazards will increase this. For example, if we have 6 instructions that incur 3 penalty cycles, we get \(\frac{6 + 3}{6} = 1.5\) as the CPI.

The program execution time is \(\frac{n \times \text{CPI}}{f}\), where \(n\) is the number of instructions in the basic block, and \(f\) is the clock frequency.

Local scheduling is when the compiler rearranges instructions within a basic block. A basic block is a sequence of consecutive instructions such that if one instruction is executed, all of them are - that means only the first instruction can be the target of a branch instruction, and only the last instruction can be a branch instruction (or the second to last for MIPS, due to the branch delay slot).

To perform local scheduling, we first identify the hazards in the code, leaving slots to put other instructions in according to how many penalty cycles are present. For example, if we have a load followed by a dependent add instruction, we might have 1 penalty cycle in our architecture, so we would leave one slot for an instruction. Then, we try to fill in the slots with other instructions while keeping dependencies satisfied.

Global scheduling is when the compiler rearranges instructions across multiple basic blocks.

A common example of this is loop unrolling, where the body of a loop known to execute \(n\) times is simply duplicated \(n\) times - this reduces branching overhead, and makes local scheduling better since it may result in larger basic blocks. While loop unrolling can make our code faster, it results in larger programs, which might cause more instruction cache misses. The compiler must make the tradeoff depending on what would be faster.

26/1/17

Project part 1 marks come out on Monday. Part 2 is out on the course website, building a 5-stage MIPS pipeline. In-class overview of assignment.

In the scalar pipelines we are looking at, exceptions are events that alter program flow. Exceptions are harder to handle than normal program flow because they can occur at any time, and we must make sure registers and memory are properly stored/restored to ensure the TPE assumption is upheld.

Exceptions can be generated by things like interrupts from peripherals, unusual cases for some instructions (invalid opcode, page fault), and trap instructions (instructions that invoke the OS).

Precise exceptions are those that are recoverable - we can return to thread after the exception is handled.

What if we have multiple instructions that generate exceptions simultaneously in the pipeline? For example, we might have a load instruction that tries to dereference a null pointer at the memory stage, while we're trying to decode an invalid instruction in the decode stage - the load instruction generates an exception from the memory stage, and the invalid instruction generates one at the decode stage. To make sure we handle all instructions in the pipeline, we only mark the instruction as having caused an error, then actually handle them in the write-back stage.

There are three types of exceptions:

Very Long Instruction Word (WLIV) architectures are architectures that have very large instructions, generally able to issue multiple operations at once. For example, Intel Itanium has 128-bit instructions, able to contain up to 3 operations each.

If and only if an architecture can handle multiple instructions per cycle, it is known as a superscalar architecture.

VLIW architectures are generally statically scheduled, where each instruction contains multiple scalar operations. They generally don't do hazard detection or even full data forwarding, instead relying on the compiler to schedule operations in a good way.

The advantage of VLIW architectures is that the compiler gets more control over the scheduling and how different functional units (FPUs, ALUs, etc.) are used, so we can devote more transistors to the actual functional units themselves, rather than hazard detection and so on.

Consider a VLIW architecture where each instruction contains two memory operations, two floating point operations, and one ALU/branch operation. Each operation can have its own pipeline, so we'll give memory operations the usual fetch/decode/execute/memory/writeback, floating point operations a fetch/decode/execute 1/execute 2/execute 3/writeback pipeline, and ALU/branch operations a fetch/decode/execute/writeback pipeline. We'll also say branches resolve in the decode stage for all types of operation, so we need 1 branch delay slot (since decode is 1 pipeline stage after fetch).

Fetches occur at the same time for every pipeline - one instruction is fetched for every cycle.

Assuming full data forwarding is implemented, the latency of memory operations could be up to 1 if the execute stage has a dependency on the memory stage, the latency of ALU/branch operations could be up to 1 if the decode stage has a dependency on the execute stage, and the latency of floating point operations could be up to 2 if the execute 1 stage has a dependency on the execute 3 stage.

Slot utilization is how well the available operation slots for each instructions are filled, defined as \(\frac{\text{slots filled}}{\text{number of instructions} \times \text{slots per instruction}}\).

31/1/17

;wip: global scheduling, software pipelining, catch up on the rest, gotta run for interviews

2/1/17

VLIW processors have many techniques for improving throughput and parallelism.

Speculative execution is basically just when we keep executing down the predicted path when we reach a branch, regardless of whether that path is actually taken or not, and then undoing the effects of that if we don't end up taking the predicted path. Implemented correctly, it eliminates branch delays when the branch predictor is right.

Speculative loading is a technique in which the compiler tries to move memory load instructions upward to avoid memory latency, even if they possibly won't be needed, like if a branch isn't taken.

In the IA-64 architecture, the Advanced Load Address Table (ALAT) is a unit of bidirectional associative memory used to implement speculative loading.

In IA-64, we denote speculative instructions with the .s suffix - ld8.s is the speculative version of ld8. We denote an advanced load version of the load instruction with the lw8.a suffix, and the ALAT check version of loads with lw8.c. Each lw8.a instruction is paired up with a lw8.c instruction later on - the lw8.a instruction starts the actual data transfer into the ALAT, and then when the corresponding lw8.c instruction is encountered, we can just load it from the ALAT rather than making a trip all the way to memory. Cache entries are invalidated if their addresses are stored to in between those instructions.

Deferred exceptions are a technique for making speculative execution easier. If we want to hoist an instruction above a conditional branch, and the exception causes an exception at runtime, we need a way to mark the instruction in a way that the exceptions are only handled if the conditional branch isn't taken, in order to make sure the hoisted code works the same as the original.

Consider a LW R1, (R5) instruction, which gets hoisted above a BEQZ R4, L1 instrucion. To implement deferred exceptions, we can add a poison bit to each register, and if and only if the speculatively executed instruction raises an exception, we can set the poison bit instead. Then, when we reach the original instruction, we can check for the poison bit being set, and if it is, actually raise the exception there.

;wip: read textbook section 3.5.6

Predicated execution is a technique in which instructoins only commit their results if some predicate is true. This is useful because we can often avoid branches and use predicates instead, therefore avoiding branch delays. IA-64, for example, has 64 1-bit predicate registers, which can be set by various operations like comparison, and used to enable/disable different instructions.

Predicated execution essentially lets the compiler turn control dependencies into data dependencies, by letting it change branches into predicated instructions. All of the instructions run, but only the ones with true predicates will actually have an effect. It's very useful when we have lots of small basic blocks.

For example, in IA-64, we can denote p1 = r1 == 0; p2 = r1 !== 0 as cmp.eq p1, p2 = 0, r1;;, where p1 to p64 are the predicate registers. Then, we can do (p1) add r4 = r4, 1 to increment r4 if and only if predicate register p1 is 1.

The Itanium architecture, IA-64, was also called an EPIC architecture (explicitly parallel instruction computing - VLIW with data forwarding and dynamic events like cache misses). It has 128 65-bit (64 bits, plus a poison bit, called the Not-A-Thing bit) integer registers, 128 82-bit floating point registers, and 64 1-bit predicate registers.

7/2/17

Out of order pipelines

Out of order pipelines dyanmically schedule instructions. The benefit is that they can take advantage of instruction-level parallelism not available at compile time, and allow us to make compilers a lot simpler, at the expense of more complicated CPUs.

When we're designing these pipelines, we need to design around hazards in register data, memory data, and control flow.

Out of order execution is inherently superscalar - there's no point in doing dynamic scheduling unless we have multiple functional units.

Scheduling is usually done using Tomasulo's algorithm, an improved version of previous techniques like scoreboarding. It was invented around 1967 and used for an IBM FPU.

Tomasulo's FPU has two functional units - a 2-stage pipelined, 2-cycle floating point adder, and a non-pipelined, 3-cycle floating point multiplier. The adder has 3 reservation stations, and the multiplier has 2.

The FPU also has 3 sets of registers:

The common data bus allows parts of the FPU to broadcast a float along with a 4-bit source tag to specify where it came from (like a reservation station or the FLB).

Units like reservation stations, the FLR, and the SDB hold data and the data's tag.

The key concept here is the reservation station, and the use of tags to

Here's the FPU's workflow:

  1. Up to two instructions are dispatched (in their original order) to the reservation station.
  2. Operands are copied to the reservation station.
  3. The destination register number is compared to the reservation station number.
  4. Instructions are sent to the functional unit (adder or multiplier) when all of its operands are ready and the functional unit is free.
  5. After computing the result, the functional broadcasts the result on the CDB, and waiting units like the FLR, reservation stations, and SDB update their contents.

;wip: what even is this, read https://en.wikipedia.org/wiki/Tomasulo_algorithm#Implementation_concepts

Tomasulo's algorithm has imprecise exceptions. For example, if register F4 is renamed by instruction X in cycle 1 and renamed by instruction Y in cycle 2, then W never writes its result to register F4, so if W faults, then recovery isn't possible since the semantics require the result be written. There's no real way to recover from this except to restart the computation from an earlier point.

An out of order pipeline looks something like the following:

The re-order buffer (ROB) remembers program order, so we can put things back in order after the Issue stage. At the dispatch stage, each instruction gets an entry in the ROB. At the finish stage, the ROB entry is updated to ;wip. At the Commit stage, the ROB entry is removed.

The ROB also keeps track of which instructions raise exceptions, so we can handle them at the Commit stage of the pipeline (this is analogous to the Itanium case, where we used a poison bit to only raise exceptions when we reach the Writeback stage). This ensures that the processor state is correct/precise at the instruction that raised the exception, which allows us to recover from the exception.

The width of an out of order pipeline is the number of instructions dispatched/committed per cycle.

9/2/17

Assignment 1 (not project 1) is due after reading week. It's about scheduling instructions in a pipelined processor.

Register renaming/assignment

Data dependencies within a block of code form a data-flow graph (DFG) in the processor, where each edge \(a \to b\) represents the fact that \(b\) depends on the result of \(a\), so \(b\) shouldn't run before \(a\). To help us run Tomasulo's algorithm, we can draw out the DFG and arrange it to minimize height.

The longest path (also known as the critical path) within the dataflow graph is the minimum number of cycles we can execute those instructions in - this is called the dataflow limit. In other words, with one thread, it's impossible to execute faster than the length of the critical path of true dependencies.

Operations in the CPU produce values, which are then assigned to registers. The live range of a value is the range within the code starting from when the value is first written to a register, up to and including the last use (when the value is last read from the register).

DFGs can also have virtual dependencies between pairs of instructions, called false dependencies, which give us write-after-read or write-after-write hazards if we change the order of those instructions.

Since there are a limited number of registers, the compiler needs to recycle registers - the compiler can use a register for a new value after the current value has been read for the last time. However, reusing registers imposes write-after-read dependencies.

Register renaming is a technique that tries to get rid of false data dependencies, which helps us get better parallelism. The

To implement this, each operation's output value is assigned to a temporary rename register within a rename register file. The actual destination register is associated with the rename register ID, and on the commit step, we can actually copy the value from the rename register to the destination register. If we have enough rename registers, we can essentially have single assignment for each register, eliminating false dependencies and allowing us to improve parallelism.

Register renaming is used by Tomasulo's algorithm.

In hardware, we have an ARF (architected register file), which maps architecture registers (R0-R31 in MIPS) to rename registers. The RRF (rename register file) has one entry for each ROB entry, and the

With ;wip, we replace the ARF with an RAT (register alias table), which maps registers R0-R31 to architected and rename register entries in the PRF. The RAT has two columns - the retirement RAT (which has indices of architected register in the PRF), and the front-end RAT (which points to renamed register indices in the PRF). In case of an exception, the retirement RAT is copied to the front-end RAT, which undoes any renames done by the instructions after the fault, allowing us to raise precise exceptions.

;wip: read textbook section 3.4.6 (or 5.3? dunno what the numbers mean) and https://en.wikipedia.org/wiki/Register_renaming

The PRF (physical register file) stores the actual contents of the architected and rename registers. The

Here's what this looks like:

Delayed read is a technique in which the reservation station only stored the operand tag and status of the data, rather than the data itself, reducing data duplication. It can also time instruction issuing to correspond with the broadcast of the result of the operation it depends on. ;wip: what

14/2/17

Register renaming is a technique for undoing the hazards caused by register recycling.

Speculative execution is a technique in which we predict the direction of a branch and start executing that path while the branch is being resolved. However, this presents issues for register renaming, because speculatively executed instructions need to rename registers too - we must make sure the effects of the renaming are undone if the branch turns out to be wrong. To do this, we can do either rollbacks or checkpointing.

To do rollbacks, we save frontend RAT entries into the ROB (reorder buffer) as soon as the speculative instruction is dispatched. To recover upon a mispredicted branch, we can copy back the rename register number from the ROB to the frontend RAT, one at a time, until the branch instruction is actually completed. This works, but it's quite slow because of all the copying we have to do over multiple cycles.

To do checkpointing, we save a full copy of the entire frontend RAT as soon as the speculative instruction is dispatched. The saved copy of the frontend RAT is copied back to the frontend RAT is copied back if we mispredict the branch. This can be a lot faster than rollbacks, but it requires more storage and hardware to support this, depending on how many branches we are speculating past.

For Intel Skylake chips, we have 224 ROB entries, 180-entry integer PRF, and 168-entry floating point PRF. Haswell chips support up to 48 checkpoints - the ability to speculatively execute 48 branches at once!

Memory Data Flow

Data in memory includes the data segments, stack, and heap. Loads usually start a chain of instructions, because they have to wait on RAM, which is slow compared to processors. How do these work with pipelined processors?

For store instructions:

  1. Upon dispatch, calculate the effective address and store it with the data in the store buffer (SB).
  2. After commit, write the store buffer entry out to the processor cache (so it gets written to RAM).

For load instructions:

  1. Upon dispatch, calculate the effective address, and read the data from either the store buffer (if available) or the processor cache (so it gets read from RAM).
  2. After commit, broadcast the destination read on the forwarding bus, so the other units know that the destination register has been updated.

We want to be able to issue store instructions out of order, to improve performance.

Speculative loads are loads performed under the assumption that they are unaliased to recent stores - loads that are done from the cache, not the store buffer. When we do speculative loads, we need to detect if the load gave a valid result, and if not, retry it.

The speculative load process:

  1. Upon dispatch, calculate the effective address. If a committed store is in the store buffer, forward the data along the forwarding bus. Otherwise, get the data from the processor cache. Add the speculative load's location and effective address to the FLB (finished load buffer).
  2. After commit, remove the created entry from the FLB.

The store process, when using speculative loads, becomes:

  1. Upon dispatch, calculate the effective address and store it with the data in the store buffer.
  2. Upon commit, if aliased loads are present in the FLB, suppress those speculative loads. Write the store buffer entry out to the processor cache (so it gets written to RAM).

This ensures that if a speculative load finishes before a store completes, the effect of the load gets undone.

Another technique for speeding up loads is data prefetching, where data gets loaded from RAM into the processor data cache before it's actually loaded, so when we actually load it, we can do so directly from the cache. For example, in AMD64 the the prefetch instruction does this.

Data prefetching can also be done automatically - hardware prefetchers might detect patterns of sequential access, strided access (access with a linear offset each time, like accessing the diagonals of a matrix or accessing the columns of a row-major ordered matrix). One downside of this is that it might bring in data we don't actually want, reducing cache hit rate. Some systems like SPARC have a separate cache for prefetched data, called the prefetch buffer, to avoid this issue.

16/2/17

Review of caching and memory (see ECE222 notes for more details):

Intel's main competitive advantage nowadays is its superior branch prediction abilities.

In MIPS, control flow happens either as a branch instruction (which does PC-relative addressing), or as a jump (which does indirect addressing).

Speculative execution has three main concerns:

Creative Commons License This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Copyright 2013-2017 Anthony Zhang.