# ECE222

Section 001 (mixed engineering)

Andrew Morton
arrmorto@uwaterloo.ca

# 14/9/15

Computer architecture and assembly language programming.

There's a single Saturday lecture to make up for Thanksgiving Monday, sometime on the 1:30-2:20 slots. There are 6 make-up lecture slots on Fridays at 1:30 PM-2:20 PM.

Midterm on October 28 at 8:30 AM, 75 minutes. Textbook is optional but recommended.

No tutorials for the first week. Tentative office hours for this section are Wednesdays at 12:30 PM.

Finding the magnitude of a negative two's complement number is done by flipping all of the bits and adding one, but a better way is to flip all the bits left of the rightmost 1, and interpreting the resulting number as the unsigned binary magnitude.

# 16/9/15

Continuation of number systems and binary - see CS241 and ECE214e notes for reference.

Overflow occurs when the result cannot be represented using the given number of bits, such as when two excessively large unsigned values are added together to get a value larger than $$2^n - 1$$. It is always important to check for overflow when doing arithmetic. In computers, there is usually an overflow flag in a status register.

For subtraction, the overflow/borrow bit represents whether the sign of the result differs from the sign of either operand - if there are both positive and negative values as the operands and result.

Sign extension is the replication of the leftmost bit to make a binary number of a particular width larger. For example, sign extending 1010 to 8 bits results in 11111010.

Computers can be general purpose or application-specific. Types of computers include personal computers, workstations, mainframes, supercomputers, embedded computers.

# 18/9/15

Hack the North starts today at 6PM! Lab 0 is done individually but you should still have a partner.

A programmable computer is a device that can store a sequence of instruction, execute a stored sequence of instructions, and conditionally select a path of execution (control transfer). One of the first computers programmable computers designed was mechanical - the analytical engine by Charles Babbage.

Some of the tehnologies used for programmable omputers are mechanical, vacuum tube, transistors, integrated circuits, VLSI, and nanotechnological integration.

The Harvard Mark I (1944) was the first computer to implement the Harvard architecture (separate code and data). The Manchester Baby (1948) was the first to implement the von Neumann architecture (code in same memory as data).

A digital computer uses digital electronic circuits to implement a programmable computer. Generally, these consist of input/output (keyboards, monitors, and network cards), memory, and the CPU (the ALU and control unit), as well as the interconnection network that connects these all together.

For this course, memory will be represented as an array of bytes. The word length of the computer depends on the processor, and these days is generally 16, 32, or 64 bits. A computer with word length $$n$$ has words of $$n$$ bits - a word would be an $$n$$ bit binary number. A long word is a $$2n$$ bit binary number.

A memory unit has a clock, read enable, write enable, and address inputs, and has data output. Most computers have multiple levels of memory, from closer to the processor (fastest) to farthest from the processor (slower):

• Registers are the fastest memory, being built into the processor, but are very small, storing only a few bytes.
• These generally are just implemented with flip-flops.
• Caches are also generally built into the processor further away, and can store several kilobytes or megabytes.
• There are often separate instruction and data caches.
• Caches implement temporal locality (recently accessed data can be retrieved again faster) and spatial locality (fetch nearby memory faster, such as fetching the next few instructions to execute).
• There can also be multiple levels of caches, from L1 (fastest/smallest) to L3 (slowest/largest).
• Caches use SRAM for memory, which is basically just a series of flip-flops.
• Primary memory (RAM) is a lot slower, but can store gigabytes.
• This is generally implemented with DRAM, which is basically just a transistor switching the high voltage to a capacitor that is also connected to ground.
• DRAM is much smaller, but needs to be refreshed and works slower.
• Secondary memory (hard drives/SSDs) are extremely slow, and can store gigabytes and terabytes of data.

In each level, we can store exponentially more data, but it takes an exponentially longer time to read or write.

Modern CISC CPUs translate complex machine code into microcode/microops first, then execute those simpler instructions. CPUs have to fetch the operation, fetch the operands, perform the operation, then store the result.

The current instruction is stored in the instruction register (IR), and the address of the next instruction is always the value in the program counter (PC). The memory address register (MAR) contains the current memory address to read or write to if applicable. The memory data register (MDR) contains the current data to read or write to memory if applicable. The general purpose registers hold data and addresses.

The control unit drives the ALU to execute instructions.

A bus is a set of wires that can have different writers and readers at different times. One unit may write to the bus at a time, but many can read from it at any time. Writing to a bus is called bus driving. Since short circuits can occur if two units drive the bus at the same time, the control unit in CPUs manages which unit is allowed to drive at any time. Buses are often implemented using tri-state buffers.

A CPU is connected to memory and I/O via a bus. Nowadays, however, the limitation that only one unit can write at a time is too restrictive, so we use multiple buses. Modern bus standards include PCI, PCIe, and USB.

Since the CPU's connection to memory and I/O is both over a bus, I/O is generally treated a lot like memory. This allows techniques such as memory-mapped I/O, where I/O listens on the bus for reads and writes to certain addresses, and dispatches actions based on that. This allows simpler I/O access from the CPU by simply using the standard memory manipulation instructions.

An instruction is a command tha tells the CPU to do something like move data, transfer control flow, or do math. A program is a sequence of instructions.

For every instruction, the CPU must fetch the instruction, decode which one it is, fetch the operands if applicable, perform the operation, then store the results if applicable.

We access memory by address - each address represents a unit of memory. For most computers, each byte is assigned an address - this is byte addressable memory. Some systems assign one address per word instead - this is word addressable memory. For example, MIPS has an address for each byte, but memory can only be accessed if it's aligned to a word boundary.

In SI, a kilobyte is $$10^3$$ bytes, or 1 kB, and a kibibyte is $$2^{10}$$, or 1 KiB. The same goes for megabyte ($$10^6$$ bytes, or 1 MiB) vs. mebibyte ($$2^{20} bytes$$), gigabyte vs. gibibyte, terabytes vs. tibibytes, petabytes vs. pebibytes, and exabytes vs. exbibytes. This is because the standard SI prefixes should always refer to powers of 10, while the powers of 2 get different prefixes. KB is an incorrect unit, since uppercase K represents Kelvin.

# 21/9/15

For most computers, and especially for the computers in our course, word sizes are 32 bits, and characters are 1 byte. The default text encoding we will use is ASCII. We will also assume our systems are byte addressable.

The byte ordering/endianness of the computer determines how the bytes of numbers are ordered physically in memory. The little endian byte ordering is the most common, and assigns the less significant digits in the number to the bytes at smaller addresses, and the more significant digits to the bytes at larger addresses. For example, 0x1234 is represented as consecutive bytes 0x04, 0x03, 0x02, and 0x01. The big endian byte ordering is the opposite order, and 0x1234 is represented wth the consecutive bytes 0x01, 0x02, 0x03, and 0x04. While big endian looks more like the human representation of the number, little endian is more logical for things like arrays of numbers. Some architectures like PowerPC support switching between either convention.

The endianness of numbers is very important if transferring data between computers. The same bytes read on a big endian machine will result in a different number than one read on a little endian machine. Most networking stacks will handle this automatically, but sometimes it is necessary to read/write the bytes in an explicit, fixed byte order to ensure all computers can read it the same.

Lowercase and uppercase letters in ASCII differ only by one bit. In order to make this work, though, the lowercase and uppercase characters are not adjacent in the ASCII table.

What the Control key does is it hides the top nibble of the character - for example, Ctrl + C hides the top bits of C (0x43), to get 0x03, the end of text character.

An address on a byte-addressible computer is aligned if it is a multiple of the word size. For example, on a 32-bit computer an address is aligned if it is a multiple of 4. Unaligned addresses may take longer to access, or even be forbidden by the architecture - ARM, for example, will raise an exception/interrupt upon unaligned accesses.

Every processor will have instructions for data transfer between memory and registers, arithmetic/logic, braching/jumping/control flow, and I/O (generally memory mapped, so these can just be the same as for data transfer between memory and registers).

The instruction set architecture (ISA) specifies how a processor works - the instructions of a processor (the operations and their format), the registers, control registers, memory layout (like memory banks), exceptions, and so on. Examples of ISAs are ARM, x86, MIPS, and SPARC. There are often variations and flavors of each ISA, like the 64-bit variant of x86, x86-64.

A reduced instruction set computer (RISC) ISA has a fixed size for each instruction (each instruction takes the same number of bits), and have a load/store architecture (memory is accessed only via load/store instructions, and all other operations work on registers). RISC ISAs are simpler to construct and understand - for example, the fixed sized instructions mean decoding is a lot easier.

In contrast, a complex instruction set computer (CISC) ISA can have variable length instructions, and many instructions can also do things like accessing memory. These are harder to construct, but we can do more with each instruction - each instruction is harder to decode, because we have to read some of it first to know how much to keep reading.

# 23/9/15

Assemly language is a symbolic representation of machine code. It uses mnemonics to represent operations/instructions. See CS241 notes for basic overview of language format.

Assembly language uses register transfer level notation:

• R0 through to R15 are names for general purpose registers.
• Control registers/signals are reserved bare words, like T1 and T2.
• Other bare identifiers are memory locations (like labels) - some_label.
• Square brackets around a memory location or register represent the value at that memory location or memory address - [R0] means "value of R0" while [LOCATION] means *LOCATION in C.
• Square brackets can be nested - [[LOCATION]] means **LOCATION.

RTL notation is not standardized and may differ between texts.

RTL notation can represent instructions in terms of how they manipulate data, like $$R1 \leftarrow [\text{LOCATION}]$$, or $$\text{if } [R0] == 1 \text{ then } R0 = 50$$. Notation like $$R1 \leftarrow [R1] + [LOCATION]$$ makes it immediately obvious what the instruction does.

RTL notation can also be used to describe instructions in assembly language in a more comprehensible way. For example, ADD R0, R1, R2 can be written as $$R0 \leftarrow [R1] + [R2]$$, LOAD R0, LOCATION can be written as $$R0 \leftarrow [LOCATION]$$, and STORE R0, LOCATION can be written as $$LOCATION \leftarrow [R0]$$.

Since registers don't have addresses, they will never appear on the right side of the $$\leftarrow$$ without being inside square brackets.

Some instructions can have literal values as well, called immediate values. For example, SUBTRACT R2, R2, 5, which can be written as $$R2 \leftarrow [R2] - 5$$, has the immediate value 5.

Branching in RTL notation simply sets the control register PC. For example, BRANCH_IF_ZERO R0, LOOP can be written as $$\text{if } [R0] == 1 \text{ then } PC \leftarrow LOOP$$.

Addressing modes are different ways to specify where to find the operand of an instruction. The location/type of operand results in an effective address, when applicable. The effective address is where we would write to if that operand is an output, and is similar to the address of an lvalue in C.

# 25/9/15

A good resource for learning ARM is the Davespace ARM reference.

The immediate addressing mode means that the operand is a constant - we simply put the constant directly in the instruction. For example, the #2 ($$2$$) in MOVE R0, #2 $$R0 \leftarrow 2$$. Since the constant is inside the instruction, and the instruction itself needs to take up some bits, the constants must be smaller than the word size.

The register addressing mode means that the operand is a register. For example, the R0 ($$[R1]$$) in MOVE R0, R1 ($$R0 \leftarrow [R1]$$). The effective address here is $$R1$$.

The absolute/direct addressing mode means that the operand is a known, fixed address in memory. For example, the LABEL ($$LABEL$$) in JUMP LABEL ($$PC \leftarrow LABEL$$). The effective address here is $$LABEL$$. Note that this addressing mode is generally only supported for control flow instructions like JUMP - for MOVE, we would use something like MOVE R0, #LABEL to store the address of the label itself.

The register indirect addressing mode means that the operand is the contents of the memory address specified by a register, like dereferencing a pointer. For example, the [R1] ($$[[R1]]$$) in LOAD R0, [R1] ($$R0 \leftarrow [[R1]]$$). The effective address here is $$R0$$. This is also sometimes denoted(R1) (parentheses rather than square brackets).

Here is an assembly program to sum an array of numbers and store it in ARRAY_SUM, in a pseudo-assembly language:

LOAD R2, ARRAY_LENGTH
CLEAR R3
MOVE R4, #ARRAY_START
SUBTRACT R2, R2, #1
BRANCH_IF_NONZERO R2, LOOP
STORE R3, ARRAY_SUM

The index addressing mode is a useful extension of indirect addressing, where the operand is the contents of the memory address specified by a register plus a constant address. For example, the 5(R1) ($$[[R0] + 5]$$) in MOVE R0, 5(R1) ($$R0 \leftarrow [[R1] + 5]$$). The effective address here is $$[R1] + 5$$. Note that in the code, all constants are generally decimal. This addressing mode is very useful for working with structs.

The base with index addressing mode is an even more flexible extension of indirect addressing, where the operand is the contents of the memory address specified by a register (specifying the base address) plus another (specifying an offset). For example, the [R1, R2] ($$[[R1] + [R2]]$$) in MOVE R0, [R1, R2] ($$R0 \leftarrow [[R1] + [R2]]$$). The effective address here is $$[R1] + [R2]$$. This addressing mode is very useful for working with arrays.

# 28/9/15

The base with index and offset addressing mode combines the above two, where there is a base address and an offset. For example, the 5(R1, R2) ($$[R1] + [R2] + 5$$) in MOVE R0, 5(R1, R2) ($$R0 \leftarrow [R1] + [R2] + 5$$). The effective address here is $$[R1] + [R2] + 5$$.

The PC-relative addressing mode is similar to the index addressing mode, but the offset is always added to the program counter. For example, the 5(PC) ($$[PC] + 5$$) in MOVE R0, 5(PC) ($$R0 \leftarrow [PC] + 5$$). The effective address here is $$[PC] + 5$$. When a PC-relative operand is a label, the assembler will generally convert that into the OFFSET(PC) form automatically - SOME_LABEL BRANCH_IF_ZERO R0, SOME_LABEL gets translated to SOME_LABEL BRANCH_IF_ZERP R0, -4(PC). This is very useful for branch instructions.

The auto-increment addressing mode is similar to register indirect addressing, but the register is incremented by the number of bytes to read after the read occurs. For example, the [R1]+ ($$[[R1]]$$ then $$R1 \leftarrow [R1] + 4$$) in LOAD R0, [R1]+ ($$R0 \leftarrow [R1] + 4$$ then $$R1 \leftarrow [R1] + 4$$). This is very useful for working with stacks and queues, especially in combination with auto-decrement addressing.

There is also the auto-decrement addressing mode, which is simply -[R1] rather than [R1]+. The register is decremented by the amount to be read before the read occurs. There are also the [R1]- and +[R1] modes, which do the writing in the opposite order as -[R1] and [R1]+.

The call stack is used for many things, such as subroutine parameters, return addresses, and local variables. The stack starts at the highest memory address, and grows downward. The top element of the stack is always pointed to by the SP register. To push onto the stack, we decrement SP and write in the data we want to push onto the stack. To pop off the stack, we read out the top element, and then increment SP.

A subroutine is a block of instructions that can be executed repeatedly - this is good for deduplication and modularity. Subroutines can be called using the CALL instruction, and return using the RETURN instruction.

When we do a call, we need to save the return address somewhere. We may store it on the stack, but starting with MIPS, RISC processors started putting the return address in the link register (LR). x86, on the other hand, stores the return address on the call stack.

# 30/9/15

On our ARM machines, R13 is the stack pointer, R14 is the link register, and R15 is the program counter.

To make a call, we use the CALL SOME_SUBROUTINE instruction, which stores the value of PC in the link register, and branch to the target of the call. To return, we use the RETURN instruction to jump back to the address saved in that link register.

If we call a subroutine and within that subroutine, call another subroutine, the link register's value (the return address) gets overwritten by the inner call. To retain the return addresses, we push the link register contents to the stack before calling the inner subroutine, and pop it off ino the link register again after calling it. This can fail if there is a stack overflow or the stack is managed incorrectly.

Parameters can be passed using registers, memory locations, and on the stack.

Parameters are passed using registers if and only if there are few parameters, the subroutine is non-recursive, and the subroutine does not call other subroutines that use those same registers. Passing by registers is the fastest option, since it avoids main memory accesses.

Passing parameters using memory locations is rarely used - it's analogous to passing values using global variables.

In all other situations, parameters are passed on the stack, by pushing the last parameter to the first parameter, calling the subroutine, and then popping off all the parameters again - the caller manages the parameter pushing and popping.

Using registers:

; N is the array length, NUM_1 to NUM_N are the elements of the array
MOVE R4, #NUM1
STORE R3, SUM

; R2 is the array length, R4 is the address of the first element - we pass parameters in registers
; push R5 on the stack so we can restore it later
STORE R5, -(SP)

CLEAR R3 ; R3 stores the result
LOOP:
; we can freely modify the parameters as we want to
SUBTRACT R2, R2, #1
BRANCH_IF_NONZERO R2, LOOP

; restore R5 and return - R3 contains the result by this point
RETURN

Using the stack:

; N is the array length, NUM_1 to NUM_N are the elements of the array
MOVE R2, #NUM1
STORE R2, -(SP) ; second parameter
STORE R2, -(SP) ; first parameter
ADD SP, SP, #4 ; pop the first parameter off
LOAD R2, (SP)+ ; pop the second parameter off - the result is stored in this space
STORE R2, SUM ; save the result to SUM

; push R2, R3, R4, R5 on the stack so we can restore it later
STORE R2, -(SP)
STORE R3, -(SP)
STORE R4, -(SP)
STORE R5, -(SP)

CLEAR R3 ; R3 stores the temporary result
LOOP:
SUBTRACT R2, R2, #1
BRANCH_IF_NONZERO R2, LOOP
STORE R3, 20(SP) ; store the result in the space originally used for the first parameter passed on the stack

; restore R2, R3, R4, R5 and return - top of stack contains the result by this point
RETURN

# 2/10/15

When we enter a subroutine, we push the frame pointer (FP), followed by spaces for the local variables, followed by saved registers - together, these form a stack frame.

The frame pointer is used as a convenient way to refer to local variables and saved registers - even as the stack pointer changes as we're using it, the frame pointer doesn't change. The parameters are at a positive offset from the frame pointer, while local variables and saved registers are at a negative offset.

LOAD R2, PARAM2
STORE R2, -(SP)
STORE R2, -(SP)
CALL SUB1
LOAD R2, (SP) # obtain return value
ADD SP, SP, #8 # pop all parameters off the stack

SUB1:
# save registers
STORE FP, -(SP)
MOVE FP, SP # initialize frame pointer to between parameters and local variables
STORE R2, -(SP) # local variable for parameter
STORE R3, -(SP) # local variable for parameter
STORE R4, -(SP) # local variable
STORE R5, -(SP) # local variable
LOAD R2, 8(FP) # get first parameter
LOAD R3, 12(FP) # get second parameter

; do stuff
; we can use -4(FP) to -16(FP) to refer to the local variables

; call another subroutine
STORE R4, -(SP)
CALL SUB2
LOAD R4, (SP)+ # obtain return value

; do other stuff

# save the result onto the stack
STORE R5, 8(FP) # the space used by the first parameter

# restore registers
RETURN

SUB2:
# save registers
# we don't have to store LINK_REG because we aren't making any other subroutine calls
STORE FP, -(SP)
MOVE FP, SP # initialize the frame pointer to between parameters and local variables
STORE R2, -(SP) # local variable for parameter
STORE R3, -(SP) # local variable
LOAD R2, 4(FP) # get the parameter

; do stuff
; we can use -4(FP) and -8(FP) to refer to the local variables

# save the result onto the stack
STORE R3, 4(FP) # the space used by the first parameter

# restore registers
RETURN

In addition to arithmetic instructions like ADD and SUB, we also have logical operations like AND DEST, A, B, OR DEST, A, B, SHIFTL DEST, VALUE, AMOUNT, SHIFTR DEST, VALUE, AMOUNT, ROTATEL DEST, VALUE, AMOUNT, ROTATER DEST, VALUE, AMOUNT. Rotating without carry rotates ignoring the carry bit, but the bit that corresponds to it is still stored in the carry bit.

## ARM ISA

The assembler is responsible for generating the machine language program from the assembly code. The three operand instructions need to fit into a single word, while also supporting immediate values. Instructions in ARM take the following form:

ARM instructions have a fixed length (32 bits) and only access memory using dedicated load/store instructions, which puts them in the RISC category. However, there are also a variety of CISC-like features, such as the many addressing modes, conditionally executing instructions, and store/load of multiple registers.

ARM memory is byte addressable with 32-bit addresses, and supports little and big endian representations, configurable to either. It supports aligned 32-bit words, 16-bit half words, and bytes.

ARM has 16 data registers, R0 to R15. R15 is the program counter, R13 is conventionally used as the stack pointer, and R14 is conventionally used as the link register. The CPSR/PSR register contains various status bits, such as negative (N), zero (Z), overflow (V), carry (C), and interrupt masks. These status codes are used by instructions that explicitly specify that they access the status bits, and are also used by the condition bits on conditional instructions.

In ARM, all the addressing modes except for immediate/register direct are derived from indexed addressing. The offset can be specified as an immediate value, or another register.

The instructions we have been using so far are similar to ARM, but not the same. For example, MOVE R0, R1 is actually LDR R0, R1 (load register) - LDR supports a register or an immediate value for the value to load. This can be used to do register direct (LDR R0, R1), immediate value (LDR R0, #123), and PC-relative addressing (LDR R0, SOME_LABEL).

Arithmetic/logic instructions are largely the same, such as ADD DEST, ADDEND, ADDEND (addition), SUB DEST, MINUEND, SUBTRAHEND (subtraction), and MUL DEST, MULTIPLICAND, MULTIPLICAND. For these, the second source operand (a source operand is an operand that is not a destination - one that is not written to) can either be a register or an immediate value (the immediate values have some restrictions, which are managed by the assembler).

# 5/10/15

• PC-relative addressing: LDR R0, LABEL translates to $$R0 \leftarrow [PC] + offset(LABEL)$$, where $$offset(label)$$ is the relative offset of the label.
• Pre-index addressing (analogous to index addressing): LDR R0, [R1, #25] translates to $$R0 \leftarrow [[R1] + 25]$$.
• Pre-index addressing with writeback (analogous to auto-increment/auto-decrement addressing): LDR R0, [R1, #4]! translates to $$R0 \leftarrow [[R1] + 4]; R1 \leftarrow [R1] + 4$$.
• Pre-index addressing with offset: LDR R0, [R1, R2] translates to $$R0 \leftarrow [[R1] + [R2]]$$, and LDR R0, [R1, -R2] translates to $$R0 \leftarrow [[R1] - [R2]]$$. Also works with writeback mode.
• Pre-index addressing with offset and index: LDR R0, [R1, R2, #-5] translates to $$R0 \leftarrow [[R1] + [R2] - 5]$$, and LDR R0, [R1, -R2, 5] translates to $$R0 \leftarrow [[R1] - [R2] + 5]$$. Also works with writeback mode.
• Post-index addressing with write-back (analogous to auto-increment/auto-decrement): LDR R0, [R1], #4 translates to $$R0 \leftarrow [[R1]]; R1 \leftarrow [R1] + 4$$.
• Post-index addressing with write-back and offset: LDR R0, [R1], R2 translates to $$R0 \leftarrow [[R1] + [R2]]; R1 \leftarrow [R1] + [R2]$$, and LDR R0, [R1, -R2] translates to $$R0 \leftarrow [[R1] - [R2]]; R1 \leftarrow [R1] - [R2]$$.
• Post-index addressing with write-back and offset and index: LDR R0, [R1], R2, #-5 translates to $$R0 \leftarrow [[R1]]; R2 \leftarrow [R1] + [R2] - 5$$, and LDR R0, [R1, -R2, 5] translates to $$R0 \leftarrow [[R1]]; R1 \leftarrow [R1] - [R2] + 5$$.

All of these addressing modes can be used each of the memory instructions.

Additional memory operations include LDR/STR (load/store register), LDRH/STRH (zero-extended load/store half-word), LDRB/STRB (zero-extended load/store byte), and LDRSH/LDRSB (sign-extended load half-word/byte).

There are also instructions to access multiple consecutive memory locations, LDM R0!, {R1, R3, R5}/STM R0!, {R1, R3, R5} (load/store multiple) means $$R1 \leftarrow [R0] + 0; R3 \leftarrow [R0] + 4; R5 \leftarrow [R0] + 8$$. The ! is optional in each case, and means that the address of the last value loaded should be written back into R0.

For certain arithmetic/logic/move instructions like ADD and MOV, we can also apply some operations to the second source operand before performing the operation. For example, ADD, R0, R1, R2, OPERATION where OPERATION is a value like LSL #5/LSR #5 (left-shift/right-shift R2 by 5), ASR #5 (arithmetic shift R2 right by 5 bits; basically right-shift with sign extension), ROR #5 (rotate R2 right by 5 bits). Internally, the LSL DEST, SRC, AMOUNT, LSR DEST, SRC, AMOUNT, and other shifting instructions are translated into moves using these operations.

The MOV R0, REGISTER_OR_VALUE instruction is like a more limited but smaller version of LDR R0, REGISTER_OR_VALUE - it can only set a register to another register's value, or an immediate value (which has some restrictions, managed by the assembler). MVN R0, REGISTER_OR_VALUE is the same as MVN R0, REGISTER_OR_VALUE, but R0 is set to the bitwise NOT of REGISTER_OR_VALUE rather than the plain value. Also, MOV supports the additional operations on the second source operand (like shifting and rotating), while LDR does not.

Logical operations in ARM include AND (bitwise AND), ORR (bitwise OR), and EOR (bitwise XOR). These also support the operations on the second operand.

The test instructions are used to do comparisons, and are often used with the conditional branching instructions. TST R0, #1 does a logical AND between R0 and 1, then sets the Z status bit according to whether the result is 0. TEQ R0, 5 does a logical XOR between R0 and 1, then also sets Z according to the result. CMP R0, R1 subtracts R1 from R0, then updates the Z, N, and possibly other status bits according to the result. Note that these write only to the status bits, not to the registers themselves.

The test instructions always write to the status registers. Arithmetic/logic/move instructions can do this too, if we use the S variant of them. For example ADDS does the same thing as ADD, but also sets status bits like carry, zero, and negative.

The branch instructions are used to modify control flow, often conditionally based on the status bits. There are several of these instructions, all beginning with B and having the form BCONDITION LABEL (the operand uses PC-relative addressing):

• Unconditional/always: B LABEL/BAL LABEL branches unconditionally to LABEL.
• Equality: BEQ LABEL and BNE LABEL branches to LABEL if Z (zero) is 1 or 0, respectively.
• Carry set/unset: BCS LABEL/BHS LABEL and BCC LABEL/BLO LABEL branches to LABEL if C (carry) is 1 or 0, respectively.
• Minus/positive: BMI LABEL and BPL LABEL branches to LABEL if N (negative) is 1 or 0, respectively.
• Overflow set/unset: BVS LABEL and BVC LABEL branches to LABEL if V (overflow) is 1 or 0, respectively.
• Unsigned higher/lower-or-same (greater/less-or-equal-to): BHI LABEL and BLS LABEL branches to LABEL if Z and not C ($$Z \wedge \overline C$$) is 1 or 0, respectively.
• Signed greater-or-equal/less: BGE LABEL and BLT LABEL branches to LABEL if N xor V ($$N \oplus V$$) is 1 or 0, respectively.
• Signed greater/less-or-equal: BGT LABEL and BLE LABEL branches to LABEL if Z, or N xor V ($$Z \vee (N \oplus V)$$), is 1 or 0, respectively.

# 7/10/15

Branch statements can be used to implement any kind of control flow.

An assembler is a program that accept assembly language and outputs an object program in machine code - one instruction in assembly language is one machine instruction. Each line of assembly language contains labels, the operation, operands, and comments, all of which are optional.

Assembler directives are commands to the assembler itself, rather than being translated to machine code. This allows us to do things like speifying the entry point of the program (ENTRY), define constants/data, and set code/data section locations (AREA CODE/AREA DATA). Directives like DCW (declare word) can be used to declare literal values, and often takes the form SOME_LABEL DCW 0x12345678, 0x87654321 (two constant words that can be referenced by SOME_LABEL):

AREA CODE ; code section
ENTRY ; entry point of the program

LDR R0, X ; load thevalue of X into R0

AREA DATA ; data section
X DCW 0x25 ; declare a word

Other directives include NAME EQU VALUE, which can be used to define symbolic names fo numerical values (like X EQU 5) and RN, which can be used to give names to registers (COUNTER RN 3).

There are also pseudo-instructions, which aren't real instructions, but get translated into real, possibly multiple instructions. For example, LDR R0, =LABEL loads the address of LABEL into R0 (as opposed to LDR R0, LABEL, which loads the value at LABEL instead). Also, LDR R0, CONSTANT_VALUE is translated to MOV R0, CONSTANT_VALUE if CONSTANT_VALUE is of the right form, and otherwise is translated into LDR R0, CONSTANT_DEFINITION and CONSTANT_DEFINITION DCW CONSTANT_VALUE - the constant is stored at a memory location, and is loaded from that memory location when required.

Summing an array:

LDR R1, N
LDR R2, =ARRAY
MOV R0, #0
LOOP:
LDR R3, [R2], #4
SUBS R1, R1, #1
BGT LOOP
STR R0, SUM

N DCW 5 ; number of array elements
ARRAY DCW 2, 4, 6, 8, 10 ; elements of array
SUM DCW 0 ; sum of the array is output in here

# 9/10/15

STM/LDM, as mentioned earlier, can do multiple register stores/loads all in a single instruction:

• The FD/IA (increment after) suffix adds the final offset to the register after performing the store/load: STMFD R0!, {R1, R2, R3} (store R1-R3 into memory address at R0, write final address into R0)
• The EA/DB (decrement before) suffix subtracts the final offset from the register before performing the store/load: STMEA R0!, {R1, R2, R3} (store R1-R3 into memory address at R0, write final address into R0)
• This also supports conditional prefixes to run the instruction conditionally based on status bits: STMFDGT R0, {R1} (store R1 into memory address at R0)

This is very useful for pushing multiple things onto (STMFD R13!, {R1, R2, R3}) or off of (LDMFD R13!, {R1, R2, R3}) the stack. The ! ensures that the final address of R13 (stack pointer) is updated to represent the top element of the stack.

Also, BL LABEL is the branch and link instruction, which branches to LABEL and sets the address of the next line as the value of the link register - the return address. This is useful for calling subroutines.

This allows for a very elegant calling convention:

LDR R0, #123 # set parameter 1
BL SUBROUTINE # call the subroutine

# accepts 1 parameter in R0, and puts the return value in R0
SUBROUTINE:
STMFD R13!, {R14, R1, R2} # push the link register and any other desired register values onto the stack
; we conventionally don't need to save the registers for the parameters (R0) since they are only used for this subroutine anyways - plus, we're reusing it for the return value

; do things with R1 and R2, since they're saved now

LDMFD, R13, {R15, R1, R2} # pop the link register's value into the program counter (to do the return) and restore the other register values

The above saves/restores registers using only two instructions, and also does the return at the end. The return value is conventionally put in R0, though we could also push it onto the stack (if the result can't fit into just a register).

Without the multiple store/load instructions, we would have to manually push each register value, then at the end manually restore the register values and return using something like BX R14.

# 14/10/15

A program needs to be assembled in two passes if and only if it includes instructions that reference labels below it in the program (forward references).

Assemblers will assemble modules as if each one will be loaded at location 0, and then the linker relocates it as appropriate. There is position dependent and position independent machine code generated by instructions - position independent code is the same as position dependent code, but it also includes information that allows the linker to relocate it.

The interconnection network is the circuitry that connects the processor, memory, and I/O, like the system bus. Communication takes place over a shared address space. I/O devices are often memory mapped - they share the same address space as the memory, and in general simply appear as memory locations. As a result, we can access these I/O devices simply using the standard load/store instructions.

I/O devices typically have three registers exposed as memory locations - data, status, and control. Often, the control and status registers have the same address, where loading from it reads the status register and writing to it sets the control register.

Programs will generally interact with I/O by polling (program-controlled I/O) or with interrupts (device-controlled I/O).

For polling, there will generally be a bit in the status register that tells us whether the I/O device has data ready to read, and then the program repeatedly checks this and reads if necessary. This is really inefficient if the reads occur infrequently, such as reading keypresses from a keyboard - most of the time, the CPU will simply be checking status registers. We want to allow other things to run while we're waiting for I/O to get reading - to let I/O devices alert the processor when it's ready. This is the role of interrupts.

# 16/10/15

I/O devices will often have a dedicated signal on a special processor line, called IRQ (interrupt request), on which the device can send signals. When the processor receives one of these interrupts signals, it switches to an interrupt service routine/interrupt handler (ISR). Interrupts are used to allow the processor to respond to high priority events.

An interrupt service routine is a small program stored in memory. ISRs should be as short and fast as possible, and do the minimum needed to satisfy the I/O device, since it is blocking the main program from running - they usually copy data into a buffer for the main program to process later. ISRs generally have to acknowledge the interrupt in their body to clear the interrupt from processing.

The steps before the ISR is called can take a significant amount of time - this delay is called interrupt latency.

Some pieces of code are critical and should never be interrupted. For this purpose, there is a CSPIE (interrupt enable) bit in the processor status (PS) register that controls whether interrupts are handled, or just ignored. This is useful for things like disabling interrupts in realtime code sections, or disabling interrupts within an ISR to avoid recursive interrupts.

Interrupts that are generated internally are called exceptions. Otherwise, they come from external sources like I/O.

The interrupt process occurs as follows:

• The I/O device asserts an interrupt.
• The processor finishes executing the current instruction and everything in its pipeline (or just stops executing it if there's an exception like invalid memory access).
• The processor state is saved (program counter PC, status register PS).
• Interrupts disabled by setting interrupt enable to 0 in PS.
• The correct ISR corresponding to the device is looked up.
• The ISR is called, servicing the device (the ISR should deactivate the interrupt to clear the interrupt request, by setting the INTA line).
• The processor state is restored.
• Interrupts enabled by setting interrupt enable to 1 in PS.
• The next instruction is executed (or not, if there was an exception).

The IRQ could just be a single signal, where upon receiving a signal, we just poll every device to see if it requested an interrupt, but this adds a lot of interrupt latency. What we can do instead is have the IRQ line accept a vector, and then map that to ISR addresses using an interrupt vector table (an array, generally at a fixed address, of ISR addresses where indices are the vector values) - these are vectored interrupts. However, vectored interrupts require IRQ to have a lot more lines to represent a vector.

One way to implement the IRQ line is by having a weakly pulled-up line pulled down by any of the I/O devices (each one either pulls down the line or is disconnected). When the line is pulled down, the processor detects this as an interrupt, and the I/O device stops pulling the line down (clearing the interrupt request) when the ISR services the I/O device.

Servicing the I/O device is done by asserting the interrupt acknowledge signal INTA, which is sent from the processor to the first I/O device and passed down the line until a device needing service receives it. This is a single line, and when set to 1 means that the processor acknowledges that the interrupt was raised. The I/O device needing service always passes on 0, while other devices pass on the original value, since we want to only acknowledge one interrupt at a time (if there are multiple active, setting INTA only acknowledges the first one).

The processor can determine which I/O device asserted the interrupt by either polling each of them in turn, or having another set of lines to transfer the vector if using vectored interrupts. Alternatively, IRQ can be weakly pulled-down and I/O devices can pull it up.

Another way to implement IRQ is to simply have one IRQ line per I/O device. This is much simpler, but requires a lot more lines - one IRQ and one INTA line per I/O device. Plus, the number of I/O devices is fixed by what is built into the processor.

If multiple interrupts occur at the same time, arbitration is needed. Most systems will have an interrupt priority mechanism, where higher priority interrupts get to execute first. In single-signal IRQ implementations, the priority is defined by the order that INTA signals are passed from one device to the next - the earlier the device in the chain, the higher its priority. For multiple dedicated IRQ lines, the processor must choose just one of them to process at a time.

Interrupt nesting allows higher priority ISRs to interrupt lower priority ones - this helps avoid higher priority ISRs being ignored while low-priority ISRs are running (and interrupts are disabled). This is done by having separate interrupt enable flags for high and low priority interrupts - low priority ISRs only disable low priority interrupts while running, while high priority ones disable both low and high priority interrupts while running.

# 19/10/15

An example of where interrupt nesting is important is for keyboard and display interrupts - while we are in the display interrupt handlers, we still want to keep receiving keyboard input.

The upon interrupt requests, the status register PS is saved to the interrupt status register IPS. The IENABLE register has one bit per device representing whether interrupts are enabled for it, while the IPENDING register has one bit per device representing whether the current interrupt request exists and has not been serviced. Registers such as these are accessed using special control register instructions, which we will call move_control SOURCE, DESTINATION for simplicity.

Simple interrupts example:

; THIS SHOULD BE IN THE MAIN PROGRAM
; clear the line buffer
move R2, #LINE_BUFFER
store R2, BUFFER_PTR

; enable keyboard interface interrupts
move R2, #2
store R2, KEYBOARD_STATUS ; keyboard status register is a special location in memory

; enable CPU keyboard interrupts
move_control IENABLE, R2
or R2, R2, #2
move_control R2, IENABLE

; enable CPU interrupts in general
move_control PS, R2
or R2, R2, #1
move_control R2, PS

; THIS SHOULD BE IN THE ISR
; save used registers
push R2
push R3

; save keyboard character to the line buffer
load_byte R3, KEYBOARD_DATA ; keyboard data register is special location in memory
store R3, (R2)
store R2, BUFFER_PTR

; restore resgisters
pop R3
pop R2

Non-maskable interrupts (NMI) are those that trigger regardless of the interrupt enable bits.

# 21/10/15

Multiple interrupts example:

; THIS SHOULD BE IN THE MAIN PROGRAM
; clear the line buffer
move R2, #LINE_BUFFER
store R2, BUFFER_PTR

; enable keyboard interface interrupts
move R2, #2
store R2, KEYBOARD_STATUS ; keyboard status register is a special location in memory

; enable display interrupts
move R2, #4
store R2, DISPLAY_STATUS ; display status register is a special location in memory

; enable CPU keyboard/display interrupts
move_control IENABLE, R2
or R2, R2, #2
or R2, R2, #4
move_control R2, IENABLE

; enable CPU interrupts in general
move_control PS, R2
or R2, R2, #1
move_control R2, PS

; THIS SHOULD BE IN THE ISR
; save used registers
push FP
push R2
push R3

; test interrupt pending to see if display generated an interrupt
move_control R2, IPENDING
and R3, R2, #4
branch_if_zero R3, NOT_DISPLAY
call DISPLAY_ISR
jump DONE

; test interrupt pending to see if keyboard generated an interrupt
NOT_DISPLAY:
and R3, R2, #2
branch_if_zero R3, DONE
call KEYBOARD_ISR

; restore resgisters
DONE:
pop R3
pop R2
pop FP

KEYBOARD_ISR:
; DO KEYBOARD HANDLER STUFF HERE

DISPLAY_ISR
; DO DISPLAY HANDLER STUFF HERE

Note that the priority of the interrupts were defined in software here - we're checking the display first, so the display always has higher priority.

Exceptions are simply any type of execution interruption, and include interrupts as well as interruptions from inside the processor.

While interrupts are for I/O devices, we also have internal faults like error recovery (zero division, bad instruction), debugging, and various utility faults for operating system multitasking primitives. For example, breakpoints on the CPU are generated by inserting a trap instruction that allows a special routine to handle it. On most OSs, system calls are made using exceptions.

A watchdog timer is a mechanism used for embedded systems that need to be very reliable. A timer in the background, when it runs out, triggers a non-maskable interrupt that resets the processor (say, every second or so) - the software has to reset the timer (kicking the watchdog) every so often, or else the timer will run out and reset the program. This ensures that if the program ever freezes or locks up, the wachdog timer will reset the program to try again.

## Hardware

When state changes occur on device inputs, it takes time before the output state updates to match. While the output is updating, it is possible that the state will be transient and invalid. This is the purpose of the clock - to ensure we can wait until the signals are fully propagated before continuing.

A negative edge-triggered D flip-flop is simply an MS flip-flop, two D-latches one after the other with the same clock, the second one's clock signal being inverted - all the abstract hardware must, in the end, boil down to real hardware. A multiplexer can be made from AND gates for each input and NOT gates for select lines.

Buses have address signals, data signals, and control signals. On the processor bus, there is an address decoder that reads the address lines and sets the enable line on the corresponding device, such as RAM or each memory mapped I/O device. Enabling a device connects it to the bus, and only one device may be connected at a time.

# 23/10/15

To use the bus, the processor generally signals on the address lines, the devices watch for their relevant addresses, and then respond on the data lines when the addresses match. The specifics of these are defined by the bus protocol.

The bus control lines control the specifics of the bus operation, like whether it's a read/write, the size of the operation, or timing information (a clock line if the bus is synchronous, or some other signals if the bus is asynchronous).

Buses are driven by a single bus master, which initiates data transfer by issuing bus commands/operations to slave devices.

For a synchronous bus, the address and control lines are set on the rising edge of the clock, at which point the slaves decode the address/commands, and on the falling edge of the clock, the slave responds on the data lines. The master can then save the result on the next rising edge. Therefore, the clock speed is limited by the bus propagation delay, the decoding time for each slave for the high phase, and the setup/hold time for the master's storage.

As a result, the clock speed (for the low phase) is limited by the slowest device, which is suboptimal since some devices will be much slower than others. To avoid this, we could have slave devices signal the master on a control line when they are done their operation:

1. On the first clock cycle, the master sets the address/control lines, and the slave decodes them.
2. On the second clock cycle, the slave starts performing the operation.
3. By the end of the second clock cycle or later, the slave responds on the data lines and sets the slave-ready signal, so the master reads the data and resets the address/control lines.
4. The slave decodes this and since the address/control lines are reset, it resets the slave-ready signal.

Electrical signals take time to travel over the bus, and different parts will see differently time shifted signals (this is bus skew).

For longer buses, an asynchronous bus with a handshake protocol is better than a synchronous one, since it automatically accounts for delay. A handshake protocol would work something like the following:

1. Master sets the address/control lines, as well as the master-ready signal.
2. All slaves receive these signals. However, only the slave corresponding to the selected address will do anything with it. That slave performs its operation and sets the slave-ready signal.
3. Master receives the slave-ready signal, waits a little bit for the data to settle and saves it, and resets the address/control lines as well as the master-ready signal.
4. Slave resets the slave-ready signal since the master-ready signal was reset.

An example of a control line is read/write - 0 means the operation is a write, and 1 means the operation is a read.

# 26/10/15

MIDTERM WEEK - NO CLASSES

I'll be writing the midterm on 8:30 AM on Wednesday in EIT 1015. Midterm covers everything up to and including chapter 5.

# 28/10/15

It actually wasn't too bad, I'd say - the midterm is definitely doable in under an hour.

# 2/11/15

By this point, we have seen that buses can transfer data over one clock cycle, multiple clock cycles, or asynchronously. Generally, asynchronous buses are preferred since they don't care about timing very much.

Buses can only have one active master at a time. When two masters on a bus both want to write to the bus at the same time, bus arbitration is necessary - the bus arbiter circuit pre. Basically, each device must send a request to use the bus, and the arbiter circuit grants access to one request at a time based on their priority, by controlling write-enable lines. Bus arbiters can be centralized or distributed.

An I/O interface circuit connects an I/O device to a bus. For an asynchronous bus, this interface needs a register for temporary data storage, a status register, a control register, an address decoder, any necessary clocks/timing signals, and converters like parallel to serial shift registers.

In a parallel interface, there are multiple bits transferred at once, while a serial interface will only move one bit at a time. Serial interfaces are theoretically slower, but are cheaper and work better for long distance transfers (no crosstalk between parallel wires). Most computers will use a combination of these.

For example, USB is a serial protocol with two data lines, used for differential signalling. Normal signalling (single-ended signalling)has a data line's voltage go up and down relative to a reference voltage to represent 0 and 1. Differential signalling has two data lines, and their relative voltages go up and down to represent 0 and 1 - this allows us to achieve significantly better practical speeds.

When we're doing serial transmission, we need to know when there are valid bits on the data line. For this, we have to have either a clock, use asynchronous master-ready/slave-ready signals like an asynchronous bus, or encode the clock in the data itself. A clock is better for high speed transfers, while USART (universal synchronous/asynchronous receiver/transmitter) uses either a clock or encodes the clock in the data.

In start-stop transmission, the transmitter and receiver agree on a baud rate. In the idle state, the data line is left high. We transfer a group of 8 bits by pulling down the line (the start bit), transmitting the data (the data bits), and then pulling up the line for 1 or 2 bits (the stop bits). Afterward, the data line is left high.

# 4/11/15

The USB standard is governed by a special USB consortium. The standard is a simple serial interconnection system supporting a wide variety of transfer characteristics and lots of features, such as plug-and-play.

A USB connector has 4 pins - ground, power, data 1, and data 2. A high bit is sent by bringing data 1 up and data 2 down, while a low bit is sent by bringing data 1 down and data 2 up - differential signalling. In USB differential signalling, the difference between the two wires is about 0.2V, which means that the voltage change is smaller and can change faster. Also, since there are 2 wires any interference on the wires will affect both of them equally, so the signal is unaffected (since the difference stays the same), which makes USB more robust.

For USB connections, the architecture is arranged like a tree, where interior nodes are USB hubs, leaf nodes are devices, and the root is the USB host. Messages from a hub are broadcast to all direct children, and are propagated down to I/O devices. Only the host can send messages, and devices can only respond to requests from the host - this means bus collisions are impossible, and no arbitration is needed.

The PCI bus is a standard interface for connecting devices to different processor/memory buses. It's a plug-and-play standard that that exposes a standard interface for things like RAM and other peripherals. PCI supports 32-bit and 64-bit data widths, transferring data in parallel. The protocol needs 3.3V or 5V, and can achieve up to around 800 MiB/s

A bridge connects two different bus interfaces. A PCI bridge connects the processor/memory bus with the PCI bus, where all the PCI devices communicate, as if they were connected directly to the bus - PCI devices become part of the processor's address space and devices are memory mapped. PCI devices can act as masters (initiators), or slaves (targets).

The SCSI bus is designed for smaller computers or devices. This is an 8-bit or 16-bit parallel bus at 3.3V or 5V, and supports single-ended and differential signalling. The SCSI bus is not memory mapped, and the processor needs to talk to the SCSI controller to interact with SCSI devices, via direct memory access (DMA). Devices can also be bus masters or slaves, so arbitration is necessary.

SCSI is used for things like CD drives and hard disks, since it is relatively slow (320 MiB/s maximum bandwidth).

# 6/11/15

## Memory

Memory can be volatile (data is lost on power off) or non-volatile (data retained on power off).

PROMs (programmable read-only memory, non-volatile, non-writeable) is the main read-only memory technology. Each PROM memory unit is an array of cells. For each memory bit, there is an NMOS transistor where the sink is connected through a fuse to ground, the base is connected to the cell's address select line, and the drain is connected to the data line. ROMs initially have all their fuses unblown, and are programmed by selectively blowing their fuses.

The common data line for a PROM is weakly pulled up. When a cell that doesn't have its fuse blown is selected (has its address select line high), the transistor connects the data line to ground, so the data line is driven low. When a cell that has its fuse blown is selected, the transistor's sink isn't connected to anything, so the data line is pulled up.

EPROM (erasable real-only memory, non-volatile, writeable) is the same as PROM, but is writeable by replacing the fuse with a special type of transistor that can hold a charge for decades at a time. EPROMs can be erased as a whole by putting them under an ultraviolet light, which allows the charges to escape.

EEPROM (electrically erasable read-only memory, also known as NVRAM, non-volatile, writeable) is the same as EPROM, but the special charge-holding transistor can be reset through higher voltages rather than ultraviolet light. Basically, we can read using a low voltage, but then write using a higher voltage.

Flash (non-volatile, writeable) memory is basically the same as EEPROMs, but the way they are organized mean that they are cheaper, and can only be addressed in blocks (e.g., 64 kB at a time). Flash is generally used in SSDs and USB thumb drives.

DRAM (dynamic random-access memory, volatile, writeable) is the same as PROM, but with a capacitor instead of a fuse. The capacitors are instantly leaking current into ground, so need to be refreshed every so often (every bit needs to be rewritten at about 100 Hz). When writing to a selected cell, we simply drive the data line high or low to charge or discharge the capacitor. When reading, we have to read the voltage on the capacitor, amplify it, and then write it back. This is about 4 times slower than SRAM, but the circuits are smaller and cheaper per bit. DRAM is generally used as the main memory of computers.

SRAM (static random-access memory, volatile, writeable) is implemented by having, for each bit, a flip-flop that is read/written selectively depending on the address line and read/write lines. This takes 6 transistors per bit, so is relatively expensive and large compared to technologies like DRAM. However, it is faster than DRAM since the flip-flop values are driven directly by the signals.

# 9/11/15

For a memory chip, the number of address pins needed depends on the size - $$\ceil{\log_2 n}$$. A memory chip must have a chip select and a write enable pin, as well as address pins. Memory chips use decoders to convert the address line signal into an enable signal for the appropriate cell.

Larger memory chips need additional structures to be practical. While we could simply have a 1:N decoder for each cell, this is inefficient when the decoder gets large enough. Instead, we can have a grid of memory cells and enable/disable entire rows at a time based on the upper bits of the address, then have a multiplexer to select the right columns in each row based on the lower bits of the address. Each of these rows is a page.

The reason we need to care about pages is that the row address bitsand the column address bits are usually specified separately, using the RAS and CAS pins (row address strobe/column address strobe). That means we usually have to send the row address to select a row, then send the column address to actually perform the operation.

The DDR (double data rate) DRAM standard transfers data on both the rising and falling edge of the clock cycle, in 2 word bursts. DDR1 to DDR4 are the same standard, just progressively faster.

In fast page mode, we keep the row cached in the read buffer (by keeping its value latched between reads) and if we need to do a read in the same row, we can skip sending the new row address and just send the column address (we don't need to go through the row address decoder). This makes reading consecutive memory locations significantly faster.

DRAM is controlled in this way asynchronously, by using the RAS and CAS pins to control when to send addresses or perform operations. SDRAM also has RAS and CAS pins, but is synchronous and has a separate clock line instead.

RAM latency is the time between the start of an operation and when the first word is successfully read/written. RAM bandwidth is the volume of the data per unit time. These are related but not always the same, due to things like burst reads (reading multiple words at once) and fast page mode.

# 11/11/15

RAM often comes in the form of DIMMs (dual inline memory modules), with multiple channels on the same PCB, each of which can be read in parallel.

We now have the memory heirarchy. Whenever we need some data, we want to read it from the fastest possible one of the following:

• Registers can be accessed within 1 processor cycle, but can only store about 64 bytes total. There is generally one per core.
• L1 cache can be accessed within around 2 cycles, and can store about 64 kB. There is generally one per core.
• L2 cache can be accessed within around 10 cycles, and can store about 256 kB. There is generally one per core.
• L3 cache can be accessed within around 40 cycles, and can store about 8 MB. This is generally shared between cores.
• RAM (main memory) can be accessed within around 100 cycles, and can store several gigabytes.
• Solid state drives can be accessed within around 1,000,000 cycles, and can store several terabytes and more.
• Hard drives can be accessed within around 10,000,000 cycles, and can store several terabytes and more.

Most memory operations tend to have high temporal locality (recent things are more likely to be read/written next) and high spatial locality (nearby things in memory are more likely to be read/written next).

The goal of the processor cache is to speed up these operations, so they transfer recently used data with RAM in blocks/lines - recently used data is temporally local, while blocks ensure spatially local data is often loaded along with the requested data. For example, the ARM Cortex A9 transfers data with RAM in blocks/lines of 8 words at a time, and uses a 4-way set associative scheme for mapping blocks to available cache slots.

When we request a piece of data from memory, we first check the L1 to L3 caches. If the data is present, we have a cache hit and can immediately retrieve the data. Otherwise, we have a cache miss, so we copy the block containing the requested data from main memory into the cache, and read the desired data.

# 13/11/15

The average access time is therefore $$t_{avg} = hC + (1 - h)M$$, where $$h$$ is the cache hit rate, $$C$$ is the cache hit time, and $$M$$ is the cache miss penalty (time to copy the missing block in from main memory). So to improve this, we need to improve hit rate (by making caches bigger or prefetching blocks before they are needed), reduce cache hit time (by making caches smaller)m ir reduce cache miss penalty time (by reading the requested data at the same time as we are copying the block from main memory into the cache).

When writing data to memory, caches tend to make things a bit more difficult, because we need to ensure the cache and RAM have the same data. One way to keep the cache and main memory in sync when writing data is to write to both cache and main memory at the same time - the write-through scheme. However, this means writes are just as slow as main memory - only reads are sped up by the cache.

What we can do instead is to write to the cache and mark that block as dirty, and whenever we evict a dirty block, we write it back to main memory. However, this gets very messy in multi-core and multi-processor environments - we could potentially have two caches for different cores have different data, which could have unpredictable results when that data gets written to main memory. This is the cache coherency problem, and we generally have to implement additional infrastructure to handle this, such as cache invalidation schemes and special memory controllers.

Each cache line has a "valid" flag that represents whether that line is mapped to a block of RAM, a "dirty" flag that represents whether the block as been modified, the index of the block of RAM that it's mapped to (if the "valid" flag is set), and the data in that block (if the "valid" flag is set). Within a cache, there are several different ways to map blocks of memory to slots in the cache. One is fully associative mapping, in which any block of memory can be stored in any slot in the cache. This has the highest hit rate and fills the most slots, but every time we want to access a block, we need to search through every single slot to see if it is mapped to that block, which makes this scheme larger and more expensive.

The direct mapping scheme, on the other hand, maps blocks of memory to to slots according to their index modulo the number of slots. This means that each block of memory only has one possible slot they can be cached in. This has the lowest hit rate and fills the least slots, but it requires only one test to see if a block is in the cache, and is therefore smaller and cheaper.

A comprimise between this is the set associative scheme, in which the cache is partitioned into sets and each block is mapped into a particular set according to their index modulo the number of sets, like in the direct mapping scheme. Within each set, the block is mapped to the first free slot, like in the fully associative mapping scheme. This ensures that we only need to search every slot in a set to see if it contains the desired block rather than searching through all slots, while still getting a decent hit rate. The associativity of a set associative scheme is the size of each set - the number of cache lines in each fully associative set.

Suppose we're trying to insert a block into the cache and there's no spaces to insert into. Clearly, if we're using the direct mapping scheme, we can just kick it out the block in the spot that we're supposed to insert into.

If we're using the fully associative/set associative caching scheme, which block do we evict to make room for the new one?

One way to do this is to just kick out a random block (random eviction), or the block that was added to the cache/set first (first-in-first-out/FIFO eviction). However, a better strategy is to use the LRU (least recently used) strategy - the block in the cache/set that was least recently accessed (rather than least recently added like in FIFO eviction). This ensures that commonly accessed data doesn't get kicked out, even if we access a lot of other stuff as well.

In direct mapping, every cache line's corresponding memory block can be uniquely identified by $$n - l$$ bits, where $$n$$ is $$\log_2(\text{number of blocks})$$ (the number of blocks is the number of addresses divided by size of blocks in bytes), and $$l$$ is $$\log_2(\text{number of cache lines})$$.

In fully associative, every cache line's corresponding memory block needs ;wip

### Virtual Memory

In modern operating systems, we want each of our processes to think it has full and exclusive access to the address space, from 0 to $$2^n - 1$$. There are many benefits to this:

• Even if we don't have $$2^n$$ bytes of physical memory, the process can still use high addresses.
• If other processes are using some particular addresses in physical memory, the process can still use those addresses.
• We don't need to relocate code since any code can just start wherever we need it to.
• Processes can't arbitrarily access other processes' memory - virtual memory isolates process address spaces from each other.

To implement this, we split up the virtual memory space into pages, each of a fixed size (like 4 kB - it's large because we want to avoid having to keep track of too many pages). Each page can be stored in either the main memory, or swapped to the hard disk. This is sort of like caching, in that we have to choose between two different storage locations, but there's only one copy of every page stored at any time.

Basically, We need to map pages to physical memory and the hard disk. Each page has a virtual page number (VPN), and is mapped to a physical frame, which has a physical frame number (PFN). Translating a VPN to a PFNs is done by looking up the PFN in a page table, which is just an array of PFNs (plus their control bits) indexed by VPN. This translation is generally handled by the memory management unit (MMU).

There is actually one page-table per process. When using the MMU to translate virtual addresses to physical addresses, we send in a virtual address and the base address for the process' page table in RAM, and get back a physical address.

# 16/11/15

For performance reasons, the page table also needs to have a cache of recently used virtual address and their associated physical address translations to avoid unnecessary table lookups, which are expensive memory operations. This is generally located in the MMU and is called the translation lookaside buffer (TLB).

TLBs are generally fully associative and relatively small. They must be cleared whenever we switch processes.

A page fault is an exception raised when the MMU can't find the page for a virtual address in main memory (the page is on the disk rather than RAM). In the page fault handler, we swap the desired page back from the disk into main memory (the page is swapped with the least recently used page - an LRU scheme), update the page tables, and return to the calling instruction.

## Processor Design

Stage of Execution Data Processing Instructions Load Instructions Store Instructions Branch Instructions
Fetch ADD R0, R1, R2 LDR R0, [R1, #4] STR R0, [R1, #4] BEQ [PC, #16]
Read Registers $$[R1], [R2]$$ $$[R1]$$ $$[R0], [R1]$$ $$[PC]$$
Use ALU $$[R1] + [R2]$$ $$[R1] + 4$$ $$[R1] + 4$$ $$[PC] + 16$$
Use Memory (not applicable) Read from $$[R1] + 4$$ Write to $$[R1] + 4$$ (not applicable)
Write Registers $$R0 \leftarrow [R1] + [R2]$$ $$R0 \leftarrow \text{result}$$ (not applicable) $$PC \leftarrow [PC] + 16$$ if test succeeds

This is the fetch/decode/compute/memory/store model CPU design.

# 18/11/15

Processor data paths with and without pipelines. How each stage of execution is implemented in hardware. See slides for chapter 8 for details.

Registers are always clocked and always have their output enabled. We control data flow by asserting control signals based on the value of registers and memory.

# 20/11/15

Continuing the previous processor datapath descriptions, for each type of instruction.

# 23/11/15

There are several processor control signals in the block diagram. We use combinatorial logic to control these - logic that should accept the current stage of execution, and output all the relevant control signals as necessary, based on the instruction and condition flags.

A pipelines procesor has a very similar design to a non-pipelined one, but data moves along like an assembly line. At each step in the execution of instructions, each instruction moves ahead by one stage. Basically, each step buffers all of its signals to the next stage, including control signals needed to finish executing the instruction.

A pipelined processor stlll has each execution step take one clock cycle, but loads the next instruction into stage 1 while the current one is in stage 2, loads the next instruction into stage 2 while the current one is in stage 3, and so on - the instructions move past the stages in a pipeline, and different stages of different instructions can be executed in parallel, greatly improving maximum speed. Ideally, the pipelined processor should ececute one instruction per cycle.

# 25/11/15

Class is cancelled today - we have an extra lecture on Friday instead.

# 27/11/15

A hazard is a condition in which the processor temporarily cannot pipeline instructions. In these cases, the pipline is stalled at the stage the hazard is in - instructions after the stall continue moving to the next stage at each cycle, but not at or before the hazard, until the hazard is solved. For example, when there is a cache miss in the instruction cache, which needs two cycles to complete. In the worst case, the hazard is at the first instruction, and exists until the entire rest of the pipeline is executed - the number of stages in the pipeline is the delay here in clock cycles. When there is a pipeline stall, stages that aren't actually executing instructions are running what is called a pipeline bubble.

A data hazard is when different instructions access the same data and one or more of them modifies the data. For example, add R0, R1, R2 followed by add R3, R4, R0 stalls the pipeline because the value of R2 isn't saved until the write stage, while the second ADD instruction needs it in the decoding stage. To solve these, the pipeline needs to stall so that each of the conflicting instructions completely execute before the next one starts decoding.

Another way to get around this is to use data forwarding, in which we save the results of the first instruction right after they're available. in the above example, we write R2 during the compute stage of the first ADD, which means the new value of R2 will then be available when the second ADD is in the compute stage in the next cycle.

A control hazard is when the result of a branch is unknown when we need to fetch the next instruction. For example, if we have a JUMP instruction to a computed address, we won't know what the new vlaue of PC will be until the write stage. To solve this, the pipeline needs to stall until the branch has fully executed. Modern processors often also have branch prediction to help get around this.

Branch prediction/speculative execution is when we predict what the next instruction to fetch is going to be (even though it might not be a correct prediction), and continue pipelining based on that prediction - if we turn out to be wrong, we simply clear out those stages of the pipeline that turned out to contain the incorrect prediction. Note that it is important that these predicted instructions can't have side effects until the branch completes, at which point the prediction is validated - that generally means we just do fetch/decode/execute, but not write.

For example, if a conditional jump goes to a fixed address, it must either jump to that address, or not jump and continue executing, so the branch predictor might assume, say, that it didn't jump and continue pipelining with that assumption.

Modern CPUs are very good at making correct branch predictions, often with better than 90% accuracy. Usually, they keep a branch history table (BHT) that tracks addresses of recent fully executed branches and the probability that the branches are taken (generally 2 bits), and then uses this information to make better predictions. Modern ISAs can even let us specify whether the branch is likely to be taken or not, which the CPU can use as a prediction hint. A simple state machine can then update the history table based on actual branching information.

Suppose we have ldr R0, [R1, #4], then add R2, R3, R0 - a load followed by an instruction that depends on the result. Even with data forwarding, there still needs to be a pipeline stall of at least 1 cycle, because the ADD depends on R0 in the decoding stage, and R0 is not available until at least the memory stage (assuming we are using data forwarding).

Compilers can often avoid these stalls by rearranging instructions so that, for example, the load is followed by an instruction that doesn't depend on its result. For example, ldr R0, [R1, #4] followed by add R2, R3, R0 followed by mov R4, #1 can be rearranged into ldr R0, [R1, #4] followed by mov R4, #1 followed by add R2, R3, R0, which doesn't have a load hazard since mov R4, #1 doesn't depend on the result of the load, R0. Modern CPUs can even do this automatically on the fly.

Data hazards are detected during decoding, at which point the processor can do data forwarding.

# 30/11/15

A structural hazard is when a piece of hardware is needed by two or more stages of execution. For example, if the pipeline has an instruction in the fetch stage and a load/store instruction in the memory stage, both of these need to access the MMU, which can only perform one access at a time. To solve this, the pipeline needs to stall so that the instruction earlier in the pipeline continues executing after the other instruction that needs the hardware has finished executing.

Another type of structural hazard is a CPU cache miss, because we need to ask the MMU for memory values, which can take multiple cycles.

The ideal cycles per instruction (CPI) is 1, but hazards cause this number to be higher. For example, if a load hazard has a 1 cycle penalty and occurs 40% of the time in load instructions, which are 20% of all instructions, then it adds $$1 \times 0.4 \times 0.2$$ = 0.08\$ to the CPI, so if there are no other hazards the CPI is 1.08. The throughput is simply the inverse - $$P = \frac 1 T$$, or instructions per cycle.

## Floating Point

Floating point numbers have a sign $$s$$, a magnitude $$m$$, and exponent $$E$$, and represent the value $$(-1)^s m 2^e$$. The IEEE754 standard defines a representation for these floating point numbers for use in computers.

This standard generally represents the magnitude using a fraction, which is the fractional part of $$m$$ when $$1 \le m < 2$$ (we can make $$1 \le m < 2$$ by changing the exponent) - we ignore the 1 before the decimal point. For example, if we have a magnitude of 1.125, we have 1.001 in binary, so a 23-bit fraction value would be 00100000000000000000000.

In IEEE754, the exponent's value is represented by an unsigned binary number with a bias $$b$$ added to it (the bias lets us represent negative exponents). The biased exponent $$E'$$ is $$E + b$$ - this is the value we actually store in the exponent bits.

A biased exponent with all 1 bits (the very largest exponent) is reserved for special values, while a biased exponent with all 0 bits (the very smallest exponent) is reserved as well for very small values, which need to be represented differently. The sign bit is 0 for a positive/zero number and 1 for a negative one.

In IEEE654 4-byte single floating points, the sign is represented by the first bit, followed by 8 bits for the exponent (with a bias of 127), and 23 bits for the fraction.

The 23-bit fraction can represent about 16 decimal significant digits, since it has a precision of $$2^{-23}$$ for each step. That means the fraction can represent all numbers from 1.000000 to 1.999999 with 6 significant fractional digits.

The highest exponent we can have is 127, because the largest biased exponent is binary 11111110/decimal 254 (since 11111111 is reserved) and the bias is 127, so we subtract the bias from the biased exponent to get the actual exponent, 127. The lowest exponent we can have is -126, because the smallest biased exponent is binary 00000001/decimal 1, so we subtract the bias to get -126.

Therefore, singles can represent numbers within $$\pm (-1)^0 1.9999999 \times 2^{127} \approxeq 3.4028235 \times 2^{38}$$.

In IEEE 8-byte double floating points, the sign is represented by the first bit, followed by 11 bits for the exponent (with a bias of 1023), and 52 bits for the fraction.

The 52-bit fraction can represent about 15 decimal significant digits, since it has a precision of $$2^{-52}$$ for each step. That means the fraction can represent all numbers from 1.000000000000000 to 1.999999999999999 with 15 significant fractional digits

The highest exponent we can have is 1023, because the largest biased exponent is binary 11111111110/decimal 2046 (since binary 11111111111 is reserved) and the bias is 1023, so we subtract the bias from the biased exponent to get the actual exponent, 1023. The lowest exponent we can have is -1022, because the smallest biased exponent is binary 00000000001/decimal 1, se we subtract the bias to get -1022.

Therefore, doubles can represent numbers within $$\pm (-1)^0 1.999999999999999 \times 2^{1023}$$.

Recall that the easiest-to-remember way to convert a fractional decimal number to binary is to decompose it into negative powers of 2. For example, $$0.4375 = 0.25 + 0.125 + 0.0625$$, so we have binary 0.0111. Likewise, we can convert the binayr back to decimal by decomposing it back into $$0.25 + 0.125 + 0.0625$$ and adding them together.

For example, we will convert 0.75 to a single floating point. Clearly, 0.75 multiplied by $$2^1$$ is 1.5, which is between 1 and 2, and 1.5 in binary is 1.1. Therefore, the sign bit is 0 (0.75 is positive), the exponent is -1 (since we had to multiply by $$2^1$$ in order to get the fraction between 1 and 2), the biased exponent is binary 01111110/decimal 126 (since we add the bias to the exponent), and the fraction is 10000000000000000000000 (which represents 1.5). All together, we have 10111111010000000000000000000000.

Previously, we said that the exponent value with all 0 bits is reserved. This is reserved for the denormalized representation - when the biased exponent is all 0, the fraction actually represents the fractional part of $$m$$ when $$0 \le m < 1$$ rather than $$1 \le m < 2$$, and the exponent is assumed to be $$1 - b$$. For example, in single floating points, denormalized representation represents $$(-1)^s m \times 2^{-126}$$ where $$0 \le m < 1$$.

The highest exponent value, when the biased exponent is all 1 bits, is reserved for the special values. When the exponent is at this highest value and the fraction is 0, it represents infinity or negative infinity (depending on the sign bit), In other words, infinity is 01111111100000000000000000000000 and negative infinity is 11111111100000000000000000000000 for single floating points. When the exponent is at this highest value and the fraction is not 0, then it represents NaN (not a number).

# 2/12/15

Note that floating point numbers represented using IEEE754 can only have magnitudes between 1 and 2 (or between 0 and 1 for denormalized representation). Normalization is a process that converts any floating point number into one that can be represented using IEEE754. To do this, we can repeatedly multiply by 2 and subtract 1 from the exponent while the magnitude is greater than 2, and repeatedly divide by 2 and add 1 to the exponent while the magnitude is less than 1. If the resulting value is less than $$2^{1 - b}$$, then we write it in denormalized form.

Rounding is the process of representing numbers with a certain number of significant digits using fewer significant digits. In the process, we lose information and introduce error. There are 3 common rounding schemes:

• Truncation at digit $$k$$ simply cuts off any digits past $$k$$.
• This gives an error within $$[0, 0.1 \ldots 1]$$ with respect to the $$k$$th digit (where the number of 1's is the number of significant digits we're cutting off), but introduces a downward bias in the data.
• For example, 000.000 rounds to 000, 000.100 rounds to 000, 001.000 rounds to 001, and 001.100 rounds to 001.
• Von Neumann rounding at digit $$k$$ sets digit $$k$$ to 1 whenever the digits after it are nonzero, and cuts off all digits after it.
• This gives an error within $$[-0.1 \ldots 1, 0.1 \ldots 1]$$ with respect to the $$k$$th digit, which is twice the error of truncation but it doesn't bias the data and the errors tend to cancel out over multiple operations.
• For example, 000.000 rounds to 000, 000.100 rounds to 001, 001.000 rounds to 001, and 001.100 rounds to 001.
• Round to nearest even at digit $$k$$ takes the digits up to $$k$$, and adds 1 to the $$k$$th digit if the digits after it are above half of it ($$\ldots, d_k, 1, 0, 0, 0, \ldots$$). If the digits after the $$k$$th are exactly half, then it only adds 1 to the $$k$$th digit if it is already 1.
• This gives an error within $$[-0.1, 0.1]$$ with respect to the $$k$$th digit, which is better than Von Neumann, while still not biasing the data and allowing errors to cancel out.
• However, it requires adding (which needs a carry signal), and operations will need additional steps to make rounding work properly.
• For example, 000.000 rounds to 000, 000.100 rounds to 000, 001.000 rounds to 001, and 001.100 rounds to 010.

Adding/subtracting two floating point numbers is done by aligning both to the largest exponent (so they both have the largest exponent, but possibly different magnitudes), adding/subtracting their magnitudes, and then normalizing the result. Note that this means that if we subtract two numbers that have the same exponent and have the same first $$k$$ bits, then we lose $$k$$ significant bits in the result.

Multiplication/division is done by adding/subtracting the exponents (to add biased exponents, $$E'_r = E'_x + E'_y - b$$; to subtract biased exponents, $$E'_r = E'_x - E'_y + b$$), multiply/divide the signs/magnitudes, normalize, round, then normalize again (this step is only really necessary for round-to-even, since it adds bits rather than just setting least-significant bits). Rounding is possible because the multiplication/division is always done with 3 extra bits of precision.

For example, to add $$1.111 \times 2^{-5}$$ and $$1.001 \times 2^{-7}$$, we can match the exponents to get $$1.111 \times 2^{-5} + 0.01001 \times 2^{-5} = 1.111 \times 2^{-5} + 0.01001 \times 2^{-5}$$

# 4/12/15

Unsigned multiplication is done by shifting and adding. We can either have dedicated adders for each shifted value (takes only one cycle to execute, but big circuit), or shift by 1 bit and add for every bit (takes many cycles to execute, but small circuit).

If we have an $$n$$ hardware multiplier, we need $$n - 1$$ adders. Each adder is enabled based on a particular bit of one multiplicand, and each adder adds the other multiplicand to the partial result from the previous adder. The final output is the product.

Recall that a very fast way to invert a binary number is to invert all the digits to the left of the rightmost 1 - this works regardless of the number's sign and requires no arithmetic. For example, 00000011 becomes 11111101.

One simple way of doing signed multiplication is to take the absolute value of both multiplicands, multiply them, then inverting the result if the multiplicand signs don't match. We could also use Booth's algorithm, which is somewhat simpler to do by hand (see course notes for procedure).

The final exam is cumulative - 25% of the content is pre-midterm lecture content, 75% is post-midterm lecture content. Questions are based on the assignments. There will be at least one ARM programming question.