Foundations of Sequential programs.

```
Instructor: Bradley Lushman
ISA: Sean Harrap
Email: CS241@uwaterloo.ca
Web: http://www.student.cs.uwaterloo.ca/~cs241
```

**Note:** I made a GDB-style MIPS debugger for the specific instruction set used by CS241, current as of Winter 2015. You might find this useful if you'd like to debug your code or disassemble MIPS binaries.

Weekly assignments, 11 total.

A sequential program is an "ordinary" program - one that is not concurrent, parallel, or so on. Sequential programs only do one thing at a time - they are single threaded.

As this course focuses on the foundation, we will be starting from the hardware, and figure out how sequential program works from there.

This course uses a simulated MIPS machine for the assignment. At the end we will be able to run a relatively complex C-like language that can run on the MIPS machine.

Refer to earlier notes for background.

A bit is a binary digit - a 0 or a 1. A nibble is a collection of 4 bits. A byte is a collection of 8 bits, although historically it could be any number needed to hold a single character - sometimes 7. A word is the a machine-specific grouping of bits - the number of bits in an address on the computer.

In this course we will be using a 32-bit architecture.

Data in a computer's memory could mean anything - it is simply data.

There are a number of conventions for representing negative numbers in binary. The **sign magnitude** convention has the leftmost bit of a binary number represent whether it is negative. Although this system is simple, it's difficult to do math with since we have to treat the first bit separately for a lot of operations. Plus, there's two representations for the number 0 - `00...00`

and `10...00`

.

The **one's complement** convention has positive numbers written normally, and negative numbers are their magnitude with their bits flipped. This works better for certain operations, but there is still two 0 representations, which complicates things unnecessarily.

The **two's complement** convention has positive numbers written normally, and the negative numbers are simply their magnitude with their bits flipped and 1 added. The advantage of this system is that the first bit still denotes the sign. Plus, there is only one representation for 0, and operations like addition work the same way as for unsigned numbers, modulo \(2^n\).

To find the numerical value of a number represented using \(n\)-bit two's complement, we first check the sign bit. If 0, then we simply convert binary to a non-negative number as usual. Otherwise, we convert the binary to a non-negative number as usual, and subtract \(2^n\) from that result to get a negative number as a result.

To quickly negate a binary number in two's complement quickly, flip all the bits and add 1, convert that to a positive number, then negate that number. This allows us to quickly convert negative numbers to binary and hex.

For 3-bit numbers, each binary string, in ascneding order, represents 0, 1, 2, 3, -4, -3, -2, -1.

A byte could also potentially be a character. Characters have multiple possibble encodings, such as unicode and ASCII. ASCII is a 7-bit code (for historical reasons, in order to avoid sending expensive bits) that has representations for most America-centric characters, although there are 8-bit extensions like those by IBM that add multilingual characters. In an 8-bit byte world, the eighth bit is 0 in standard ASCII. We will be using ASCII in this course to represent characters.

A byte could also be part of a data structure, or even part of an instruction. In this course, we have 32-bit instructions. A byte could also just be garbage - memory that isn't currently being used for anything and has no meaning.

The idea is that the meaning of a byte is whatever meaning we assign to it, and we have to remember what each little chunk of memory we're using actually means.

Computer programs operate on data, and are theselves data. Historically, the program code was stored in special memory such as punch cards or switches, and it operated on separate data memory. This was known as the Harvard architecture.

John Von Neumann proposed that we could just put the program code in the same memory as the rest of the data. This is known as the Von Neumann architecture. The result of program code being treated as just a program was that we can now write programs that operate on programs, even themselves. For example, operating systems, compilers, and viruses that modify existing code to do what it wants.

Machine language is specific to a particular processor type, like MIPS, ARM, and x86. In this course we will be using a simplified version of MIPS.

The MIPS machine consists of a CPU and main memory (RAM), connected via a bus. The CPU contains the ALU, the registers, and the control unit.

The ALU (arithmetic logic unit) is responsible for doing arithmetic and logic, for doing things like addition and subtraction.

Registers are simply small pieces of memory that are really fast to work with since they are so close to the CPU - they are the fastest pieces of memory we have, and we have relatively little of it. Most instructions can actually only work directly on registers, so we usually need to bring data from the main memory into registers before we work on them.

There are the 32-bit **general purpose registers** R0 to R31, which can be used for general purpose data storage. In our assembly code, we refer to them with $0, $1, $2, and so on, until $31. However, R0 is always 0, and R30 (stack pointer/frame pointer) and R31 (return pointer) have other purposes in control flow by convention.

The **program counter register** is responsible for keeping track of where we are in the main memory for executing code. It holds an address to the next instruction to run while the current instruction is running. Some commands manipulate this register, so we can actually do control flow by modifying this in a certain way.

By convention, we guarantee that a certain address in memory like 0 always starts off with valid code, and then we initialize PC to that location. So when we start the machine, it will start executing instructions from that location and go onwards.

The **instruction register** IR is closely related to PC. Instead of holding the address of the next instruction, it holds the current instruction itself (not its address).

The HI/LO registers are a pair of registers that are useful for storing the output of operations like division, which results in the quotient and the remainder, stored in this pair of registers. They cannot be accessed directly, but are implicitly used by certain instructions.

The **memory address register/memory data register** MAR and MDR are used for working with memory. MAR is used to store the desired address for us to load, and MDR is used to store the loaded value.

MAR/MDR, PC, and IR are all hidden registers - they are not visible to our programs.

The control unit decodes instructions, and dispatches them to the rest of the computer to actually execute them.

There are many types of memory, such as registers, CPU cache, main memory, disk memory, and network memory, in order of increasing distance from the CPU. The closer the memory to the CPU, the faster it is to work with.

The main memory (RAM) is a large amount of memory stored away from the CPU, and communicates with the CPU over the bus (physically, a bunch of wires). It can be thought of as a huge array of \(n\) bytes. Each byte in the RAM has an address, one of \(0, \ldots, n - 1\).

Since a word in MIPS is 32 bits (4 bytes), and we can only access memory on word boundaries, we can only get or set addresses that are multiples of 4.

RAM access is much slower than register access - as much as three orders of magnitude. Usually, data in RAM must be transferred to registers before it can be used.

It takes 5 bits to encode the 32 general purpose registers (we can't directly use the other registers as memory). Therefore, we need 15 bits to encode two source registers and a destination register. That means we have 17 bits left to encode the rest of our instruction.

There are actually only two instructions for working with RAM - load word and store word.

If we do something invalid like dereferencing an invalid address, our machine will simply crash. Real-world machines have machine exceptions to deal with this.

The CPU essentially works as follows:

```
PC <- 0
loop:
IR <- MEM[PC]
PC <- PC + 4
decode and execute the instruction in IR
```

A program gets executed by getting called by a **loader**, which puts the program into memory (from somewhere like the hard drive) and sets PC to the first instruction in the program. This is the role of the operating system when opening a program.

When our program wants to end, we want to return control flow back to the loader. This is done by setting PC to the next instruction of the loader after the one that started our program.

Conventionally, R31 is set to the address of the next instruction in the loader, by the loader - the address we need to jump to in order to return to the loader. Therefore, in our program code, we simply set PC to R31 in order to jump back to the loader. This looks something like `jr $31`

in MIPS assembly.

Example code to add value in register 3 to register 5, store the result in register 7, then return:

```
RAM LOCATION BINARY HEX MEANING
------------ --------------------------------------- ---------- --------------
00000000 0000 0000 1010 0111 0001 1000 0010 0000 0x00A71820 add $3, $5, $7
00000004 0000 0011 1110 0000 0000 0000 0000 1000 0x03e00008 jr $31
```

The resulting code is therefore the binary representation of the hex value `00A7182003e00008`

.

To store a literal value into a register, we can use the **load immediate and skip** (`lis`

) instruction. Essentially, it reads the next word in memory, adjacent to the `lis`

instruction, and stores it in the specified register.

Note that ASCII is a mapping between characters and binary, and ASCII values are not binary. For example, `0010000100110101`

is simply an ASCII string of 1 and 0, while `!5`

would be the ASCII encoding the previous string would result in if interpreted as binary.

To read the ASCII representation of a given binary file, use `xxd <filename>`

or `hexdump -Cv <filename>`

, where `<filename>`

is the path to the file to represent. `xxd -bits`

shows binary, and `xxd -cols n`

shows `n`

bytes per line.

Instead of writing in hex or binary, which is very tedious to convert, we can simply use the mnemonics like `add`

and `lis`

, and have a program (the **assembler**) translate it to binary for us. This reduces our chances of error and makes programming much simpler.

Essentially, one line of assembly is one word, generally one machine instruction like `jr`

. There are also lines that are **directives**, which are instructions to the assembler to do something, rather than the machine. For example, the `.word X`

directive tells the assembler that at the current point, we want the resulting binary output to have the specified literal value `X`

.

The above code example could therefore be written as:

```
add $3, $5, $7
jr $31
```

Some instructions unconditionally modify PC, like `jr`

- these are known as jumps. Some instructions conditionally modify PC, like `bne`

- these are known as branches. Generally, the modification is given in terms of the number of words to move by. Note that it is possible to give negative values.

A quick way to zero a register is `add X, $0, $0`

, where `X`

is a register. `lw A, i(B)`

loads the value in memory at address specified by register `B`

with offset specified by literal number `i`

, storing the result in register `A`

. `sw A, i(B)`

does the opposite - writing the value of `A`

in memory instead of reading from it.

The `mult A, B`

instruction has a somewhat unusual form - it only has the multiplicands, and no output parameter. Instead, since multiplication can result in a 64-bit number, the answer is is stored in the HI and LO registers, the LO bits having the lower 32 bits and the rest in HI. To get values out of HI and LO, we have `mflo A`

, which gets the value of LO and stores it in the register `A`

, and `mfhi A`

, which does the same thing, but for HI.

Division is similar. After dividing, the quotient is in LO, and the remainder is in HI.

Counting the number of instructions to jump by is tedious and error-prone. We can mitigate this using **labels**. For example:

```
some_label: add $1, $2, $3
add $4, $5, $6
some_other_label: .word 0x12345678
```

Labels are assembler features that allow us to specify a specific location and use it later on. In any jump or branch instruction, we can simply write the label name rather than the offset, like `bne, $2, $0, some_label`

, which will jump to the location specified by `some_label`

if and only if `$2`

is non-zero.

We want to be able to write functions in MIPS assembly - blocks of reusable code that we can call with parameters and have control returned to us afterward.

Our procedures need to guarantee that the registers will remain unchanged - after calling the function, the values of registers will be guaranteed to be as they were before calling. Otherwise, subfunctions will clobber their caller's register values.

The MIPS loader initializes R30 to the address of the word just past the last word in memory - this is the **stack pointer**, and we are using the end of the memory as a stack that grows downward. This convention allows us to use RAM as a stack for storing our calling contexts - by decreasing the stack pointer, we reveal more memory we can use, and by increasing the stack pointer, we can deallocate that memory.

We will allocate blocks of memory on the stack that we use as stack frames. Stack frames store register values and the return address.

When we call a function, we allocate a frame and store all our register values in it for safekeeping, after which we can use the registers as we want in the function. When the function returns, we restore the values of the registers, and then jump back to the return address.

When we want to return from a function, we need to set the value of PC to after the jump that called the function - the very next instruction after the function call. Since functions can be called from multiple places, the caller must give the return address.

There is a "jump and link register" instruction `jalr $s`

that sets R31 to the next instruction (PC), and jumps to the specified address (setting PC to `$s`

). This allows us to store the return address in R31, and jump to the function, all in one operation.

However, we also need to save R31 since `jalr`

will overwrite it. In other words, we will need to store R31 on the stack.

To pass parameters to functions, we can just use registers (documenting what each parameter register is for is very important). If there are more parameters than available registers, it is also possible to store parameters on the stack.

We can also pass back the return value via an agreed-upon register. We might specify that the return value is in R2, for example.

A template for functions can now be defined:

```
some_func:
; store the registers we will be using on the next free spaces on the stack
sw $2, -4($30)
sw $3, -8($30)
; decrement the stack pointer by the number of bytes used by our stack frame
lis $2 ; $2 is free to use now that we have saved its former value
.word 8
sub $30, $30, $2
; body of the function
; increment the stack pointer again to restore it to its previous value, assuming we didn't change $3
add $30, $30, $3
; make sure to pop the registers off the stack
lw $3, -8($30)
lw $2, -4($30)
jr $31 ; return to the instruction after the one we called the function from
```

And we can call it as follows:

```
main:
; body before the function call
; save the value of $31 on the stack to keep our return address
sw $31, -4($30)
lis $31 ; $31 is free to use now that we have saved its former value
.word 4
sub $30, $30, $31
; jump to the function with $31 as the return address
lis $31
.word some_func
jalr $31 ; store the value of PC into $31 as the next instruction, and set PC to $31 - we are storing the return address in $31
; restore the value of $31 back to its original value
lis $31
.word 4
add $30, $30, $31
lw $31, -4($30) ; we do this after we're done using $31 as a temporary variable
; body after the function call
jr $31 ; return back to the loader
```

Example procedure:

```
; sum_1_to_N: adds natural numbers from 1 to N
; Registers:
; $1 - Value of N
; $2 - Output sum
sum_1_to_N:
; save the register values on the stack
sw $1, -4($30)
sw $3, -8($30)
lis $1
.word 8
sub $30, $30, $1
; function body
add $2, $0, $0 ; zero out the output register
lis $3 ; set R3 to 1
.word 1
sum_1_to_N_loop:
add $2, $2, $1 ; add the current value to the output
sub $2, $2, $3 ; decrement the value
bne $2, $0, sum_1_to_N_loop ; break out of the loop if the value is 0
; restore the values of the registers
lis $1
.word 8
add $30, $30, $1
lw $3, -8($30)
lw $1, -4($30)
jr $31
```

To do output, write a word to 0xFFFF000C. The least significant byte will be printed as an ASCII character:

```
; prints a newline
lis $1
.word 0xFFFF000C
lis $2
.word 65
sw $2, 0($1)
```

The assembler is a translater, from assembly code to machine code. Any translation process will involve two phases, analysis (understanding the input) and synthesis (producing the output from the understanding of the input).

Our approach is to first group the characters into meaningful tokens, like a label, a hex number, a register, a directive, or so on. Tokenization is done by something known as a **lexer**, which accepts source code and outputs a stream of tokens.

We already have a tokenizer available to us in `asm.ss`

and `asm.cc`

. Our goal is to group tokens into actual instructions if possible - the analysis phase. If there is an error, we must output the string "ERROR" to stderr. The assembled MIPS code is printed to stdout.

To parse labels, occasionally we will have the label definition occur after their usages. To properly assemble this, we need to do things like make an extra pass beforehand to create a symbol table, or put in a placeholder value and fix it after parsing all the labels.

Also note that it is also possible to label the address just past the end of the program, and a line may have more than one label associated with it.

Instructions in MIPS start with a 6 bit opcode, followed by their parameters that specify things like registers and offsets.

An OS can be as simple as the following:

```
repeat:
p <= the next program to run
$2 <= loader(p)
jalr $2
beq $0, $0, repeat
```

Note that the OS itself is just a program, which means it needs to be in memory as well. Since execution starts at 0x00000000, we need to put the OS at the very beginning of the memory. That means our program cannot occupy the first location in memory.

That also means that the addresses of the labels will not be correct if we load them directly, since the assembler created the machine code assuming addresses start from the beginning of the program (a label at the very beginning of the assembly should result in a value of 0x00000000). If we load our program at offset \(d\), then we must add \(d\) to the value of each label or reference to a label, like `.word some_label`

. This process is called **relocation** and allows us to load the code anywhere in memory and have it run properly.

Since the machine code is just a stream of bytes, we don't know which words are labels and which are not - the assembler has to give us additional information about what type of thing each word is.

As a result, what most assemblers output is not actually just machine code, but actually **object code** - machine code with additional metadata that lets us load and run the program in a clean way. These assemblers are called **relocatable assemblers**.

Basically, the metadata needed in object files includes the addresses of words that represent label addresses in the code so we can relocate it, plus things like exports and imports.

The object file format we will be using is called **MIPS executable relocatable linkable** (MERL), and often has a filename of the form `*.merl`

. It looks like this:

- The literal word 0x10000002, which is the MIPS instruction for
`beq $0, $0, 2`

, and skips two words ahead, past the rest of the header.- this means that the object file can also be run as a normal MIPS program, since this value causes it to jump past the header and straight to the code if loaded at address 0x00000000.
- It also acts as a sanity check to make sure the file is really a MERL file.

- The length of the entire module in bytes as a word - this is the length of the header (12), the machine code, and the relocation table.
- The address of the end of the code section in bytes as a word, which is the length of the header (12) and the machine code.
- The actual MIPS machine code, which includes the 12 byte header offset - a label at the beginning of the machine code is at address 12.
- The
**relocation table**, which appears after the code returns and has a sequence of entries for different types of relocatables. An entry in the relocation table consists of two words:- The format code - a word representing the type of entry this is. For example, a value of 1 represents that this entry is a relocation entry.
- The actual address, which, if this entry is a relocation entry, is the address of a word holding the address of a label. This is relative to the beginning of the header.

The MERL format could also be written as follows:

```
; MERL header
beq $0, $0, 2
.word end_of_module
.word end_of_code
;code goes here
add $1, $2, $0
; MERL footer - relocation table
end_of_code:
.word 0x1 ; relocatable address
.word SOME_ADDRESS
; ...
.word 0x1
.word SOME_OTHER_ADDRESS
end_of_module:
```

The `cs241.merl`

tool accepts a MERL file and a relocation address and outputs MIPS machine code relocated at the desired address, ready to be loaded starting at that offset. The `mips.twoints`

and `mips.array`

tools also support an optional second argument that specifies the address to load the code at.

Relocation is typically done by the loader, automatically. The loader usually has the following structure:

```
verify(read_word()) # check cookie
end_of_module = read_word()
code_length = read_word() - 12
# load code
p = allocate_RAM(code_length + stack_space)
for i in range(0, code_length, 4):
RAM[p + i] = read_word()
relocation_table = code_length + 12 # start of relocation table
for i in range(relocation_table, end_of_module, 8):
format = read_word()
if format == 1: # we only support format 0x1 for now
offset = read_word() - 12
RAM[p + offset] += p - 12 # we subtract 12 since in MERL addresses are relative to the start of the header, and the header wasn't loaded into RAM
```

When we have multiple files, we might have a function that calls a function in another file. However, the assembler can't assemble this directly since the label of the function isn't defined in the file it's being called from.

We could just append all the code together and then assemble them, which works but is not very good, since that requires the assembly code to be available (in practice, people just want to distribute MERL), and we would need to assemble it possibly multiple times in larger projects.

A better solution is a program that accepts multiple MERL files and properly puts them together into one - the **linker**. We want to put MERL files together without ever forcing it to know about ASM.

First, what we could do is have the assembler ignore unknown labels, leaving them to be resolved to actual values later. This is done by simply using a placeholder value like 0, and then storing a list of these addresses in the relocation table along with the labels we want to assign to them.

To make this easier for the assembler, we add the `.import IDENTIFIER`

directive, which does not result in any output, but tells the assembler that `IDENTIFIER`

is an identifier that is external to the current file and will be linked in later. This allows us to catch typos in our label names more easily - we can give an error if a label isn't defined and isn't imported.

In the MERL files, we define a new format code for the relocation table - 0x11, the **external symbol reference** (ESR) format. The table entry then contains the address of the label reference, and the name of the label:

```
; ESR relocation table entry
.word 0x11
.word ADDRESS # location where the label was used
.word LENGTH # number of words/chars that make up the label name
.word C[0] # first character of the label as ASCII value
; ...
.word C[LENGTH - 1] # last character of the label as ASCII value
```

However, we also need a way for a module to say what labels it has, so other modules can actually import it. We don't want to just export all the labels in a MERL module - labels are often duplicated, expecially when we start having compilers generating them automatically.

Instead, we have a `.export LABEL`

directive, which makes a new entry in the relocation table, in the **external symbol definition** (ESD) format. This contains the address where the label is defined, and the name of the label:

```
; ESD relocation table entry
.word 0x5
.word ADDRESS # location where the label was defined
.word LENGTH # number of words/chars that make up the label name
.word C[0] # first character of the label as ASCII value
; ...
.word C[LENGTH - 1] # last character of the label as ASCII value
```

In programming languages like C, each function gets its own `.export`

entry, which is how the linker can link C object files together. A `static`

function doesn't get an ESD entry, which is why it can't be imported from other files.

Now we can write a linker:

```
# Given MERl files m1 and m2 as binary blobs, and we want to make a single MERL file with m2 linked in after m1
# `get_imports` and `get_exports` get the ESRs and ESDs, respectively, as a dictionary mapping label names to addresses
a <- len(m1) - 12
relocate m2 higher by `a`
add `a` to every address in m2's symbol table
assert set_conjunction(get_exports(m1).keys(), get_exports(m2).keys()) == empty_set
# link imports and exports together
for address1, label in get_imports(m1):
if label in get_exports(m2):
m1[address1] = get_exports(m2)[label]
delete the ESR entry `label` from m1's relocation table
add relocation entry `address1` to m1's relocation table
for address2, label in get_imports(m2):
if label in get_exports(m1):
m1[address1] = get_exports(m1)[label]
delete the ESR entry `label` from m2's relocation table
add relocation entry `address1` to m2's relocation table
# we don't have an else to check for any references still unresolved because this linker only links two MERL files at a time, so it's possible that we run this multiple times for three or more files
# make the final relocation table
relocations = set_disjunction(get_relocations(m1), get_relocations(m2))
imports = set_disjunction(get_imports(m1), get_imports(m2))
exports = set_disjunction(get_exports(m1), get_exports(m2))
# output MERL file
output MERL cookie
output total code length plus relocation table size plus 12 (header size)
output total code length plus 12 (header size)
output m1
output m2
output relocations, exports, and imports
```

The compiler is responsible for converting a high level language to assembly. Unlike an assembler, the compiled result is not unique - there are multiple possible ways to convert high level languages to assembly.

We want a formal theory of string recognition - general principles that work in the context of any programming language.

An **alphabet** is a finite set of symbols, generally denoted \(\Sigma\) (\(\Sigma = \set{a, b, c}\), for example). Symbols can be anything we want, and might not even be strings.

A **string/word** is a finite sequence of symbols, each symbol belonging to \(\Sigma\) (\(abaabbcca\), for example).

The length of a string \(s\) is \(\abs{s}\). The empty string is usually denoted \(\epsilon\). This should not be confused with any symbols in the alphabet - \(\abs{\epsilon} = 0\) and \(\epsilon\) is not in any alphabets.

Given a string \(s\), \(s^n\) is equivalent to \(s\) repeated \(n\) times - \(s^n = s \ldots s\).

A **language** is a set of strings (\(\set{a^{2n}b \mid n \ge 0}\), for example, which is an even number of \(a\) followed by a \(b\)). The empty language is just an empty set \(\emptyset\) - a language with no words. This is different from \(\set{\epsilon}\), which is a language with one word, the empty word.

We want to determine if a given string belongs to a given language. How difficult this is depends on the language. For infinite sets of strings, like the set of all valid MIPS or C programs, things get a little trickier. In fact, there exist languages where this is impossible.

Therefore, we want to characterize languages by how hard they are to recognize. Noam Chomsky came up the Chomsky heirarchy to deal with this. The following is based on that, and lists language classes in ascending order of difficulty recognizing:

- The
**finite languages**are the easiest to recognize - there are a finite number of strings, so we can just check each one. - The
**regular languages**. - The
**context-free languages**. - The
**context-sensitive languages**. - The
**recursive languages**. - There are other types that are more difficult to recognize, and some that are even impossible. For example, a language that contains one word, the solution for an unsolvable problem in mathematics.

Our goal is to use as easy a level as possible on the heirarchy, since then the compiler will be simpler. Higher levels are supersets of lower levels.

We can see if a word is in a finite language by comparing it is in each element of the finite set. However, we can do this more efficiently by going char by char and filtering out the ones that are not matches.

For example, if we have a language with words "cat", "car", "cow", we can follow the following procedure:

- Scan the input left to right.
- If the first character is "c", then move on, and otherwise produce false.
- If the next character is "a", and the next character is "t" or "r", and there is no next character, then produce true, and otherwise produce false if "t", "r", or end failed to match.
- If the next character is "o", and the next character is "w", and there is no next character, then produce true, and otherwise produce false if "w" or end didn't match.
- Otherwise, produce false.

This only scans the word once - we read only in one pass. Although the above procedure uses no memory, there is still state - the value of PC is different whenever we read a character.

We can abstract this sort of decision-making program into something like a graph. Every state in the program is represented as a circle, and transitions between states can be represented by drawing unidirectional arrows between them, labelled with what the transition is. In the above example, states might be values of the PC register, and state transitions might be labelled by the character that is read next.

A circle within a circle (or a circle with a double border) accepts - it represents a state that, if the input ends there, we can simply accept it as valid, and we can use it as an end state. A circle with an arrow pointing into it from the left represents the start state.

This sort of program is known as a **state machine**. The circles we draw are **states** - configurations of the program based on the inputs seen. The diagrams we draw are called **deterministic finite automata** (DFAs).

All state machines that check the validity of an arbitrary string in a finite language can be written as a tree. In other words, for any finite language there exists a DFA without an cycles.

We match DFAs as follows:

- Start at the start state.
- For each char in the input, follow the correponding arrow to the next state, and produce false if there are no possible transitions from any of the states.
- We are now at the end of the input. Produce true if the current state is an ending state, and false otherwise.

To make the math more elegant, we define an implicit **error state**, and every single state goes to this state for unlabeled transitions. So if we get an unexpected input for a node (so we don't have a defined transition for the input), we implicitly define a transition to the error state. The error state is not special, but it is always a state that cannot transition to any other states, and it never matches.

A DFA represents a finite language if and only if it does not contain any cycles.

The regular languages are built from finite languages, plus set operations like union (set union of the words of two languages - \(L_1 \cup L_2 = \set{x \mid x \in L_1 \vee x \in L_2}\)), concatenation (every possible combination of elements from one set concatenated to another - \(L_1 L_2 = \set{xy \mid x \in L_1 \wedge y \in L_2}\)), and repetition (a language repeated an arbitrary number of times - \(L^* = \set{\epsilon} \cup \set{xy \mid x \in L^* \wedge y \in L} = \set{\epsilon} \cup L \cup LL \cup LLL \cup \ldots\)).

Note that repetition generally results in infinite sets - this means that the regular languages contain infinite languages.

We show that a language is regular by building it up from finite languages and the set uperations (union, concatenation, and repetition). For example, \(\set{a^{2n}b \mid n \ge 0}\) is regular because \(\set{a^{2n}b \mid n \ge 0} = \set{aa}*\set{b}\).

A **regular expression** is a shorthand for all the cumbersome set notation we are using for describing regular languages, and is itself a language we will denote \(E\). It works as follows:

- \(E = \emptyset\) represents \(L = \emptyset\).
- \(E = \epsilon\) represents \(L = \set{\epsilon}\).
- \(E = aaa\) and other sequences of non-special characters represent \(L = \set{aaa}\) and themselves in a set.
- \(E_1|E_2\) represents \(L_1 \cup L_2\).
- \(E_1 E_2\) represents \(L_1 L_2\).
- \(E_1 = E_2^*\) represents \(L_1 = L_2^*\).

For example, \(\set{a^{2n}b \mid n \ge 0}\) can be represented with \((aa)*b\). Regular languages are, in practice, relatively simple. For example, C is not a regular language, but C programs are made up of tokens, each of which is part of a regular language - tokens, string literals, number literals, types, and so on.

For regular languages, DFAs can have cycles - this occurs as a result of repetition.

Consider the language defined as the set containing all strings that are composed of an even number of "a" and an odd number of "b". Clearly, we can write a DFA for this using four states (even "a" and even "b" - start, odd "a" and even "b", odd "a" and odd "b", and even "a" and odd "b" - end/accept). The regular expression for this is rather difficult to write. ;wip: how?

Formally, a DFA is a 5-tuple \((Q, \Sigma, q_0, A, \delta)\), where the variables represent the following:

- \(Q\) is the set of states of the DFA - all possible states. This is a finite, non-empty set.
- \(\Sigma\) is the alphabet of the DFA - all possible input values. This is a finite, non-empty set.
- \(q_0\) is the start state - the state at which the DFA is initialized. This is an element of \(Q\).
- \(A\) is the set of accepting states of the DFA - the states that the input can terminate at. This is a subset of \(Q\)
- \(\delta\) is the transition function - a mapping from a state and input character to a new state. This is a function \(q, x \to r, q \in Q, x \in x \in \Sigma, r \in Q\).

DFAs don't always have to just accept characters. It is also possible to consume entire words, using an extension of the original \(\delta\) function, which only consumes one character at a time.

This **word consuming function** can be defined as \(\delta^*(q, \epsilon) = q, \delta^*(q, cw) = \delta^*(\delta(q, c), w), q \in Q, c \in \Sigma, w \in \Sigma^*\). A DFA accepts a word \(w\) if \(\delta^*(q_0, w) \in A\) - the DFA accepts the input if read from its start state. \(\delta^*\) represents a function that gives the end state of the DFA given a start state and input.

Given a DFA \(M\), \(L(M)\) represents the set of all strings accepted by \(M\). This is the set of all elements in of \(\Sigma^*\) such that \(M\) accepts them.

**Kleene's theorem** states that \(L\) is regular if and only if \(L = L(M)\) for some DFA \(M\). In other words, the regular languages are exactly the languages for which there exists a DFA that accepts them.

A DFA is still just a representation of a computer program. In practive, these are often implemented using switch statements:

```
int state = $q_0$
char current;
while (read_char(current)) {
switch (state) {
case $q_0$:
switch (current) {
case 'a': state = // the state to transition to
// cases should exist for all possible transitions from this state
case 'z': state = // the state to transition to
default: error("No transitions available");
}
case $q_1$:
// another switch statement like the one for $q_0$, for $q_1$'s transitions
}
}
```

This works but is also long and error-prone. Instead, we can use a state table, where rows are chars and columns are states, and each cell points to the next state. See the assignment 3 starter code for an example.

It is also possible to attach actions to the transitions in the DFA. Potentially, we could use this to do things like get the numerical value of a number literal while we are parsing it.

Every regular expression can be converted into a DFA, and every DFA can be converted into a regular expression. However, the translation is not always straightforward. For example, consider `(a|b)*abb`

, the set of all words in \(\set{a, b}^*\) that end with "abb". When we generate the DFA, we need to add extra transitions for the states that make up "abb" in order to backtrack, in order to handle inputs like "aba":

```
1: START
"a": 2
"b": 1
2:
"a": 2
"b": 3
3:
"a": 2
"b": 4
4: ACCEPT
"a": 2
"b": 1
```

The easiest way to convert between regular expressions and DFAs is to convert to an \(\epsilon\)-NFA first, then converting to the desired form.

What if we had a DFA that allowed more than one transition from the same state that have the same label? For example, a state with transitions to two other states, each with the same label.

This would mean that when we have an input with multiple possible transitions, the machine would get to choose one of the transitions - the state machine is non-deterministic. This is a **non-deterministic finite automata** (NFA). The difference from a DFA is that the machine gets choices in which transition to use.

We also assume that the machine always makes the correct choice - the state machine accepts if there exists a sequence of choices such that the input is accepted. In practice, we do this with something like backtracking - when we get to the end and we haven't accepted the input, we undo and choose a different set of choices. By using a stack storing when we made choices, we can easily use backtracking search over all possible choices.

With NFAs, the above example becomes a lot simpler, which is the main benefit of this type of machine:

```
1: START
"a": 1, 2
"b": 1
2:
"b": 3
3:
"b": 4
4: ACCEPT
```

Formally, an NFA is a 5-tuple like a DFA - \((Q, \Sigma, q_0, A, \delta)\), where the variables represent the same things as with a DFA, except the transition function. With an NFA, \(\delta\) is a mapping from a state and input character to a set of new states. This is a function \(q, x \to R, q \in Q, x \in x \in \Sigma, R \subseteq Q\).

An NFA accepts the input if a path exists such that the NFA results in acceptance, and rejects if none do.

We now want to define a transition function for NFAs that accepts words. What we will do is define a function that accepts a set of possible current states and a word, resulting in a new set of possible states. When we are moving through the NFA, we are basically stepping through all the possible transitions at once when we have a choice. In other words, we step through all the possible choice paths in parallel.

We define this as \(\delta^*(S, \epsilon) = S, \delta^*(S, cw) = \delta^*(\set{\delta(q, c) \mid q \in S}, w), S \subseteq Q, c \in \Sigma, w \in \Sigma^*\). This is a function that results in the set of all possible end states of the machine. The NFA accepts a word \(w\) if and only if \(\delta^*(\set{q_0}, w) \cap A \ne \emptyset\) - if any of the possible end states are accepting.

To simulate a DFA:

```
state = $q_0$
while read_char(current):
state = $\delta$(state, current)
return state in $A$
```

To simulate an NFA:

```
states = set($q_0$)
while read_char(current):
states = set($\delta$(q, current) for q in states)
return states & $A$ != set()
```

If we have an NFA and give each possible set of possible states (which we get from stepping through every possible option) a name, and use these sets as the states instead, we get a DFA. In this way it is possible to convert any NFA into a DFA. This is called **subset construction**.

Note that this means that an NFA with \(n\) states can be converted into a DFA with at most \(2^n\) states - one for every subset of states. We can convert an NFA to a DFA by stepping through all the possibilities for input. The accepting states of our DFA are the **sets containing accepting states in the original NFA**.

Whenever we step an NFA, we are actually simulating a DFA in a simpler way - we are constructing a DFA on the fly, with only the states that we actually need for a certain input.

Every NFA must have a corresponding DFA, and every DFA is also an NFA - the set of NFAs is a superset of the set of DFAs, and every NFA has a corresponding DFA. Therefore, NFAs accept the same class of languages that DFAs do, the regular languages.

Consider the language \(L = \set{cab} \cup \set{w \in \set{a, b, c}^* \mid w \text{ has an even number of } a}\).

```
1, 2, 3, 4, 5, 6
1 > c > 2
2 > a > 3
3 > b > 4
1 > a > 5
1 > b,c > 6
5 > b,c > 5
5 > a > 6
6 > a > 5
6 > b,c > 6
1 > accept
6 > accept
```

To convert this to a DFA, we start by stepping the NFA with \(\set{1}\) (a set containing only the start state), while considering every possible next input for the currently stepped states to get the next candidate states. Repeating until no new subset states are found, we get the equivalent DFA:

```
{1} > a > {5}
{1} > b > {6}
{1} > c > {2, 6}
{5} > b,c > {5}
{5} > a > {6}
{6} > a > {5}
{6} > b,c > {6}
{2, 6} > b,c > {6}
{2, 6} > a > {3, 5}
{3, 5} > a > {6}
{3, 5} > b > {4, 5}
{3, 5} > c > {5}
{4, 5} > a > {6}
{4, 5} > b, c > {5}
{1} > accept
{6} > accept
{2, 6} > accept
```

We will now consider an \(\epsilon\)-NFA. This is like a normal NFA, but we can change states without reading any input at all. This is represented with arrows labelled with \(\epsilon\). This makes many alternation cases easier, because it gives a much easier way of gluing multiple small NFAs together into a bigger NFA. Basically, the \(\epsilon\) transition is an optional transition we can take from one state to another.

Using the same argument as before with states being sets of states, the \(\epsilon\)-NFAs also accept regular languages.

Also, \(\epsilon\)-NFAs can be easily converted into a regular expression using a simple algorithm. Additionally, if we can prove that all regular expressions have an \(\epsilon\)-NFA, then we have one direction in the proof for Kleene's theorem - that every regular expression has a DFA.

The \(\epsilon\)-NFA for any regular expression can be found recursively:

- Empty language is a non-accepting start state.
- The regular expression \(\epsilon\) is an accepting start state.
- A regular expression with only one letter is a start state that transitions to an accepting state via that letter.
- A regular expression of the form \(E_1|E_2\) is a start state, with two \(\epsilon\) transitions, one to the \(\epsilon\)-NFAs for \(E_1\) and one to the \(\epsilon\)-NFA for \(E_2\).
- A regular expression of the form \(E_1 E_2\) is the \(\epsilon\)-NFA for \(E_1\), where the accepting states are not accepting and instead have an \(\epsilon\) transition to the start state for the \(\epsilon\)-NFA for \(E_2\).
- A regular expression of the form \(E_1^*\) is the \(\epsilon\)-NFA for \(E_1\), where the start state is a new accepting state with an \(\epsilon\)-transition to the original start state of the \(\epsilon\)-NFA for \(E_1\), and all accepting states of also have an \(\epsilon\) transition back to the start state of the \(\epsilon\)-NFA for \(E_1\).

As a result, all regular expressions can be converted into an \(\epsilon\)-NFA, which can be converted into a DFA.

Regular expressions and DFAs are useful in writing compilers because we need to do tokenization/scanning/lexing. For example, the lexer needs to be able to scan for keywords, identifiers, literals, operators, comments, and other C tokens, and these are all regular languages. As a result, sequences of C tokens are also regular languages, and we can use finite automata to do tokenization/scanning/lexing.

To do tokenization, we are given a string \(w\), break it into words \(w_1, \ldots, w_n\) such that each \(w_i\) is in the language of the tokens (or give an error otherwise), and then output each \(w_i\) as the tokens. If \(E_1, \ldots, E_k\) are the regular expressions that match each type of token, then \((E_1 | \ldots | E_k)^*\) is a regular expression that recognizes sequences of valid tokens. If we add an action to output the token after a match of \(E_1 | \ldots | E_k\), we now have a tokenizer.

Consider an \(\epsilon\)-NFA for parsing a C-style identifier: \([a-zA-Z][a-zA-Z0-9]^*\). Clearly, if we parse \(abcd\), this could be interpreted as either 1, 2, 3, or 4 tokens, since the \(\epsilon\) move outputs a token. This ambiguity is problematic because a program written in a programming language should only have one interpretation.

We can resolve this by adding add a rule to the NFA that says that we should only take an \(\epsilon\) move if we have no other choice - to always look for the longest possible next token, rather than trying all the options. In effect, we are removing the nondeterminism from the \(\epsilon\)-NFA.

Clearly, this doesn't always work. Consider \(L = \set{aa, aaa}\), and the word \(w = aaaa\). Clearly, \(w \in L\), but an \(\epsilon\)-NFA that always took the longest possible move would not accept it. However, in practice this sort of redundancy doesn't show up, and we can implement this technique.

This can be formalized as the **maximal munch algorithm** for greedy matching using DFAs:

- Run the \(\epsilon\)-NFA without any \(\epsilon\)-moves, until no more non-error moves are possible. This is the greedy part of the algorithm, where we try to consume as much as possible.
- If currently in an accepting state, we have found a token.
- Otherwise, back up to the most recent accepting state (or error if none), and the input to that point is a token. Scanning resumes from there.
- Output the token.
- Resume scanning, for the next token, by using an \(\epsilon\)-move back to the starting state.

We often just use a simplified maximal munch, which works on a more limited set of languages:

- Run the \(\epsilon\)-NFA without any \(\epsilon\)-moves, until no more non-error moves are possible. This is the greedy part of the algorithm, where we try to consume as much as possible.
- If currently in an accepting state, we have found a token.
- Output the token.
- Resume scanning, for the next token, by using an \(\epsilon\)-move back to the starting state.

Basically, the same thing, but without the backtracking step. This means we can't tokenize things like `#ifdefg`

into `#ifdef`

and `g`

- whitespace or another separator is required. However, this is usually good enough because languages are often designed intentionally in such a way that this works.

This sort of parser sometimes causes issues in C++, such as `vector<vector<int>> v;`

not being valid because the `>>`

is interpreted as a single token instead of two `>`

tokens.

Tokenization is good, but it can't parse everything. For example, consider a language containing all strings consisting entirely of balanced parentheses. Clearly, we need at least one state for each level of nesting in order to keep track of how many parentheses are currently open in the input, but no finite number of states can be used to recognize all levels of nesting, and DFAs must have a finite number of states. They also can't parse palindromes for the same reason.

That means that DFAs can't parse balanced parenthesis, like those in Scheme, and that this is not a regular language. Therefore, we need a higher class of languages - the **context-free languages**.

The context-free languages are those that can be described by a **context-free grammar** (a set of rewrite rules, which can be recursive). For example, a context-free language for strings of balanced parentheses can be written as \(S \to \epsilon; S \to (S); S \to SS\). As a shorthand, we can write this as \(S \to \epsilon \mid (S) \mid SS\).

A word is in a context-free language if and only if there exists a way to construct it from the rewrite rules in the grammar. For example, \((())()\) can be generated by applying rule 3 to rule 2 applied to rule 2 applied to rule 1 and rule 2 applied to rule 1.

A context-free grammar consists of:

- An alphabet \(\Sigma\), called
**terminal symbols**. This never includes \(\epsilon\). - A finite, non-empty set \(N\) of
**nonterminal symbols**, such that \(N \cap \Sigma = \emptyset\). The**vocabulary**is defined as \(V = N \cup \Sigma\). - A finite set \(P\) of
**productions**, which are rules of the form \(A \to B\) where \(A \in N, B \in V^*\). - A starting element \(S\) of \(N\).

To make notation easier, we use these conventions:

- \(a, b, c, \ldots \in \Sigma\) - letters from the beginning of the alphabet represent characters.
- \(z, y, x, \ldots \in \Sigma^*\) - letters from the end of the alphabet represent strings.
- \(A, B, C, \ldots, Z \in N\) - capital letters represent nonterminals.
- \(S\) is the starting symbol/nonterminal.
- \(\alpha, \beta, \ldots \in V^*\) - Greek letters represent sequences of symbols.

\(\alpha A \beta \implies \alpha \gamma \beta\) means that there is a production \(A \to \beta\) in \(P\) - the right side is derivable from the left side in one step.

\(\alpha A \beta \implies^* \alpha \gamma \beta\) means that there exists a sequence of 0 or more derivation steps that convert the left side to the right side - the right side is somehow derivable from the left.

A context-free language is therefore defined as \(L(G) = \set{w \in \Sigma^* \mid S \implies^* w}\) - the set of all strings derivable from \(S\), the starting state. Here, \(G\) is a context-free grammar and gives the values for \(\Sigma, N, V, P, S\).

A language \(L\) is context-free if and only if there exists a context-free grammar \(G\) such that \(L = L(G)\).

For example, a language that consists of palindromes over \(\set{a, b, c}\) can be represented using the grammar \(S \to aSa \mid bSb \mid cSc \mid M; M \to a \mid b \mid c \mid \epsilon\).

A language that consists of arithmetic expressions over \(\set{a, b, c, +, -, *, /}\) can be written as \(S \to a \mid b \mid c \mid S O S \mid (S); O \to + \mid - \mid * \mid /\). We can show that \(S \implies^* a + b\) by using \(S \implies S O S \implies a O S \implies a + S \implies a + b\). This is a **derivation** - we apply rules one by one until we have constructed the expression.

A derivation is a **proof that a word is in the language**.

At our \(S O S\) step, we had a choice as to which rule to expand next - any of the \(S\), \(O\), or \(S\) could have been expanded next. We can do a **leftmost derivation**, where we always expand the leftmost symbol (in this case, the first \(S\)), and **rightmost derivation**, where we always expand the rightmost symbol (in this case, the second \(S\)).

We can express the derivation more naturally as a **parse tree**:

```
S
/|\
S O S
| | |
a + b
```

For every leftmost and rightmost derivation, there exists a corresponding **unique parse tree**. A grammar for which some word has more than one distinct leftmost derivation is **ambiguous** - there are multiple possible parse trees for some strings.

A string might have two leftmost derivations, and so would have two unique parse trees. For example, we can derive \(a + b * c\) by starting with \(S \implies S O S \implies S O S O S \implies \implies a O S O S \implies \ldots\), or with \(S \implies S O S \implies a O S \implies a O S O S \implies \ldots\). This is equivalent to parsing it as \((a + b) * c\) or \(a + (b * c)\).

If we just care about whether a word is in a language, then it doesn't matter if we use an ambiguous grammar. However, for compilers, the shape of the parse tree determines the meaning of the program, so words that have two different parse trees would have two different meanings.

To resolve ambiguity, we could use heuristics (often called **precedence**) to guide which rule we expand next. However, most of the time we just want to use an unambiguous grammar instead.

The unambiguous version of the language can be written as \(E \to E O T; T \to a \mid b \mid c \mid (E); O \to + \mid - \mid * \mid /\)

To get the multiplicative operators to have precedence over the additive operators, we can use the following grammar: \(E \to E \text{Plus} T \mid T; T \to T \text{Mult} F \mid F; F \to a \mid b \mid c \mid (E); \text{Plus} \to + \mid -; \text{Mult} \to * \mid /\). Note that we used a rule for each level of precedence - \(E\) is the precedence level for additive operators, and \(T\) is for multiplicative operators.

If \(L\) is a context-free language, there may not necessarily be an unambiguous context-free grammar for it. The languages that only have ambiguous grammars are **inherently ambiguous grammars**. For example, \(A \to \epsilon, B \to \epsilon, S \to A \mid B\).

It is not possible to tell if an arbitrary context-free grammar is ambiguous - this problem is undecidable. However, we can write tools that can ensure that a grammar, constrained in certain ways (like no left recursion), is always unambiguous. Tools like ANTLR can do this.

Testing whether two grammars describe the same language is also an undecidable problem.

A finite language can be recognized with a loop-less program. A regular language can be recognized with a program that uses finite memory (this is because regular languages are recognized with DFAs, and the states of a DFA represent the program's memory). A context-free language can be recognized with an NFA and a stack (this is a program that can use infinite memory, since the stack can grow arbitrarily).

A real compiler needs a bit more than just recognition. If the program is valid, we need a proof in the form of a parse tree. If the program is invalid, we also need a proof in the form of an error message. The process of finding the derivation, and the parse tree, is known as **parsing**.

Parsing is the following problem: given a grammar \(G\), start symbol \(S\), and a word \(w\), determine the derivation \(S \implies \ldots \implies w\), or prove that there is no possible derivation.

We can do parsing in two ways - finding the derivation forwards (starting from the left, expand until we have \(w\), this is **top-down parsing**), or finding it backwards (starting from the right, unexpand until we have \(S\), this is **bottom-up parsing**). Both are used in practice, and they each have advantages for different types of grammars.

The basic idea is to start with \(S\), apply grammar rules, and eventually reach \(w\).

We start with \(S\), and then repeatedly attempt to resolve the leftmost non-terminal until we have obtained the input desired, or an error otherwise. In this way we obtain the leftmost derivation. Potentially, we could also obtain the rightmost derivation by always attempting to resolve the rightmost nonterminal.

We will use a stack to store all the \(\alpha_i\), in reverse, in the derivation \(S \implies \alpha_1 \implies \ldots \implies \alpha_n \implies w\). We will then match against characters in \(w\) from left to right. We will also enforce the property that the consumed input concatenated with the reversed stack contents are equal to some \(\alpha_i\).

We will use **augmented grammars** to make parsing simpler. Basically, we take a normal grammar and add the symbols \(\vdash\) and \(\dashv\) (the beginning and ending symbols), and a new start symbol \(S'\), defined as \(S' \to \vdash S \dashv\). Our word is now \(\vdash w \dashv\).

The language of WLP4 is defined using regular expressions to determine the tokenizations, a context-free grammar to define the parsing, context-sensitive code to do things like check types and declarations, and a mapping from parse tree structures to executable output to determine the semantics of the language.

The basic structure of a top-down parser is as follows:

```
# input is given as an array of tokens, which are also terminals
current_input = [START_STATE]
while current_input != full_input:
let `A` be a nonterminal in `current_input` # usually this would always be the leftmost nonterminal in order to get the leftmost derivation
find production rules such that $A \to \ldots$, and select one such rule `A \to value`, or give error if none # it is very important to choose the right rule
replace the instance of `A` chosen earlier with `value` in `current_input`
```

Basically, we are repeatedly reducing the nonterminals in the input until we have all terminals, or have determined that the word is not in the language.

The step where we find which production rule to use is an important part of top-down parsing. Generally, this is based on ;wip

We can implement top-down parsing using a stack \(S = X_n \ldots X_1, X_1, \ldots, X_n \in V\) (stack contains symbols in the vocabulary). The parsing is made possible by reading the input left-to-right into `read_input`

and enforcing the invariant `current_input = read_input + S`

as we read through.

Consider the following grammar: \(S \to A y B; A \to ab; A \to cd; B \to z; B \to wx\). Consider the word \(w = \vdash abywx \dashv\). Then to parse it, we do the following:

Stack | Read input | Unread input | Next Action | Current Input |
---|---|---|---|---|

\(S'\) | \(\epsilon\) | \(\vdash abywx \dashv\) | top of stack is nonterminal, pop and pick rule \(S' \to \vdash S \dashv\) | \(S'\) |

\(\vdash S \dashv\) | \(\epsilon\) | \(\vdash abywx \dashv\) | read \(\vdash\), compare to top of stack, pop | \(\vdash S \dashv\) |

\(S \dashv\) | \(\vdash\) | \(abywx \dashv\) | top of stack is nonterminal, pop and pick rule \(S \to AyB\) | \(\vdash S \dashv\) |

\(AyB \dashv\) | \(\vdash\) | \(abywx \dashv\) | top of stack is nonterminal, pop and pick rule \(A \to ab\) | \(\vdash AyB \dashv\) |

\(abyB \dashv\) | \(\vdash\) | \(abywx \dashv\) | read \(a\), compare to top of stack, pop | \(\vdash aayB \dashv\) |

\(byB \dashv\) | \(\vdash a\) | \(bywx \dashv\) | read \(b\), compare to top of stack, pop | \(\vdash aayB \dashv\) |

\(yB \dashv\) | \(\vdash ab\) | \(ywx \dashv\) | read \(y\), compare to top of stack, pop | \(\vdash aayB \dashv\) |

\(B \dashv\) | \(\vdash aby\) | \(wx \dashv\) | top of stack is nonterminal, pop and pick rule \(B \to wx\) | \(\vdash aayB \dashv\) |

\(wx \dashv\) | \(\vdash aby\) | \(wx \dashv\) | read \(w\), compare to top of stack, pop | \(\vdash abywx \dashv\) |

\(x \dashv\) | \(\vdash abyw\) | \(x \dashv\) | read \(x\), compare to top of stack, pop | \(\vdash abywx \dashv\) |

\(\dashv\) | \(\vdash abywx\) | \(\dashv\) | read \(\dashv\), compare to top of stack, pop | \(\vdash abywx \dashv\) |

\(\epsilon\) | \(\vdash abywx \dashv\) | \(\epsilon\) | end of input, accept since stack empty | \(\vdash abywx \dashv\) |

Note that we only really need to care about the stack and the read input. That means that we can implement a top-down parser using just a stack as storage.

;wip: missed the rest due to sleep

The **first set** of a nonterminal is the set of all terminals that might start that rule in the parser's input. For example, if we have \(A \to B, B \to x\), then the first set of \(A\) includes \(x\). The **follow set** of a nonterminal is the set of all terminals that might follow the rule in the parser's input For example, if we have \(A \to B C, B \to x, C \to y\), the follow set of \(B\) includes \(y\).

A nonterminal is **nullable** if and only if it can derive \(\epsilon\).

Calculating first sets:

```
# grammar is a list of pairs of nonterminals and their expansions (lists of vocabulary elements)
First = {A: set() for A, expansion in grammar}
do:
for A, expansion in grammar:
for element in expansion:
if element is a terminal:
First[A].add(element)
break
else:
First[A].extend(First[element])
if not Nullable[element]: break
until First stops changing
```

A grammar that has matched \(b\)-\(d\) and \(p\)-\(q\) pairs with \(l\) in between can be written as \(S' \to \vdash S \dashv; S \to b S d; S \to p S q; S \to C; C \to lC; C \to \epsilon\).

Running this algorithm on the grammar results in the first set \(\set{\vdash}\) for \(S'\), \(\set{b, p, l}\) for \(S\), and \(\set{l}\) for \(C\).

The follow sets for this algorithm are \(\set{\dashv, d, q}\) for \(S\) and \(\set{\dashv, d, q}\) for \(C\).

With the first and follow sets, we can create a prediction function \(\text{Predict}(A, a) = \set{A \to B \middle| a \in \text{First}(B)} \cup \set{A \to B \middle| \text{Nullable}(B) \wedge a \in \text{Follow}(A)}\), where \(A\) is a nonterminal and \(a\) is a terminal. Basically, this describes the union of the set of rules for which the right hand side begins with the terminal \(a\), and the set of rules that can expand to 0 tokens and are followed by \(a\).

We can also represent the prediction table as a table where the rows are different nonterminals \(A\) and the columns are different terminals \(a\), with the cells filled with the resulting rules.

This works for \(LL(1)\) grammars. A grammar is \(LL(1)\) if and only if:

- No two production rules with the same left-hand side can generate the same first symbol - the set \(\set{A \to B \middle| a \in First^*(B)}\) cannot have more than 1 element for any symbol \(a\).
- As a result, the grammar \(E \to E + T \mid T; T \to T * F \mid F' F \to a \mid b \mid c\) is not \(LL(1)\) because it is
**left recursive**- given an input \(a\), it is possible for either \(E\) or \(T\) to generate it. - Left recursion will immediately make a grammar not \(LL(1)\). Right recursion could also do so, but not always.
- To make it \(LL(1)\), we could do something like \(E \to T E'; E' \to \epsilon \mid + E; T \to F T'; T' \to \epsilon \mid * T; F \to a \mid b \mid c\). This is called
**left factoring**. - There is only one way to nullify a nullable symbol - the set \(\set{A \to B \middle| Nullable(B) \wedge a \in \text{Follow}(A)}\) cannot have more than 1 element.
- In fact, if there is more than 1 way to nullify a symbol, then the grammar is ambiguous.

To build a **parse tree** with a top-down parser, we simply build a node when we expand a production rule - we start with the parse tree being the start rule \(S\), then whenever we pick a rule \(A \to \ldots\), we add \(\ldots\) as the children of \(A\).

For example, if the tree is \(S\) and we expand it using the rule \(S \to A y B\), we get the tree \(S(A, y, B)\). When we then expand \(A\) using \(A \to wx\), we get the tree \(S(A(wx), y, B)\), and so on.

The basic idea is to start with \(w\), and then go back using rules to get \(S'\) from \(S' \implies \alpha_1 \implies \ldots \implies \alpha_k \implies w\).

Parsing will always require a stack. In our bottom-up parsing, we are going to store partially reduced input read so far.

In our case, we will enforce the invariant \(stack + unread_input \in \set{S', \alpha_1, \ldots, \alpha_k, w}\) after reading each terminal from the input.

Consider the following grammar: \(S \to A y B; A \to ab; A \to cd; B \to z; B \to wx\). Consider the word \(w = abywx\). Then to parse it, we do the following:

Stack | Read input | Unread input | Action |
---|---|---|---|

\(\epsilon\) | \(\epsilon\) | \(\vdash abywx \dashv\) | (initial state) |

\(\vdash\) | \(\vdash\) | \(abywx \dashv\) | shift \(\vdash\) |

\(\vdash a\) | \(\vdash a\) | \(bywx \dashv\) | shift \(a\) |

\(\vdash ab\) | \(\vdash ab\) | \(ywx \dashv\) | shift \(b\) |

\(\vdash A\) | \(\vdash ab\) | \(ywx \dashv\) | reduce \(A \to ab\) |

\(\vdash Ay\) | \(\vdash aby\) | \(wx \dashv\) | shift \(y\) |

\(\vdash Ayw\) | \(\vdash abyw\) | \(x \dashv\) | shift \(w\) |

\(\vdash Ayw\) | \(\vdash abyw\) | \(\vdash\) | shift \(x\) |

\(\vdash Aywx\) | \(\vdash abywx\) | \(\vdash\) | reduce \(B \to wx\) |

\(\vdash AyB\) | \(\vdash abywx\) | \(\vdash\) | reduce \(AyB \to S\) |

\(\vdash S\) | \(\vdash abywx \dashv\) | \(\vdash\) | shift \(\dashv\) |

\(\vdash S \dashv\) | \(\vdash abywx \dashv\) | \(\epsilon\) | reduce \(S' \to \vdash S \dashv\) |

\(S'\) | \(\vdash abywx \dashv\) | \(\epsilon\) | accept |

Notice that at each step we have two possible actions:

- Shift - moving one character from the input to the stack.
- Reduce - the top of the stack is the right-hand side of some grammar rule, so replace it with the left hand side.

We accept if the stack is \(S'\) and the unread input is \(\epsilon\) (or we have \(\vdash S \dashv\) on an empty input, or when we shift \(\dashv\)).

How do we know whether to shift or reduce? We can decide based on the next token, but in practice this is very difficult.

Donald Knuth is known for proposing big-O notation as an algorithm analysis tool, writing the series of books "The Art Of Computer Programming", and the TeX typesetting system, among others.

He also invented a theorem: the set \(\set{wa \middle| \exists x, S' \implies^* wax}\) is a regular language. In other words, if \(w\) is our stack and \(a\) is our next input char, then we can use a DFA to describe and decide when to shift or reduce. This technique is called **LR parsing** (L for left-right scanning, R for rightmost derivations).

An **item** is a production with a dot (\(\cdot\)) somewhere on the right-hand side, and represents the partially completed rule.

Consider the grammar \(S' \to \vdash E \dashv, E \to E + T, E \to T, T \to id\).

;wip: define LR(0) and LR parsing

When we are LR parsing, we have a DFA. The states of this DFA are sets of rules, each with a dot inserted in them to represent where we are in parsing the rule.

When we create a state from another state using a transition symbol \(A\), we take the rules of that state that have their dot followed by \(A\) and move the dot to the right by 1 symbol for every rule. Then, for each rule, if the dot precedes a nonterminal \(A\), then we include all rules \(A \to \ldots\) as \(A \to \cdot \ldots\). We do this repeatedly until there are no more dots preceding nonterminals. For example, if we create a state from \(S'\), we move the dot to the right, include \(E\) rules, and then include \(T\) rules, to get \(S' \to \vdash \cdot E \dashv, E \to \cdot E + T, E \to \cdot T, T \to \cdot id\).

The start state has only one rule, for the starting rule for the grammar \(S' \to \cdot \vdash S \dashv\). For the above grammar, we have \(S' \to \cdot \vdash S \dashv\). We want to buil the entire DFA starting from this start state.

For every state, we now create new states corresponding each symbol following the dots in the rules, with those following symbols as the transitions to those states. For example, a state \(S' \to \vdash \cdot E \dashv, E \to \cdot E + T, E \to \cdot T, T \to \cdot id\) creates 3 states, with transitions \(E, T, id\). For example, the state \(S' \to \vdash \cdot E \dashv, E \to \cdot E + T, E \to \cdot T, T \to \cdot id\) creates the state \(E \to T \cdot\) using transition \(T\). If there are no symbols following dots, then we do not create any new states. Duplicate states are combined into one state.

A state where the dot is at the end of a rule is a reduce state. All other states are shift states.

We repeat this for the new states, and their new states, until we no longer have new states. For the above grammar, we get the following DFA:

```
$S' \to \vdash E \dashv$ > $\vdash$ (1) > $S' \to \vdash \cdot E \dashv, E \to \cdot E + T, E \to \cdot T, T \to \cdot id$ (1)
$S' \to \vdash \cdot E \dashv, E \to \cdot E + T, E \to \cdot T, T \to \cdot id$ > $id$ > $T \to id \cdot$
> $T$ > $E \to T \cdot$
> $E$ > $S' \to \vdash E \cdot \dashv, E \to E \cdot + T$
$S' \to \vdash E \cdot \dashv, E \to E \cdot + T$ > $\dashv$ > $S' \to \vdash E \dashv \cdot$
> $+$ > $E \to E + \cdot T, T \to \cdot id$
$E \to E + \cdot T, T \to \cdot id$ > $id$ > $E \to T \cdot$
> $T$ > $E \to E + T \cdot$
```

To use the machine, we start in the start state with a stack that contains only the start state, and step through the DFA either shifting or reducing at each state.

To **shift**, we read a char from the input to stack, and then follow the transition for that char if there is one, and otherwise reduce if the state is a reduce state, and give an error otherwise. A reduce state is a state that contain a single production rule, with the dot at the very right.

To **reduce**, we pop the right-hand side off the stack, backtrack by the size of the right-hand side, then follow the transition for the left-hand side and push the left-hand side onto the stack.

We also first push the DFA state we are transitioning to into the stack before transitioning, in order to be able to backtrack later. We could also use a separate stack for this, but it is more elegant to use just one in theory.

We accept when \(S'\) is on the stack by the end of the DFA.

Parsing \(\vdash id + id + id \dashv\):

Stack | Read | Unread | Action |
---|---|---|---|

\(1\) | \(\epsilon\) | \(\vdash id + id + id \dashv\) | Shift, go to state 2 |

\(1 \vdash 2\) | \(\vdash\) | \(id + id + id \dashv\) | Shift, go to state 6 |

\(1 \vdash 2 id 6\) | \(\vdash id\) | \(+ id + id \dashv\) | Reduce \(T \to id\): pop 1 symbol and 1 state, so now in state 2, then push \(T\) and go to state 5. |

\(1 \vdash 2 T 5\) | \(\vdash id\) | \(+ id + id \dashv\) | Reduce \(E \to T\): pop 1 symbol and 1 state, so now in state 2, then push \(E\) and go to state 3. |

\(1 \vdash 2 E 3\) | \(\vdash id\) | \(+ id + id \dashv\) | Shift, go to state 7 |

\(1 \vdash 2 E 3 + 7\) | \(\vdash id +\) | \(id + id \dashv\) | Shift, go to state 6 |

\(1 \vdash 2 E 3 + 7 id 6\) | \(\vdash id + id\) | \(+ id \dashv\) | Reduce \(T \to id\): pop 1 symbol and 1 state, so now in state 2, then push \(T\) and go to state 5. |

\(1 \vdash 2 E 3 + 7 T 8\) | \(\vdash id + id\) | \(+ id \dashv\) | Reduce \(E \to E + T\): pop 3 symbols and 3 states, so now in state 2, then push \(T\) and go to state 3. |

\(1 \vdash 2 E 3\) | \(\vdash id + id\) | \(+ id \dashv\) | Shift, go to state 7. |

\(1 \vdash 2 E 3 + 7\) | \(\vdash id + id +\) | \(id \dashv\) | Shift, go to state 6. |

\(1 \vdash 2 E 3 + 7 id 6\) | \(\vdash id + id + id\) | \(\dashv\) | Reduce \(T \to id\): pop 1 symbol and 1 state, so now in state 7, then push \(T\) and go to state 8. |

\(1 \vdash 2 E 3 + 7 T 8\) | \(\vdash id + id + id\) | \(\dashv\) | Reduce \(E \to E + T\): pop 3 symbols and 3 states, so now in state 2, then push \(E\) and go to state 3. |

\(1 \vdash 2 E 3\) | \(\vdash id + id + id\) | \(\dashv\) | Shift, go to state 4. |

\(1 \vdash 2 E 3 \dashv 4\) | \(\vdash id + id + id \dashv\) | \(\epsilon\) | Reduce \(S' \to \vdash E \dashv\). |

\(1 S'\) | \(\vdash id + id + id \dashv\) | \(\epsilon\) | Accept. |

If a state looks like \(A \to \cdot x, B \to y \cdot\), we have a **shift-reduce** conflict and don't know whether to shift or reduce at this state.

If a state looks like \(A \to x \cdot, B \to y \cdot\), we have a **reduce-reduce** conflict, and don't know which rule to reduce by.

If any complete item \(A \to x \cdot\) that isn't the only rule in the state, we will have a shift-reduce conflict or a reduce-reduce conflict. This makes the grammar not \(LR(0)\).

If we have a right-associative grammar like \(S' \to \vdash E \dashv, E \to T + E, E \to T, T \to id\), when we build the DFA we notice that there is a shift-reduce conflict when we try to make the state \(E \to T \cdot + E, E \to T \cdot\).

Basically, if the next input is \(id \dashv\) then we need to reduce, and if the input is \(id + \ldots\) then we should not reduce. Since we can't know this, we can't parse this grammar using LR parsing.

However, we can get around this by using lookahead. For each rule in the DFA \(A \to x \cdot\), we attach \(\text{Follow}(A)\) - the follow set (\(\set{\dashv}\) for \(A\), \(\set{+, \dashv}\) for \(T\)). Basically, when we are at that state, we look ahead at the next character \(c\), and only consider a rule in that state \(A\) if \(c \in \text{Follow}(A)\) - each rule can only be used if its follow set contains the next character.

With lookahead, we can figure out which rule we want to use in the conflicting cases, solving the shift-reduce conflict. This is called \(SLR(1)\) parsing - **simple LR parsing with 1 character lookahead**. This is pretty good, since it resolves many but not all conflicts.

To build parse trees using bottom-up parsing, we just attach the old nodes as children of the new node when we are reducing, and at the end we have a tree.

We want to do full \(LR(1)\) parsing, which is more sophisticated than \(SLR(1)\) and parses a superset of grammars. \(LR(1)\) is pretty powerful, but we don't use it much in practice, because it tends to produce very large parser DFAs, with a lot of states. There's also \(LALR(1)\) parsing, which is something of a comprimise between \(SLR(1)\) and \(LR(1)\), parsing more than the former but less than the latter. \(LALR(1)\) parsing is used quite often in the real world.

\(LR(1), SLR(1), LALR(1)\) are all identical parsing algorithms, with the only difference being the state machine that each one constructs. Given a state machine, the steps afterwards are identical.

Some properties of valid programs in languages like C cannot be enforced by CFGs. For example, type checking, scoping, and error checking like using a variable before declaring it.

Therefore, languages like C are actually context-sensitive languages (a superset of the context-free languages), and can be fully represented using **context-sensitive grammars**. However, in practice these grammars tend to be very difficult to work with. Instead, we usually parse the syntax using a context-free grammar, and then do context-sensitive analysis (semantic analysis) as necessary in order to enforce context-sensitive invariants.

The rest of the compiler is largely tree walking:

```
class Tree {
public:
vector<string> tokens; // production rule, left hand side as first element, right hand side after
vector<Tree> children;
}
void print(Tree &t) {
// print `t.rule
for (vector<Tree>::iterator i = t.children.begin(); i != t.children.end(); i ++) {
// print `*t` as children
}
}
```

Semantic analysis needs to catch errors and figure out variable scope and type. In WLP4, the only error checking we need to do is to find type errors, and declaration errors (multiple declarations in the same scope, or using variables without declaring them first).

We can detect declaration errors by doing the same thing as we did in the assembler, by **constructing a symbol table** (a map from variable name to its type). We will construct one for each scope, and since we don't have nested scopes, we can just have one symbol table per function.

To detect duplicate variable declarations, we traverse the parse tree and, for each function, collect all declarations (\(\text{dcl} \to \text{TYPE} \text{ID}\)) into a symbol table, with name and type information. If the name of a variable declaration was already in the symbol table for that function, then we have an error.

To detect variable use without declaration, we traverse the parse tree again and, for each scope check all the used variables (\(\text{factor} \to \text{ID}\) or \(\text{lvalue} \to \text{ID}\)), making sure they are already in the symbol table for that function - if they are not, then we have an error. We can do this in one pass because the language requires declarations to come before types.

To detect duplicate function declarations, we traverse the parse tree again and, for each function, collect all the function declarations into a global symbol table by name. If the name of a function was already in the function symbol table, then we have an error.

We can implement all of the above using a single `map<string, map<string, string> >`

as a symbol table, where the `string`

values are the function name, variable name, and variable type, respectively. This maps function names to symbol tables for that function, each one mapping variable names to their types. It is also possible to do all of the above operations in a single tree traversal.

However, we also need to check type errors, and so need to keep track of the type signature of procedures, to ensure we are passing the right types and number of parameters to each function. We will simply traverse the tree, keeping track of function signatures as a list of parameter types (we don't need to store the return type since it's always `int`

in WLP4). Then, we traverse again, and for every function call we verify that its arguments are correct.

We could also potentially do this in one pass, but then we need forward declarations in the language.

Types are how we assign interpretations to the values at some memory address. A good type system will prevent us from reinterpreting the memory values as -something else, like using a string as an int. For example, `int *a; a = 7;`

should be made invalid.

To check types, we determine the type for every expression, and ensure that the operators and functions are applied to arguments of the correct type - including variable initialization. We can determine the type of expressions recursively using the type rules for the language, where the variable's type is stored in the symbol table, and expressions are constructed from operators, variables, and literals. We give an error whenever there is no applicable type rule or a mismatched type.

Type rules are simply formal deductions, using the Fitch bar notation. For example, the type rule for variables \(\frac{\tup{\text{ID.name}, \tau} \in \text{declarations}}{\text{ID}: \tau}\) means that if the name of an ID and a type is in the declarations, then that ID is of that type.

```
def type_of(tree):
for c in tree.children:
compute `type_of(c)`
use `tree.rule` to decide which type rules are relevant
# the above type rule example could be written as: `if tree.rule == "ID name": return symbol.lookup(name)`
combine the types of children using a type rule to get the type of `tree`, or error if not possible
```

For WLP4, we might also have rules like \(\frac{}{\text{NUM}: \text{int}}\), \(\frac{}{\text{NULL}: \text{int*}}\), \(\frac{\text{Expr}: \tau}{(Expr): \tau}\), \(\frac{\text{Expr}: \text{int}}{\text{&Expr}: \text{int*}}\), \(\frac{\text{Expr}: \text{int*}}{\text{*Expr}: \text{int}}\), \(\frac{\text{Expr}: \text{int}}{\text{new int[Expr]}: \text{int*}}\), \(\frac{\text{Expr1}: \text{int}, \text{Expr2}: \text{int}}{\text{Expr1 * Expr2}: \text{int}}\).

Note that the `+`

operator is operloaded: we can add two integers, or an integer pointer and an integer: \(\frac{\text{Expr1}: \text{int}, \text{Expr2}: \text{int}}{\text{Expr1 + Expr2}: \text{int}}\), \(\frac{\text{Expr1}: \text{int*}, \text{Expr2}: \text{int}}{\text{Expr1 + Expr2}: \text{int*}}\), \(\frac{\text{Expr1}: \text{int}, \text{Expr2}: \text{int}}{\text{Expr1 * Expr2}: \text{int*}}\). Similarly, we can subtract two integers, an integer pointer and an integer (which is an integer pointer), or two integer pointers (which is the number of integer slots between them as an integer).

For if statements and while loops we would need a Boolean-like type, although the grammar ensures already that the conditions are comparison expressions, so if the operands of the comparison operator are well-typed, then so is the if statement or while loop. These statements do not have a type, but still need to be considered in type checking. **Well-typed** means that an expression has a type: \(\frac{E: \tau}{E \text{ is well-typed}}\). For example, \(\frac{\text{Expr}: \text{int*}}{\text{delete []Expr} \text{ is well-typed}}\) and \(\frac{\text{Expr1}: \tau, \text{Expr2}: \tau}{\text{Expr1 == Expr2} \text{ is well-typed}}\).

Also, for statements we have \(\frac{\text{Statement1} \text{is well-typed}, \text{Statement2} \text{ is well-typed}}{\text{Statement1; Statement2} \text{ is well-typed}}\), \(\frac{\text{ID} \in \text{Expr}: \tau}{\tau \text{ ID = Expr} \text{ is well-typed}}\). A function is well-typed if its declarations and statements are well-typed and its return value is an integer. A function call is of type integer as long as all its parameters have a type, and match the signature of the function: \(\frac{E_1: \tau_1, \ldots, E_n: \tau_n, \tup{f, \tup{T_1, \ldots, T_n} \in \text{declarations}}}{f(E_1, \ldots, E_n): \text{int}}\).

So far we've done lexical analyis on code to get tokens, parsing on tokens to get a parse tree, and semantic analysis to get another parse tree and symbol table. The only remaining step is to generate the assembly code from the parse tree and symbol table.

There are infinitely many possible MIPS programs we can generate from the a given WLP4 program. We want to choose the ones that are correct, relatively simple to generate, efficient to compile, efficient to run, and possibly also the shortest possible code.

Consider the code `int wain(int a, int b) { return a; }`

. We will introduce the convention that the parameters of `wain`

are stored in registers 1 and 2, and the return value is stored in register 3. With that, this code becomes `add $3, $1, $0`

and then `jr $31`

in MIPS.

Whenever we use a variable in the program, we need to actually use a register in the MIPS machine. In other words, we need to assign variables to registers for each procedure, and if there aren't enough registers available, we need to assign them to memory locations and load or store them as needed. We might add a field to the symbol table that stores the location of each variable, as a register or memory addresses, so the table would store names, types, and locations.

In this course, it is simplest to just store all variables (local declarations and parameters) on the stack, even though this is rather inefficient. The above program would translate to:

```
sw $1, -4($30) ; parameter a
sw $2, -8($30) ; parameter b
lis $4
.word 8
sub $30, $30, $4 ; push
lw $3, 4($30)
add $30, $30, $4 ; pop
jr $31
```

Note that we will eventually store the temporary results of evaluating expressions onto the stack, and possibly other values. If we do that, the correct offset for each variable access (like the `lw $3, 4($30)`

above) would change whenever we pushed or popped, which means we would need to do a lot of calculations to get the offsets right at every line. What we could do instead is just use a register to store the original stack pointer value, so we can use negative offsets.

We now introduce the convention that register 4 will always contain 4, and register 29 will always contain the a pointer to the bottom of the stack frame - the **frame pointer**.

```
lis $4
.word 4
add $29, $30, $0 ; store the stack pointer value
sw $1, -4($30) ; parameter `a`
sub $30, $30, $4 ; push
sw $2, -4($30) ; parameter `b`
sub $30, $30, $4 ; push
; body starts here
lw $3, -4($29) ; within the body here, `a` is always at offset 0, `b` is always at offset 4, relative to $29
; restore the stack to its original state
add $30, $30, $4 ; pop
add $30, $30, $4 ; pop
; if this was a non-main procedure, we would restore the register values here, including $29
jr $31
```

We can evaluate expressions like `1 + (2 * 3)`

in the same way we evaluate RPN expressions - for each token, if it is an operand then we push it onto the stack, and if it is an operator we take a number of operands off the stack, evaluate, then push the result back onto the stack. By convention, we will use register 5 and 6 for the operands, and 3 for the results (at the end, the return statement translates into code that writes the return value to register 3, so we can use it until then).

The previous expression therefore results in the following code on the top of the stack:

```
lis $5
.word 1
sw $5, -4($30)
sub $30, $30, $4 ; push 1
lis $5
.word 2
sw $5, -4($30)
sub $30, $30, $4 ; push 2
lis $5
.word 3
sw $5, -4($30)
sub $30, $30, $4 ; push 3
add $30, $30, $4 ; pop left multiplicand
lw $5, -4($30)
add $30, $30, $4 ; pop right multiplicand
lw $6, -4($30)
mult $5, $6
mflo $3
sw $3, -4($30)
sub $30, $30, $4 ; push multiplication result
add $30, $30, $4 ; pop left addend
lw $5, 0($30)
add $30, $30, $4 ; pop right addend
lw $6, 0($30)
add $3, $5, $6
sw $3, -4($30)
sub $30, $30, $4 ; push addition result
; the expression result is now at the top of the stack, so load it again into $3
add $30, $30, $4 ; pop final result
lw $3, -4($30)
```

Note that register 3 always contains the latest result. Therefore, we can avoid some unnecessary pushes and pops, and use only one working register:

```
lis $5
.word 1
sw $5, -4($30)
sub $30, $30, $4 ; push 1
; still need push here, since next stack entry has arity 0
lis $5
.word 2
sw $5, -4($30)
sub $30, $30, $4 ; push 2
; still need push here, since next stack entry has arity 0
lis $5
.word 3
; no push
; no pop
lw $5, -4($30)
add $30, $30, $4 ; pop left multiplicand
mult $5, $3
mflo $3
; no push
; no pop
lw $5, -4($30)
add $30, $30, $5 ; pop left addend
add $3, $5, $3
; the expression result is now in $3
```

The `println`

statement can be implemented using the `print`

procedure done in assignment 2 - by generating code that calls this procedure, and making sure the procedure is included in the output somewhere or is linked into the final result.

The `print.merl`

module will print the value in register 1. To generate code for `println(expr);`

, we evaluate `expr`

, set the result as the parameter for `print`

, and call it. If we are using register 1 before that, then we need to save it somewhere first in addition to the following:

```
; code for `expr`
add $1, $3, $0
; push $31
lis $5
.word print
jalr $5
; pop $31
```

Every expression has at most 2 values - its **lvalue** (the value when it appears on the left side of an assignment) and its **rvalue** (the value when it appears on the right side of an assignment). For example, in `x = x`

, on the right `x`

represents the value stored in the variable, while on the left `x`

represents its address/location, so we can assign to it.

Most expressions don't have lvalues, though all of them have rvalues. In WLP4, all lvalues must be able to represent addresses. As a result, we can't do something like `x + x = x`

, because `x + x`

doesn't represent an address, and therefore doesn't have an lvalue.

To generate code for `lvalue = expr`

, we evaluate `expr`

, look up the address of `lvalue`

(using the symbol table) and store the result at that memory location.

In the code, it is often helpful to use a convention where we store 1 in some fixed register and the address of `print`

in another, for convenience. At this point we have the following code:

```
; prologue
.import print
lis $4
.word 4
lis $10
.word print
lis $11
.word 1
add $29, $30, $0 ; store frame pointer
; allocate space on the stack for all the variables
; body of code goes here
add $30, $29, $0 ; restore the stack directly, since for the main procedure we don't need to restore registers
jr $31
```

If and while statements have boolean tests. For example, the code for `expr1 < expr2`

might appear as:

```
; code for `expr1`
add $6, $3, $0 ; we can use $6 because evaluating expressions only uses registers 3 and 5
; code for `expr2`
slt $3, $6, $3
```

To do `expr1 > expr2`

, we simply swap the inputs to `slt`

. We can't swap `expr1`

and `expr2`

since they might have side effects, though in C the order of evaluation within expressions is unspecified. For `expr1 != expr2`

, we do `slt $7, $3, $6`

and `slt $8, $6, $3`

(which can't both be 0), then `add $3, $7, $8`

(which will be 1 if any are true and 0 otherwise). Equality is the same, but we invert the result using `sub $3, $11, $3`

.

If statements are now simple:

```
; code for test
beq $3, $0, generated_label_for_else
; cody for body of if statement
beq $0, $0, generated_label_for_end
generated_label_for_else: ; this is generated by the compiler to be unique
; code for else clause of if statement
generated_label_for_end: ; this is generated by the compiler to be unique
```

We could also generate shorter code by using comparisons directly from the tests in the if statement - if we have `if (expr1 == expr2)`

, we could simply evaluate `expr1`

, store the result in register 6, evaluate `expr2`

, and then `bne, $3, $6, generated_label_for_else`

.

While loops are very similar:

```
generated_label_for_while: ; this is generated by the compiler to be unique
; code for test
beq $3, $0, generated_label_for_end
; cody for body of if statement
beq $0, $0, generated_label_for_while
generated_label_for_end: ; this is generated by the compiler to be unique
```

It is a good idea to generate comments with the MIPS output, in order to aid debugging.

;wip: MIPPITS disasm

To support pointers, we need to support null, dereferencing, address operator, pointer comparison, pointer arithmetic, allocation/deallocation, assignment through pointer dereferencing.

For null, we represent it in the source code using 0, like in `int *p = 0`

, but the compiler can actually use whatever value it wants - it is possible that `int *p = (int *)(1 - 1)`

is not a null pointer. We will use address 1 as our null value, because if we try to dereference it, it will crash (since it is not aligned on a word boundary). Therefore, the code for `NULL`

is `add $3, $11, $0`

.

For dereferencing, we simply evaluate the expression being dereferenced as an lvalue to get the address, then load the word at that address using `lw $3, 0(3)`

.

Pointer comparisons are similar to integer comparison, but we have to make sure we use unsigned comparisons (using `sltu`

instead of `slt`

), since all pointers are unsigned. This requires type information - the compiler should annotate the tree nodes with type information while doing type checking.

Pointer arithmetic is almost the same - we have to make sure we treat all pointers as unsigned, and use unsigned operations accordingly.

For adding an `int*`

and an `int`

, we evaluate the `int*`

into R3, push R3, evaluate the `int`

into R3, multiply R3 by 4 in place, pop into R5, then add R5 to R3.

The process for subtracting an `int*`

and an `int`

is almost exacty the same, but we subtract at the end. For subtracting two `int*`

's, we evaluate the two operands, subtract the second from the first, divide the difference by 4, and this is the result.

Assignment is possible through a pointer dereference: `*x = y`

and `*(x + z) = y`

, or even `*(new int[10]) = y`

- the left side can be anything that is resolvable to an address. To do this in code, we calculate the value of the factor (the expression inside the dereference operator) and interpret this as an address, evaluate the right hand side, then store the result into the address calculated before.

To implement the address of operator, like `&x`

, `&(x + z)`

, and `&(new int[10])`

(`&LVALUE`

), we can do something like the following:

- If the lvalue is an ID, just load the ID's offset (looked up in the symbol table) using
`lis $3`

, then get the address using`add $3, $29, $3`

. - If the lvalue is of the form
`*FACTOR`

, then we can just give the code for`FACTOR`

.

New/delete, the last parts of pointer support for WLP4, are rather complicated. For the assignments, we will be provided the new and delete functions already in `alloc.merl`

, as part of the runtime environment. Note that this file **needs to be linked last**, because it needs to know where the end of the program is to put the heap. This library can be used as follows:

```
.import init # sets up the allocator's data structures, must be called once in the prologue, takes 1 parameter in $2 specifying the offset from end of program
.import new # allocates a block of memory, takes 1 parameter in $1 specifying the number of words to allocate and returning the pointer to the block in $3, or 0 when allocation fails
.import delete # deallocate a block of memory, takes 1 parameter in $1 specifying the address of the block to deallocate
# if the first parameter of wain is an `int*`, then the program is intended to be run with `mips.array`, and we set $2 to the length of the array (which it already should be, so we just need to not modify $2 elsewhere)
# otherwise, the first parameter is an `int`, and the program is intended to be run with `mips.twoints`, and we set $2 to 0
# this is necessary because the array is allocated right after the program, which is also where we want to store our heap, so we need the size of the array to avoid overwriting it with our heap
# here, call the `init` function using the value in $2 set according to the above rules
# new can be called just like a regular MIPS function, so call it with requested memory in words in $1, and check the result in $3 - if the result is 0, change it to 1 (the null pointer is 1, but the library returns 0 for some reason).
# delete is also a regular MIPS function, so check for null (and skip the actual call if it is), then call it with the memory address to free in $1
```

So to implement `new int[expr]`

, we evaluate `expr`

into R3, move R3 to R1, call new, and then if R3 is 0, change it to 1.

So to implement `delete [] expr`

, we evaluate `expr`

, and check if `expr`

is null. If it isn't, then we set R1 to the result of `expr`

, then call delete.

Note that we check for null because C/C++ supports attempting to delete null pointers.

The main function had a prologue and epilogue. However, other procedures also have their own prologue and epilogue. In the main prologue/epilogue we had to save R1 and R2, import print, init, new, and delete, set R4, R11, and other constants, set R29 to the frame pointer, call init, execute the main body, then pop everything we pushed off the stack at the end and return using `jr $31`

.

In procedures we need to save the used registers (except R3 for the result) on the stack, set R29, execute the procedure body, then pop everything we pushed off the stack at the end and return.

Note that the main function needs to come first in the assembly code, before all the other procedures. This is because the main function needs to execute first.

Every procedure should save and restore all the registers it will modify. However, it is difficult to know which registers we will be modifying. We could just save and restore them all (except for R3, the result), but this is inefficient. In our code we are only using registers R1 to R8 and R29 to R31, so R1 to R8 and R29 are the only ones we need to save and restore.

There are two approaches to saving registers - caller-save and calee-save. Callee-save tends to generate smaller code, but caller-save allows us to do things like calling the same procedure with a different amount of parameters. So far, our approach has been caller-save for R31, and callee-save for everything else.

If we are doing callee=save, we always need to set R29 before we start saving variables, because the value of R29 depends on the stack pointer, which is changed by saving variables. However, that means we can't directly save R29 like a normal variable - we need to store R29 into the stack, set the new value of R29, and then finally decrement the stack pointer, in that order. Alternatively, we could push normally, then afterwards set R29 to 4 plus R30.

We also have to make sure to use prefixes on labels for procedure names, since they might conflict with the names of the labels we use internally in the compiler, like `init`

and `print`

. Simply prepend `F`

to the procedure name to get the label name.

Note that parameters need special handling for function calls. When we have a function call `ID(expr, ..., expr)`

:

```
# push $29
# push $31
# code for expr1
# push $3
# code for expr2
# push $3
...
lis $5
.word ID
# pop all the arguments off the stack
# pop $31
# pop $29
```

We can implement the function as follows:

```
add $29, $30, $0
# push declarations - we do this before saving registers in order to fix symbol table offsets
# push registers
# code for statements
# code for return expression
# pop registers
add $30, $29, $0
jr $31
```

However, note that the saved registers are between the arguments and the local variables, which doesn't match the offsets in our symbol table. We need to fix our symbol table so that it matches what we actually have on the stack - parameters should actually have positive offsets, and local variables should have negative offsets. To fix the argument offsets, we add 4 times the number of arguments to all the offsets in the symbol table.

We could also just save variables in the caller, which greatly simplifies matters, but tends to generate longer code overall, since functions are generally called more than once, so we are duplicating the saving code.

The goal of optimization is to minimize the runtime or code size of compiler output. In general, this problem is unsolvable, but we can apply heristics to get us closer to the best answer.

One simple optimization is **constant folding** - evaluating constant expressions like `1 + 2`

in the code and substituting that with `3`

.

We can also do **constant propagation**, in which we propagate constant values like `int x = 1`

ahead into the code, so if we do a `1 + x`

later without changing `x`

first (or modifying it indirectly via pointers), we can substitute it for `1 + 1`

(and now we could potentially do constant folding here). For example, `int y = 8; return y + y;`

could become `return 8 + 8;`

, which we then apply constant folding to get `return 16;`

.

The expression evaluation approach shown earlier where we simply reuse the value of R3 rather than pushing and popping it off the stack.

**Common subexpression elimination** is an optimization in which we store common subexpressions to avoid needing to recompute them. For example, if we have `(a + b) * (a + b)`

, then we could have a register store `a + b`

, then multiply it with itself.

**Dead code elimination** is the removal of code that is determined to never run, like code in a procedure that is never called, or a branch of an if statement whose condition will never let it run.

**Register allocation** is an optimization in which we just keep local variables as registers rather than constantly loading and storing them in RAM, since registers are much faster to access. Generally, we want to do this for the most frequently used variables. One potential issue with this is that we can't take the address of a register (**what is the address of register 4?**), and so the address-of operator will not work (we could potentially fix this with fake addresses).

**Strength reduction** is the replacement of expensive operations with less expensive ones. For example, addition is much faster than multiplication in most modern architectures, so we should prefer additions over multiplications. If we multiply something by 2, we could instead simply add it to itself. The same applies for larger multiplicands, until a certain point where multiplication is just faster.

Some optimizations are specific to procedures. For example, **inlining** is where we replace a function call with its body to avoid having to make a long and expensive call. For example, if we want to call `int f(int x) { return x + x }`

, we could just replace any call to `f`

with `x + x`

instead. Note that this is not always advantageous - the resulting code could be larger than before since we are duplicating the body of the function as many times as we are inlining calls to it (this generally works best for small functions). Also, recursion makes things a little messy.

**Tail call optimization** is a powerful optimization in which we can simply jump to a function rather than making a full function call if we have a tail call - a function call that is also the last operation in the procedure it is in (basically, we are reusing stack frames to save memory). In WLP4, checking for tail calls is as simple as verifying that the return expression for a procedure is a plain function call, but since returns must be at the end, this is limited to always calling another function (which is an infinite loop if that call is a recursion).

To get around this, we could mutate the parse tree to make things tail recursive - whenever we have a return immediately following an if statement, we move the return into the if body and else body and try to simplify the return value, the goal being to convert `int ret = 0; if (...) { ...; ret = 5 } else { ...; ret = 6 } return ret;`

into `int ret = 0; if (...) { ...; return 5; } else { ...; return 6; }`

so we might be able to apply tail call optimization.

A subset of tail calls are tail recursions, and tail call optimizations on these are called **tail recursion optimization**.

In languages like WLP4 and C++ we have to do manual memory management, using new/delete. In languages like Java/Racket/Python we have implicit memory management, via a **garbage collector**, which manages our memory for us. We want to look at implicit memory management in a bit more detail.

We always know when we are creating objects - this is explicitly specified in the code. The problem is determining the earliest possible time we can delete them - when we can prove that it is impossible to use it anymore.

The **scope** of a variable is the part of the program in which the variable's name is visible. In WLP4, the scope of a variable is the procedure it is defined in - the scope is static and can be determined at compile time.

The **extent** of a variable is the part of the program's execution in which the variable's value is live - when it can still be accessed. This depends on the data itself - for stack-allocated data, the extent ends at the end of the scope, but for heap-allocated data, the extent goes on until there are no references to it anymore.

The new and delete operators can be implemented in many different ways. A good implementation can actually be \(O(1)\).

One way to implement these is with **free lists**. In this scheme, we maintain linked lists of blocks (a pointer and a block length, which together represent a block of memory) into the heap of free memory. Initially, the free list has one entry, the entire available heap.

We can implement this linked list on the heap itself. A block at address \(a\) is a word \(w\) (the length), followed by \(w\) bytes (the data). The first block would always be at some fixed location, so we can always access it. When a block is not allocated, the first word of the data can be used to store a pointer to the next block in the free list.

When we allocate memory, we actually want to allocate 4 more in order to make a block. We then walk through the free list and find the first block that will fit the required size, chop off the desired size from the block, and return a pointer to one word after the new block starts - the data itself.

When we deallocate memory, we are given a pointer that we know is 4 bytes after the start of the block. We then add this block to the free list to mark that memory as available for allocation. When we are inserting the block into the free list, we also need to make sure to store the next pointer in the block data, which we can use since the memory is already considered unused at this point.

Basically, each block is a node in our linked list, and the next pointer is stored in the first 4 bytes of the data. When we allocate memory from a block, we no longer need the next pointer, and can use that data freely.

We notice that both operations will insert elements into the list, so the free list length will grow without bound. Therefore, we need to **occasionally clean up the list by merging adjacent blocks in memory**. An efficient algorithm for this is beyond the scope of the course, but the naive solution is \(O(n^2)\) - comparing every pair of blocks to check if they are adjacent.

This type of allocator suffers from **fragmentation** issues if we allocate and deallocate a lot - we might end up leaving small blocks of free memory in between our allocated blocks, and if we try to allocate something large afterwards, we might not have a big enough block, even though there is enough free space in the heap for it between all the free blocks.

We can mitigate this by always choosing the smallest suitable block during allocation (it helps if we keep the blocks sorted by size in this case), which will still fragment it, but the fragments are smaller. This is called the **best fit algorithm**. Simply inserting into the beginning is known as the **first-fit algorithm**. Although the best-fit algorithm results in better memory layout than first-fit, it takes longer since we have to perform sorted insertion each time.

The garbage collector is a component of the language that cleans up our memory for us once in a while, to avoid having to clean it up ourselves. When memory is no longer accessible, the runtime environment will reclaim it for us. The idea behind garbage collection is to give the impression that the machine has an infinite heap.

The simplest garbage collection technique is **mark and sweep**. In this scheme, we have a flag for every object that marks whether it is being referenced. At the start of a garbage collection cycle, we scan the active part of our program, and find every possible live pointer in the program (even those stored in arrays and structs) and set the flag on the objects they reference (mark). Afterward, we scan every object on the heap and reclaim any object that doesn't have its flag set, and also reset flags (sweep).

The biggest problem is that this requires us to stop the program in order to perform the garbage collection. This makes for very bad real-world performance as the program will occasionally pause without warning.

Another technique is **reference counting**, where each object is associated with a count of how many times it is being referenced. Every time we make a new reference to it, we increment that counter, and each time that reference goes out of scope, we decrement the counter. If the counter ever goes to 0, we reclaim that object.

This works really fast because we only change a counter when we create or delete a reference, but it doesn't always work - cyclic references will cause memory leaks. For example, if A is an object that references B, and B is an object that references A, then they will keep each other's reference count above 0, even though they might not be available otherwise. To solve this, we can use **weak references**, which are references that don't increment the reference count of the object they reference, but this needs to be specified explicitly by the programmer.

Another is **copying garbage collection**. In this scheme, the heap is split into halves, "from" and "to". We only allocate from the "from" half, and when this half is full, we copy all reachable objects into the "to" half, and then switch the labels so "from" is now "to" and "to" is now "from". The memory in the "to" half is now considered freed.

The big advantage of this is that when we are doing copying, we can also copy them contiguously, so there is no fragmentation. Every time we run a garbage collection cycle, we can always ensure we don't have any fragmentation. The disadvantage is the same as with mark and sweep - we need to stop the entire program in order to perform the reference checking and copying. Also, we can only use half the heap at any point in time.

This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright 2013-2017 Anthony Zhang.