Lecture Notes by Anthony Zhang.


Data Structures and Data Management.

Therese Beidl
Section 002
Email: beidl@uwaterloo.ca
Web: http://www.student.cs.uwaterloo.ca/~cs240
ISA: Joseph (Danny) Sitter
ISA Email: cs240@studennt.cs.uwaterloo.ca

Section 1 and 2 are the regular classes, and section 3 is the enhanced section and has very different content.

Midterm at Thurs. Feb. 26 at 4:30 PM, worth 25%. Final exam worth 50%. Must pass weighted average of exam to pass the course. 5 assignments, each worth 5%.


Assignments is due on Tuesdays. Assignment 0 is due on Jan. 13.

Suppose we have a lot of data to keep track of. We could store it in an array/list, but depending on the type of the data, this might not be the best choice. Data structures should make it easy and efficient to perform the operations we need. For example, an English dictionary probably needs to be efficiently searched, but we don't really need to care about insertion and deletion since they're so rare.

The best data structure for something depends on the type of data we want to store. Our goal is to have a short running time and little memory.

In this course we will be performing theoretical analysis, developing ideas and pseudocode (and sometimes, implementations), and analyzing them using tools like big-O notation.

An Abstract Data Type is the idea of describing something by what it can do, not how it does it.

Required background includes arrays, linked lists, strings, stacks, queues, ADTs, recursive algorithms, trees, sorting algorithms (insertion, selection, quick, merge), binary search/BSTs, arithmetic series, geometric series, harmonic series (\frac 1 1 + \ldots + \frac 1 n = \ln n).

In this course, \log n is implicitly \log_2 n - all logarithms are base 2 unless otherwise stated.

A problem is a description of a general situation and the desired outcome. For example, the sorting problem is "given comparable values, put them in sorted order".

A problem instance is a particular input to a problem, like a particular array of numbers to sort for the sorting problem.

A problem solution is a change/process that, given the situation of a problem instance, results in the desired outcome. For example, the sorted array of numbers for the sorting problem.

We solve a problem by giving the correct algorithm for it. A solution is correct if it finds a solution for every possible input that can be given.

An algorithm is a finite description of a process that gives an answer (a solution that is not necessarily correct) for all possible instances of a problem.

Efficiency usually refers to an algorithm's runtime, and sometimes its memory usage. It may also refer to things specific to the problem domain, like the number of comparisons done.

To solve a problem:

  1. Design an algorithm.
  2. Write down the main idea of the algorithm in plain prose. For example, "for increasing i, make A[0 \ldots i] sorted by inserting A[i] into a sorted A[0 \ldots i - 1]" for the sorting problem.
  3. Optionally, write pseudocode - code that might not be a real language, and is something like a mix between prose and code. This is a more precise way of specifying an algorithm. Any consistent, precise, and clearly understandable language will be accepted as pseudocode.

    Preconditions: an array `A[0 ... n - 1]`$.
    Postconditions: `A` is sorted
    for i in 1 ... n - 1:
      value = A[i]
      j = i - 1
      while j >= 0 and A[j] > key:
        A[j + 1] = A[j]
        j = j - 1
      A[j + 1] = value
  4. Argue/prove that it is correct.
  5. We can use formal correctness proofs, but this is often excessive to convince someone an algorithm is correct. Instead, we can simply give lots of invariants and prove that the algorithm terminates.
  6. For example, for the above, we would say that for the inner loop, "A[j + 1 ... i] contains items that are bigger than value, and is in sorted order" is an invariant.
  7. It is very important to prove that it terminates, especially recursive algorithms. We simply use the standard methods specified in CS245, but we can usually just write that each call or iteration gets some smaller input, and that small enough inputs cause the algorithm to terminate.
  8. Analyze how good the algorithm is, in terms of efficiency and sometimes lower bounds.
  9. For this we use time complexity/running-time analysis, memory usage analysis, and so on.
  10. For example, the above code is known as Insertion Sort, which we already know to have a worst case time complexity of O(n^2) and a best case time complexity of O(n).
  11. Recall that formally, an algorithm being in O(f(n)) means that there exists a k > 0 and m \ge 0 such that for all n > m, k \cdot f(n) > T(n) where T(n) represents the number of constant time steps or amount of time that the algorithm needs to give an answer for an input of size n - the running time function. This is written as T(n) \in O(f(n)).
  12. Repeat above steps until satisfactory algorithm is found.

Only after this point would we implement/experimentally test the algorithm.


A timing function is a function S_p: \mathrm{Input} \to \mb{R}^+ where p is the program we are running. For example, we can compare quicksort and mergesort by comparing S_{\text{mergesort}}(\tup{5, 6, 2, 3}) and S_{\text{quicksort}}(\tup{5, 6, 2, 3}).

A timing function measures the amount of time a program takes for a given input.

Since computers are constantly changing and improving, comparing timing functions is not very meaningful since they will not stay consistent over differnt machines and over time as computers change. We want a machine-agnostic way to measure the amount of time a given program takes.

For example, Intel's core instruction set is RISC, with tons of extensions such as MMX, SSE, and FPU and GPU stuff. These add instructions such as fsqrt and similar. We assume that every instruction takes a finite, constant amount of time (interesting note).

We can instead measure the number of elementary operations, but it is difficult to define what an elementary operation actually is, and determining the exact number of elementary operations is a lot of work. For example, we might measure the number of clock cycles needed to run a program, but this would not make sense across things like RISC vs CISC machines. For example, if an architecture doesn't have a constant time division operation, we would probably need to implement one in software that would not have constant time.

In other words, we write our algorithms in pseudo-code, then count the number of primitive operations.

We now want to plot our timing function with respect to the size of the input \abs{\mathrm{Input}} \subseteq \mb{N}. This allows us to figure out the behaviour of the function as the size of the input gets bigger. This is difficult since finding the timing function for arbitrary inputs is often difficult. Additionally, we don't necessarily care about the constants in the function output, just its behaviour on large inputs.

Usually we care about the worst case behaviour, to determine the worst possible behaviour in all cases. We may also be interested in the average case to see how things will perform in practice. Occasionally, we may also be interested in the best case - for example, in cryptography, we want to make sure the problem cannot be solved in less than some given amount of time.

The worst case behaviour is T_p(n) = \max \set{S_p(I) \middle| \abs{I} = n}. The best case behaviour is T_p(n) = \max \set{S_p(I) \middle| \abs{I} = n}. The average case behaviour is T_p(n) = \frac{\sum_{e \in R} i}{\abs{R}} where R = \set{S_p(I) \middle| \abs{I} = n}. We can now plot these functions to get the time bounds on our program.

The big-O notation is used to make life easier for computer scientists trying to calculate timing functions and to make comparing functions much easier. Refer to the CS136 notes for background. The O in O(f(x)) is actually a capital omicron, bu nowadays the O is more common.

Comparing Big O is analogous to ordering in \mb{R}. Just like how x \le y, we might write f(x) \in O(g(x)). What we care about is the trends of the function, not the actual numbers.

Big-O gives an upper bound behaviour on a function (analogous to x \le y). Big-\Omega gives a lower bound (analogous to x \ge y). Big-\Theta us the exact bounds - when the Big-O is the same as the Big-\Omega (analogous to x = y).

There is also Little-o, which is a non-inclusive upper bound (analogous to x < y), and Little-\omega, which is a non-inclusive lower bound (analogous to x > y). For little-o, instead of \exists c > 0, \exists n_0 > 0, n > n_0 \implies f(n) \le cg(x), we have \forall c > 0, \exists n_0 > 0, n > n_0 \implies f(n) < cg(x).

There are also incomparable functions, like one that goes up and down arbitrarily. As a result, functions can only be partially ordered.


Big-O does not care about the values of the functions - only about how it grow as input values get large.

Prove that 2n^2 + 3n + 11 \in O(n^2):

We want to prove that 2n^2 + 3n + 11 \le cn^2 for all n > n_0.
Let n_0 = 10 and n \ge n_0. Clearly 2n^2 + 3n + 11 \le 2n^2 + 3n^2 + 11n^2 when n \ge 1 and 2n^2 + 3n^2 + 11n^2 = 16n^2.
Then 16 is a possible value for c and 2n^2 + 3n + 11 \in O(n^2).

O(n^2 + \log n) is not correct since it is not fully simplified.

Common time complexities include \Theta(1) (constant), \Theta(\log n) (logarithmic), \Theta(n) (linear), \Theta(n \log n) (pseudo-linear), \Theta(n^2) (quadratic), \Theta(n^3) (cubic), \Theta(n^k) (polynomial), \Theta(2^n) (exponential). In the real world, everything above O(n \log n) tends to be rather bad performance.

It is not always the case that an algorithm with better time complexity than another is always better in all circumstances. For example, despite insertion sort being O(n^2) and merge sort being O(n \log n), we often use insertion sort when sorting smaller lists since it is faster in practice when the input is small. Many practical sorting algorithms therefore drop down into insertion sort when the input is small. This is also why most people use insertion sort when sorting things in real life - it is very fast for real-world inputs, which are usually small.

Let L = \lim_{n \to \infty} \frac{f(n)}{g(n)}. If L = 0, f(n) \in o(g(n)). If 0 < L < \infty, f(n) \in \Theta(g(n)). Otherwise, L = \infty and f(n) \in \omega(g(n)). This is a useful way to prove orders for functions that would otherwise be difficult to prove by first principles. This is the limit order rule.

For example, if we want to prove that n^2 \log n \in o(n^3), we can do \lim_{n \to \infty} \frac{n^2 \log n}{n^3} = \lim_{n \to \infty} \frac{\log n}{n} \lH \lim_{n \to \infty} \frac{\frac 1 n}{1} = 0, and use the fact that L = 0 to conclude that n^2 \log n \in o(n^3).

We can aso use this to prove that (\log n)^a \in o(n^b) for any a, b > 0.

Also, f(n) \in \Omega(g(n)) \iff g(n) \in \Omega(f(n)), f(n) \in O(g(n)) \iff g(n) \in \Omega(f(n)), f(n) \in o(g(n)) \iff g(n) \in \omega(f(n)), f(n) \in \Theta(g(n)) \iff g(n) \in O(f(n)) \wedge g(n) \in \Omega(f(n)), f(n) \in o(g(n)) \implies f(n) \in O(g(n)), f(n) \in o(g(n)) \implies f(n) \notin \Omega(g(n)), f(n) \in \omega(g(n)) \implies f(n) \in \Omega(g(n)), and f(n) \in \omega(g(n)) \implies f(n) \notin O(g(n)). We can use these to simplify order notation expressions.

Also, O(f(n) + g(n)) = O(\max(f(n), g(n))) = \max(O(f(n)), O(g(n))). Also, orders are transitive - f(n \in O(g(n))) \wedge g(n) \in O(h(n)) \implies f(n) \in O(h(n)).

In our pseudocode, we often use x \leftarrow 5 or x := 5 to denote assignment, since = is ambiguous - it could potentially mean equality or assignment.

Math with orders works a lot like normal math: \Theta(1) \cdot (1 + \ldots + n) = \Theta(1) \cdot \frac{n^2 + n}{2} = \Theta(n^2 + n) = \Theta(n^2).


A priority queue is a data structure that stores items with keys that represent the priorities that the values have. It has two operations - insertion with a given key and deletion of the item with the maximum priority.

There are a number of ways we could implement a priority queue - an unsorted dynamic array (amortized O(1) insert, O(n) deletion), a sorted dynamic array, and something known as a heap. We don't want to use sorted order, since that makes insertion slow. We also dont't want to use unsorted order, since that makes deletion slow.

A heap is a data structure that allows us to implement a priority queue in a very efficient way. The main idea is that we would like to know very easily what the maximum element is, so we make all the other elements follow from it.

The basic idea behind the heap is that it is a tree where each element keeps track of candidates that can replace them when they're deleted, as children. The maximum item would therefore be the root.

Most commonly, we use binary heaps, which is a binary tree. Recall that this is a data structure consisting of nodes and links between node, such that each node has 2 or fewer children, and 1 or fewer parents.

Formally, a binary max-heap is a binary tree that satisfies the structural property and the max-heap property. There are no rules about the order of children, unlike a binary search tree.

The structural property is that all levels except the lowest level must be completely filled. Each node must have two children, except the lowest level, where nodes can have fewer than two children but must fill up from the left - in the lowest level, the left sibling of a node must have 2 children before we can insert a child into it.

The max-heap property is that for any node x in the heap, the key of x is less than or equal to the key of the parent of x. As a result, the maximum node of a heap is always the root node. There is also a min-heap property where the key of x is greater than or equal to the key of the parent of x, used for min-heaps. In other words, the parent of any node in a heap must be bigger or equal to that node.

Insertion into a heap is easy because we can simply do a standard tree insertion, with certain constraints to ensure it satisfies the structural property and the heap property. This operation is \Theta(\log n) because the structural property ensures that the height of the tree is \Theta(\log n).

Also, the height of a binary heap is \ceil{\log_2 n}. The height of a tree is 1 less than the number of levels it has (1 less than the maximum depth). Levels start at 1 and increase with depth.


Let x be a heap with n nodes and height h.
Clearly, any binary tree of height h must have n \le 2^{h + 1} - 1, since each level i has less than or equal to 2^{i - 1} nodes, and the total is n \le \sum_{i = 1}^{h + 1} 2^{i - 1} \le \frac{2^{h + 1} - 1}{2 - 1}.
Since n + 1 \le 2^{h + 1}, h \ge \log(n + 1) - 1.
Clearly, n \ge 2^h, by the structural property, since the lowest level is always h + 1 so each level i before it has exactly 2^{i - 1} nodes, and the total for all levels except the lowest is 2^h - 1. Since the lowest level has 1 or more nodes, the total is n \ge (2^h - 1) + 1, or 2^h.


Binary heaps are trees, so we can store them as a tree structure with references and nodes. However, a more efficient way to store binary trees is as an array:

Essentially, we first have the first level, then the second level, and so on. Formally, the left and right children of a node stored in the array at index i are lc(i) = 2i + 1 and rc(i) = 2i + 2 (in a 0-indexed array). Likewise, the parent of any node is p(i) = \floor{\frac{i - 1} 2}. This works because the leftmost node in level l has index \sum_{k = 1}^{l - 1} 2^{k - 1}.

In fact, this left and right parent system works for any binary tree, and even for n-ary trees with a simple extension. However, it is not space efficient unless the tree obeys something like the structural property. The structural property ensures that the array is mostly filled, possibly except for part that stores the last level.

Heaps as Priority Queues

We will represent our heap as an array A containing n values, automatically expanded or shrunk as needed.

To insert into a heap, we first find the only location that the structural property allows us to insert a new node i into. This could possibly violate the heap property since it is possible that i > p(i), so if i > p(i), we swap i and p(i) to get p(i)' and i', where p(p(i)') = i'.

Now the heap property is satisfied for i' and p(i)', and it isn't possible for p(i)''s sibling to be greater than i' since it was less than the original parent p(i). However, it is possible that i' is now greater than it's new parent p(i'), so we swap again if i' > p(i)'.

We repeat swapping the originally inserted node with its parents until either it is less than its parent, or it is the new root node. This is called "bubbling up until it holds".

In pseudocode:

def insert(key):
    # wip: we resize the array here if necessary
    A[n] = key # $n$ is the index of the next free space in the heap - we keep track of this in the heap structure
    j = n
    n += 1 # set $n$ to the next empty space in the array
    while j > 0 and A[p(j)] < A[j]:
        swap A[j] and A[p(j)]
        j = p(j)

Clearly, this takes O(\log n) time worst case since every run of the loop runs on a lower height on the tree. The bubbling up process would potentially need to visit every level in the tree if inserting an element greater than the root element of the heap.

To delete the maximum of a heap, our goal is to remove the root element of the heap. When we do this, we now have a blank space at the root node of the heap. We fix this by moving the last node in the heap (the rightmost node on the lowest level) into the position the root originally occupied. This allows us to satisfy the structural property again.

However, now the heap property is not satisfied. We fix this now by swapping the root node (originally the last node) i with the bigger of its children lc(i) and rc(i), which we will represent as c(i), to get c(i)' and i' such that p(i') = c(i)'. Now we have satisfied the heap property between i' and c(i)' and between i''s sibling and c(i)', but it is possible that i''s children could be greater than it.

Therefore, we repeatedly swap the originally last node with its larger child until it is greater (or equal to) than both of its children, or it is a leaf node.

In pseudocode:

def delete_max():
    max = A[0]
    n -- # 1 less than the next free space is the last node in the heap, which we will consider as a free space after we move it to the root
    # if memory management is needed, free `A[0]` here
    A[0] = A[n]
    # the following is the bubble_down operation, which bubbles values down until they satisfy the heap property
    j = 0
    while j <= n - 1:
        max_index = j
        if lc(j) exists and A[lc(j)] > A[j]: # existance means that lc(j) <= n - 1
            max_index = lc(j)
        if rc(j) exists and A[rc(j)] > A[max_index]: # existance means that rc(j) <= n - 1
            max_index = rc(j)
        if max_index == k: break # the heap property is already satisfied, so stop bubbling
        swap A[j] and A[max_index] # swap the current node with its child, which is larger than it
        j = max_index
    # if necessary, shrink the array here
    return max

Clearly, this takes O(h) time worst case where h is the height of the heap underneath the starting element, the root, which is O(\log n) since h is roughly \log n. The worst case is when the last node of the heap is also the smallest one.

Priority queue sort (PQSort) is a sorting algorithm where we insert all our elements into a priority queue and then take them out again in order. PQSort needs n insertions and n deletions of maximums. Since each operation is O(\log n), PQSort is O(n \log n), like mergesort. When the priority queue is implemented with a heap, we also call it heapsort.

Heapsort is pretty nice because we can sort in-place (by building a heap directly) and without a lot of recursion, but it can have poor real-world behaviour because the worst case heap insertion occurs when we are inserting nodes one by one that are already in ascending order, and already-sorted data occurs very often in real-world applications.

Also, heapsort is not stable. If a sorting algorithm is stable, then if different nodes have equal keys, they appear in the same order in the output as they did in the input.

Also, we often measure the runtime of sorting algorithms in terms of the number of key comparisons made (A[i] < A[j]). This is because we often allow users to specify their own key computing functions, which could potentially be aribtrarily exxpensive. Everything else is simply proportional or a constant on top of the number of key comparisons, in terms of runtime.

Bubbling downward is basically the operation of repeatedly swapping an element with its largest child until it is greater or equal to both. Bubbling up is basically repeatedly swapping an element with its parent until it is less than or equal to its parent.

We may also want to build heaps - given an array D with n values, we want to build a head from that. This is relatively trivial in theory - simply go through the array and insert it into a heap. Clearly, this is O(n \log n) since insertion is O(\log n) and we need n of them.

Clearly, this takes roughly \sum_{k = 0}^{n - 1} \log(k + 1) time since k starts off at 0 and grows by 1 upon every insertion. Clearly, there are at least \frac n 2 terms in that sum where k + 1 \ge \frac n 2, so \sum_{k = 0}^{c - 1} \log(k + 1) \ge \frac n 2 \log \frac n 2, so the operation is \Omega(n \log n), and by the above, \Theta(n \log n).

A better option is to build the heap in place, to save memory. Clearly, an array always satisfies the structural property since there's no way not to, so all we have to do is enforce the heap property in the array.

We will enforce the heap property by bubbling downward on every level of the node starting from the last node that has children (the last node on the second-to-last level). Our invariant will be that for all k > i, the subheap rooted at k satisfies the heap property (we can say this because we bubble down from right to left). We will show that this takes less than or equal to 2c key comparisons:

for i in floor(n / 2) - 1 ... 0:

We could bubble down for n - 1 down to 0, since that visits every node, but since the ones at the end are all leaf nodes and bubbling down is a no-op, we just need to bubble down for all nodes from p(n - 1) = \floor{\frac{n - 2} 2} = \floor{\frac n 2} - 1 down to 0.


Without loss of generality, assume n = 2^t - 1 where t is the number of levels - all levels are full.
Clearly, any node on a level l needs at most 2(t - l) comparisons for its bubble down, since each level until the bottom needs 2 comparisons each and there t - l levels between the current level and the bottom.
Clearly, there are 2^{l - 1} nodes on each level, which means that there are \sum_{l = 1}^{t + 1} 2^{l - 1} 2(t - l) comparisons in total.
Clearly, \sum_{l = 1}^p 2^{l - 1} 2(p - l) = t\sum_{l = 1}^p 2^l - \sum_{l = 1}^p l2^l) \le 2^t - 2t - 2 \le 2^t - 1 = 2n (note that \sum_{x = 1}^n x2^x = 2^{n + 1} n - 2^{n + 1} + 2). ;wip: wat So there are 2n or fewer comparisons in total.

In other words, we can build a heap in-place from an array in O(n) time.

Also, if we prove that the running time of a function is of a certain order at infinity, then it is proven for all n.


Another proof that in-place heapify is linear in time would be to use induction to show that the number of comparisons done for each node that is i levels above the lowest is 2i, and that there are \frac{N + 1}{2^{i + 1}} nodes in that level. Therefore, the total number of comparisons is \sum_{i = 1}^{h + 1} \frac{N + 1}{2^{i + 1}} 2i = (N + 1)\sum_{i = 1}^{h + 1} \frac{i}{2^i} \le (N + 1)\sum_{i = 1}^\infty \frac{i}{2^i} = 2(N + 1), by the series rules.

Selection Problem

Given an array A of N integers and an integer k \in \set{0, \ldots, N - 1}, find the kth smallest item in A. If k = 0, the result is the minimum. If k \approxeq \frac N 2, the result is the median. If k = N - 1, the result is the maximum.

We could solve this by sorting the array, which would be O(n \log n), or we could build a min-heap or max-heap and remove the appropriate number of items, both O(n \log n).

There is also a better solution called quickselect, based on the divide and conquer approach - we divide the problem into smaller subproblems, and then solve these subproblems. The basic idea is to recursively choose a pivot, partition the array into an array where every element is less than or equal to that pivot and an array greater than that pivot, then recurse on one of the smaller arrays to find the solution:

def quick_select(A, k):
    if len(A) == 1: return A[0] # only one element, so must be the answer
    x = the pivot element # an element of the array that we choose to split on - how we choose it is up to us
    partition `A` into two sections (possibly in-place) `A[0:i - 1]` and `A[i + 1, n - 1]` such that `A[0:i - 1]` contains all elements that are less or equal to the pivot (excluding the pivot itself), and `A[i + 1, n - 1]` contains all elements that are greater, and `A[i]` is the pivot `x`
    if k == i: return x # the pivot element is the `k`th smallest element, so we found the answer
    if k < i: return quick_select(A[0:i - 1], k) # the answer is in the left partition, so recurse on the smaller problem with the same index
    if k > i: return quick_select(A[i + 1, n - 1], k - i - 1)

Partitioning can be done in-place in O(n) time worst-case:

def partition(A, i):
    swap A[0] and A[i] # we want to get `x` out of the way first, so we put it in `A[0]`
    i, j = 1, len(A) - 1
    while True: # this is linear since `i` can only increase and `j` can only decrease, so we have at most one comparison with `A[0]` per element>
        # find an element that belongs in the left partition, and an element that belongs in the right partition
        while i < len(A) and A[i] <= A[0]: i += 1 # go right until `A[i]` is greater than `x`
        while A[j] > A[0]: j -= 1 # go left until `A[j]` is less than or equal to `x`
        if j < i: # all of the left side is less or equal to `x` and all the right side is greater, so the partitioning is done
            # the place where the elements that are less or equal and the elements that are greater meet is the boundary between partitions
            swap A[0] and A[j] # put the pivot back into its place between the two partitions, since A[j] is less or equal to `x`
        # at this point, `i` is greater than or equal to `j`, where `i` is the index of an element that is on the right but belongs on the left, and `j` represents the index of an element that is on the left but belongs on the right
        swap A[i] and A[j] # by swapping, we are moving each element into their correct partition

The worst case runtime for quickselect is T(N) = \begin{cases} c_1 &\text{if } N = 1 \\ T(N - 1) + c_2 N &\text{if } N > 1 \end{cases} = c_2 N + c_2 (N - 1) + \ldots + c_2 2 + c_1 = O(N^2) where c_1, c_2 are constants - in the worst case, quickselect takes quadratic time. However, in the best case it takes just O(n), where the pivot is the kth smallest element.

The average case runtime is T_{avg}(N) = \frac{\sum_{\text{each input } i \text{ of size } N} T(i)}{\text{number of inputs with size } N}. This is not feasible to calculate directly, but we can actually describe every input instance as a permutation P, since we only care about the order of numbers, not the values themselves. We will analyze the worst case of k, where we need the maximum amount of recursion, in order to not have to consider it in our calculations.

With these assumptions in place, we can now write it as T_{avg}(N) = \frac{\sum_{\text{each permutation } i \text{ of } N \text{ elements}} T(i)}{N!}. We now split the sum into those where the pivot ended up being roughly in the middle of the array after partitioning, and those where the pivot did not: \frac{\sum_{\text{each permutation } i \text{ of } N \text{ elements where } i \in [\frac 1 4 N, \frac 3 4 N)} T(i)}{N!} + \frac{\sum_{\text{each permutation } i \text{ of } N \text{ elements where } i \notin [\frac 1 4 N, \frac 3 4 N)} T(i)}{N!}.

Note that when the pivot ends up being roughly in the middle, we are recursing on an input that is roughly half the size of the original input, which is the best case scenario for splitting input. When the pivot is not near the middle, in the worst case of k the input we are recursing on could be almost as large as the original input.

Clearly, if i \in [\frac 1 4 N, \frac 3 4 N), which is \frac 1 2 of the time, then when we recurse in quickselect, our new input size is proportionally smaller, at most \frac 3 4 of the original size, so T(N) = T(\frac 3 4 N). If i \notin [\frac 1 4 N, \frac 3 4 N), which is also \frac 1 2 of the time, then the largest array we might recurse on could be roughly the same size as the original array, so T(N) = T(N - 1) + c_2 N.

Therefore, on average T(N) = \begin{cases} c_1 &\text{if } N = 1 \\ \frac 1 2 T(\frac 3 4 N) + \frac 1 2 T(N - 1) + c_2 N &\text{if } N > 1 \end{cases} = O(n) where c_1, c_2 are constants. This means that quickselect is, on average, a linear time operation, with a O(n^2) worst case and a O(n) best case.

Also, exponential runtime includes all bases, but each base is distinct: O(4^n) is not the same as O(2^n).



Quicksort sorts by partitioning the input into two based on a pivot, then sorting each partition recursively. This can all be done in place:

def quicksort(A, left, right):
    # postcondition: `A[left]`, ..., `A[right]` is sorted
    if right >= left: return # the array is already sorted since it has 1 or fewer elements
    pivot = find_pivot(A, left, right)
    i = partition(A, left, right, pivot) # `i` is the new position of the original `pivot` element
    quicksort(A, left, i - 1)
    quicksort(A, i + 1, right)

Quicksort is similar to quickselect, but we sort both sides when we recurse. In practice, quicksort is very fast, though in theory heapsort and mergesort are better.

Let T(n) be the runtime for sorting n elements, which is right - left + 1. Clearly, T(n) = \begin{cases} O(1) &\text{if } n \le 1 \\ T(\text{size of left partition}) + T(\text{size of right partition}) + O(n) &\text{if } n > 1 \end{cases}.

We can also write this more formally:

Let c_1, n_0 exist such that everything in the function except the recursive calls takes c_1 n_0 time or less for all n \ge n_0. Let c_2 = \max(T(0), \ldots, T(n_0)). Let c = \max(c_1, c_2).
Then T(n) \le c_2 \le c \le cn \le T(\text{size of left partition}) + T(\text{size of right partition}) + cn if 2 \le n \le n_0. So T(n) \le \begin{cases} c &\text{if } n \le 1 \\ T(\text{size of left partition}) + T(\text{size of right partition}) + cn &\text{if } n > 1 \end{cases}.

In the worst case the size of one of the partitions will be n - 1, when the pivot is the largest or smallest element each time - sorted input. So T(n) = T(1) + T(n - 1) + O(1) = O(n^2).

In the best case the size of each partition will be roughly equal, when the pivot is one of the median elements. So T(n) = 2T\left(\frac n 2\right) + O(n) = O(n \log n).

The average case proof is a bit more difficult:

Clearly, the average is simply take the average of all possible pivot combinations: T_{avg}(n) = \frac{\sum_{i = 0}^{n - 1} \left(T_{avg}(i) + T_{avg}(n - i + 1) + cn\right)}{n} = \frac 1 n \sum_{i = 0}^{n - 1} T_{avg}(i) + \frac 1 n \sum_{i = 0}^{n - 1} T_{avg}(n - i + 1) + cn = \frac 2 n \sum_{i = 0}^{n - 1} T_{avg}(i) + cn.
We claim that given D = \max\left(T(0), T(1), \frac{T(2){2 \ln 2}}, 18c\right), T_{avg}(n) \le D \max(1, n \ln n) for all n. This can be proven uing induction over n, and T_{avg}(n) \le D\frac 2 n \sum_{i = 2}^{n - 1} i \ln i + \frac 2 n D + cn.
Clearly, D\frac 2 n \sum_{i = 2}^{n - 1} i \ln i + \frac 2 n (T(0) + T(1)) + cn \le D\frac 2 n \sum_{i = 2}^{n - 1} i \ln i + \frac 2 3 (D + D) + cn because n \ge 3.
So D\frac 2 n \sum_{i = 2}^{n - 1} i \ln i + \frac 2 3 (D + D) + cn \le \frac 1 2 n^2 \ln n - \frac 1 4 n^2 \le \frac{2D} n \frac 1 2 n^2 \ln n - \frac{2D} n \frac 1 4 n^2 + cn \le D n \ln n - \frac D {18} n + cn = O(n \log n).
Basically, \frac 2 n \sum_{i = 0}^{n - 1} T(i) + cn \in O(n \log n).

Quicksort and quickselect are bad in the worst case but good in the average case. In real life, how good it is will depend on what kind of input we run it on. The average case assumes that the every input is equally likely to occur, but it is not necessarily true. For example, sorting almost sorted arrays is a very common use case in real life, which, given our pivot selection technique, would often give the worst case behaviour.

Instead, we can force the instances to become equally likely, to get something like average case behaviour. We do this by using randomization - eliminating the bias caused by real world input by adding a random factor to the algorithm.

For quicksort, we can do this by selecting a random pivot instead of just always taking the first element in the array as the pivot. This makes all elements equally likely, so the probability of choosing a bad pivot is always \frac 1 n, regardless of what the input is. The average case is now O(n \log n) regardless of what kind of input we have.


We will prove that all comparison sorting algorithms are \Omega(n \log n). However, there are other sorting algorithms that can also run in O(n) time, which we will also look at.

A comparison sorting algorithm is one that is based purely on key comparisons - we can only compare two keys to see if one is greater, less, or equal to another, and no other operations are available.

We don't really have tools for analyzing arbitrary algorithms. However, for comparison-based algorithms we have decision trees:

Consider an example, sorting three elements A[0..2]:

                                     A[0] \le A[1]
                                    / false       \ true
                       A[1] \le A[2]               same as left side, but with A[0] and A[1] exchanged
                      / false       \ true
A[0] \le A[1] \le A[2]               A[0] \le A[2]
                                    / false       \ true
              A[0] \le A[2] \le A[1]               A[0] \le A[2] \le A[1]

Decision trees are binary trees that represent every possible decision path in the input, where each leaf is an output and each nde represents a decision. This decision tree represents the absolute minimum number of comparisons that we have to use in order to correctly sort three values

The worst case number of comparisons is the length of the longest path from the root to the leaf - the height of the tree. The length of a path is the number of edges in it.

Using these, we can now construct a proof:

No matter what comparison sorting algorithm we use to sort, it is based on comparisons and is therefore representable using decision trees.
When we sort n items, there are n! possible outputs - every permutation of the input is a possible output.
Clearly, any binary tree of height h must have at most 2^h leaves, so any binary tree with l leaves must have height at least \log_2 l.
So the height h of the decision tree is at least \log_2 n!.
Clearly, h \ge \log_2 n! = \log_2 (n(n - 1) \cdots 1) \ge \log_2 n + \log_2 (n - 1) + \ldots + \log_2 \frac n 2 \ge \frac n 2 \log_2 \frac n 2 = \frac n 2 \log_2 n + \frac n 2 \log_2 \frac 1 n = \frac n 2 \log_2 n - \frac n 2.
So h \in \Omega(n \log n).

However, non-comparison sorting algorithms can be faster. Bucket sort is how most people will sort things by keys like names or grades. Basically, we have one pile for each possible value, make a pile for each value, and put them together:

let $0 \le A[i] \le R - 1$ be an integer ($R$ is the radix)
create an array $R$ of empty lists $L[0], \ldots, L[R - 1]$
for element in A:
    append `element` t `L[element]`
concatenate all the lists $L[0], \ldots, L[R - 1]$ together into another list `L`
the sorted result is now `L`

This is a stable sort. Since we iterate through the array, and the lists take a constant time to create, bucket sort is O(n + R) in the worst case.

This is not a comparison sorting algorithm because it uses more information than just comparisons - it also relies on assumptions about the keys all being in the finite set of possible values.

Key-index/count sort is similar to bucket sort (and is also stable), but doesn't need to use lists, which are something we try to avoid. The idea is that we know where the elements of L[1] will be in the final result based on the number of elements in L[0], where the elements of L[2] based on where L[1] is and the number of elements in it, and so on:

# compute length of each list
list_lengths = [0] * R
for element in A:
    list_lengths[element] += 1

# compute position of each array in the result
list_positions = [0] * R
for i in range(1, R):
    list_positions[i] = list_positions[i - 1] + list_lengths[i - 1]

# put each value into the correct position
initialize `result` as a list with the same length as `A`
for element in A:
    result[list_positions[element]] = element
    list_positions[element] += 1

This can also be done in place, and we can combine list_length into list_positions. This is still O(n + R) since we have to initalize the array of size R to 0.

In practice, we wouldn't want to use these sorts for any real-world inputs. This is because although it is O(n), it is often still slower than a good comparison sort for real-world number values.

These sorts also work for occasions when we have too many buckets to be practical, like names. We could first sort by the first letter/digit of the name/grade, then recursively sort each of the buckets based on the second letter/digit, and so on. This technique is also called MSD-radix-sort. When R = 2, this is very similar to quicksort.

LSD-radix-sort is a bit better, since we don't have to keep track of all the indices by which we're sorting recursively:

for d from the length of the largest value down to 1:
    sort `A` by the `d`th significant digit with a stable sorting algorithm

The time complexity for this is simply O(m(n + R)) where m is the largest possible number of digits and R is the radix in which we are sorting (number of possible characters). Typically, the number of digits or letters is bounded, so the depth of recursion is bounded by a constant and the worst case time complexity is O(n + R).


An empty tree has height -1. A tree with one node has height 0. The height of a tree is 1 more than the larger of the heights of the two subtrees.

A dictionary is an abtract datatype that associates keys to values. Essentially, it stores key-value pairs, and supports insertion of a key value pair, searching for a key-value pair by key, and deleting a key-value pair by key.

We assume that all the keys are unique and can be compared in constant time, and each pair takes up constant space.

A naive way to implement dictionaries is to use unsorted arrays/lists (O(1) insert, O(n) search, O(n) delete) or sorted arrays/lists (O(n) insert, O(\log n) search, O(n) delete).

A binary search tree is a binary tree where for every node, every key in the left subtree is smaller than it and the every key in the right subtree are larger. We don't have to care about the duplicate keys since we assume keys are unique.

Implementing dictionaries using binary search trees is possible by constructing the BST where the value of each pair is its key (O(h) insertion, O(h) search, O(h) deletion). It is always true that h \in \Omega(\log n) and h \in O(n).

We want to make sure h is O(\log n), by rebalancing the binary search tree as we perform operations on it. We do this by imposing additional conditions on the tree, and then showing that we can do all operations while maintaining these conditions and taking O(h) time or less, and that these conditions ensure that h \in O(\log n).

There are many possible sets of conditions that we can use to make sure the binary tree is always balanced. One of these are those used in AVL-trees, which imposes just one condition: for any node, the height of the left and right subtree differ by at most 1 - the structural condition.

AVL trees were one of the first ever self-balancing binary trees, and although they work, they are rather slow in practice.

When we store a node in an AVL tree, we will also store the height of the tree formed by that node and its children in that node. This allows us to efficiently do the calculations we need later for insertions and deletions.

Searching in an AVL tree is exactly the same as in a normal BST. Since we don't modify the tree, we also can't violate the strutural condition.

For insertion, we need to insert the key-value pair at its proper spot, then check if the tree is still balanced and rebalance if not. Inserting into the correct subtree is the same as in a normal BST, but we also need to update the tree heights we are storing with the nodes - the newly inserted node gets height 0, and we recalculate the height of its parent. If the height changed, we recurse and recalculate the height of its parent, and so on.

While we are recalculating heights, we also check if the heights of the children of the ancestor differ by at most 1. If they differ by more, then we need to rebalance on that node - to make sure it is both a BST and tree heights don't differ by more than 1. We basically have 4 possible rebalancing operations, and it has been proven that at least one of them will correctly rebalance the tree.

The first is right rotation on node Z. Given a BST with nodes in the form Z(Y(X(A, B), C), D), we can rearrange it as Y(X(A, B), Z(C, D)) while still maintaining the BST properties, since A < X < B < Y < C < Z < D. Basically, we rotate Y and Z clockwise/right, changing only those elements so the entire operation is constant time.

The second is left rotation on node Z, which is similar but we are rotating in the opposite direction. Given a BST with nodes in the form Z(A, Y(B, X(C, D))), we an rearrange it as Y(Z(A, B), X(C, D)) while still maintaining the BST properties.

The third is double right rotation on node Z. Given a BST with nodes in the form Z(Y(A, X(B, C)), D), we can rearrange it as X(Y(A, B), Z(C, D)). Basically, we are bringing X up to replace Z. This is called a double right rotation because it can be implemented as left rotate on Y and right rotate on Z.

The fourth is the double left rotation on node Z. Given a BST with nodes in the form Z(A, Y(X(B, C), D)), we can rearrange it as X(Z(A, B), Y(C, D)).

Basically, we try each operation in turn, and check if each one is balanced. If so, then we can stop trying the others.


Operations on AVL trees can now be defined:

def insert(k, v):
    insert `k` and `v` into the tree as a new leaf, without rebalancing
    let `node` be the new node that was inserted
    while `node` is not the root node:
        p, s = parent(node), sibling(node)
        if abs(s.height - node.height) >= 2: # node unbalanced, rebalance it
            node = p
        else: # node balanced, update the height
            new_height = max(node.height, s.height) + 1
            if (new_height == p.height): break # the height has stopped changing and we can stop rebalancing
            p.height = new_height

def delete(k):
    delete the node `k`, without rebalancing
    let `node` be the node that was deleted
    while `node` is not the root node:
        p, s = parent(node), sibling(node)
        if abs(s.height - node.height) >= 2: # node unbalanced, rebalance it
            node = p
        else: # node balanced, update the height
            new_height = max(node.height, s.height) + 1
            if new_height == p.height: break # the height has stopped changing and we can stop rebalancing
            p.height = new_height

def rebalance(node):
    let `y` be the larger child of `node`
    let `x` be the larger child of `y` # this child must exist since the height differs by more than 1, so `node` must have grandchildren
    perform the rebalancing operation according to the four rebalancing cases, based on whether `y` and `x` are left and right children
    update the heights of `node`, `y`, and `x`

The height h of an AVL-tree is always at most O(\log n). Proof:

We want to prove that that h \le \log_c n = O(\log n) for a constant c.
We want to find c and n for n \ge c^h - given a fixed height, we want to find the smalles number of nodes we can have.
Clearly, for height 0, there must be at least 1 node. Clearly, for height 1 there must be at least 2 nodes. Clearly, for height 2 there must be at least 4 nodes. Clearly, for height 3 there must be at least 7 nodes.
By induction, we can show that the minimum number of n, represented n(h) must be n(0) = 1, n(1) = 2, n(h) = 1 + n(h - 1) + n(h - 2), since each minimal tree must have the smaller minimal trees as its children, and we want to use one as large as possible while the other is as small as possible.
Interesingly, the values of n(h) is simply 1 less than each Fibonacci number. The Fibonacci numbers grow exponentially, so n(h) does too.
Clearly, n(h) \ge \sqrt{2}^h, since n(h) = n(h - 1) n(h - 2) + 1 \ge n(h - 2) + n(h - 2) = 2n(h - 2) \ge 2^2 n - 4 \ge \ldots \ge 2^i n(h - 2i), i > 1, so n(h) \ge 2^{\ceil{\frac h 2}}.
So any AVL-tree of height h has \sqrt{s}^h or more nodes, so h \le \log_{\sqrt 2} n = O(\log n).

Clearly, the worst case time for insertion/deletion/search is O(h), and since h \in O(\log n) and each rebalancing operation takes constant time, insertion/deletion/search is always O(\log n).


A 2-3-tree is similar to AVL-trees in concept, but we no longer have to be a binary tree - a node may have 2 or 3 children. The 2-3-tree could be said to be a generalization of the AVL-tree.

Each node in a 2-3 tree is either a node with one key-value pair and two children, or a node with two key-value pairs and three children (the left child is less than key 1, the middle child is between key 1 and 2, and the right child is greater than key 2, and key 2 must be greater than key 1). Additionally, all empty subtrees must be on the same level.

Searching is done by a generalized version of the standard BST search - when we are comparing, we simply need to check how many key-value pairs the node has and check the cases accordingly.

Insertion into a 2-3 tree is done by searching for the node on the lowest level that would have had the key as a child if it was inserted directly. If it has only one key-value pair, then add our key-value pair to it to make it a node with two key-value pairs. Otherwise, we add the key-value pair to it to make a node with three key-value pairs (an overflow), then split it into three nodes with one key-value pair each and move the middle one (the one that is neither largest or smallest) up to the parent, so the node becomes two nodes and the parent gets an extra key-value pair. If the parent now has three nodes when we do this, we recursively perform this splitting and moving up operation on the parent, all the way up until no nodes have three key-value pairs.

Deletion in a 2-3 tree is done by searching for the key-value pair that we want to delete. We then swap the key-value pair with its successor/predecessor (the descendant key-value pairs that are closest to the current node's value), like with deletion in a normal BST. If the successor/predecessor's node has 2 keys, then we just delete the successor/predecessor to get a node with a single key-value pair. Otherwise, we have an underflow. If possible, we transfer by moving a key-value pair from the parent down into the leaf, and move a key-value pair from a sibling up into the parent to complete the deletion. Otherwise, we fuse by moving a key-value pair from the parent down into a sibling. If we do this, it might be necessary to recurse and fuse on the parent, and so on.


Insertion, deletion, and searching in a 2-3 tree are all O(h) where h is the height of the tree. The height of a 2-3 tree is always \Theta(\log n). Proof:

Let n be the number of key-value pairs in the tree. We want to find the smallest possible n for any given h, in order to get an upper bound on h given n.
Clearly, the first level has at least 1 key-value pair and at least 2 children. Clearly, the second level has at least 2 key-value pairs and at least 4 children. By induction, it can be proved that we have at least n \ge 2^{i - 1} key-value pairs on each level i, with at least 2^i children.
So the total number of key-value pairs is n \ge 2^{h + 1} - 1, and h \le \log_2(n + 1) - 1. So h \in O(\log n).
Clearly, the first level has at most 1 key-value pair and at most 3 children. Clearly, the second level has at most 6 key-value pairs and at most 9 children. By induction, it can be proved that we have at most n \le 2 \cdot 3^{i - 1} key-value pairs on each level i (2 key-value pairs per node, 3^{i - 1} nodes per level), with at most 3^i children.
So the total number of key-value pairs is n \le 3^{h + 1} - 1, and h \ge \log_3(n + 1) - 1. So h \in \Omega(\log n).
So h \in \Theta(\log n).

We can also generalize 2-3 trees into a-b trees. These are trees where the nodes can have between a and b children, except the root node, which can have between 2 and b children. The number of children for each node is 1 more than the number of key-value pairs it has. Children are subtrees that are between two neighboring keys. All empty subtrees must be on the same level.

The height of an a-b tree is h \in \Omega(\log_b(n)), h \in O(\log_a(n)). Searching takes O(\log_2 b \log_a n).

External Memory

It is quite common that we have more data than we can fit into the main memory, and we store it in external memory. Accessing external memory is generally far slower.

We now use a new computation model in which we only count how many times we access external memory - the number of disk transfers. We will also often consider sequential access to be faster than random access access.

A B-tree of order M is a \ceil{\frac M 2}-M tree - a tree with between \ceil{\frac M 2} and M children per node (except the root). We choose an M so that one node of the B-tree will fit into main memory. Clearly, h \in O(\log_{\frac M 2} n). Storing our data as a B-tree is an excellent way to minimise disk transfers, since searching takes h or fewer disk transfers, and insertion/deletion takes 2h or fewer.


As a result, we can use balanced binary trees to implement a dictionary with \Theta(\log n) runtime for all supported operations.

Any comparison-based implementation of searching for a key k is \Omega(\log n). We can prove this as follows:

Clearly, we can only use comparison between k and existing keys a_0, \ldots, a_n.
Clearly, in the decision tree for the algorithm there must be at least one leaf per key, since any key might be looked up. Clearly, there must also be a leaf for the case where k is not in the dictionary.
So the number of leaves is at least n + 1. So the height of the decision tree must be h \ge \log(n + 1).
Clearly, the height of the decision tree determines the worst-case number of comparisons that need to be made.
So the worst case comparisons for the algorithm is \Omega(\log n).

So logarithmic time for searching is the best we can do if we just have comparisons. However, if we exploit other properties of keys, we can do better.

If we assume keys of our dictionary are integers, and that they have a small lower and upper bound l \le k < h, then we can use key-indexed search. Basically, we create an array of size h - l and map every key k to the index k - l in the array. This means we have \Theta(1) time for all operations. However, it can be very wasteful of space for sparse dictionaries, and the assumption doesn't occur very often in practice.

Hash Tables

What we could potentially do is to map keys to the indices of a much smaller table - hashing the keys. We keep an unordered array T (a hash table), where each entry is known as a bucket or slot. T has size M, which we can choose ourselves.

We now want to map our keys to the indices 0, \ldots, M - 1 - a hash function h(k): \text{keys} \to \set{0, \ldots, M - 1}, that can be computed in O(1). So now any key k is stored at T[h(k)]. An example of a hashing function over integer keys is h(k) = k \pmod{M}.

However, it is possible for multiple keys to hash to the same value (a collision). There are multiple ways to resolve this. One way is to use a linked list or something similar at each slot, to store all the colliding key-value pairs. Unfortunately, this means that in the worst case, when all the keys collide, we have O(n) search and deletion, or O(\log n) if we used balanced binary trees at each slot instead of linked lists.

Our goal is therefore to minimise collisions on average. An ideal hash function must therefore be uniform (all slots are equally likely to be chosen for any key). We will assume that we have such a function, although in practice they are very hard to do correctly.

The load factor is \alpha = \frac{n}{M} - the ratio of keys to slots. We control \alpha because we control M. Essentially, we can resize the array when we need to to reduce the load factor. We generally want to keep the load factor small to avoid collisions, but not so small that we waste too much memory.

Alternatively, we can find an different slot for k in the same array rather than using a linked list for each slot (this is known as open addressing). This is more efficient because we avoid the overhead of linked lists, but is still has the poor O(n) worst case behaviour for all operations.

Basically, we now have a sequence of slots h(k, 0), h(k, 1), \ldots, h(k, M - 1) using our slightly modified hash function (this is a probe sequence), and we search for an empty slot in the sequence T[h(k, 0)], \ldots, T[h(k, n)] when we want to insert or delete.

When we delete we need to make sure we fill in the gap in the sequence that opened up when we deleted a key-value pair, or else we will break the sequence. What we could do is after deletion, insert a sentinel value that represents a deleted element into that slot. This sentinel value is interpreted as an element by search and delete (but not insert), and is not considered when deciding whether to shrink the array (that means we might need to keep track of two load factors).

A common way to define the probe sequence is linear probing, when h(k, i) = h(k) + i \pmod{M}. Ideally, a probe sequence is a permutation of 0, \ldots, M - 1 - we want to ensure that we try all slots and don't try each one more than once.

For open addressing, we must ensure that the load factor is less than 1 (it can't be 1 since otherwise it will break search). It is generally a good idea to increase M when the load factor gets up to 0.5 or so.

Linear hashing is inefficient because it ends up making big runs of non-empty slots - it clusters a lot of values together, and any key that hashes innto the cluster increases its size, which means that searching will often encounter long runs that it will need to go through. A good probe sequence should hop around the array more to avoid further collisions. Quadratic probing does this a little better using h(k, i) = h(k) + c_1 i + c_2 i^2, but it is hard to choose the constants such that it makes a permutation. Another technique is to use double hashing, where h(k, i) = h_1(k) + i h_2(k) and h_2(k) \ne 0, but getting a hash function that is a permutation is difficult.


Cuckoo hashing is more or less the best hashing strategy around. It uses a very different insertion strategy: we definitely insert a key at slot h(k). If that slot was occupied prior to insertion, we kick the old one out and insert the new one.

This strategy has two independent hash functions, h_1(k) and h_2(k):

def insert(key, value):
    original_index = h_1(key) # the index of the new key-value pair
    index = original_index
    while T[index] is not None: # slot is not already empty, kick out old value
        old_key, old_value = T[index]
        T[index] = (key, value) # store the current key-value pair, displacing the original key-value pair in there
        index = h_1(old_key) if index == h_2(old_key) else h_2(old_key) # switch the index to the alternate slot for the newly displaced pair
        key, value = old_key, old_value # repeat the whole kicking out and inserting process for the newly displaced key-value pair
    T[original_index] = (key, value) # insert the new key-value pair in case it didn't kick any other pairs out earlier

It is possible for insertion to go into a cycle, such as if the hash functions were equal. To avoid an infinite loop, we check if the length of the displacement sequence (the number of times the while loop runs) and check if it equals the table size. If this is the case, we rebuild the table with a larger size, and new hashing functions.

Insertion into a hash table using cuckoo hashing has an expected runtime of O\left(\frac \alpha {(1 - 2\alpha)^2}\right), where \alpha is the load factor and assuming the hashing functions are uniformly distributed. So when the load factor starts going above 0.5, the expected runtime goes to infinity. We should always rebuild the table before this.

For all the hashing strategies we've seen so far, insertion, deletion, and search has an expected runtime of O(1), as long as we keep the load factor below a constant (1 for open addressing, 0.5 for cuckoo hashing).

Midterm covers everything up to here, includes identities sheet, questions like run algorithm on some input, compare algorithms by things like memory or time, develop new algorithms, proving runtimes.

Rebuilding tables/Hashing

We want to occasionally rebuild the table when the load factor gets too big or too small, to avoid collisions and save memory. Basically, we monitor \alpha during insertions and deletions. If \alpha gets too big, we make a new, larger array (usually about double the size of the previous one, or less), then insert all the elements of the old one into the new one, and creating new hash functions. If \alpha gets too small, we do the same thing, but with a smaller array (usually about half the size of the previous table).

If we do both the growing and shrinking operations, we can ensure that the hash table also uses \Theta(n) space.

Obviously, this is very slow, but it only happens very rarely. These rebuilding operations are amortized so that insertion and deletion are O(1) on average.

In practice, open addressing is the most commonly used collision resolution strategy. We also want to use cheaply computable functions, unlike some theoretical ones that are very expensive to compute. A hash function should not ignore any part of the key, and should mix up patterns in data.

A common hash function for integer keys is h(k) = k \pmod M, M \in \mb{N}. This function is the division method. However, this works very badly if the data is human-readable and M is a power of 10 (since many real world numbers end with 0), and if M is a power of 2, since powers of two are very commonly used in computer arithmetic. Ideally, M would be a prime number. A common value is 2^i - 1, but this also is bad for certain patterns of numbers.

Another common hash function for integer keys is h(k) = \floor{(Ak \mod 1) \times M}, A \in \mb{R}. This function is the multiplication method. For this, we want to choose an A that destroys patterns in the input, though good values are hard (\phi = \frac{\sqrt{5} - 1} 2 is a good value in practice). This technique is good because the table size doesn't affect the quality of hashing.

Keys are not always integers. To hash strings, we often use the ascii values, and then use each of them in a certain base. For example, "APPLE" becomes 65, 80, 80, 76, 69, and we use this as the digits in a certain radix R, to get h(k) = (65R^4 + 80R^3 + 80R^2 + 76R^1 + 69R^0) \mod M. Since the value before the modulo is so large, it is hard to compute. We can do the equivalent function h(k) = ((((((((65 \mod M)R + 80) \mod M)R + 80) \mod M)R + 76) \mod M)R + 69) \mod M instead.


In external memory, a B-tree uses O(n \log_M n) disk transfers. Ideally, we want to use hashing in external memory much like in internal memory to speed up our accesses.

External memory is organized in pages, which are chunks that each fit in internal memory. We can easily read pages of external memory at a time.

We could use open addressing, but this would result in a lot of disk transfers. We could use hashing with chaining/closed addressing, but potentially if there are long chains one chunk would not fit in internal memory and we would need to store it in external memory again, which wastes space.


There is a good way to do hashing with external memory, a data structure known as a directory, using a technique called extendible hashing.

Let h(k) \in \set{0, \ldots, 2^L - 1} - all hash values are binary strings of size L. The directory contains an array, stored in internal memory, of size 2^d where d \le L (d is called the order of the directory).

Each slot in the array points to the a block in external memory, a chunk of fixed size. Each slot of the array has an index that can be represented as a binary string of length d. The idea is that every key-value pair with a hash value that begins with binary string x_1 \ldots x_d is stored at the index represented by x_1 \ldots x_d. For example, a key-value pair with hash 110000 in a directory of order 2 is stored at index 11, or 3 in decimal.

Each block stores an integer k_d \le d, the local depth, which represents how many bits of the hashes of the keys will be the same, starting from the beginning. A block also stores key-value pairs such that the first k_d bits of the hashes of each key are the same. The local depth also shows how many references there are to the block, as 2^{d - k_d}. For example, a block with local depth 2 stores key-value pairs with hashes that all begin with the same first two bits.

Blocks can store their key-value pairs however they want, like hash tables or BSTs, since they are loaded into internal memory before we work with them. We will assume that searching within the block is O(\log S) where S is the maximum number of key-value pairs a block can store.

Now, to lookup a key-value pair, we take the first d bits of the key hash as an array index, then read the block at that array index into internal memory and search for the key-value pair within that block. For example, a key-value pair with hash 010001 stored in a directory of order 3 is in the array at index 010, or 2 in decimal.

To insert a key-value pair, we take the first d bits of the key hash as an array index, then read the block at that array index into internal memory. If there is room in the block, we insert the key-value pair and are finished. Otherwise, if the block contains the entries for multiple slots (k_d < d), we split the block into two blocks with local depth k_d + 1, and make sure the array points to the correct blocks, then try to insert again. Otherwise, we have to rebuild the directory's entire array by doubling the array and populating it again, so d is incremented.

To delete a key-value pair, we do something similar to insertion by reading the block and deleting the key, but merge two blocks together if possible. Merging involves decrementing k_d on one block, copying key-value pairs over, then deleting the other block and updating references in the directory's array.

Clearly, lookup is O(1) and only ever needs 1 disk transfer. Insertion could potentially take O(2^d) time (to rebuild to array), but this happens very rarely, and only needs 2-3 disk transfers.

The advantage of extendible hashing is that we never have to re-update all the blocks in external memory, or rehash them. This makes it very well suited for working with external memory.

An empty directory starts off with all slots pointing to the same block, and the k_d of that block is 0. Basically, all of the slots are combined into one. It is possible for multiple slots to point to the same block, when k_d < d. Insertion and deletion can cause blocks to become combined and split combined blocks.

So far, we have seen various ways of implementing dictionaries: unsorted arrays, sorted arrays, hash tables, AVL-trees, in increasing order of difficulty to implement.


Heuristics and Self-rearranging Arrays

In the real world, sometimes algorithms with poor worst case time complexity will actually perform better.

For dictionaries implemented using unordered arrays or linked lists, search will always need O(n) worst case. However, a successful search takes O(\text{index of } k \text{ within the array/linked list}).

The idea is to place keys that are more frequently searched for should be at the beginning of the array/linked list. In the real world, the 80-20 rule of thumb says that about 80% of searches will be for 20% of the keys.

If we know the frequencies of search for the keys beforehand, we can just put them in the array/linked list sorted by decreasing frequency. This allows to to minimize \sum_{k \in \text{ keys of the dictionary}} \text{frequency of search} \times \text{index of } k. This is called static ordering.

We often don't know the frequencies. Instead, we could change the order of the keys with every search. Every time we search for a key and find it, we move it toward the beginning.

One such movement strategy is always moving the found item to the front, and shifting the array to the right by 1, the move-to-front heuristic. There is also the transpose heuristic, which always moves the found item 1 item toward the front of the array if not already at the front (by swapping it with its left neighbor). These are all O(1), except the move-to-front heuristic on arrays, which is O(n). The move-to-front heuristic tends to be better in practice.

For ordered arrays, search will always need O(\log n) worst case. However, in real life we often have a general idea where the key is. For example, if we are looking for a name beginning with Z in a phone book, we would start near the end.

The idea is to guide the search by where we expect the key to be. If the current bounds in our search are l and r in the sorted array A (assumed to contain cardinal values), then we would expect the a key k to be at index l + (r - l)\frac{k - A[l]}{A[r] - A[l]}, assuming the values are all uniformly spaced. This is called interpolation search:

def interpolation_search(key, A, left = 0, right = len(A)):
    if left >= right: return A[l] if A[l] == k else None
    middle = left + (right - left) * ((key - A[left]) / (A[right] - A[left]))
    if A[middle] >= key: return interpolation_search(key, A, left, middle)
    return interpolation_search(key, A, middle + 1, right)

This is now greater than O(\log n) in the worst case, but it works pretty well in real life.

We also have galloping, which is a search technique for sorted arrays in which we start from the beginning, and skip ahead by increasing increments. For example, we might skip ahead by 1 at first, then 2, then 4, then 8, and so on, until we are at or past the desired key, in which case we would do a binary search or similar on the bounds. This is good for huge arrays (and even works on infinite-size arrays, as long as they are sorted) because we can start searching before the entire array is finished loading, and is actually O(\log n) if we increase the search index exponentially.

A skip list is a data structure used for implementing dictionaries using ordered linked lists. Basically, the idea is that we store additional references to nodes within the list, so we can skip to it directly when we need to, like with an ordered array. We would essentially store lists L_1, \ldots, L_k, where L_k is a subset of L_{k - 1} of proportional size, L_1 contains every node, and L_k includes only 1. Each node has multiple fields for the next pointers of each level of list, so at each node we can go to the next element of any of the lists that include it. The goal is to have each list be a uniformly spaced sample of the elements of the lower list. This is a lot like a tree with variable arity and fixed height.

To search, we would first search left to right through the top list. Right before we pass the key's value or reach the end, we drop down a level and try searching left to right again, repeating the search and drop down until we have found the key or are at the lowest level and try to drop down (key not found).

To insert, we search for where k should be, insert k into the lowest list l_1, and then randomly choose how many lists we choose to insert it into with decreasing probability (probability of inserting into i lists above should be \frac 1 {2^i}). So at each level, we would have a 50% chance of also inserting it into the parent level.

This takes O(n) expected space and O(\log n) expected runtime for search, insert, and delete.


Our abstract datatype dictionaries we have insertion, deletion, and search. However, there are often other ueful operations in many situations.

Range Queries

A range query is an operation \text{Range}(x_1, x_2) that returns all the key-value pairs k, v such that x_1 \le k \le x_2 - all the key-value pairs in a range.

For implementations of dictionaries using unordered arrays, hash tables, heaps, and other data structures that use arrays that are not sorted, we can't do better than just searching through the entire array. For implementations that are sorted, like sorted arrays or BSTs, we can do binary search for the range endpoints, and build the result along the way, in O((x_2 - x_1) + \log n) time.

We don't really care about the x_2 - x_1 term since we can't do anything about it - it is the cost of producing the output. We will only consider the terms that depend on n.

To do range queries on a BST, we can do a modified tree traversal to flatten the desired subtrees:

def range_query(node, lower, upper):
    if node == None: return []
    if node.value < lower or node.value > upper: return []
    return range_query(node.left) + [(node.key, node.value)] + range_query(node.right)

Range queries are extremely useful for building databases and working with multidimensional data. Suppose the items stored in a dictionary all each have d attributes, and that the attributes are numbers - the values of the items are d-tuples of numbers.

Now we want to do a d-dimensional range query to search by multiple attributes.

One way to do 2D range queries is to use a quadtree. This is a data structure that stores points as a tree-like structure, where each node has 4 children that split a 2D rectangle into 4 quadrants. To build one, we find a bounding box of all the points (preferably, a power of 2), and then recursively split boxes into 4 smaller boxes as quandrants as necessary until every non-split box contains 0 or 1 points. To draw a quadtree, we label the edges as one of NE, NW, SW, SE (and draw them in this order from left to right), label leaves as empty or the item it stores, and label non-leaf nodes with the bounding boxes they represent.

Searching a quadtree is similar to searching a binary tree, just with more cases for which child to recurse into. Insertion is similar to binary tree insertion as well, but when we find the leaf where the new point should go, we place it down if the leaf is empty, replace the leaf if the points are equal, or split the node into quadrants if it already has a point (putting the old point into the correct quadrant) and attempt to insert the new key into the correct child.

As a result, insertion can split a node many times. In fact, the height of a quadtree is unbounded with respect to n. Potentially, we could have an arbitrary number of quadrant splits depending on how close together points are (closeness of new point to any existing points determines the max number of splits upon insertion). The height is actually O\left(\log \frac{\text{max distance between points}}{\text{min distance between points}}\right).


Range search in quadtrees is similar to range search in binary trees, but with bounds on both the X and Y coordinates of the points:

def box_query(node, left, top, right, bottom, node_width, node_height):
    if node is None: return []
    if node.is_leaf: return [(node.key, node.value)]
    if node_width < left or right < 0: return [] # outside of the box bounds
    if node_height < top or bottom < 0: return [] # outside of the box bounds
    return range_query(node.top_left, left, top, right, bottom, node_width / 2, node_height / 2) +
           range_query(node.top_right, left + width / 2, top, right, bottom, node_width / 2, node_height / 2) +
           range_query(node.bottom_left, left, top + height / 2, right, bottom, node_width / 2, node_height / 2) +
           range_query(node.bottom_right, left + width / 2, top + height / 2, right, bottom, node_width / 2, node_height / 2)

We can also use a coloring algorithm that colors each node white (none of the children are in the box), black (all of the children are in the box), or gray (not sure, or neither). We recurse on gray nodes, ignore white nodes, and output all descendants of black nodes.

The runtime is therefore O(g + \frac w {O(g)} + \frac b {O(g)}), where g, w, b represents gray, white, and black, repectively. The number of gray nodes is unbounded, and therefore the runtime is as well.

A k-d tree is a data structure similar to quadtrees. Where in quadtrees we could split nodes into 4 equal parts when inserting, in k-d trees, we split the region into 2 parts (alternatingly horizontally and vertically, or through each dimension in higher dimensional k-d trees) that each have roughly the same number of points, and the size of both regions can be different.

The partitions/splits are placed at the median of all the points inside the node. Calculating the median can be done in O(n) time, but is rather difficult to do in practice. Leaf nodes are those that have no split, and store points.

The height of a k-d tree is always O(\log n) since each split cuts the search space roughly in half, so search is possible in O(\log n) (compare coordinate of search to coordinate of split, then recurse if necessary). The tree uses O(n) space.

Insertion is done by finding the leaf in which we'd insert the node, insert it so the node has 2 elements, then split the node and put the 2 elements into the 2 leaves, all O(\log n) overall. However, this means the tree is no longer balanced, and the medians are no longer at the medians. What we could do instead is to keep track of how unbalanced the tree is, and if it gets too bad, we just rebuild the tree entirely. There are theoretical implementations that can rebalance a k-d tree after insertion, but they are largely impractical.

Deletion is done by finding the leaf containing the node, removing it, and then combining the node with its parent, so it is O(\log n). There are no ways to rebalance a k-d trees after a deletion.

Building a k-d tree from a set of elements is possible in \Theta(n \log n), by finding the median of all the elements (in O(n)), partitioning the elements into two sets, then recursively rebuilding those sets into k-d trees.

Range query in a k-d tree is very similar to range queries in quadtrees, but we need to ensure we keep the query bounds and the current node's bounds up to date. Overall, this is a O(n) + O(\sqrt n) operation, and there is a theorem to prove it.


We color the tree with our range query values - white for nodes that are entirely outside the query bounds, black for nodes entirely inside, and gray for all others, then ignore white nodes, return black nodes, and recurse on gray nodes.
Clearly, the time complexity of range queries is O(n) for returning all the black nodes, plus the number of grey nodes. We want to prove that there are O(\sqrt n) grey nodes.
Gray nodes are possible when parallel edges of the query rectangle intersect one edge of the node, when parallel edges of the query rectangle intersect parallel edges of the node, when the query rectangle is entirely inside the node, and when perpendicular edges of the query rectangle intersect perpendicular edges of the node.
Clearly, every gray node must have at least one intersection - there are more intersections than gray nodes.
Clearly, the largest number of intersection points between a line and a rectangle is proportional to the largest number of intersection points between a rectangle and a rectangle, so they have the same order.
Let Q(n) be the largest number of intersection points between a horizontal/vertical line and the node's rectangle or its children's rectangles. Clearly, the number of gray nodes is O(Q(n)).
Clearly, Q(n) \le 2 + Q(\frac n 4) + Q(\frac n 4) = O(\sqrt n), since there must be at least 2 points intersected in the node's rectangle, and the line can intersect at most Q(\frac n 4) points in the descendents.
Since there are O(\sqrt n) intersections, there are O(\sqrt n) gray nodes.< So there are at most O(\sqrt n) recursions, and the time complexity is O(n) + O(\log n) as required.

We can use a strategy where we don't always alternate which dimension to split along to improve the rectangle shapes (ideally they would all be squares), which can sometimes reduce tree height.

We can also reduce height by storing points in interior nodes, which means that the points are on the split lines. If not, then we can also split at non-point coordinates - all split coordinates are valid.


Range trees are another way to implement dictionaries that use points as keys. This tree has the advantage of not having the bad time complexities of quadtrees and kd-trees. Also, it allows us to do range queries in a much simpler way.

Basically, a range tree stores points in a balanced BST by the X coordinate of each point. Every node in the tree also contains a balanced BST, which contains the node's points and the points of all of the node's descendents, by the Y coordinate of each point. When we compare points, we compare the X coordinates, then the Y coordinates if the X coordinates are equal.

This means that points will be stored multiple times in the same tree. Each point will appear in the tree at most once per level of the tree, and is proportional to the height. Therefore, range trees need O(n n) memory, since each node needs O(\log n) memory.

Searching a range tree is as easy as using binary search to find a point in the main tree with the same X coordinate, then search in the node's tree for the one with the same Y coordinate, all possible in O(\log n).

Inserting is simply inserting the new point into the main tree, then going upwards and inserting the point into the tree for each ancestor node of the new node and rebalancing those trees. However, we can't rebalance the main tree because that would ruin the node trees, so we basically have to rebuild it every so often. Deletion is a similar operation.

To do a range query within (x_1, x_2) to (x_2, y_2), we do a range query in the main tree for x_1 to x_2. Now we color the tree using this query: the nodes that are on the paths of the boundary nodes in the range query are gray, the nodes that are entirely inside the range query are black, and the ones that are entirely outside are white. We ignore white nodes, do an explicit test for whether the point is in the range for gray nodes, and know that the point is in the range for black nodes. For points that we know are in the X range, we can now do a range query for y_1 to y_2 for each point, and combined, this is the result of the original range query. Since there are O(\log n) of each type of node, and the range query in the node tree is O(n), range queries are O(\log^2 n) + O(n) (and can actually be a bit better with certain tricks).


Dictionaries for words are relatively common. There are dedicated data structures that make it very simple to associate data with words. Recall that words are sequences of characters, and characters are elements of a finite set, an alphabet (commonly \Sigma = \set{0, 0} or ASCII values). For this purpose we can use a trie.

A trie is an n-ary tree with edges labelled with characters from \Sigma, or the end of word symbol (by convention, we usually use $). Basically, words are stored as the sequence of edges in paths from the root node to leaf nodes character by character, where the ith level corresponds to the ith character (or the end of word symbol) of a word. Tries do not contain any unnecessary edges. At the leaf nodes we could store the whole word, or any data we want:

     / word: 1
  / \1
 /   \
/1    word: 11
 \   word: 0
  \ /$

One variation of this is to store words at internal nodes instead of using the end of word symbol. This might be beneficial if using an end of word symbol makes implementation difficult. We could also not store the words at all, and instead keep track of all the edge labels as we recurse downward to reconstruct the word.

Searching for a word is as easy as iterating through each character and taking the edge labelled by that character as the new tree to search in. If a leaf node is reached, we found the word. Otherwise, one of the labelled edges we want don't exist and the word is not in the trie. The runtime of this operation is O(\abs{w}\abs{\Sigma}), where w is the word and \Sigma is the alphabet.

To insert a node into a trie, we follow edges corresponding to characters in the word (creating them as necessary) until we arrive at some node, then create an edge labelled with the end of word symbol to a leaf node and store the data that leaf node. To delete a node from a trie, we find where it should be in the trie, then delete that node and its ancestors up to but not including the first one with 2 or more children, to remove unnecessary edges. Insertion is O(\abs{w}\abs{\Sigma}), and deletion is O(\abs{w}).

We can also compress tries by allowing edges to store more than one character. This allows us to write them in a cleaner way, but still uses the same asymptotic memory.

A Patricia trie is a special type of trie in which nodes are also labelled with the index of the character to test, in addition to edges labelling characters. In a normal trie, the level of the node determines the index of the haracter we test; a node at level 5 means we take edges corresponding to the 5th character of the word. In a Patricia trie, we can test any character at any node. If the index is greater than the index of the end of word symbol in the word, then we take that to mean that the word is not in the Patricia trie.

Patricia tries can actually compress tries - we can simply make tries that avoid checking characters that are the same in all the descendents (for example, if all the descendents have the second character be 0, then we can just skip checking the second character in that subtree), and since we store the entire word at each leaf node, we can't get to a leaf node by following a word that isn't actually in the trie.

Patricia tries use less space in practice, though don't really improve time complexities.


Pattern Matching

Pattern matching is the problem of finding strings inside other strings, like what grep does.

Formally, the problem is checking for the existance and finding the position of the string P[0 \ldots m - 1] \in \Sigma^* within the string T[0 \ldots n - 1] \in \Sigma^*, where \Sigma is a set of characters. Here, P is the pattern/needle, T is the text/haystack, and \Sigma is the alphabet.

Generally speaking, T will be much larger than P. Although we care about time complexity with respect to the length of P, we care much more about time complexity with respect to T.

A substring of T is a string S such that S = T[i \ldots j] for some i, j. A prefix is a string S such that S = T[0 \ldots i] for some i. A suffix is a string S such that S = T[i \ldots n - 1] for some i.

An occurrence of P in T is an index s such that T[s \ldots s + (m - 1)] = P[0 \ldots m - 1]. If s is an occurrence of P in T, then we say P occurs in T at shift s. Note that the equality operator over strings doesn't take constant time - checking if an index s is an occurrence takes \Theta(m) time in the worst case.

The brute force algorithm for pattern matching is just to try every possible shifts and return the first one that works:

def find(needle, haystack):
    for s in range(len(haystack) - len(needle)):
        for i, c in enumerate(needle):
            if c != haystack[s + i]: break
        else: return True
    return False

Clearly, this is \Theta(m(n - m + 1)) in the worst case. For small, constant m this is actually relatively fast, especially in practice.

However, we are clearly doing a lot of unnecessary comparisons. For example, if we have a pattern "gattaca" and text "gattaccgattact", we would fail the match at the second "c", but since we know we didn't match, we could just skip 7 characters ahead and start at the second "g", avoiding the need to check all the shifts in between. For longer patterns, this gives huge performance gains.

The idea is to improve runtime by using preprocessing - doing a lot of work upfront, but then making later searches very fast. We could preprocess the text (suffix trees do this), or preprocess the pattern (DFAs, KMP, Boyer-Moore, Rabin-Karp all do this).

String matching using DFAs is always O(n). An NFA is easy to construct, but the DFA is not - converting an NFA to a DFA results in exponentially increasing numbers of states. Instead, we have to design the DFA directly. For example, if we are matching P = abac in \Sigma = \set{a, b, c}, then we have the following DFA:

# one transition per each character of the pattern
start > a > a
a > b > ab
ab > a > aba
aba > c > abac
abac > accept

# transitions from each state to the state with the longest value that is a suffix of the current state
a > c > start
a > a > a
ab > b, c > start
aba > a > a
aba > b > ab

The basic algorithm is to add a state for each prefix of P, and for each combination of prefix w and next character x, add a transition from the state w to the state with the longest prefix v that is a suffix of wx.

Finding the DFA is possible in quadratic time, or cubic using a naive solution. However, this is far too slow in practice.

Instead, we can use \epsilon-NFAs, that we make deterministic. We enforce that we only take the \epsilon transition if there are no other possible transitions. Basically, we have the standard matcher, but at each state we have an \epsilon transition that tells us where to go if the ith character doesn't match. Since there is only one epsilon transition, with our rule the \epsilon-NFA is deterministic.

For example, the \epsilon-NFA for P = abac appears as follows:

# one transition per each character of the pattern
start > a > a
a > b > ab
ab > a > aba
aba > c > abac
abac > accept

# epsilon transitions to the failure states
a > > start
ab > > start
aba > > a

# transitions from start to start for all non-matching first characters
start > b > start
start > c > start

This is the Knuth-Morris-Pratt algorithm (KMP) - using \epsilon-NFAs to match strings.

The idea is to describe the \epsilon-NFA using a failure function \pi(k) - the state to go to given that we have seen P[0 \ldots k - 1] in the text and the next character of T is not P[k]. We can represent the failure function using a table where the columns are k values and the rows are the prefixes with lengths equal to those k values and the corresponding values of \pi(k).

In code, we can represent each state by the length of the prefix it represents, where the start state is 0 and the state representing the prefix "aba" represents 3. This looks like the following:

def kmp(T, P):
    compute the failure function $\pi(k): 1 \ldots m - 1 \to 0 \ldots m - 1$ as `failure_transition`
    state = 0
    i = 0
    while i < len(T):
        if T[i] == P[state]: # match/forward transition
            state += 1 # move forward in pattern
            i += 1 # move forward in text
            if state == len(P): return True
        else: # mismatch/backward transition
            if state > 0: state = failure_transition[state] # follow epsilon transition
            else: i += 1 # at start state, stay at start state and move forward in text

Clearly, each run of the loop either does a forward transition, which consumes a character and therefore can happen at most n times, or we do a backward transition, which is always followed by a forward transition (since the state we transition to must be able to consume the next character), and therefore can happen at most n times. As a result, the loop can run at most O(n + n) = O(n) times, and is therefore the algorithm's worst case time complexity.


The \epsilon-NFA for the failure function has states for each character of the pattern, from 1 to the length of the pattern, and the start state is 0. Each state is therefore associated with an index, and the character at that index in the pattern. Each state is also associated with a prefix, which is all the characters of the pattern from the start of the pattern to the state's index. Each state has a transition to its next state by its character.

If we are in state j, then we have already matched P[0 \ldots j - 1]. Formally, we are always in the state \max\left(\set{j \middle| P[0 \ldots j - 1] \text{ is a suffix of the text input read so far}}\right). Basically, the longest prefix of the pattern that is a suffix of the currently read input.

The failure function, \pi(k) = j, describes the epsilon transition at a state - the state we should hop back to if matching the next character failed to match. Since we hop backward, we have j < k.

Clearly, this is the same as \max\left(\set{j \middle| j < k, P[0 \ldots j - 1] \text{ is a suffix of } P[0 \ldots k - 1]}\right) - the epsilon transition doesn't consume any characters. Since j < k, we can just say the failure function's value is \max\left(\set{j \middle| P[0 \ldots j - 1] \text{ is a suffix of } P[1 \ldots k - 1]}\right).

In other words, the failure function goes to the rightmost state whose prefix is a proper suffix of the current state's prefix (a proper suffix is a suffix that is not equal to the string).

When we step the \epsilon-NFA on P[1 \ldots k - 1], the state it is in at the end is the rightmost state whose text value is a proper suffix of the current state's text value. This is a suffix because the \epsilon-NFA would go back to a previous state if it wasn't, is a proper suffix because P[1 \ldots k - 1] is shorter than P[0 \ldots k - 1] (the current node's prefix), and is the longest prefix because the \epsilon-NFA always matches characters whenever possible, avoiding \epsilon-transitions unless there is no other option. Therefore, it satisfies all the conditions to be the value of the failure function at k.

Note that we don't yet know what the \epsilon-transitions for the \epsilon-NFA are yet. However, we do know that \pi(0) is undefined, and \pi(1) = 0. Given this, we can use strong induction to find the rest.

That means that if we run the \epsilon-NFA on P[1 \ldots m - 1], then right before we read the nth character to step the NFA, the state of \epsilon-NFA is the value of \pi(n + 1). Now we have everything we need to calculate a lookup table for \pi(k):

def calc_kmp_table(P):
    table = []
    table[0] = -1 # state 0 has no epsilon transition, so this is undefined
    table[1] = 0 # state 1 always fails to the start state
    input_position = 1 # start at the second character
    current_state = 0
    while input_position + 1 < len(P): # `input_position + 1` is the state for which we are calculating the failure function at
        if P[input_position] == P[current_state]: # continue prefix matching
            table[input_position + 1] = current_state # store the rightmost state whose prefix is a proper suffix of the target state's prefix
            input_position += 1 # move ahead in the input
            current_state += 1 # move to the next state
        elif current_state != 0: # only non-start states have epsilon transitions
            current_state = table[current_state] # go back to current state's epsilon transition, which exists by inductive hypothesis
        else: # at start state and prefix doesn't match
            table[input_position + 1] = 0 # no prefixes, go back to start state
            input_position += 1 # move ahead in the input, since input is an unnacceptable first character
    return table

Clearly, this takes O(m) time, by the same argument as the one for the matching routine. So KMP uses O(m) time to compute the failure function, and O(n) time to match the string, so takes O(m + n) time overall. Note that it does not depend on the size of the alphabet. Theoretically speaking, KMP has the best time complexity, but in practice it is often slower than certain other algorithms with worse time complexities.

The Boyer-Moore algorithm is basically the brute force algorithm plus a few tricks. The first is the looking glass heuristic: when we are comparing the pattern to the current part of the string by testing whether P[0 \ldots m - 1] = T[s \ldots s + m - 1], we compare from m - 1 to 0 (the characters are compared backward).

Another trick is the character-jump heuristic: if we are checking a pattern and T[s + i] \ne P[i], then we find the character T[s + i] in P and shift the pattern over to that character. For example, if we want to match "bacaca" in "bacabaaacaacaba", we try to compare using the looking-glass heuristic and notice that the second to last character "b" doesn't match the "c" in the pattern. Clearly, the next match cannot occur before we find another "b" for "bacaca" to match, so we shift over by 4 to start matching the pattern on "baaacaacaba".

Finally, there is the good-suffix heuristic: if we are checking a pattern and T[s + i] != P[i], then we shift by a value that matches P[i + 1 \ldots m - 1]. The shift is the largest k such that P[k + 1 \ldots m - 1 + (k - j)] = P[j + 1 \ldots m - 1] and P[k] \ne P[j].

The shifts in the above two heuristics only depend on the pattern itself. As a result, we can precompute these before we match any text input at all. The code looks something like the following:

def boyer_moore(T, P):
    last = [-1] * len(P);  for i, c in enumerate(P): last[c] = i # calculate shifts for character-jump heuristic
    suffix = compute_suffix_function(P) # good suffix heuristic (`suffix[j]` is the amount to shift by if we see $P[j + 1 \ldots m - 1]$, can be computed in $O(n^3)$)
    i = m - 1;
    while i < n: # compare $T[i - (m - 1) \ldots i]$ with $P$
        j = m - 1
        while j >= 0 and T[i] == P[j]: i, j = i - 1, j - 1
        if j == -1: return True
        i = i + m - 1 - min(suffix(j), last[T[i]])

Without the good suffix heuristic, the function is very simple to implement. However, it makes the runtime O(m + n + \abs{\Sigma}). The basic idea behind Boyer-Moore is that it skips as much text as possible - some characters of the string aren't even processed at all. In practice, the algorithm skips around a quarter of characters in English text.

The Rabin-Karp algorithm preprocesses the text rather than the pattern, though the preprocessing depends on the pattern. Basically, we need a hash function that takes O(1) time given T[i \ldots i + (m - 1)], P, and h(T[i - 1 \ldots i - 1 + (m - 1)]) - a hash function that can be computed in constant time using the previous result.

The idea is to eliminate many substring comparisons using the hash function. The hash function can be computed at the same time as we are matching in the string.

For example, assume we are using binary as the alphabet. For each substring of T with length m, we count the number of 1's, which can be done in O(n) time overall and constant memory. Then, we only need to compare P to those substrings that contain the same number of 1's as P does.


Suffix trees are another approach to pattern matching, and preprocesses the text rather than the pattern, like Rabin-Karp. This means that it is well suited for when we are searching for many different patterns in the same text.

A suffix trie is an uncompressed version of a suffix tree.

The main idea is that if P is a substring of T, then P is a prefix of a suffix of T. In other words, if we store a trie containing all the possible suffixes of T, then P is a substring of T if and only if P is within the trie starting at the beginning - the trie lets us check for prefixes of all the suffixes at once. So for T = \text{test}, we would store the empty string, t, st, est, and test in the trie.

However, this trie would waste a lot of memory. A suffix tree is simply a compressed version of this trie. Basically, when we have a node with only one child that also only has one child, we can merge the three nodes into 2. For example, if we have a structure like t(e(s(s, t))) (the trie with "tess" and "test"), we can change this to tes(s, t).

This is much shorter to draw, but as we observed earlier for tries this doesn't save any space because we have to store the edge labels and leaves anyways. However, note that all labels and leaves are simply substrings of the text, and therefore can be represented using pairs of numbers representing the substring's start and end positions within the text.

The suffix tree is therefore a compressed trie of suffixes of text, where labels and words are represented using indices representing substrings of the text. For example, "bananaban" becomes a(ban\$(aban\$, n(a(ban\$(anaban\$)), \$(an\$)))), ban(anaban\$(bananaban\$), \$(ban\$)), n(a(ban\$(naban\$), naban\$(nanaban\$)), \$(n\$)), \$(\$), where $ is the end-of-word char and x(\ldots) means that there is an edge labelled x to a node \ldots and plain words are leaves.

Clearly, T has \abs{T} suffixes and therefore \abs{T} suffix tree leaves. Therefore, there are n - 1 or fewer internal nodes since each node has 2 or more children (if there was only 1 child then it would be merged in). Since nodes and edges take O(1) space, suffix trees need O(n) space.

Clearly, a naive solution for constructing the suffix tree is O(n^2) - take each suffix, insert it into a trie, then compress. However, it is possible to do this in O(n) time, though this is very complicated (fully covered in CS482).

Pattern matching using suffix trees is very easy:

def suffix_tree_search(T, P):
    build the suffix tree with root node as `S`
    pattern_position = 0
    while True:
        if pattern_position >= len(pattern): return True
        if S.is_leaf: return True
        for edge in S:
            if edge.startswith(P[pattern_position:pattern_position + len(edge)]):
                pattern_position += len(edge)
                S = S[edge]
            return False

In summary:

Property Brute Force KMP Boyer-Moore Rabin-Karp Suffix Trees
Preprocessing time O(1) O(m) O(m + \abs{\Sigma}) O(n) O(n)
Search time O(mn) O(n) O(n) O(mn) O(m)
Space O(1) O(m) O(m + \abs{\Sigma}) O(n) O(n)

KMP has the best asymptotic complexity overall. Boyer-Moore is often sublinear and best in practice. Rabin-Karp is also very fast in practice, especially with a good hash function. Suffix trees are very good when we are searching for different patterns in the same text, while all the others are good for same patterns in different texts.

Text Compression/Encoding

Suppose we want to transmit text between two parties. Most communication channels work with bit strings, so we would have to encode the text into a bitstring when we send it, and decode the bitstring when we receive it.

We want an encoding/compression algorithm that converts a word into a bitstring, and a decoding/decompression algorithm that converts a bitstring back into the original word.

The main objective here is to make the resulting bitstrings to be as short as possible, in order to save space.

There is character-by-character/fixed-length encoding, which converts each character into a fixed-length bit strings, like ASCII. However, this can't represent every character. The UTF-32 standard, for example, uses 32 bits per character in order to represent every language properly.

There is also variable-length encoding, where characters could be mapped to bit strings of varying lengths. For example, Morse code (Morse code actually uses 4 symbols - dot, dash, end of character, end of word), UTF-8, and UTF-16. With this, the length of the encoding is simply the sum of the frequency of each character times the length of the encoding for that character.

The problem with variable-length encoding is that it is easier to accidentally design one where there exists a bitstring that has two possible decoded strings. This happens whenever the code for one character is a prefix of the code for another (the trie with all the character encodings has no words at interior nodes).

A variable-length encoding where no code for one character is a prefix of the code for another character is a prefix code.

We want to use the smallest possible bitstring encoding for encoded text given frequencies for each of the characters. There is a simple algorithm to do this if we are given the frequency function f(c).

First, we find the two characters c_1, c_2 with the smallest frequencies - the ones that occur least often. We then add these two as children to an internal node of a trie (with edges labelled 0 and 1), and treat that new internal node as if it were a character with frequency f(c_1) + f(c_2) (and no longer consider c_1, c_2 as characters anymore).

We then repeatedly select the new two characters/internal nodes with the smallest frequencies and repeat this process. When we only have 1 character/internal node, we have created a trie with the best possible prefix encoding. This is the basic idea behind Huffman encoding.


We can use the following process to get the smallest encoding trie:

def smallest_encoding(frequency_function, character_list):
    let `heap` be a min-heap of tries # this is to allow us to efficiently select the lowest frequencies later
    for character in character_list:
        let `new_node` be a trie leaf node with `character` as its value and `frequency_function(character)` as its tag
        insert `new_node` into the heap indexed by its tag
    # we could also make the above more efficient by building heap in place
    while len(heap) >= 2:
        smallest1, smallest2 = heap.pop_min(), heap.pop_min()
        let `new_node` be a trie internal node with `smallest1` and `smallest2` as children (order doesn't matter), and `smallest1.tag + smallest2.tag` as its tag
        insert `new_node` into the heap indexed by its tag
    # at this point, the heap contains only one trie, which represents the best prefix code

This is called Huffman encoding. Clearly, the heap starts off size d = \abs{\Sigma}, and in every iteration of the while loop we decrease the size of the heap by 2 and then increase it by 1, so we have d runs of the while loop, which itself takes O(\log n) time. Therefore, Huffman encoding takes O(d \log d) time.

Suppose we have f(s) = 2, f(e) = 2, f(v) = 1, f(n) = 3, f(_) = 1, f(b) = 1, f(a) = 3, from the string "seven bananas". Using the above process, we might get something like b \to 000, s \to 001, e \to 010, v \to 0110, _ \to 0111, n \to 10, a \to 11.

There is a theorem that states that for any given set of frequencies, the above algorithm finds a prefix code that yields the shortest possible encoding in all bitstring-based prefix codes.

The problem with Huffman encoding is that we also need to send the dictionary (mapping bitstrings to characters) along with the encoded value, or the frequencies and tie-breaking rules. This adds a lot of overhead for smaller inputs, which occurs very often in practice.

Additionally, we have to go through the input twice - once to get the frequencies, and once to actually encode it. This is bad if we are working with huge amounts of data that can't be randomly accessed. We could potentially use the standard English text frequencies, but that would work poorly for non-English inputs. One approach is to use a fixed-size sample of the input to get the frequencies, but this also could perform poorly if the sample is non-representative.

There are two approaches to text compression. One is to encode multiple characters at once, and use the text to guide which characters to group together for encoding.

Run-length Encoding

Suppose we want to compress a bitstring (so \Sigma = \set{0, 1}). If we have a string like "1111100000001100000000", run-length encoding can compress this into 15728.

In real-world data, we often have long runs of 0 and long runs of 1. We can encode these runs with the length of each run - "1111100000001100000000" would become 5728, because there are 4 runs, with length 5, 7, 2, and 8, respectively. However, this also represents the string "0000011111110011111111", so we need to prepend the first bit to disambiguate.

However, we still need to actually represent the run lengths as a bitstring in order for this actually to be an encoding. Note that the run lengths can get arbitrarily large, so we can't just directly implement them using a fixed number of bits. We could solve this by breaking overly long runs of 1 into two runs of 1 separated by a zero-length run of 0.

If we know the length of the input, we can just encode that value into the output, by putting that many 0 bits before every run length:

def run_length_encode(value):
    write out the first bit of the string
    while not at the end of the string:
        let `k` be the length of the next run
        write a 0 bit `floor(log(k) / log(2))` times
        write `k` as binary and let `bits` be the number of bits written

To decode this, we can do the following:

def run_length_decode(value):
    current_bit = value[0]
    i = 1
    while i < len(value) - 1:
        l = 0; while value[i] == 0: l += 1; i += 1
        k = int(value[i:i + l + 1], 2)
        write out `current_bit` `k` times
        flip `current_bit`

Note that run-length encoding can actually result in a longer bitstring than the original. There is a threshold for average run-length such that if the average run length is below it, the output will actually be larger than the input. As a result, run-length encoding is really suited only for applications where we expect long runs of the same valule, like compressing some types of images.

Lempel-Ziv-Welch Compression

This is used extensively in the real world, since it is so effective. The basic idea is to create the dictionary while scanning and encoding the input.

We will assume that the input is an ASCII string, and the output is a sequence of 12-bit bitstrings.

We have a dictionary d(k) \to v where k is an ASCII string and v is a 12-bit bitstring. It is initialized with all the ASCII characters mapped to their values - NUL to DEL mapped to 0 to 127.

Suppose our text is "YO!_YOU!_YOUR_YOYO!". First, we read in a character to get "Y", and find that it is in the dictionary, so we output the value, 0x089. Then, we read in the next character to get "YO", which is not in the dictionary. We then add "YO" to the dictionary, mapped to the next available code, which is 128, and clear all read input except the last character read, so when we read the next character, we get "O!" rather than "YO!" or "!".

We then repeat this process of reading inputs, checking if they're in the dictionary, and outputting the code if it is and resetting if it isn't, until the input is exhausted. At the end we might print out the code for the one character in the read input if there is one.

We can implement this as follows:

def lzw(value):
    populate a dictionary `dictionary` populated with ASCII values
    while True:
        let `prefix` be the longest prefix of the remaining input that is in `dictionary`
        output the 12-bit number `dictionary[prefix]`
        let `c` be the result of peeking at the next character in the input
        dictionary[prefix + c] = len(dictionary) # `len(dictionary)` is always the next free bitstring code

We can actually store the dictionary very efficiently using a trie, since our alphabet is fixed to ASCII.


LZW decoding does not require a dictionary - we can figure it out from the data itself, as long as we know the starting dictionary (for ASCII text, this is mapping from characters to their corresponding ASCII codes). The idea is to build the dictionary again as we are decoding the string.

We can decode by looking up the next code in the encoded input, and adding the previous string plus the first character of the next string to the dictionary, at the next code number.

For example, if we have 67, 65, 78, 32, 66, 129, 133, 83, then we know which strings all values below 128 correspond to. As a result, we know that the string is "CAN B129, 133S. As we are doing this, we are building the dictionary. By the time we get to 129, we have, in addition to the initial mappings, "CA" to 128, "AN" to 129, "N " to 130, " B" to 131, and "BA" to 132 - all the pairs of characters. However, 133 is special because we still don't have an entry in the dictionary for it.

This can be implemented as follows:

def lzw_decode(encoded): # encoded is an array of 12-bit numbers
    let `D` be a trie populated with the codes for all the ASCII characters
    code = encoded[0]
    result = ""
    value = D[code]
    result += value
    for code in encoded[1:]: # go through the rest of the characters
        previous_value = value
        if code != len(D): # the code is still in the dictionary
            value = D[code]
        else: # unseen code, not in dictionary so we construct it from previous value
            value = previous_value + previous_value[0] # the value is the previous value appended with the first character of the previous value
        result += value
        D[len(D)] = previous_value + value[0] # the new substring is the previous value appended with the first character of the current value
    return result

Note that the dictionary is a trie, where all nodes store a code (including the interior nodes, but not the root). If a string is in the trie, then due to the algorithm so are all the non-empty prefixes are also in the trie.

The basic idea behind LZW encoding is to find the longest prefix in the dictionary, output its code, and then add this prefix plus the next char to the dictionary. The runtime of finding the longest prefix is \Theta(n) where n is the number encoded characters so far, and since our search ended at the correct place in the trie to add a child, adding the new entry to the dictionary is \Theta(1). Therefore, the overall time complexity of LZW encoding is \Theta(n) where n is the length of the input string.

In practice, a trie takes a huge amount of memory - we often use a hashing solution instead. Also, since we have a 12-bit code we can't have more than 2^{12} codes, so we need to make sure we don't use that many in the encoding. If we do, we need to make sure to tell the decoder that there is more text to encode after that code.

For decoding, all our code numbers are consecutive, so we can just store our dictionary as an array for \Theta(1) access (and if we initialize it to 2^{12} elements, we won't even need to expand it).

Looking codes up in the dictionary takes \Theta(1) time, but adding the looked up string to the result, takes O(n) time where n is the length of that string. Therefore, The overall time complexity of LZW decoding is \Theta(n) where n is the length of the decoded string.

In fact, we don't even need to store entire prefix strings in the dictionary while decoding - we can simply store the code of its prefix (or nothing, for the initial entries) and the last character (since the string without its last character is guaranteed to be in the dictionary). Then, when we need to look up the string by code, we just decode the stored prefix, and append the last character. This keeps our time complexity the same, but lets us store our dictionary entries in O(1) time.

Huffman encoding is the best 01-prefix code, but we need to send the dictionary and it needs 2 passes - it's used in PDFs. Run-length encoding is very simple, but bad on text (though good for bitmap images), and is rarely used in practice. LZW is the best compression for medium-length English text, and is used in things like JPEG and MP3.

Huffman encoding is always the same length or shorter than the original (without the dictionary), but all the others can possibly be longer.

The bzip2 compression scheme uses multiple algorithms together. The main idea is to transform the text before trying to compress it, in order to get a better compression ratio. The text is first put through a Burrows-Wheeler transform, then the move-to-front heuristic, and then Huffman encoded, before being transmitted. Upon receipt, the encoded value is put through reverse Huffman, reverse move-to-front, and then reverse Barrows-Wheeler transform.

The move-to-front heuristic is a way of encoding text using a dictionary. Starting with an ASCII dictionary, we apply the move-to-front method with the character to the dictionary - we move the character represented by the current code to the front of the dictionary for each character, the goal being to use the smallest possible code for each character to improve compressability later.


The Burrows-Wheeler transform is a transform where we compute all the cyclic shifts of a string, sort them lexicographically, write them out one below the other in a square matrix, and then output the last column of the matrix, as well as the row index within the grid that corresponds to the original text that is unshifted. If the text ends with a sentinel value, then we don't need to output the shift as well.

The advantage of doing this transform is that equal characters will generally get more grouped together, which makes for better compressibility. Any pattern that occurs frequently means that the first character appears consecutively in the output of the transform. The goal of this is to get a lot of repeating characters. This works really well with the move-to-front heuristic because with repeating characters, it gets a smaller encoding.

To do a reverse Burrows-Wheeler transform, we partially restore the matrix. First, we fill in the last column of a matrix with the given encoded value.. Note that the first column of the matrix is just a sorted version of the last column, since it is the most significant character we were sorting by while encoding. Therefore, we can fill in the first column as well.

Now we know the first character of the original text. Note that the first column is a cyclic shift of the word rightward by 1 character. Therefore, given any character, we know what the next character is (for duplicate characters, make sure we label each one with unique tags to avoid confusing them) - we look up a character in the last row, and the character after it is the corresponding one in the first row. So starting from the first character, we can follow the next character until we have the correct number of characters.

Basically, we sort the value to get all the length 2 substrings, and then use those substrings to calculate the original value.

The runtime for this is O(n + \abs{\Sigma}) - we can sort using count sort/bucket sort (since we have a fixed number of characters to sort) in O(\abs{Sigma} + n) time, and recover the length 2 substrings and the original text in O(n) time.

Decoding is pretty fast, but the encoding still seems rather inefficient. We can do better. Let T be the original string, and let S = TT. Clearly, all substrings of length \abs{T} in S are cyclic shifts of T. Clearly, there is a bijection between substrings of length \abs{T} and suffixes of length \abs{T} or greater, and the suffixes will sort in the same order as the substrings. Therefore, we can just sort the suffixes instead of the substrings, which will make things a lot simpler.

Now we want to sort these suffixes. This can easily be done by building a suffix tree/trie, and deleting the suffixes that are too short (or just don't add them in the first place while building the tree). We can build this in O(n) time using some clever algorithms. Now, every prefix in the trie of length \abs{T} is a length \abs{T} substring of the word.

Now we can traverse the trie in order, and for each leaf, we output the character at index \abs{T} - 1. Since the suffix tree traversal is equivalent to going through each length \abs{T} substring in sorted order, we basically calculate the substring the suffix corresponds to and output the relevant character. Alternatively, we could just store the character at index \abs{T} - 1 in the leaf while we are building the suffix tree, and directly output the leaves while traversing. With this new, faster scheme, we can perform BWT in O(n) time.

To summarize, bzip2 compression applies BWT to improve character grouping, applies MTF to get smaller numbers, and then applies Huffman encoding to get a variable length bitstring encoding for the text. Decoding is simply the opposite process. Bzip2 compresses really well, but is rather slow in practice. Overall, it is slightly over linear time since almost all the parts have linear runtime.

Course Overview

There isn't always a best solution for problems - in many cases, there will be different solutions that are better in different cases.

Final is in same format as midterm, questions ask to run algorithm on some inputs, apply knowledge to solve problems, or create algorithms.

Creative Commons License This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Copyright 2013-2017 Anthony Zhang.