Algorithms.

```
Doug Stinson
Section 001
https://www.student.cs.uwaterloo.ca/~cs341/
```

Most materials will be on LEARN. Questions and answers can be posted on Piazza.

Study of the design and the analsis of algorithms - in particular, the correctness (proved via formal proofs) and efficiency (proved using time complexity analysis).

Useful metrics for algorithms are asymptotic complexity (best case/average case/worst case time/space complexity), number of specific computations used (like comparisons in sorting algorithms), whether an algorithm is the most efficient for a particular problem, etc.

Interesting things to know are lower bounds on possible algorithms to solve a particular problems, problems that cannot be solved by any algorithm (undecidability), and problems that cannot be solved efficiently by any algorithm, but can be solved (NP-hardness).

Useful design strategies we will go over are divide and conquer, greedy algorithms, dynamic programming, breadth-first and depth-first search, local search, and linear programming.

An example of an algorithm design is the maximum problem: given an array of integers \(A\), find the maximum integer in \(A\). A simple solution to this is as follows:

```
def find_max(array):
current = array[0]
for elem in array[1:]:
if elem > current: current = elem
return elem
```

This seems to be obviously correct, but we can also use program verification to prove it correct. In this case, we will use induction to prove the loop invariant - at the end of each iteration of the loop, `current`

is equal to the largest element encountered so far in the array. We'd check the base case for arrays of length 1, then prove the inductive hypothesis to formally verify the program.

Clearly, this algorithm is \(\Theta(n)\), since it loops over the entire array once. Formal time complexity analysis can also be done by noting that the loop body takes \(\Theta(1)\) time, and the loop runs \(\Theta(n)\) times.

We also know that this algorithm is asymptotically optimal, at least in terms of the number of comparisons done. Since a correct solution to the Maximum problem must look at every element in the array, it must therefore do at least \(n - 1\) comparisons. Since each comparison takes \(\Theta(1)\) time, the best possible algorithm must take \(\Omega(n)\) time. Let's formally prove this:

Suppose there was an algorithm that could determine the maximum element of an array using fewer than \(n - 1\) comparisons.

Let \(G\) be a graph such that each vertex in \(G\) corresponds to an element in \(A\), and each comparison done by a run of the algorithm between any two elements results in an edge between those two elements.

Clearly, there are \(n - 2\) edges or less, and \(n\) vertices.

Therefore, there are at least 2 components in \(G\), since the graph cannot be connected.

Clearly, the solution could be in any of the components, and can only be found by comparing them.

Therefore, the algorithm cannot exist.

An algorithm to find the minimum and the maximum element of an array can also trivially be designed, using \(2n - 2\) comparisons and \(\Theta(n)\) time. However, it is possible to design another algorithm that does fewer than \(2n - 2\) comparisons (though the time complexity might be worse).

One way to do this is to consider elements two at a time - we compare elements one pair at a time, the larger of the pair with the maximum, and the smaller of which with the minimum. That means we need 3 comparisons per pair, or \(\ceil{\frac 3 2 n} - 2\) in total, a ~25% improvement:

```
def find_min_max(array):
if array[0] < array[1]: min, max = array[0], array[1]
else: max, min = array[0], array[1]
for i in range(2, len(array), 2):
if array[i] < array[i + 1]: small, big = array[i], array[i + 1]
else: big, small = array[i], array[i + 1]
if big > max: max = big
if small < min: min = small
if len(array) % 2:
if array[-1] > max: max = array[-1]
if array[-1] < min: min = array[-1]
```

It can actually be proven, using graph theory, that the number of comparisons for this algorithm is optimal. However, it's too lengthy and complicated to cover here.

Suppose we want to determine whether and which three elements of an array of integers sum to 0. This is pretty easy to do in \(O(n^3)\) (specifically, \(n \choose 3\) runs of the inner loop) simply by checking all possible triples in the array. Basically, we pick two elements \(A[i]\) and \(A[j]\), then try to search for a \(A[k] = -(A[i] + A[j])\) using linear search. However, it is actually possible to do so in \(O(n^2 \log n)\) by sorting the array first, and searching for \(k\) using binary search instead.

In fact, it's possible to do even better by sorting the array, taking each \(A[i]\), and searching from both ends of the array inward for \(j\) and \(k\) such that \(A[j] + A[k] = -A[i]\). After each step in the search, we either move inward from the left if \(A[j] + A[k] < -A[i]\), or inward from the right if \(A[j] + A[k] > -A[i]\)

The pseudocode for the above algorithm looks like the following:

```
def better_3sum(array):
array = sorted(array)
result = []
for i in range(n - 2):
j, k = i + 1, n
while j < k:
triple_sum = array[i] + array[j] + array[k]
if triple_sum < 0: j += 1
elif triple_sum > 0: k -= 1
else:
result.append((i, j, k))
j += 1
k -= 1
```

Basically, this goes through each value of \(i\), and does an \(O(n)\) search for two values that would make it possible to have all three sum to 0. Proving this correct is left as an exercise to the reader - prove that \(k - j\) is monotonically decreasing, and the pairs cover all possible pairs that can possibly sum up to \(A[i]\). It's somewhat reminiscent of the array merge in merge sort.

This algorithm is \(O(n^2)\), since the search is \(O(n^2)\) and the sort is \(O(n \log n)\). There are actually even better algorithms, and the best currently known ones are \(O\left(n^2 \left(\frac{\log \log n}{\log n}\right)^2\right)\).

A **problem** is a computational task. A **problem instance** is the input for the computational task. The **problem solution** is the output. The **size of a problem instance** is a positive integer that is a measure of the size of the instance, and depends on the problem itself. The size is usually fairly intuitive, such as the size of an array for array sum, but sometimes it gets a bit more tricky.

An **algorithm** is a high level description of a computation. An algorithm **solves** a problem if it finds a valid solution for any instance in finite time. A **program** is an implementation of an algorithm using a computer language.

The running time of a program \(M\) for a problem instance \(I\) is \(T_M(I)\). The worst case running time for problem instances of size \(n\) for the program is \(T_M(n)\). The average case running time is \(T_M^{avg}(n)\).

The **worst case time complexity** of an algorithm \(A\) is \(f(n)\) such that a program \(M\) exists implementing \(A\) such that \(T_M(n) \in \Theta(f(n))\).

Running time can only be found by running the program on a computer. Complexity is independent, but we lose information such as the constant factors.

To get the time complexity of an algorithm, we can either use \(\Theta(f(n))\) throughout our analysis, or prove \(O(f(n))\) and then \(\Omega(f(n))\), which is often easier due to being able to make more assumptions for each separate proof.

The complexity of a loop in code is sum of the complexity of its body over each iteration of the loop. The complexity of consecutive operations is the sum of the complexities of the operations they are composed of.

Prove that \((\ln n)^a \in o(n^b)\) for any \(a\) and \(b\):

Let \(L = \lim_{n \to \infty} \frac{(\ln n)^a}{n^b} \lH \lim_{n \to \infty} \frac{a (\ln n)^{a - 1}}{b n^b} \lH \ldots \lH \lim_{n \to \infty} \frac{a!}{b^a n^b} = 0\).

By the limit order rule, \((\ln n)^a \in o(n^b)\).

Useful to know:

- \(\sum_{i = 0}^{n - 1} (a + di) = na + \frac{dn(n - 1)} 2\) - arithmetic sequence
- \(\sum_{i = 0}^{n - 1} ar^i = \begin{cases} a\frac{r^n - 1}{r - 1} &\text{if } r \ne 1 \\ na &\text{if } r = 1 \\ \end{cases}\) - geometric sequence
- \(\sum_{i = 0}^{n - 1} (a + di)r^i = \frac a {1 - r} - \frac{(a + (n - 1)d)r^n}{1 - r} + \frac{dr(1 - r^{n - 1})}{(1 - r)^2}\) -arithmetic-geometric sequence
- \(\sum_{i = 1}^n \frac 1 {i^2} = \frac{\pi^2} 6 = \Theta(1)\)
- \(_n = \sum_{i = 1}^n \frac 1 i \in \Theta(\log n)\) - harmonic sequence

The order of \(n!\) is \(\Theta(n^2 \sqrt n e^{-n})\). This is similar to Stirling's approximation, \(\sqrt{2 \pi n} n^n e^{-n}\) (this gets more and more accurate for larger values).

For order notation, the base of logarithmic terms don't matter because \(\log_{10} n = \frac 1 {\log_b 10} \log_b n\), and \(\frac 1 {\log_b 10}\) is a constant factor.

**Polynomial growth rates** are those in \(O(n^k), k \in \mb{R}\). For example, the best known algorithm for graph isomorphism is \(n^{\sqrt n \log_2 n}\), so it is not polynomial.

Consider the following program:

```
def useless(n):
for i in range(n):
j = i
while j >= 1:
j /= 2
```

Clearly, each iteration of the inner loop will run \(\Theta(\log n)\) times, so the overall time complexity is \(\sum_{i = 1}^n \Theta(\log i) = \Theta(\sum_{i = 1}^n \log i) = \Theta(\log(\prod_{i = 1}^n i)) = \Theta(\log(n!)) = \Theta(n \log n)\).

A **recurrence relation** specifies \(a_n\) in the infinite sequence of real numbers \(a_1, a_2, a_3, \ldots\) in terms of the previous terms \(a_1, \ldots, a_{n - 1}\) (\(a_1\) must therefore be a constant - the **initial value**). A **solution** to a recurrence relation is a closed form formula for \(a_n\) - one that does not depend on the previous values of \(a_1, \ldots, a_{n - 1}\). These are generally solved using techniques like guess and check or recursion trees, but there aren't any methods that are guaranteed to solve all recurrences, and in fact, reurrences may not necessarily even have solutions.

Recurrence relations can also be written using function notation, like \(T(1) = 2, T(n) = T(n - 1) + 1\).

The **guess and check** method involves computing enough \(a_n\) instances to guess the form of the solution (without any constants). Then, we solve for the constants using the computed values, and check if our solution is correct - if it is, we must prove it is correct using induction, and if not, we start over with a different guess.

Solve \(a_0 = 4, a_n = a_{n - 1} + 6n - 5\):

Guess and check: the first few elements of the sequence are \(a_0 = 4, a_1 = 5, a_2 = 12, a_3 = 25, a_4 = 44\).

This seems to be quadratic growth, so we guess \(an^2 + bn + c, a, b, c \in \mb{R}\) as the solution.

Solving for \(a, b, c\), \(c = 4, a + b + c = 5, 4a + 2b + c = 12\), we get \(a = 3, b = -2, c = 4\).

So we guess that \(a_n = 3n^2 - 2n + 4\). We can verify this for \(0 \le n \le 4\) as our inductive base case.

Assume for some \(k \in \mb{N}\) that \(a_k = 3k^2 - 2k + 4\).

Clearly, \(a_{k + 1} = 3k^2 - 2k + 4 + 6(k + 1) - 5 = 3(k + 1)^2 - 2(k + 1) + 4\).

So by induction, \(a_n = 3n^2 - 2n + 4\).

The **recurrence tree** method is often used for divide-and-conquer algorithms. This technique involves building a call tree for the recurrence. then summing up the results of the function on each level of the tree.

To construct a recurrence tree, we start with the timing function \(T(n)\), then add children for each recursive call. The recursive calls are then removed from the parent. This is repeated for the children, all the way until we reach the base case. Note that after each step of growing the tree, the sum of all the nodes are always equivalent to \(T(n)\).

Then, we sum the nodes at each level in the tree. The sum of all these sums is equal to \(T(n)\).

Use the recurrence tree method to find the running time of merge sort:

Clearly, \(T(n) = \begin{cases} 2T\left(\frac n 2\right) + cn &\text{if } n > 1 \wedge n \text{ is a power of 2} \\ d &\text{if } n = 1 \\ \end{cases}\) where \(c, d\) are constants.

Construct a recursion tree: start with the root node \(N\) with value \(T(n)\), then add 2 children, \(N_1, N_2\), both \(T\left(\frac n 2\right)\), and replace \(N\)'s value with \(cn\).

What we get is a binary tree with every internal node having 2 children, and value \(c\frac n {2^i}\) (where \(i\) is the level in the tree, where 0 is the root level), and each leaf node has value \(d\).

Let \(n = 2^j\). Clearly, the height of the tree must then be \(j + 1\), with \(j\) interior levels. Clearly, at the bottom level there are \(2^j\) leaf nodes, since each level \(i\) has \(2^{j - i}\) nodes. The sum of the bottom level is therefore \(2^j d = dn\).

Clearly, at each internal level \(i\) each node has value \(c\frac n {2^i}\). Since there are \(2^i\) nodes, the sum of each interior level is \(c2^i\frac n {2^i} = cn\).

So \(T(n) = dn + jcn = dn + cn \log_2 n = \Theta(n \log n)\).

The **master theorem** is a generalized formula for solving certain forms of recurrences. One thing it says is that given \(T(n) = aT(\frac n b) + \Theta(n^y)\) where \(a \ge 1, b > 1\), and \(n\) is a power of \(b\), then \(T(n) \in \begin{cases} \Theta(n^x) &\text{if } y < x \\ \Theta(n^x \log n) &\text{if } y = x \\ \Theta(n^y) &\text{if } y > x \\ \end{cases}\) where \(x = \log_b a\).

Also, the exact value is \(T(n) = da^j + cn^y \sum_{i = 0}^{j - 1} \left(\frac a {b^y}\right)^i\), or more simply \(T(n) = dn^x + cn^y \sum_{i = 0}^{j - 1} r^i\) where \(x = \log_b a\) and \(r = \frac a {b^y} = b^{x - y}\).

This can be proved by constructing the recursion tree, figuring out the geometric sequence for the tree sums, and then solving for its value for each of the three cases.

A more general version, which will not be proved here, is that given \(T(n) = aT(\frac n b) + f(n)\) where \(a \ge 1, b > 1\), and \(n\) is a power of \(b\), then \(T(n) \in \begin{cases} \Theta(n^x) &\text{if } f(n) \in O(n^{x - \epsilon}) \text{ for some } \epsilon > 0 \\ \Theta(n^x \log n) &\text{if } f(n) \in \Theta(n^x) \\ \Theta(f(n)) &\text{if } \frac{f(n)}{n^{x + \epsilon}} \text{ is an increasing function of } n \text{ for some } \epsilon > 0 \text{ after } n \text{ is greater than some } n_0 \\ \end{cases}\) where \(x = \log_b a\).

Solve \(T(1) = 1, T(n) = 3T(\frac n 4) + n \log n\):

Let \(a = 3, b = 4, f(n) = n \log n\), so \(x = \log_4 3\) and \(f(n) / n^{x + \epsilon}\) is an increasing function of \(n\) for \(\epsilon = 1 - \log_4 3\) since \(f(n) / n^{x + \epsilon} = \log n\).

So \(T(n) \in \Theta(n \log n)\), by the master theorem.

The main task when using the master theorem to prove time complexities is choosing the right \(\epsilon\) to make one of the cases valid. For example, the master theorem cannot be used for something like \(T(1) = 1, T(n) = 2T(\frac n 2) + n \log n\), since none of the cases apply. Instead, we can use the recurrence tree method to solve the recurrence directly:

Each level \(i\) (from the bottom) of the tree has \(2^i\) nodes, each with value \(2^{j - i}(j - i)\) where \(n = 2^j\).

So at each level, the sum is \(2^i 2^{j - i} (j - i) = (j - i) 2^j\).

So the sum of the tree is \(\sum_{i = 0}^j (j - i) 2^j = j^2 2^j - 2^j \sum_{i = 0}^j i = 2^j\left(j^2 - \frac{j(j + 1)}{2}\right) = \Theta(n \log^2 n)\).

Divide and conquer is a design strategy for algorithms where we divide a problem into smaller subproblems, solve those individually, then combine them back together to get a good result. Formally:

- Divide the problem instance \(I\) into smaller subproblems \(I_1, \ldots, I_n\).
- Solve \(I_1, \ldots, I_n\) recursively to get sollutions \(S_1, \ldots, S_n\).
- Use \(S_1, \ldots, S_n\) to compute \(S\) - combine the results. This is usually the hardest step.

For example, mergesort splits an array into two subarrays, and then recursively sorts those. Then, the merge is performed to combine the two sorted subarrays.

The time complexity of this is \(T(n) = \begin{cases} T(\ceil{\frac n 2}) + T(\floor{\frac n 2}) + cn &\text{if } n > 1 \\ d &\text{if } n = 1 \\ \end{cases}\) for some constants \(c, d\). This is called the **exact recurrence**. If we remove the ceilings and floors, we get a **sloppy recurrence** \(T(n) = \begin{cases} 2T(\frac n 2) + cn &\text{if } n > 1 \\ d &\text{if } n = 1 \\ \end{cases}\), which only works the same when \(n\) is a power of 2, but is a lot simpler to work with.

The master theorem can prove that mergesort is \(\Theta(n \log n)\) for the sloppy recurrence, and therefore only for when \(n\) is a pwoer of 2. However, if we want to get the time complexity for all \(n\), we need to use induction on the exact recurrence.

Suppose we have a set of 2D coordinates/points. One point **dominates** another if it has a greater or equal X and Y coordinate than the other. We want to find those points that are non-dominated - the points that are not dominated by any other point. The non-dominated points form a sort of boundary below which all the dominated points lie.

Clearly, this can easily be solved simply by checking every pair of points to see which dominate which, but this would take \(\Theta(n^2)\) time.

A better solution is to sort the points by their X coordinate, and then apply divide and conquer.

Dividing larger sets of points \(S\) is also easy - we can just split the sorted points into two sets \(S_1, S_2\), and then recursively solve the problem. It is the combining step that is the difficult step.

Given a point and the highest point on its right, it is easy to determine whether it is dominated. This is our base case

Clearly, no point in \(S_1\) can dominate a point in \(S_2\), since the X coordinates of points in \(S_1\) is strictly less than points in \(S_2\). So we find the point in \(S_2\) with the highest Y coordinate, and all points in \(S_1\) with Y coordinate at or below this value are guaranteed to be dominated by at least one point in \(S_2\). In this way, we can eliminate all the points in \(S_1\) that are dominated by points in \(S_2\) in \(O(n)\) time - this is the combining step.

In code, this looks like:

```
def non_dominated(points):
points = sorted(points, key=lambda p: p.x)
if len(points) == 1: return {points[0]}
pivot = floor(len(points) / 2)
left = non_dominated(points[:pivot])
right = non_dominated(points[pivot + 1:])
i = 0
while i < len(left) and left[i].y > right[1].y:
i += 1
return left[:i] + right
```

Clearly, the time complexity of this algorithm is \(T(n) = T\left(\floor{\frac n 2}\right) + T\left(\ceil{\frac n 2}\right) = O(n \log n)\) (the sloppy version of this can be proven using the master theorem).

Without divide and conquer, we could also just sort the list, and then iterate though it from right to left, keeping track of the largest Y axis coordinate seen so far and eliminating points at or below this Y coordinate:

```
def non_dominated(points):
largest = float("-inf")
for point in reversed(sorted(points, key=lambda p: p.x)):
if point.y <= largest:
yield point
else:
largest = point.y
```

Suppose we have a set \(Q\) of \(n\) distinct 2D points in Euclidean space. We want to find the two points that are closest together - the points \(u, v\) for which \(\sqrt{(u_x - v_x)^2 + (u_y - v_y)^2}\) is minimised.

Clearly, minimising \(\sqrt{(u_x - v_x)^2 + (u_y - v_y)^2}\) is the same as minimising \((u_x - v_x)^2 + (u_y - v_y)^2\). We will again sort the points in \(Q\) by X coordinate, taking \(\Theta(n \log n)\) time.

The dividing step is again trivial to get two partitions \(Q_1, Q_2\), but the combining step takes some work.

If we are given the distance \(\delta\) between the closest pair of points seen so far in the left and right partitions, and a point \(p\), we know that if this point is in the closest pair of points, the other point must have X coordinate within \(\delta\) of this point's X coordinate. Therefore, the other point must be in the **critical strip** - a section of the plane from \(p_x - \delta\) to \(p_x + \delta\) and extending infinitely both up and down.

We now want to find the closest pair between the left and right partitions, if it is less than the .

There is also a lemma that says that if a critical strip \(R\) of \(Q\) is sorted by Y coordinate, and \(R[j]\) and \(R[k]\) have distance less than \(\delta\) where \(j < k\), then \(k \le j + 7\). Here, \(\delta\) is the smaller of the minimum distance between points in \(Q_1\), and the minimum distance between points in \(Q_2\). This can be proven by dividing the strip around the point as squares of side length \(\frac \delta 2\), and then showing that no two points can be in the same square. In other words, within a critical strip and given \(\delta\), the point closest to a given point can be determined in 7 comparisons or less.

```
def closest_pair(points, start, end):
if start == end: return float("inf")
pivot = floor((left + right) / 2)
closest_left = closest_pair(points, left, pivot)
closest_right = closest_pair(points, pivot + 1, right)
closest = min(closest_left, closest_right)
candidates = select_candidates(left, right, closest, points[pivot].x) # find points in the critical strip
candidates = sorted(candidates, key=lambda p: p.y)
closest = check_candidates(candidates, closest)
return closest
```

The time complexity of this is \(O(n \log n \log n)\). We can improve this by eliminating the Y coordinate sort of the candidates. One of them is to sort all the points \(Q\) by Y coordinate to begin with, and then use that list when checking candidates, and then adding the necessary checks that make it work. Alternatively, we could replace the Y coordinate sort with a merge, and maintaining the candidates in sorted order.

```
def closest_pair(points, start, end):
if start == end: return float("inf")
pivot = floor((left + right) / 2)
closest_left = closest_pair(points, left, pivot)
closest_right = closest_pair(points, pivot + 1, right)
closest = min(closest_left, closest_right)
merge_in_place(points, start, pivot, end)
candidates = select_candidates(left, right, closest, points[pivot].x) # find points in the critical strip
closest = check_candidates(candidates, closest)
return closest
```

The **size** of an integer is the number of bits that we use to represent it, rather than the value of the integer itself - this is basically just \(\ceil{\log_2 n}\). The **bit complexity** of algorithms operating on integers is therefore the complexity as a function of the sizes of those integers.

Suppose we have 2 positive binary integers, \(X = [X_{k - 1}, \ldots, X_0]\) and \(Y = [Y_{k - 1}, \ldots, Y_0]\). We want to compute the \(2k\)-bit product \(Z = [Z_{2k - 1}, \ldots, Z_0]\). Naively, we can use the simple binary multiplication algorithm, to shift \(Y\) by each position \(0 \le i < k\) and add those shifted values for which \(X_i = 1\), in \(O(k^2)\). Note that the fastest way to add two integers is \(\Theta(k)\).

We can do better than this by using divide and conquer, by using \(XY = (2^{\frac k 2} X_L + X_R)(2^{\frac k 2} Y_L + Y_R) = 2^k X_L Y_L + 2^{\frac k 2}(X_L Y_R + X_R Y_L) + X_R Y_R\), then solving for those four multiplications between \(X_L, X_R, Y_L, Y_R\) (multiplying the powers of 2 are simply shifts, so they're basically free):

```
def multiply(x, y):
if len(x) == 1: return x[0] * y[0]
x_left, x_right = x[:len(x) // 2], x[len(x) // 2:]
y_left, y_right = y[:len(y) // 2], y[len(y) // 2:]
z_top = multiply(x_left, y_left)
z_middle = add(multiply(x_left, x_right), multiply(x_right, y_left))
z_bottom = multiply(x_right, y_right)
z = add(
shift_left(z_top, len(x)),
shift_left(z_middle, len(x) // 2),
z_bottom
)
return z
```

Using the master theorem, we can show that this is takes \(\Theta(n^2)\) time - no improvement over the naive algorithm. However, this is a good base for an improved version.

Consider \((X_L + X_R)(Y_L + Y_R) = X_L Y_L + X_L Y_R + X_R Y_L + X_R Y_R\). So \(X_L Y_R + X_R Y_L = (X_L + X_R)(Y_L + Y_R) - X_L Y_L - X_R Y_R\). Since we need to calculate \(X_L Y_L\) and \(X_R Y_R\) anyways, we can compute \(X_L Y_R + X_R Y_L\) with just one multiplication by computing \((X_L + X_R)(Y_L + Y_R)\).

So using this:

```
def multiply(x, y):
if len(x) == 1: return x[0] * y[0]
x_left, x_right = x[:len(x) // 2], x[len(x) // 2:]
y_left, y_right = y[:len(y) // 2], y[len(y) // 2:]
z_top = multiply(x_left, y_left)
z_middle = multiply(add(x_left, x_right), add(y_left, y_right))
z_bottom = multiply(x_right, y_right)
z = add(
shift_left(z_top, len(x)),
shift_left(z_middle, len(x) // 2) - z_top - z_bottom,
z_bottom
)
return z
```

Now, we only solve 3 subproblems instead of 4. This gives us a time complexity of around \(\Theta(k^{\log_2 3})\), or about \(\Theta(k^{1.59})\).

Even better algorithms exist. One is Toom-Cook, which splits \(X\) and \(Y\) into 3 pieces each, and solves 5 subproblems, all in \(O(k^{\log_3 5})\). The Stronhage-Strassen method takes \(O(k \log \log \log k)\) time, using Fourier transforms. The Furer method is the most advanced as of the present, using \(O(k \log k 2^{3 \log^* k})\) time where \(\log^* k\) is the inverse Ackermann function.

We will now assume that integer multiplication can be done in constant time, to avoid having to consider bit complexity.

Suppose we have two \(n\) by \(n\) matrices \(A, B\). Clearly, the naive algorithm for multiplying them is \(\Theta(n^3)\), since we just take the dot product of the corresponding row of \(A\) and the corresponding column of \(B\) - the dot product takes \(\Theta(n)\), and there are \(\Theta(n^2)\) cells in the result.

We can use the same technique as for integer multiplcation - we break each \(n\) by \(n\) matrix into 4 \(\frac n 2\) by \(\frac n 2\) matrices, multiply those, then combine those together, then again try to remove subproblem calculations where possible.

If we split \(A, B\) into 4 blocks \(A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}, B = \begin{bmatrix} e & f \\ g & h \end{bmatrix}\), then \(AB = \begin{bmatrix} ae + bg & af + bh \\ ce + dg & cf + dh \end{bmatrix}\). If we use divide and conquer to solve these 8 multiplications in the block version, it still takes \(\Theta(n^3)\). However, the problem is now in a form that is useful for optimizing, by reducing the number of sub-problems.

Strassen's method notes that if \(P_1 = a(f - h), P_2 = (a + b)h, P_3 = (c + d)e, P_4 = d(g - e), P_5 = (a + d)(e + h), P_6 = (b - d)(g + h), P_7 = (a - c)(e - f)\), then \(ae + bg = P_5 + P_4 - P_2 + P_6, af + bh = P_1 + P_2, ce + dg = P_3 + P_4, cf + dh = P_5 + P_1 - P_3 - P_7\).

So \(AB = \begin{bmatrix} P_5 + P_4 - P_2 + P_6 & P_1 + P_2 \\ P_3 + P_4 & P_5 + P_1 - P_3 - P_7 \end{bmatrix}\), and since there are only 7 multiplications needed to calculate \(P_1, \ldots, P_7\), we only have 7 sub-problems to solve. As a result, the algorithm is \(\Theta(n^{\log_2 7})\) or around \(\Theta(n^{2.81})\), since the recurrence is \(T(n) = 7T\left(\frac n 2\right) + \Theta(n^2)\) (solve this using the master theorem).

Research into this problem has resulted into better algorithms. The current state of the art is the Coppersmith-Winograd algorithm, taking around \(O(n^{2.373})\) time.

Suppose we have an array \(A\) of size \(n\), and we want to find the \(k\)th smallest integer, where the smallest element is found when \(k = 1\). The median algorithm is a special case of this where \(k = \ceil{\frac n 2}\).

Naively, we could just sort the array and pick the \(k\)th element, all in \(\Theta(n \log n)\) time. Another naive approach is to use a modified selection sort (repeatedly finding the smallest element in the remaining part of the array, and putting it at the end of the sorted array), where we stop after \(k\) moves, all in \(\Theta(kn)\) time - this is linear if \(k\) is constant, but quadratic if it is not. Combining the two, we can use a modified heapsort, which can solve the problem in \(\Theta(n + k \log n)\) time.

There's a make-up lecture on November 21st or so.

Using divide and conquer, we might do something similar to quicksort, called quickselect - we pick a pivot, partition the elements, and then recursively quickselect within the partition that contains the desired element:

```
def quickselect(array, k):
pivot = array[0]
partition1, partition2 = [x for x in array if x < pivot], [x for x in array if x > pivot]
if k - 1 == len(partition1):
return pivot
elif k - 1 < len(partition1):
return quickselect(partition1, k)
else:
return quickselect(partition2, k - len(partition1) - 1)
```

Note that unlike quicksort, there's only one recursive call in each invocation. On average, the pivot will be good about 50% of the time, where a good pivot is one that is within the 25th and 75th percentile inclusive. So on average, we would expect that every two calls, we would get a good pivot. So on average, every other call would have partitions that are of size at most 75% of the array's size. Therefore, on average the size of the subproblem would be reduced by 25% on every other call, which makes this function \(\Theta(n)\) average case.

Note that in the worst case, the pivot would be at one of the extremes of the percentiles, and so the subproblem would only be reduced by a constant amount, and there are \(n\) calls. Therefore, the function is \(\Theta(n^2)\) worst case.

However, if we could always choose a good pivot, then we can ensure that the function is \(\Theta(n)\) worst case as well. Doing so is a bit more clever, but it is possible, using the **median of medians** algorithm. First, we assume that there are at least 15 elements, so \(n = 10r + 5 + \theta\), where \(r \ge 1, 0 \le \theta \le 9\).

We now split \(A\) into subarrays of exactly 5 elements: \(B_1, \ldots, B_{2r + 1}\) (we ignore the last \(\theta\) elements entirely here), and then find the median of each \(B_i\), as \(m_i\). Then, we use a selection algorithm to find the median of this array of medians.

We claim that \(y\) is always a good pivot - that is is guaranteed to be between the 25th and 75th percentile inclusive. We will prove this later, but first, we will integrate it into quickselect:

```
def median_of_medians_quickselect(array, k):
# generate a good pivot
if len(array) < 15: return sorted(array)[k]
write $n = 10r + 5 + \theta$
construct $B_1, \ldots, B_{2r + 1}$
find medians $M = [m_1, \ldots, m_{2r + 1}]$
pivot = median_of_medians_quickselect(M, r + 1)
# do the rest of the steps from quickselect
partition1, partition2 = [x for x in array if x < pivot], [x for x in array if x > pivot]
if k - 1 == len(partition1):
return pivot
elif k - 1 < len(partition1):
return quickselect(partition1, k)
else:
return quickselect(partition2, k - len(partition1) - 1)
```

We have two recursive calls now, since the median of medians algorithm also required us to do a selection to find `pivot`

(\(y\)).

There are \(r + 1\) medians at or below \(y\) in \(m_1, \ldots, m_{2r + 1}\) - let these be represented \(m_{i_1}, \ldots, m_{i_{r + 1}}\). Clearly, in each \(B_{i_1}, \ldots, B_{i_{r + 1}}\), there are at least 3 out of the 5 elements that are less than or equal to \(y\) (since they are less than or equal to the median). Therefore, there must be at least \(3(r + 1)\) elements in total that are less than or equal to \(y\). Since there are \(10r + 5\) elements (since we ignored \(\theta\)), the ratio of elements below \(y\) to the total number of elements is \(\frac{3(r + 1)}{10r + 5}\), or basically \(\frac 3 {10}\). Therefore, 30% or more of the elements must be less than or equal to \(y\).

Using a similar argument, but with the elements at or above \(y\) rather than at or below, we can prove that 30% or more of the elements must be greater or equal to \(y\). Therefore, \(y\) is within the 30th to 70th percentile inclusive.

Therefore, the sloppy recurrence is \(T(n) \le T(\frac n 5) + T(\frac{7n}{10}) + \Theta(n)\), which we can solve using the recurrence tree method - each node has two children, which are the node's value times \(\frac 1 5\) and \(\frac 7 {10}\), respectively, and the root node's value is \(n\). When we draw the tree, we see that each level \(i\) from the root (level 0) has sum \(\left(\frac 9 {10}\right)^in\), and the sum of all levels to infinity converges to \(10n\), so \(T(n) \in \Theta(n)\).

The exact recurrence for this is \(T(n) \le T(\floor{\frac n 5}) + T(\floor{\frac{7n + 12}{10}}) + \Theta(n)\). This accounts for the rounding down in the partitioning, and accounts for the last \(\theta\) values we were ignoring earlier.

Greedy algorithms are often used for **optimization problems**, which are problems where the goal is to find a solution that both satisfies the **problem constraints**, and maximizes/minimizes an **objective/profit/cost function**.

**Feasible solutions** are those that satisfy the problem constraints, and given a problem instance \(I\), are represented as \(\mathrm{feasible}(I)\). The objective function is in \(\mathrm{feasible}(I) \to \mb{R}\).

For greedy algorithms, it is generally the proof of correctness that is the hard part. Greedy algorithms build up partial solutions, making locally optimal decisions at each step, with the goal of getting a globally optimal result.

A feasible solution should be writeable as an \(n\)-tuple \(\tup{x_1, \ldots, x_n}\). A **partial solution** is an \(i\)-tuple \(\tup{x_1, \ldots, x_i}\) where \(i < n\) and none of the problem constraints are violated (note that this is not necessarily a feasible solution). A partial solution isn't always extendible to a feasible solution - some partial solutions are dead ends, and greedy algorithms can't get an optimal solution here.

A **choice set** for a partial solution \(\tup{x_1, \ldots, x_i}\) is the set \(\mathcal{X}\) of all elements we can append to \(\tup{x_1, \ldots, x_i}\) to make another, longer partial or feasible solution \(\tup{x_1, \ldots, x_i, y}\) - the set of all possible \(y\).

For any \(y\) in the choice set \(\mathcal{X}\), we have a **local evaluation criterion** \(g(y)\), that measures the profit/cost of including \(y\) in the current partial solution. **Extension** of a partial solution is the process of adding the \(y\) with the highest \(g(y)\) value to the partial solution. A greedy algorithm, in general, is repeated extension until we have a feasible solution \(X\).

Greedy algorithms do no backtracking or lookahead - they only consider the elements of the choice set at each step. That means they are often faster than algorithms that do backtracking, but the solution is possibly not as good. When implementing greedy algorithms, it is often efficient to do a preprocessing step to compute the local evaluation criterion, like sorting the elements of an array.

For some greedy algorithms, it is possible to always obtain an optimal solution. However, these proofs are often rather difficult.

Suppose we have a set \(A\) of pairs representing intervals. What is the largest subset \(B\) of these pairs such that none of the intervals intersect?

One greedy strategy might be to repeatedly choose the interval with the earliest start time that doesn't intersect with any intervals in the current partial solution. Another might be to choose the interval with the smallest duration that doesn't intersect. A third is to choose the interval with the earliest end time that doesn't intersect.

Clearly, the first one is not valid because we can make a counterexample: an interval that starts early but is really long is chosen over multiple non-intersecting short intervals that start a bit later, which means this strategy doesn't always give the optimal solution.

It is also possible to find a counterexample for the second strategy - a short interval that intersects with two non-intersecting longer intervals is chosen over those two, which means this strategy also doesn't always give the optimal solution.

The last strategy is correct. We can implement it as follows:

```
def select_interval(intervals):
intervals = sorted(intervals, key=lambda interval: interval[1])
solution = [intervals[0]]
latest_finish = intervals[0][1]
for interval in intervals[1:]:
if interval[0] >= latest_finish:
solution.append(interval)
latest_finish = interval[1]
return solution
```

Most correctness proofs for greedy algorithms use strong induction. The proof for the greedy interval selection algorithm is relatively straightforward:

Let \(A\) be an array of intervals, sorted by their end values.

Let \(\mathcal{B} = \tup{A_{i_1}, \ldots, A_{i_k}}\) where \(i_1 < \ldots < i_k\) be the greedy solution.

Let \(\mathcal{O} = \tup{A_{j_1}, \ldots, A_{j_k}}\) where \(j_1 < \ldots < j_l\) be the optimal solution.

Clearly, \(k \le l\) since \(O\) is the optimal solution, and it is impossible for the greedy solution to be better than optimal.

We want to first prove that \(i_m \le j_m\) for all \(1 \le m \le k\), which means that the ends of the intervals in the greedy solution are always before the ends of the corresponding intervals in the optimal solution.

For \(m = 1\), \(i_1 \le j_1\) since \(i_1 = 1\) - the greedy algorithm always initially chooses the interval that ends earliest.

Assume \(i_{m - 1} \le j_{m - 1}\). So in the greedy solution, the \(m - 1\)th interval ends no later than the \(m - 1\)th interval in the optimal solution.

Suppose \(i_m > j_m\). Clearly, the optimal solution's \(m\)th interval ends earlier than the greedy algorithm's \(m\)th interval, since \(A\) is sorted by end values.

However, this is not possible, since that means the greedy algorithm didn't choose the earliest end time. So \(i_m \le j_m\).

Suppose \(k < l\). Clearly, the optimal solution must have at least \(k + 1\) intervals.

Clearly, \(i_k \le j_k\). So the optimal solution's \(k + 1\)th interval must be in the choice set for the greedy algorithm at the \(k\)th step.

However, this isn't possible since the greedy algorithm didn't choose that interval, so \(k \ge l\).

Since \(k \le l\) and \(k \ge l\), \(k = l\) and the greedy solution is the same length as the optimal solution, and so the greedy algorithm must be optimal.

The time complexity of this algorithm is easy to prove - it's \(\Theta(n \log n)\) for the sort and \(\Theta(n)\) for the traversal, so \(\Theta(n \log n)\) overall.

Suppose we want to partition \(A\) into disjoint sets of non-intersecting intervals, minimizing the number of these sets. This is called the interval coloring problem - we want to assign a color to each interval such that all intervals with a given color do not intersect each other.

An optimal greedy strategy for this is to sort the intervals by start values ascending, and for each interval in order, attempt to insert each interval into existing sets (assign it an existing color), and create a new set if not possible (assign it a new color):

```
def partition_intervals(intervals):
intervals = sorted(intervals, key=lambda interval: interval[0])
partitions = [{intervals[0]}]
partition_latest_finishes = [intervals[0][1]]
for i in range(1, len(intervals)):
interval = intervals[i]
for partition_index, latest_finish in enumerate(partition_latest_finishes): # try to assign interval to existing partitions
if latest_finish <= interval[0]:
partitions[partition_index].add(interval)
partition_latest_finishes[partition_index] = interval[1]
break
else: # no existing colors can accept the interval
partition_index = len(partition_latest_finishes)
partitions[partition_index] = {interval}
partition_latest_finishes.append(interval[1])
return partitions
```

The correctness proof for this algorithm is a bit unusual:

Suppose the greedy algorithm uses \(D\) colors. Since we only create new colors if existing colors can't accept the interval, for every color \(c < D\), there exists an interval \(A_c\) that overlaps \(A_i\).

That means that there are at least \(D\) intervals that all overlap each other, since they all overlap \(A_i\).

Therefore there must be at least \(D\) colors, which means that having \(D\) colors is optimal.

Clearly, this algorithm is \(O(n^2)\), or more specifically \(O(nD)\) since \(D\) is \(O(n)\). We can actually do better by using a different data structure for the inner loop - the inner loop simply is trying to find the lowest end value in a partition, so if we use a heap for the partitions instead, we can have the algorithm run in \(O(n \log D)\), or more simply, \(O(n \log n)\):

```
import heapq
def partition_intervals(intervals):
intervals = sorted(intervals, key=lambda interval: interval[0])
partitions = [{intervals[0]}]
partition_latest_finishes_heap = [(intervals[0][1], 0)]
for i in range(1, len(intervals)):
interval = intervals[i]
latest_finish, partition_index = heapq.heappop(partition_latest_finishes_heap)
if latest_finish <= interval[0]: # try to assign interval to the existing partition that ends earliest
partitions[partition_index].add(interval)
heapq.heappush(partition_latest_finishes_heap, (interval[1], partition_index))
else: # no existing colors can accept the interval
partition_index = len(partition_latest_finishes_heap)
partitions[partition_index] = {interval}
heapq.heappush(partition_latest_finishes_heap, (interval[1], partition_index))
return partitions
```

Consider the **knapsack problem**: given profits \(P = [P_1, \ldots, P_n]\) and weights \(W = [W_1, \ldots, W_n]\), and a capacity \(M\), all positive integers, find the \(x = \tup{x_1, \ldots, x_n}\) that maximises \(\sum P_i x_i\) given that \(\sum W_i x_i \le M\).

There are several variations of this problem: for the 0-1 knapsack problem, \(x_i\) is either 0 or 1 (we either take the item or don't); in the rational knapsack problem, \(0 \le x_i \le 1\) (we can take a fraction of the item from none to all).

As it turns out, the rational knapsack problem can be solved greedily (but not other knapsack problems; 0-1 knapsack problems are NP-complete, for example) - we simply take as much as possible of each item, considering them in decreasing order of profit divided by weight:

```
def greedy_rational_knapsack(profits, weights, capacity):
items = sorted(zip(profits, weights), lambda x: x[0] / x[1])
current_weight = 0
for item in items: # consider items in decreasing order of profit to weight ratio
profit, weight = item
if current_weight + weight <= capacity: # we have enough space to take all of the item, so take it
result.append(1)
current_weight += weight
else: # we can't take all of the item, but we'll take enough to fill the rest of the knapsack
result.append((capacity - current_weight) / weight)
break
return result
```

Proof of correctness:

Suppose the greedy algorithm is not optimal. Then for certain inputs the output must differ from optimal algorithms.

Let the result of the greedy algorithm be \(G = [G_1, \ldots, G_n]\) and a result \(B = [B_1, \ldots, B_n]\) of all optimal algorithms for this problem such that \(B\) and \(G\) agree for as many items as possible, disagreeing at index \(d\), so \(B = [G_1, \ldots, G_{d - 1}, B_d, \ldots, B_n]\).

The greedy algorithm always chooses the item with the highest profit per weight, so the actual profit from \(G_1, \ldots, G_d\) is at least as much as the actual profit of \(B_1, \ldots, B_d\), because we will pick the most.

Construct \(B' = [B_1, \ldots, B_{d - 1}, G_d, B_{d + 1}, \ldots, B_n]\). Clearly, \(B'\) is at least as profitable as \(B\), but it agrees with \(G\) for 1 more element than \(B\), a contradiction.

Therefore, the greedy algorithm is optimal.

Consider the **coin changing problem**: given a list of coin denominations \(d_1, \ldots, d_n\) and a positive integer \(T\) (target sum), find \(A = [A_1, \ldots, A_n]\) such that \(T = \sum a_i d_i\) and \(N = \sum A_i\) is minimised. In other words, we are trying to find the smallest set of coins that add up to a certain value.

We can do this simply by considering the coins from largest denomination to smallest, filling up the sum as close to the target sum as possible. Although this works for a coin system like \(1, 5, 10, 25, 100, 200\), this isn't always optimal: the coin system \(12, 5, 1\) doesn't give the optimal solution for something like a target sum of 15. The study of where these algorithms are optimal is an area of active research.

Consider the **stable marriage problem**: suppose we have sets of \(n\) men and \(n\) women, \(M = [M_1, \ldots, M_n], W = [W_1, \ldots, W_n]\). Define \(\mathrm{pref}(M_i, k)\) is the \(k\)th favourite \(w \in W\) for \(M_i\), and \(\mathrm{pref}(W_i, k)\) is the \(k\)th favourite \(m \in M\) for \(W_i\). Find the matching \(\set{(M_{i_1}, W_{j_1}), \ldots, (M_{i_n}, W_{j_n})}\) such that there does not exist any \((M_u, W_v)\) that are not in the set, but prefer each other to their matches in the set - a **stable matching**.

Formally, we want to find a matching such that

The Gale-Shapley algorithm is an algorithm for this problem, for which Gale and Shapely won a Nobel prize for economics. In this algorithm, men propose to women from their most preferable to least preferable, and women accept proposals if not engaged or if a proposal is better than their currently accepted proposal (the new proposal replaces the old one), and women reject proposals otherwise.

Basically, the men propose starting from their most preferable to least preferable, and the women are proposed to starting from the least preferable to most preferable - the men always move down in their preferences, and the women always move up in their preferences:

```
def woman_prefers(woman, preference_function, man1, man2):
i = 1
while True:
man = preference_function(woman, i)
if man == man1: return True
if man == man2: return False
i += 1
def gale_shapley(men, women, preference_function):
engagements = {}
men_ranks = {man: 0 for man in men}
unengaged_men = set(men)
while unengaged_men:
man = next(unengaged_men.keys())
men_ranks[man] += 1 # next rank in preferences
woman = preference_function(man, men_ranks[man])
if woman not in engagements or woman_prefers(woman, preference_function, man, engagements[woman]):
if woman in engagements: unengaged_men.add(engagements[man])
unengaged_men.remove(man)
engagements[woman] = man
return set(engagements.items())
```

This algorithm is good for the men, but not the women - suiters can ask any of the reviewers, but reviewers can only pick from the suiters that proposed to them. For example, given 3 men and 3 women with preferences \(M_1: W_2, W_1, W_3; M_2: W_3, W_2, W_1; M_3: W_1, W_3, W_2; W_1: M_2, M_1, M_3; W_2: M_3, M_2, M_1; W_3: M_1, M_3, M_2\), all men get their first choices, while all women get their last choice. The men get the best possible match that will accept them, while the women get the worst possible match they will accept.

This algorithm always terminates because women cannot reject offers if unengaged, there is a man for every woman, and at some point a man will ask that woman. In the worst case, the men will all have their last choise, and have to propose to all \(n\) of their preferences. Since there are \(n\) men, the algorithm takes \(O(n^2)\) proposals, specifically \(n^2 - n + 1\) or less.

Proof of correctness:

We want to prove that there is no man that prefers a woman that also prefers that man. Suppose Alice prefers Bob over her current partner, and Bob also prefers Alice over his current partner.

If Bob prefers Alice, then he must have proposed to her before his current partner.

Alice must have rejected the proposal, since Bob is engaged to a less preferable choice.

However, Alice is with a less preferable partner, which means Alice should have accepted Bob and rejected her current partner when he proposed.

This is a contradiction, so Alice and Bob cannot exist, and the matching is stable.

To implement the Gale-Shapely algorithm more efficiently, we can represent the set of unengaged men as a linked list, represent the men's preferences using arrays/linked lists, and represent engagements using an array. We could also precompute a lookup table for \(M_j = \mathrm{pref}(W_i, k)\) to look up \(j\) given \(W_i\) and \(M_j\).

Consider a naive function to compute the \(n\)th Fibonacci number, based on its recurrence definition:

```
def fibinacci(n):
if n <= 1: return n
return fibinacci(n - 1) + fibinacci(n - 2)
```

However, this is extremely inefficient. Using the recurrence tree method, we find that it takes \(2f_{n + 1}\) total calls of the function where \(f_n\) is the \(n\)th Fibonacci number. Using the closed form for the Fibonacci number, \(f_n = \frac{\phi^n - (-\phi)^{-n}}{\sqrt{5}}\) where \(\phi\) is the golden ratio, or around 1.62. Therefore, this function takes \(\Omega(\phi^n)\) time.

This function basically is inefficient because it computes the same thing over and over again - for example, for `fibonacci(100)`

we will call `fibonacci(10)`

millions of times, when it gives the same result each time.

The idea behind dynamic programming is to avoid recomputing the same thing over and over again, by keeping those results around and simply looking them up when necessary. As a result dynamic programming trades off memory for time.

For example, if we were to write the above function iteratively, we might start from the base case and build upward toward the result:

```
def fibinacci(n):
saved = [0, 1]
for i in range(2, n + 1):
saved[i] = saved[i - 1] + saved[i - 2]
return saved[n]
```

However, note that at any given time, we only really use the last two elements of `saved`

. Therefore, we can just store the last two elements:

```
def fibinacci(n):
if n <= 1: return n
last1, last2 = 0, 1
for i in range(2, n + 1):
current = last1 + last2
last1, last2 = current, last1
return current
```

This is much better - we now have \(\Theta(n)\) iterations and constant memory. Since the numbers grow exponentially, however, and addition takes \(\Theta(\log n)\) (using multiprecision arithmetic), addition will actually take \(\Theta(n)\), so the algorithm overall takes \(\Theta(n^2)\).

Dyanmic programming is used to solve optimization problems, like greedy algorithms are often used to do.

To solve a problem using dynamic programming, we examine the structure of an optimal solution - the **optimal structure**. Given a problem instance \(I\):

- Define all the relevant subproblems \(S(I)\) that allows us to compute \(I\).
- This is just the direct subproblems, not the subproblems of the subproblems. For example, for
`fibonacci(n)`

we would have \(S(I)\) be`fibonacci(n - 1)`

and`fibonacci(n - 2)`

.

- This is just the direct subproblems, not the subproblems of the subproblems. For example, for
- Write a recurrence relation for \(I\) in terms of \(S(I)\) and the base cases.
- The references to \(S(I)\) allow us to look up the answers later in the table.

- Compute the optimal solution to \(I\) using the recurrence relation from the bottom-up (starting from the base cases of the recurrence relation) - filling in a table of values for the solutions to each subproblem.
- Whenever we have a subproblem we've previously encountered, we can simply look up the answer in the table to avoid recomputing it.
- When all subproblems have been solved, we can often compute the solution to \(I\) very simply using values from the table.

This is somewhat similar to divide and conquer, where we have subproblems and solve them recursively, up until we actually need to combine the solutions - instead of combining recursively, we solve them from the bottom up.

While the 0-1 knapsack problem we saw previously is NP-complete, we can solve it relatively well using dynamic programming.

Suppose \([x_1, \ldots, x_n]\) is an optimal solution for a given problem instance where \(P = [P_1, \ldots, P_n]\) and weights \(W = [W_1, \ldots, W_n]\), and capacity \(M\). Clearly, either \(x_n = 0\) or \(x_n = 1\).

If \(x_n = 0\), then \([x_1, \ldots, x_{n - 1}]\) is an optimal solution to a smaller problem instance where \(P = [P_1, \ldots, P_{n - 1}]\) and weights \(W = [W_1, \ldots, W_{n - 1}]\), and capacity \(M\), and the profit of \([x_1, \ldots, x_n]\) is the same as the profit of \([x_1, \ldots, x_{n - 1}]\).

If \(x_n = 1\), then \([x_1, \ldots, x_{n - 1}]\) is an optimal solution to a smaller problem instance where \(P = [P_1, \ldots, P_{n - 1}]\) and weights \(W = [W_1, \ldots, W_{n - 1}]\), and capacity \(M - W_n\), and the profit of \([x_1, \ldots, x_n]\) is the same as the profit of \([x_1, \ldots, x_{n - 1}]\) plus \(P_n\).

Clearly, the profit of \([x_1, \ldots, x_n]\) is the larger of the two cases, where either \(x_n = 0\) or \(x_n = 1\).

These are our relevant subproblems - the knapsack problem instances with the same weights and profits, but only including the first \(0 \le i \le n\) items and for different capacities \(0 \le m \le M\), so \(P = [P_1, \ldots, P_i]\) and weights \(W = [W_1, \ldots, W_i]\).

Now we can define a 2D array (a table) \(P[i, m]\) that defines the optimal profit of such a subproblem, considering the first \(i\) items and for a knapsack with capacity \(m\).

The basic idea will be to compute \(P[i, m]\) for all \(i\) and \(m\), until we get to \(P(n, M)\).

Now we'll write the recurrence relation: \(P[i, m] = \max\left(P[i - i, m], P[i - 1, m - W_i]\right)\). Here, \(P[i - i, m]\) represents the case where \(x_i = 0\), and \(P[i - 1, m - W_i]\) represents the case where \(x_i = 1\).

We also need some base cases: \(P[i, m] = \begin{cases} 0 &\text{if } i = 1 \wedge m < W_i \\ P_1 &\text{if } i = 1 \wedge m \ge W_i \\ P[i - 1, m] &\text{if } i > 1 \wedge m < W_i \\ \max\left(P[i - i, m - W_i] + P_i, P[i - 1, m]\right) &\text{otherwise} \\ \end{cases}\).

Now we fill in an \(n\) by \(M + 1\) table of all \(P[i, m]\) values, For each row \(i\), we fill in all the \(m\) values from smallest to largest. The final answer is therefore at the last cell of the bottommost row.

In code, it looks like this:

```
from collections import defaultdict
def knapsack(profits, weights, max_weight):
max_profits = defaultdict(dict)
for weight in range(max_weight):
if weight >= weights[0]:
max_profits[0][m] = profits[0]
else:
max_profits[0][m] = 0
for i in range(1, len(profits)):
for weight in range(max_weight):
if weight < weights[i]:
max_profits[i][weight] = P[i - 1][weight]
else:
max_profits[i][weight] = max(max_profits[i - 1][weight - weights[i]] + profits[i], max_profits[i - 1][weight])
return max_profits[len(profits) - 1][max_weight - 1]
```

One improvement we could make is that in the last row we only need to compute the single entry \(P[n, M]\), since it only relies on \(P[n - 1, M - w_i]\) and \(P[n - 1, M]\).

The above algorithm finds the maximum profit, but not the actual items chosen to get that profit. We can get the items by repeatedly finding which \(\max\left(P[i - i, m - W_i] + P_i, P[i - 1, m]\right)\) we chose at each step, and if \(P[i, m] = P[i - i, m - W_i] + P_i\), then we include the item in the solution. then, we repeatedly do the same check for \(P[i - i, m - W_i]\) if \(P[i, m] = P[i - i, m - W_i] + P_i\) or do the same check for \(P[i - 1, m]\) if \(P[i, m] = P[i - 1, m]\), and so on:

```
from collections import defaultdict
def knapsack(profits, weights, max_weight):
max_profits = defaultdict(dict)
for weight in range(max_weight):
if weight >= weights[0]:
max_profits[0][m] = profits[0]
else:
max_profits[0][m] = 0
for i in range(1, len(profits)):
for weight in range(max_weight):
if weight < weights[i]:
max_profits[i][weight] = P[i - 1][weight]
else:
max_profits[i][weight] = max(max_profits[i - 1][weight - weights[i]] + profits[i], max_profits[i - 1][weight])
# figure out which items are in this max profit selection
weight = max_weight
items = [None] * len(profits)
profit = max_profits[len(profits) - 1, capacity - 1]
for i in range[len(profits) - 1, 0, -1]: # go through indices from the last row's index to the second row's index
if profit == max_profits[i - 1, weight]:
items[i] = False
else:
items[i] = True
profit -= profits[i]
weight -= weights[i]
items[0] = profit == 0
return items
```

Clearly, this is an \(\Theta(nM)\) algorithm, since the table has \(nM\) cells and each cell value takes constant time to compute, and finding the actual items can be done in \(\Theta(n)\) since we trace one row per iteration.

Although this seems to be a polynomial time algorithm, this is actually not the case. The size of a given problem instance is the **sum of the sizes of its inputs**.

At first, we might assume that this is \(n + M\). However, the size of the \(M\) input is **actually the number of bits it has**, which is \(k = \ceil{\log_2 M}\) - the size of the input is \(n + k\). Therefore, the algorithm is \(O(n2^k)\). This is consistent with the fact that the knapsack problem is NP-complete.

There is also a recursive backtracking algorithm that tries all the combinations of items and finds the largest profit, which runs in \(\Theta(2^n)\). This is useful in different in different situations - even though both are exponential, the dynamic programming one is better for small \(M\) and large \(n\), and the backtracking one is better for small \(n\) and large \(M\).

We can solve the coin changing problem optimally as well, in almost the same way:

Note that for \(d_n\), \(0 \le a_n \le \frac T {d_n}\).

Our dynamic programming table will be \(n\) by \(\ceil{\frac{T}{d_1}}\) - each row is a number of coin denominations used, and each column is a different target sum.

Let \(N[i, t]\) be a list of coins used in the optimal solution to the coin changing problem instance with only the coin denominations \(d_1, \ldots, d_i\) and target sum \(t\).

Then we have the recurrence relation \(N[i, t] = \begin{cases} j \text{ where } 0 \le j \le \floor{\frac t d_i} \text{ is the index for which } N[i - 1, t - jd_i] + j \text{ is minimized} &\text{if } i > 1 \\ t &\text{if } i = 1 \\ \end{cases}\).

The final answer is in \(N[n, T]\). We could also save memory by, instead of having \(N[i, t]\) for solutions, having a table \(A[i, t]\) store the number of coins used in the optimal solutions, and then at the end tracing backwards like we did for the knapsack problem.

Suppose we have two sequences \(X = [X_1 \ldots X_m]\) and \(Y = [Y_1 \ldots Y_n]\) where all elements are in some finite alphabet \(A\). What is the longest subsequence that is in both \(X\) and \(Y\)?

A subsequence of a sequence \(Z\) is a sequence that contains any number of the elements of \(Z\) in order, but not necessarily consecutively. For example, \(1, 4, 8, 9\) is a subsequence of \(1, 2, 3, 4, 5, 6, 7, 8, 9\).

We will solve this using dynamic programming. Let \(\mathrm{LCS}(X, Y)\) represent the length of the longest common subsequence.

If \(X_m = Y_n\), then \(\mathrm{LCS}(X_1 \ldots X_m, Y_1 \ldots Y_n) = \mathrm{LCS}(X_1 \ldots X_{m - 1}, Y_1 \ldots Y_{n - 1}) + 1\).

If \(X_m \ne Y_n\), then \(\mathrm{LCS}(X_1 \ldots X_m, Y_1 \ldots Y_n) = \max\left(\mathrm{LCS}(X_1 \ldots X_{m - 1}, Y_1 \ldots Y_n), \mathrm{LCS}(X_1 \ldots X_m, Y_1 \ldots Y_{n - 1})\right)\). This is because there are three cases: if the LCS ends on \(Y_n\), then it doesn't end on \(X_m\) and we have \(\mathrm{LCS}(X_1 \ldots X_m, Y_1 \ldots Y_n) = \mathrm{LCS}(X_1 \ldots X_{m - 1}, Y_1 \ldots Y_n)\); if the LCS ends on \(X_m\), then it doesn't end on \(Y_n\) and we have \(\mathrm{LCS}(X_1 \ldots X_m, Y_1 \ldots Y_n) = \mathrm{LCS}(X_1 \ldots X_m, Y_1 \ldots Y_{n - 1})\). The third case is if the LCS doesn't end on \(X_m\) or \(Y_n\), but we don't need to consider it in the formula because \(\mathrm{LCS}(X_1 \ldots X_{m - 1}, Y_1 \ldots Y_{n - 1}) \le \mathrm{LCS}(X_1 \ldots X_{m - 1}, Y_1 \ldots Y_n)\) - it is already covered by the other cases.

The basic idea behind the recurrence is that we keep chopping off elements from the ends of \(X\) and \(Y\) - we are computing the LCS for all prefixes of the string. We now define our dynamic programming table, \(c[i, j] = \mathrm{LCS}(X_1 \ldots X_i, Y_1 \ldots Y_j)\) (the rows are values of \(i\) and the columns are values of \(j\)).

Clearly, we now have the recurrence \(c[i, j] = \begin{cases} 0 &\text{if } i = 0 \vee j = 0 \\ c[i - 1, j - 1] + 1 &\text{if } X_i = Y_j \wedge i > 0 \wedge j > 0 \\ \max(c[i - 1, j], c[i, j - 1]) &\text{if } X_i \ne Y_j \wedge i > 0 \wedge j > 0 \\ \end{cases}\). We want to compute \(c[m, n]\).

Clearly, to compute any \(c[i, j]\), we need to ensure that \(c[i - 1, j - 1], c[i - 1, j], c[i, j - 1]\) are already computed. We can do this by filling in the table from left to right, top to bottom. In code, it looks like the following:

```
from collections import defaultdict
def LCS(X, Y):
c = defaultdict(dict)
for i in range(len(X) + 1): c[i][0] = 0
for j in range(len(Y) + 1): c[0][j] = 0
for i in range(1, len(X) + 1):
for j in rnage(1, len(Y) + 1):
if X[i] = Y[j]:
c[i, j] = c[i - 1][j - 1]
else:
c[i, j] = max(c[i][j - 1], c[i - 1][j - 1])
return c[len(X), len(Y)]
```

We can also get the actual sequence by doing a traceback like for the knapsack problem. There are three cases: if we moved up and to the left in the table, then we include \(X_i\) or \(Y_j\) in the LCS and set \(i = i - 1, j = j - 1\); if we moved left in th etable, we set \(j = j - 1\); if we moved right in the table, we set \(i = i - 1\). By repeatedly following this until we get a base case, we get the LCS:

```
from collections import defaultdict
def LCS(X, Y):
c = defaultdict(dict)
direction = defaultdict(dict)
for i in range(len(X) + 1): c[i][0] = 0
for j in range(len(Y) + 1): c[0][j] = 0
for i in range(1, len(X) + 1):
for j in rnage(1, len(Y) + 1):
if X[i] = Y[j]:
c[i, j] = c[i - 1][j - 1]
direction[i][j] = "up_left"
else:
c[i, j] = max(c[i][j - 1], c[i - 1][j - 1])
direction[i][j] = "left" if c[i, j] == c[i][j - 1] else "up"
# retrieve the LCS
i, j = len(X), len(Y)
sequence = []
while i > 0 and j > 0:
if direction[i][j] == "up_left":
sequence = X[i] + sequence # we could also use `Y[j]` rather than `X[i]` since they are identical here
i -= 1; j -= 1
elif direction[i, j] == "up":
i -= 1
else:
j -= 1
return sequence
```

Clearly, this algorithm takes \(\Theta(mn)\) time and memory, since computing each table entry takes constant time and there are \((m + 1)(n + 1)\) table entries.

Suppose we have a convex polygon denoted by point \(q_1, \ldots, q_n\). How can we triangulate it such that the sum of the perimeters of the \(n - 2\) triangles are minimized?

In other words, what is the trangulation such that the sum of the lengths of all the chords (lines within the polygon that divide it into triangles) is minimised? These two formulations are the same because the sum of lengths of all the chords times 2 plus the perimeter of the polygon is the sum of the perimeters of all the triangles.

As it turns out, a polygon with \(n\) vertices has \(C_n\) possible triangulations, where \(C_n\) is the \(n\)th Catalan number (\(C_n = \frac 1 {n + 1} {{2n} \choose n} = \Theta\left(\frac{4^n}{n^{\frac 3 2}}\right)\)). This is exponential, so we want to avoid computing all possible triangulations.

Clearly, \(q_n q_1\) is in a triangle \(q_n q_k q_1\), where \(2 \le k \le n - 1\). Clearly, for each \(k\), we have the triangle \(q_n q_k q_1\), which divides the polygon into a polygon with vertices \(q_1, \ldots, q_k\), and a polygon with vertices \(q_k, \ldots, q_n\).

Clearly, the smallest triangulation of the polygon that includes the triangle \(q_n q_k q_n\) is this triangle along with the triangles in the smallest triangulation of \(q_1, \ldots, q_k\) and the smallest triangulation of \(q_k, \ldots, q_n\). If we check all possible \(k\), then the smallest of all of these is the smallest triangulation of the polygon \(q_1, \ldots, q_n\). These smaller polygons are our subproblems. Since all of them are polygons of the form \(q_1, \ldots, q_i, \ldots, q_j, \ldots, q_n, 1 \le i < j \le n\), we have \(\Theta(n^2)\) subproblems in total.

Let \(S[i, j]\) denote the smallest triangulation of the polygon \(q_i, \ldots, q_j\). Let \(P(q_i, q_k, q_j)\) be the perimeter of the triangle formed by \(q_i, q_k, q_j\). Then \(S[i, j] = \min\set{P(q_i, q_k, q_j) + S[i, k] + S[k, j] \middle| k \in \mb{Z}, i < k < j}\).

A graph is a pair \(\tup{V, E}\), where \(V\) is the set of vertices \(\set{v_1, \ldots, v_n}\) and \(E\) is the set of edges \(\set{\set{u_1, v_1}, \ldots, \set{u_m, v_m}}\), where \(u, v\) are edges. A digraph (directed graph) is also a pair \(\tup{V, E}\), but \(E\) is a set of edges \(\set{\tup{u_1, v_1}, \ldots, \tup{u_m, v_m}}\) - they have arcs (directed edges) that represent direction. In directed edges, \(u\) is the **tail** and \(v\) is the **head**. We can also write \(\tup{u, v}\) and \(\set{u, v}\) as \(uv\), which is easier to write and covers both directed and undirected edges.

In this course, we will not allow duplicate edges, so \(m \le n^2\). We will also not allow loops, so vertices will never have edges connecting them to themselves.

On the computer, we often represent graphs with adjecency matrices and adjacency lists.

An **adjacency matrix** is an \(n\) by \(n\) matrix where each row represents a vertex and each column represents a vertex. The elements of this matrix are 1 if there is an edge from the vertex for its row to the vertex for its column (\(A_{u, v} = \begin{bmatrix} a_{1, 1} & \ldots & a_{1, u} \\ \vdots & \vdots & \vdots \\ a_{v, 1} & \ldots & a_{v, u} \end{bmatrix}\)). Formally, the adjacency matrix is \(A = \begin{bmatrix} a_{1, 1} & \ldots & a_{1, n} \\ \vdots & \vdots & a_\vdots \\ a_{n, 1} & \ldots & a_{n, n} \\ \end{bmatrix}\).

Since loops (edges that have the same vertex on both ends) aren't allowed, the diagonals \(a_{1, 1}, \ldots, a_{n, n}\) are all 0. Note that for an undirected graph, the adjacency matrix has \(2\abs{E(G)}\) 1 entries, while for a directed graph, the adjacency matrix has \(\abs{E(G)}\) 1 entries.

An **adjacency list** is an array of \(n\) linked lists, represented as \(\adj(_1), \ldots, \adj(v_n)\), where \(\adj(v_i)\) is the linked list of the vertices that \(v_i\) is directly connected to. This is basically the same as the adjacency mtrix, but as an array of linked lists rather than a matrix. This is more memory efficient for sparse graphs, but make some operations more difficult since we can't test for edges in \(O(1)\) time.

We usually use breath-first search for undirected graphs, and depth-first search for directed graphs.

A **clique** is a subset of vertices that is fully connected. A **vertex cover** is a subset of vertices such that all edges have at least one end in that set.

To do a **breadth-first search** (BFS), we pick a vertex \(s\), then spread out from \(s\) by keeping track of the vertices we've explored so far, and exploring the unexplored neighbors of the explored vertices. Basically, the search spreads out from \(s\) in layers. Starting from \(s\), we explore its neighbors, then the neighbors of those, then the neighbors of those, and so on. The full form of breadth-first search appears as follows:

```
from collections import deque
def breadth_first_search(vertices, adjacency_list, start_vertex):
colors, predecessors = {}, {}
for vertex in vertices:
colors[vertex] = "white"
predecessors[v] = set()
colors[start_vertex] = "gray"
q = deque([start_vertex])
while q:
vertex = q.popleft()
for neighbor in adjacency_list[vertex]:
if colors[neighbor] == "white":
colors[neighbors] = "gray"
predecessors[neighbor] = vertex
q.append(neighbor)
colors[vertex] = "black"
```

The colors represent the visit state of vertices - white means a vertex is undiscovered, gray means it's discovered but its neighbors are still being processed, and black means it's discovered and fully processed. The `predecessors`

dict maps all vertices (except the start vertex) to the the vertices they were visited from (their **predecessors**), and is formally represented \(\pi(v_i)\).

In breadth-first search, an edge \(\set{\pi(v), v}\) is a **tree edge**. All other edges are **cross edges**. The set of all tree edges forms a **BFS forest** (there is a BFS tree for each graph component), and if the graph is connected, the BFS tree is a spanning tree.

A simpler form of the algorithm might appear as follows:

```
from collections import deque
def breadth_first_search(vertices, adjacency_list):
start_vertex = vertices[0]
visited = set()
q = deque([start_vertex])
while q:
vertex = q.popleft()
visited.add(vertex)
for neighbor in adjacency_list[vertex]:
if neighbor not in visited:
q.append(neighbor)
```

Note that breadth first search, since it visits every vertex and could potentially visit every edge, takes \(\Theta(\abs{V} + \abs{E})\) worst case time.

Shortest path from a starting vertex:

```
from collections import deque
def shortest_path(vertices, adjacency_list, start_vertex):
start_vertex = vertices[0]
visited = set()
q = deque([start_vertex])
distance_to_start = {start_vertex: 0}
while q:
vertex = q.popleft()
visited.add(vertex)
for neighbor in adjacency_list[vertex]:
if neighbor not in visited:
q.append(neighbor)
distance_to_start[neighbor] = distance_to_start[vertex] + 1
```

Proof of correctness:

Let \(\mathrm{dist}(v)\) represent

`distance_to_start[v]`

where \(v\) is a vertex, and \(s\) represent`start_vertex`

.

We want to prove that for any vertex \(v\), \(\mathrm{dist}(v)\) is the length of the shortest path from \(s\) to \(v\).

Define layers as \(L_i = \set{v \middle| \mathrm{dist}(v) = i}\).

Clearly, \(v \in L_i\) if and only if \(uv\) is an edge such that \(u \in L_{i - 1}\) and there is no \(u' \in L_j, j < i - 1\) - it is adjacent to a vertex in a previous layer and is not adjacent to any vertices in lower layers.

Clearly, if \(u\) is discovered before \(v\), then \(\mathrm{dist}(u) \le \mathrm{dist}(v)\).

Clearly, if \(uv\) is an edge, then \(\abs{\mathrm{dist}(u) - \mathrm{dist}(v)} \le 1\).

Clearly, if \(uv\) is a tree edge, then \(\mathrm{dist}(v) = \mathrm{dist}(u) + 1\).

Let \(\delta(v)\) be the length of the shortest path from \(s\) to \(v\).

Clearly, \(\delta(v) \le \mathrm{dist}(v)\) because \(\mathrm{dist}(v)\) is the length of the sequence \(v, \pi(v), \pi(\pi(v)), \ldots, s\).

We can easily use induction over the shortest path \(s, v_{i_1}, \ldots, v_{i_k}, v\) to prove that \(\delta(v) \le \mathrm{dist}(v)\).

Therefore, \(\delta(v) = \mathrm{dist}(v)\), and \(\mathrm{dist}(v)\) is the length of the shortest path.

For most graph algorithms, the steps will be quite intuitive, but the proofs of correctness will be rather difficult.

Determine whether a graph is bipartite:

Clearly, a graph is bipartite if and only if it contains an odd cycle (see MATH239 notes for proof of this.

We can check for odd cycles using breadth first search.

At each step, if we encounter an edge \(\set{u, v}\) where \(\mathrm{dist}(u) = \mathrm{dist}(v)\), then we have found an odd cycle containing \(u\) and \(v\) and the first common ancestor of \(u\) and \(v\) in the BFS tree.

This is because the distance from \(u\) and \(v\) to the common ancester must be equal, so if the distance is \(k\) then the cycle must have \(2k + 1\) edges (\(k\) for \(u\) to the ancestor, \(k\) for \(v\) to the ancestor, and 1 for \(u\) to \(v\)), which forms an odd cycle.

Otherwise, if we go through the entire graph without finding an odd cycle, there are no odd cycles. If this is the case, then we can partition the vertices graph into two partitions \(\set{u \in V(G) \middle| \mathrm{dist}(u) \text{ is even}}\) and \(\set{u \in V(G) \middle| \mathrm{dist}(u) \text{ is odd}}\).

**Depth-first search** (DFS) is done in a way very similar to breadth-first search, but with a different visiting mechanism and using a stack rather than a queue.

```
def depth_first_search(adjacency_list):
colors, predecessors = {}, {}
discovery_time, finish_time = {}, {}
for vertex in adjacency_list:
colors[vertex] = "white"
predecessors[vertex] = None
current_time = [0]
for vertex in adjacency_list:
if colors[vertex] == "white":
visit(adjacency_list, vertex, colors, predecessors, current_time, discovery_time, finish_time)
def visit(adjacency_list, vertex, colors, predecessors, current_time, discovery_time, finish_time):
colors[vertex] = "gray"
current_time[0] += 1
discovery_time[vertex] = time
for neighbor in adjacency_list[vertex]:
if colors[neighbor] = "white":
predecessors[neighbor] = vertex
visit(adjacency_list, neighbor, colors, predecessors, current_time, discovery_time, finish_time)
colors[vertex] = "black"
current_time[0] += 1
finish_time[vertex] = time
```

Note that we keep track of the discovery and finish times for each vertex. This information can be useful in some situations. A simplified version of DFS can be written as follows:

```
def depth_first_search(vertices, adjacency_list):
visited = set()
for vertex in vertices:
if vertex not in visited:
depth_first_search_visit(vertices, adjacency_list, vertex, visited)
def depth_first_search_visit(vertices, adjacency_list, vertex, visited):
visited.add(vertex)
for neighbor in adjacency_list[vertex]:
if neighbor not in visited:
depth_first_search_visit(vertices, adjacency_list, neighbor, visited)
```

In depth-first search, an edge \(\tup{\pi(v), v}\) is a **tree edge**. The set of all tree edges forms a **DFS forest** (there is a DFS tree for each graph component). If the graph is connected, the DFS forest is a spanning tree. For a tree edge, we discover and finish with \(v\) during \(u\)'s processing - \(v\)'s discovery and finish time interval is nested within \(u\)'s discovery and finish time interval. Additionally, \(v\) will be white when the edge is dicovered.

An edge \(\tup{u, v}\) is a **forward edge** if and only if it is not a tree edge, and \(v\) is a descendant of \(u\) in the depth-first forest - an edge that connects a vertex with a descendant that is not a direct child. Like for tree edges, \(v\)'s time interval is nested in \(v\)'s, but \(v\) will be black when we the edge is discovered.

A **back edge** is the same thing as a forward edge but in the other direction. However, \(\tup{u, v}\) can't possibly be a tree edge since \(u\) is a descendant of \(v\) in the DFS tree, so we can simply say a back edge is one that connects a vertex to an ancestor. For a back edge, we discover and finish with \(u\) during \(v\)'s processing - \(u\)'s discovery and finish time interval is nested within \(v\)'s. Also, \(v\) will always be gray.

All other edges are **cross edges** - generally this happens when edges connect siblings in the tree. For example, in the graph \(\adj(u) = \set{v, w}, \adj(v) = \set{w}, \adj(w) = \set{v}\), either \(vw\) is a cross edge, or \(wv\) is, depending on the order we visit the vertices in. For a cross edge, the discovery and finish time intervals for \(u\) and \(v\) will be disjoint - they will never overlap. Also, \(v\) will always be black.

The **parenthesis theorem** says that the discovery and finish time intervals are either disjoint or nested.

BFS and DFS might not visit all the vertices of a graph if it is not connected - if it has multiple components. To have it visit all nodes, we can keep track of unvisited vertices, and while there are still unvisited nodes, we can repeatedly run the search algorithm, stopping when all vertices have been visited.

A **directed acyclic graph** (DAG) is a directed graph with no cycles. A **topological ordering/topological sort** of a directed graph \(G\) is an ordering \(<\) of all vertices in \(G\) such that \(\set{u, v} \in E(G) \implies u < v\). A directed graph has a topological ordering if and only if it is a DAG - if it has no cycles.

A directed graph \(G\) is a DAG if and only if it has no back edges when we do a depth-first search on it. This is because if there is a back edge, that back edge and the tree edges connecting its ends form a directed cycle, and if it has a cycle, then one of its edges must be a back edge.

As it turns out, the negative finishing time forms a topological ordering - for all \(\set{u, v} \in E(G)\), `finish_time[u] > finish_time[v]`

, so `-finish_time[u] < -finish_time[v]`

. In other words, **ordering vertices by finishing time descending results in a topological ordering**.

The **indegree** of a vertex in a directed graph is the number of edges going into it. The **outdegree** of a vertex in a directed graph is the number of edges leaving it.

Prove that a DAG must have a vertex of indegree 0:

We will prove the contrapositive - that a graph with no vertices of indegree 0 must have a cycle.

If there are any edges starting and ending at the same vertex, then there is a cycle. Assume that there aren't any.

Clearly, there is an arc from \(v_1\) to \(v_2\), and \(v_2\)t o \(v_3\), and so on - every vertex must have at least one edge going into it from another vertex.

At some point, we must encounter a \(v_j = v_i\) where \(j > i\) - we run out of vertices to come from.

Therefore, there must also be a cycle with \(v_i\) and \(v_j\). So if there are no cycles, there must be a vertex of indegree 0.

Prove that a directed graph \(G\) has a topological ordering if and only if it is a DAG:

Clearly, if \(G\) has a directed cycle \(v_1, \ldots, v_k, v_1, \ldots\), then that cycle cannot have a topological ordering since \(v_k > 1\) and \(v_1 > v_k\).

So if \(G\) has a topological ordering, then it must be a DAG.

Suppose \(G\) is a DAG. Then \(G\) has a vertex \(v_1\) of indegree 0, and it can only have outward edges.

Construct a topological ordering where \(v_1\) is the lowest value, and then delete \(v_1\) and all incident edges to make a new graph \(G'\).

Clearly, \(G'\) is still a DAG, since it is just \(G\) with a vertex deleted. So it also has a vertex \(v_2\) of indegree 0.

Let this be the second lowest value in the topological ordering, then delete \(v_2\) and all incident edges and repeat the steps for the new graph until we have processed all vertices.

Clearly, once we have processed all vertices we have a topological ordering. Therefore, if \(G\) is a DAG then we have a topological ordering.

Therefore, \(G\) is a DAG if and only if it has a topological ordering.

Note that this proof gives an algorithm for constructing a topological ordering, but this is rather inefficient. We can use depth-first search to do this more efficiently.

Making a topological ordering is possible in \(\Theta(\abs{V} + \abs{E})\):

```
def topological_ordering(vertices, adjacency_list):
visited = {}
topological_ordering = []
for vertex in vertices: visited[vertex] = False
for vertex in vertices:
if not visited[vertex]:
topological_ordering_visit(vertices, adjacency_list, vertex, colors, predecessors, current_time, discovery_time, finish_time)
return reversed(topological_ordering)
def topological_ordering_visit(vertices, adjacency_list, vertex, colors, visited, topological_ordering):
visited[vertex] = True
for neighbor in adjacency_list[vertex]:
if not visited[neighbor]:
topological_ordering_visit(vertices, adjacency_list, neighbor, visited, topological_ordering)
topological_ordering.append(vertex)
```

For two vertices \(x, y\) in a directed graph \(G\), let \(x \sim y\) if and only if \(x = y\) or \(x \ne y\) and there exist directed paths from \(x\) to \(y\) and from \(y\) to \(x\). Note that \(\sim\) as defined here is an equivalence relation - it is a relation that is reflexive (\(a \sim a\)), symmetric (\(a \sim b \implies b \sim a\)), and transitive (\(a \sim b \wedge b \sim c \implies a \sim c\))- An equivalence relation partitions \(V(G)\) into sets of vertices such that \(u \sim v\) between any two \(u, v\) in the same partition.

A directed graph is **strongly connected** if and only if for any two vertices \(u, v\) there is a directed path from \(u\) to \(v\) and \(v\) to \(u\) - \(u \sim v\). A **strongly connected component** is a maximal strongly connected subgraph.

The **component graph** of a directed graph is a directed graph that has vertices corresponding to each component in the graph, and vertices corresponding to whether any two components have an edge between them. Define the discovery time and finishing time of a component to be the earliest discovery and latest finish time of any vertex in that component.

The component graph must be a DAG. This is because if there is a cycle between strongly connected components, then there must exist a path from each component to each other component, which means each of those components must not be maximal. Instead, the whole thing is all one component.

For two components \(C_1, C_2\) in a component graph, if there is an edge from \(C_1\) to \(C_2\), then \(C_1\) has a later finishing time than \(C_2\) (note that there can't be any cycles, so if this is the case there can't be an edges from \(C_2\) to \(C_1\) as well). This is because a depth-first search must go through all of \(C_2\) before going to \(C_1\) if it encounters \(C_2\) first, and must go through all of \(C_2\) before being able to finish going through \(C_1\) if it encounters \(C_1\) first.

A graph is strongly connected if and only if for any fixed vertex \(s\), there is a directed path from \(s\) to any vertex \(t\) and a directed path from \(t\) to \(s\). Knowing this, it is simple to create an algorithm to test if a graph \(G\) is strongly connected:

- We need to test if we can reach all vertices from \(s\).
- Pick any vertex \(s\) and call
`visit(s)`

(from DFS) once on \(G\). - If any of the vertices are not black, then there are vertices that we can't reach from \(s\) at all, so \(G\) is not a strongly connected component.

- Pick any vertex \(s\) and call
- We need to test if we can reach \(s\) from all vertices.
- Construct a new graph \(H\) that is the same as \(G\) but where all the edges are reversed.
- Pick any vertex \(s\) and call
`visit(s)`

(from DFS) once on \(H\). - If any of the vertices are not black, then there are vertices that can't reach \(s\) at all, so \(G\) is not a strongly connected component.

- If both tests pass, then \(G\) is a strongly connected component.

To find the strongly connected components in \(G\):

- Do a depth-first search on \(G\), storing the finishing times.
- Construct a new graph \(H\) with all the edges reversed.
- Do a depth-first search on \(H\), each time calling
`visit`

starting with the unvisited vertex with the latest finishing time in the DFS from step 1. - The strongly connected components of \(G\) are the trees in the DFS forest from the DFS in step 3.

Proof of correctness:

Note that \(G\) and \(H\) have the same strongly connected components.

Let \(u\) be the first vertex visited in step 3 - the one with the highest finishing time from the DFS in step 1.

Let \(C\) be the strongly connected component that contains \(u\). Let \(C'\) be any other strongly connected component.

Clearly, the finishing time of \(C\) is later than that of \(C'\).

So there cannot be an arc from \(C'\) to \(C\) in \(G\) (it is \(G\) rather than \(h\) because the finishing times were for a DFS on \(G\)).

So we can only reach the vertices in \(C\) from \(u\), not from \(C'\).

Clearly, this has time complexity \(\Theta(\abs{V} + \abs{E})\), since both tests from step 1 and 2 take \(\Theta(\abs{V} + \abs{E})\).

The spanning tree of a connected undirected graph \(G\) is a subgraph \(T\) that is a tree containing every vertex in \(G\). \(T\) is a spanning tree if and only if \(T\) is an acyclic subgraph that has \(\abs{V(G)} - 1\) edges.

Given a weight function \(w: V(G) \to \mb{R}\) on a graph \(G\), we want to find a spanning tree of \(G\) such that \(\sum_{e \in E(T)} w(e)\) is minimized - the **minimum spanning tree problem**. In other words, we want a connected subgraph of \(G\) such that the sum of the weights is minimised.

Kruskal's algorithm is one way to solve this. Basically, we sort the edges by weight ascending and considering each edge in order, we add the edge to the result if it doesn't form a cycle in the result.

At any point, we have a forest of disjoint trees (no trees share any vertices). We start off with \(n\) trees, and whenever we add an edge to the forest, we merge two trees, reducing the number of trees in the forest by 1.

We can efficiently test if adding an edge would result in a cycle by keeping track of the parents of each vertex in its tree as \(L[v]\). The root of the tree is therefore the vertex \(v\) such that \(L[v] = v\). By repeatedly getting the parent/leader of the vertex, checking \(v, L[v], L[L[v]], \ldots\), we eventually get the root of the tree. When we add an edge to the resulting tree, we simply add an entry \(L[v] = u\) or \(L[u] = v\).

We now define \(\operatorname{find}(v)\) to be the root of the tree containing the vertex \(v\), and implement it by repeatedly checking \(L\). Testing if an edge causes a cycle is the same as checking if the edge \(\set{u, v}\) has \(u, v\) in two different trees - if \(\operatorname{find}(u) = \operatorname{find}(v)\). So we add an edge \(\set{u, v}\) if and only if \(\operatorname{find}(u) \ne \operatorname{find}(v)\).

As it turns out, if we ensure that the depth of the tree containing \(u\) is less than the depth of the tree containing \(v\) when adding the entry \(L[v] = u\), we can ensure that \(\operatorname{find}\) is always \(\Theta(\log(\abs{E}))\).

\(L\) is a type of data structure called a **union-find/merge-find** data structure, because it has a find operation and a union/merge (setting the leader/parent of a vertex) operation.

```
def spanning_tree(vertices, edges, weights):
leaders = list(range(len(vertices)))
tree_edges = set()
for edge in sorted(edges, key=lambda edge: weights[edge]):
vertex1_root, vertex1_depth = find(edge[0], leaders)
vertex2_root, vertex2_depth = find(edge[1], leaders)
if vertex1_root != vertex2_root:
tree_edges.add(edge)
if vertex1_depth < vertex2_depth:
leaders[vertex2_root] = vertex1_root
else:
leaders[vertex1_root] = vertex2_root
return (vertices, tree_edges)
def find(vertex, leaders):
depth = 0
while leaders[vertex] != vertex:
vertex = leaders[vertex]
depth += 1
return vertex, depth
```

Since sorting the edges by weight takes \(\Theta(\abs{E} \log \abs{E})\) and we run the \(\Theta(\log \abs{E})\) union and find operations.

Prim's algorithm is another way to find the minimum spanning tree of a graph.

Basically, we define a tree \(A\), and starting from an arbitrary vertex \(u_0\), we repeatedly add the lowest-weight edge that connects a vertex in \(A\) to a vertex not in \(A\) to \(A\).

Let \(V_A\) be the set of vertices that are incident to edges in \(A\). Let \(N[v]\) be a vertex \(u \in V_A\) where \(\set{u, v}\) is a minimum weight edge. Let \(W[v] = w(N[v], v)\) - the best possible choice we can make at any vertex.

This is a pretty standard greedy algorithm.

```
def spanning_tree(vertices, edges, weights):
tree_edges = set()
start_vertex = vertices[0]
tree_vertices = {start_vertex}
for vertex in vertices - {start_vertex}:
W[vertex] = weights[{start_vertex, vertex}]
N[vertex] = start_vertex
while len(tree_edges) < len(vertices) - 1:
min_weight_vertex = min(vertices - tree_vertices, key=lambda vertex: W[vertex])
tree_vertices.add(min_weight_vertex)
edge_vertex = N[min_weight_vertex]
tree_edges.add({min_weight_vertex, edge_vertex})
for vertex in vertices - tree_vertices: # look at the vertices in the rest of the graph
if weights[min_weight_vertex, vertex] < W[vertex]: # new edge has a lower weight than the current one
W[vertex] = weights[min_weight_vertex, vertex]
N[vertex] = min_weight_vertex
return (vertices, tree_edges)
```

Since we need to check edges \(\Theta(\abs{V})\) times, and each time takes \(O(\abs{V})\) time, this implementation of Prim's algorithm takes \(O(\abs{V}^2)\) time. Alternatively, we could use the adjacency list representation and build a heap of all the vertices in \(W\) to compute `min_weight_vertex`

in \(O(\log(\abs{V}))\) time, giving us \(O(\abs{E} \log \abs{V})\), which is better for sparse graphs (graphs with relatively few edges compared to vertices).

To prove the correctness of Kruskal's and Prim's algorithm, we will construct a general algorithm of which Kriskal's and Prim's algorithm are special cases of. Proving this algorithm correct will prove both Kruskal's and Prim's algorthms correct.

A **cut** is a paritition of \(V(G)\) into sets \(S, V \setminus S\) such that both are non-empty. A **crossing edge** with respect to a cut \(S, V \setminus S\) is an edge that has one end in \(S\) and another in \(V \setminus S\). A cut \(S, V \setminus S\) **respects** a set of edges \(A \subseteq E(G)\) if and only if none of the edges in \(A\) are crossing edges - all edges in \(A\) have vertices that are either both in \(S\) or both in \(V \setminus S\).

The general algorithm is as follows:

```
def greedy_minimum_spanning_tree(vertices, edges, weights):
spanning_tree = set()
while len(A) < len(vertices) - 1:
let `S` and `vertices - S` be a cut that respects `spanning_tree`
let `e` be the minimum weight crossing edge
spanning_tree.add(edge)
return spanning_tree
```

Kruskal's algorithm simply has the cut be between two trees being grown in the graph. Prim's algoorithm has the cut be between the tree being grown and the rest of the graph.

Proof of correctness:

Let \(j = \abs{V} - 1\). Let \(A = \set{e_1, \ldots, e_j}\) be the set of edges in the spanning tree where the edges are chosen in that order.

We will prove that \(A\) is a minimum spanning tree, via induction. For simplicity, we will assume that all of the edge weights are distinct.

Clearly, if \(j = 0\) then there is only 1 vertex and the spanning tree has no edges, so \(A\) is a minimum spanning tree.

Assume for some \(j\) that \(\set{e_1, \ldots, e_{j - 1}}\) is contained in a minimum spanning tree.

So there is a cut \(S, V(G) \setminus S\) such that \(\set{e_1, \ldots, e_{j - 1}}\) respects the cut - \(S\) and \(V \setminus S\) form two trees that span all vertices.

In the algorithm, \(e_j\) is chosen to be the minimum weight crossing edge with respect to this cut.

If \(e_j\) is in a minimum spanning tree, then \(A\) is a minimum spanning tree. Assume \(e_j\) is not in the minimal spanning tree.

Clearly, There is some other crossing edge joining the two trees \(e_j'\) that has lower weight than \(e_j\). Then the minimum spanning tree has a lower weight than \(A\), so \(w(e_j') < w(e_j)\), a contradiction since \(e_j\) is chosen to be the one with the lowest weight. So \(A\) is a minimum spanning tree.

So by induction, the spanning tree is minimal.

Given a directed graph \(G\), a weight function \(w: E(G) \to \mb{R}^+\) (non-negative), and a source/start vertex \(u_0\), find the shortest directed path \(P\) from \(u_0\) to any vertex \(v\) such that the total weights are minimised - the smallest possible value of \(\sum_{e \in P} w(e)\). This is called the **single source shortest path problem**.

**Dijkstra's algorithm** (pronounced "diig-stra") is one algorithm for this problem, structurally very similar to Prim's algorithm. ;wip: explain this better

Let \(S\) be the subset of vertices for which the shortest path from \(u_0\) to all vertices in \(S\) are known. Let \(D[v]\) where \(v \notin S\) be the weight of the shortest path from \(u_0\) to \(v\) where all the interior vertices of the path (vertices excluding \(u_0\) and \(v\)) are in \(S\).

Let \(\pi[v]\) be the predecessor of \(v\) on a path (vertices in this algorithm will only be part of one path at a time).

Initially \(S = \set{u_0}\) and \(D[v] = w(u_0, v)\) for all \(v\) (edges that don't exist have weight \(\infty\)). At each step, we choose a \(v \in V(G) \setminus S\) such that \(D[v]\) is minimised, add \(v\) to \(S\), then update \(D\) and \(\pi\) to match:

```
def compute_paths(vertices, edges, weights, start_vertex):
visited = {start_vertex}
min_path_weights = {start_vertex: 0}
predecessors = {}
for vertex in vertices - {start_vertex}:
min_path_weights[vertex] = weights[{start_vertex, vertex}]
predecessors[vertex] = start_vertex
while len(visited) < len(vertices):
min_weight_vertex = min(vertices - visited, key=lambda vertex: min_path_weights[vertex]) # vertex in the rest of the graph with lowest current path weight
visited.add(min_weight_vertex)
for vertex in vertices - visited: # look at the vertices in the rest of the graph
if min_path_weights[min_weight_vertex] + weights[{min_weight_vertex, vertex}] < min_weight_vertex[vertex]: # new vertex has lower path weight than the current one
min_path_weights[vertex] = min_path_weights[min_weight_vertex] + weights[{min_weight_vertex, vertex}]
predecessors[vertex] = min_weight_vertex
return predecessors
def shortest_path(start_vertex, end_vertex, predecessors):
path = [start_vertex]
vertex = end_vertex
while vertex != start_vertex:
vertex = predecessors[vertex]
path.append(vertex)
return reversed(path)
```

Since this is structurally so similar to Prim's algorithm, it has the same time complexity, \(O(\abs{V(G)}^2)\) for the above implementation or \(O(\abs{E(G)} \log \abs{V(G)})\) for a heap-based version.

This works because \(D\) has the weight of the shortest path from \(u_0\) to any vertex \(v_k \in S\), and then we compute, step by step, the best paths from \(v_k\) to \(v\). Basically, \(D\) keeps track of all the possible candidate paths until we get to \(v\).

The Bellman-Ford algorithm can also solve the single-source path problem, but **also supports negative weights**, as long as there are no negative weight cycles (cycles whose edge weight sum is negative). This runs in \(O(\abs{E(G)} \abs{V(G)})\).

We might think that adding a constant offset to all the negative weights (to make them positive) would allow us to find the shortest paths using the algorithms we've seen so far. However, there exist graphs with negative weights such that if we add a constant offset, the shortest path changes. Consider a graph with 2 paths \(P_1, P_2\) from \(u\) to \(v\), where \(P_1\) has 5 edges each with weight -1, and \(P_2\) has 2 edges each with weight -2, so \(P_1\) has a lower weight. If we add 2 to the weights of each edge to make all the weights non-negative, \(P_2\) has a lower weight.

If there are **no cycles in the graph** (a DAG), there is an algorithm that can find the shortest path even if there are negative weight edges, and running more efficiently than the Bellman-Ford algorithm (see slide 160 for code).

The **all-pairs shortest path problem** is to find the shortest path between all pairs of vertices in the graph, allowing negative weight edges but not negative weight cycles. We can solve this by running the Bellman-Ford algorithm on each vertex, but the Floyd-Warshall algorithm can do it in \(O(n^3)\), using a dynamic-programming-like technique (see slide 164 for code). Like Bellman-ford, Floyd-Warshall also supports negative edge weights as long as there are no negative weight cycles.

\(D_m[i, j]\) denotes the weight of the shortest path from \(i\) to \(j\) such that the interior vertices of the path (vertices excluding \(i\) and \(j\)) are in \(\set{1, \ldots, m}\). At each step, we check our candidate path \(D_m[i, m] + D[m, j]\) against the old best path \(D_m[i, j]\):

```
def shortest_path(vertices, edges, weights):
min_weight_path = [[[[float("inf")] * len(vertices)] for _ in vertices] for _ in vertices] # 3D array of size $n$ by $n$ by $n$:
for m in range(len(vertices)):
for i in range(len(vertices)):
for j in range(len(vertices)):
D[m][i][j] = min(D[m - 1][i][j], D[m - 1][i][m] + D[m - 1][m][j])
```

The complexity classes we will look at are \(P\) (polynomial), \(NP\) (non-deterministic polynomial), and \(NP\)-complete. These are sets of problems. \(NP\) is a superset of \(P\), and \(NP\)-complete is a subset of \(NP\).

A **decision problem** is a problem whose solutions are either "yes" or "no". A yes-instance/no-instance is a problem instance of a decision problem for which the answer is yes/no. An algorithm solving a decision problem is **polynomial time** if and only if it has a worst case time complexity of \(O(n^k)\) for some positive integer \(k\).

The complexity class \(P\) is the **set of all decision problems that have polynomial time algorithms solving them**. Given a decision problem \(\Pi\), \(\Pi \in P\) if and only if there is a polynomial time algorithm solving \(\Pi\). For example, sorting is not in \(P\) because it is not a decision problem. For example, checking if a graph contains a cycle is in \(P\) (\(O(n^2)\)), while checking if a graph contains a Hamiltonian cycle has no known polynomial-time algorithms that solve it (and there probably isn't one).

The rational knapsack problem (where we answer yes if and only if the optimal profit is \(T\) or more) is in \(P\), since we have a polynomial time greedy algorithm that can solve it, but the 0-1 knapsack problem (where we answer yes if and noly if the optimal profit is \(T\) or more) is NP-complete.

A **certificate** for a yes-instance \(I\) is extra information that allows us to verify whether \(I\) is in fact a yes-instance - enough information to generate a formal proof that the answer is correct. A **certificate verification algorithm** is an algorithm that correctly determines whether a problem instance \(I\) is actually a yes-instance given a certificate \(C\), written as \(\operatorname{Ver}(I, C)\).

A valid certificate is one that does in fact verify that \(I\) is a yes-instance. Certificates can also be invalid, which means their information doesn't prove that \(I\) is a yes-instance. Note that if \(I\) is a no-instance, then no valid certificate for \(I\) can exist. \(\operatorname{Ver}(I, C)\) outputs yes if \(I\) is a yes-instance and \(C\) is a valid certificate for \(I\), and no if \(C\) is not a valid certificate for \(I\).

A certificate verification algorithm \(\operatorname{Ver}(I, C)\) **solves** a decision problem if and only if for every yes-instance \(I\), there exists a certificate \(C\) such that \(\operatorname{Ver}(I, C)\) outputs yes, and for every no-instance \(I\), all certificates \(X\) make \(\operatorname{Ver}(I, C)\) output no. In other words, given a certificate for a problem instance, the certificate verification algorithm outputs the correct answer to that problem instance, assuming yes-instances come with a valid certificate.

The complexity class \(NP\) is the set of all decision problems that can be solved by a polynomial-time certificate verification algorithm - the **decision problems for which it is possible to verify that the result is correct in polynomial time**. Note that this includes all problems in \(P\) because the certificate verification algorithm for polynomial time algorithms can just be that algorithm itself and the certificates can just be the problem instances.

For example, the problem "does a graph contain a Hamiltonian cycle?" is in \(NP\), because the certificate can just be the Hamiltonian cycle itself (a list of vertices in the cycle), and we can verify whether that particular cycle is a Hamiltonian cycle in \(O(n)\) time (check that the cycle includes all vertices once, and neighboring vertices are connected by an edge). So \(P \subseteq NP\).

Whether \(P = NP\) or not is currently a big open question in computer science. It is commonly believed that \(P \ne NP\) - that there are problems that can be verified in polynomial time, but not computed in polynomial time.

All problem instances for problems in \(NP\) can be solved in exponential time. This is because we could potentially generate all possible certificates (this is of exponential size because the certificate's size must be \(O(n)\), or else the certificate verification algorithm would not be polynomial time) and verify them. If any certificate verifies to a yes, then the problem instance is a yes-instance. Otherwise, it is a no-instance.

Let \(\mathcal{I}(\Pi)\) be the set of all instances of a decision problem \(\Pi\). Let \(\mathcal{I}_{\text{yes}}(\Pi), \mathcal{I}_{\text{no}}(\Pi)\) be the set of all yes-instances and no-instances, respectively, of a decision problem \(\Pi\).

A **polynomial-time reduction/polynomial transformation** from decision problem \(\Pi_1\) to decision problem \(\Pi_2\) (written as \(\Pi_1 \le_P \Pi_2\)) exists if and only if there exists \(f: \mathcal{I}(\Pi_1) \to \mathcal{I}(\Pi_2)\) such that \(f\) can be computed in polynomial time, \(I \in \mathcal{I}_{\text{yes}}(\Pi_1) \implies f(I) \in \mathcal{I}_{\text{yes}}(\Pi_2)\), and \(I \in \mathcal{I}_{\text{no}}(\Pi_1) \implies f(I) \in \mathcal{I}_{\text{no}}(\Pi_2)\).

In other words, if \(\Pi_1 \le_P \Pi_2\), then \(\Pi_1\) is no more difficult to solve than \(\Pi_2\) - problem instances can be transformed into a different problem with the same answer in polynomial time, even without knowing what the answer is.

So if \(\Pi_1 \le_P \Pi_2\), and \(\Pi_2\) is in \(P\), then \(\Pi_1\) must also be in \(P\).

Some NP-complete problems are checking clique sizes (determine whether a graph contains a clique of size \(k\) or greater), and checking vertex cover sizes (determine whether a graph contains a vertex cover of size \(k\) or smaller). Note that there is a polynomial transformation from instances of the clique problem to instances of the vertex cover problem. We can define \(f(\tup{G, k}) = \tup{H, \abs{V(G)} - k}\) where \(H\) is the complement of \(G\) (a graph with edges where \(G\) doesn't have edges, and no edges where \(G\) does). Since \(f\) can be computed in polynomial time, and transformed problem instances give the same result, the polynomial transformation exists.

Proof that the clique problem has a polynomial transformation to the vertex cover problem:

Let \(I\) be a yes-instance of the clique problem for size \(k\).

So there exists a \(W \subseteq V(G)\) such that \(\abs{W} = k\) and \(W\) is a clique in \(G\) - \(W\) is a subset of size \(k\) of the largest clique. We want to prove that \(f(I)\) is also a yes-instance of the vertex cover problem.

Clearly, \(V(H) \setminus W\) is a vertex cover in \(H\), since every edge in \(V(H) \setminus W\) has one end in \(V(H) \setminus W\) and another in \(V(H)\) (this is a theorem in graph theory).

So if \(f(I)\) is a yes-instance, \(I\) must also be a yes-instance.

Let \(f(I)\) be a yes-instance of the vertex cover problem for size \(n - k\).

By a similar argument as above, show that there must be a clique of size \(n - k\) in \(G\).

So if \(I\) is a yes-instance, \(f(I)\) must also be a yes-instance.

Therefore, \(f(I)\) is a yes-instance if and only if \(I\) is a yes-instance.

Note that if a problem instance \(I\) has size \(n\), then the size of any polynomial transformation of \(I\) must have size \(O(n^k)\) for positive integer \(k\). This is because the computation creating a problem instance of size larger than \(O(n^k)\) would take longer than \(O(n^k)\) time, so the transformation would not be able to run in polynomial time. In other words, \(\abs{f(I)} \in O(\abs{I}^k)\) for some positive integer \(k\).

If \(\Pi_1 \le_P \pi_2\) and \(\Pi_2 \in P\), then \(\Pi_1 \in P\) - if a problem has a polynomial transformation to another, and the other can be solved in polynomial time, then so can the first problem. This is because we can solve an instance of \(\Pi_1\) by transforming it into an instance of \(\Pi_2\) in polynomial time, solve the \(\Pi_2\) instance in polynomial time, and give that as the answer.

If \(\Pi_1 \le_P \Pi_2\) and \(\Pi_2 \le_P \Pi_3\), then \(\Pi_1 \le_P \Pi_3\) - polynomial transformations can be composed. This is because any instance of the first problem can be polynomially transformed into an instance of the third by transforming it into an instance of the second, and then transforming that instance into an instance of the third problem.

The complexity class \(NP\)-complete (also known as \(NPC\)) are the set of all decision problems such that \(\Pi \in NP\) and for all \(\Pi' \in NP\), \(\Pi' \le_P \Pi\) - the **decision problems in NP such that there are polynomial transformations from all other problems in NP to them**.

That means that if \(P \cap NPC \ne \emptyset\), then \(P = NP\) - if any NP-complete problem can be solved in polynomial time, then that algorithm can be used to solve any NP-complete problem in polynomial time.

Given a Boolean formula \(F\) over Boolean variables \(x_1, \ldots, x_n\) in conjunctive normal form (the formula is a product of sums), is there a truth assignment of the \(n\) variables such that the formula evaluates to true? In other words, do there exist values of \(x_1, \ldots, x_n\) such that \(F\) is true? This is the **CNF-satisfiability problem** (also known as SAT).

A **literal** is either a variable or a negation of a variable, \(a\) or \(\neg a\). A **clause** is a logical OR over literals, usually represented as a set of literals. A **formula** is a set of clauses.

According to the Cook-Levin theorem, the CNF-satisfiability problem is NP-complete. This is the first ever problem to be proved NP-complete.

There is a theorem that says that if \(\Pi_1\) is NP-complete, and \(\Pi_1 \le_P \Pi_2\), then \(\Pi_2\) is also NP-complete, since \(\Pi_2 \in NP\) and any \(\Pi'\) must satisfy \(\Pi' \le_P \Pi_1 \le_P \Pi_2\). So once we know one NP-complete problem, we can simply show that an NP-complete problem can be transformed in polynomial time to the problem we want to prove is NP-complete, which is a much simpler problem than proving a problem NP-complete from scratch.

For example, we can prove the 3SAT/3-CNF-satisfiability problem (same as SAT, but \(F\) is a product of sums where every sum has exactly 3 literals/variables, all different) is NP-complete in this way. To convert any SAT problem instance into 3SAT:

- Initialize \(F'\) as an empty set.
- For each clause \(C = \set{x_{i_1}, \ldots, x_{i_k}}\) in \(F\) (each sum in the product of sums):
- If \(\abs{C} = 1\), add the clauses \(x_{i_1} \vee a_1 \vee a_2, x_{i_1} \vee a_1 \vee \neg a_2, x_{i_1} \vee \neg a_1 \vee a_2, x_{i_1} \vee \neg a_1 \vee \neg a_2\) to \(F'\), where \(a_1, a_2\) are new, unused variables (these should not used in any other clauses).
- This is clearly true since the clauses evaluate to \(x_{i_1}\) regardless of the values of \(a_1, a_2\).

- If \(\abs{C} = 2\), add the clauses \(x_{i_1} \vee x_{i_2} \vee c, x_{i_1} \vee x_{i_2} \vee \neg a\) to \(F'\), where \(a\) is a new, unused variable (this should not be used in any other clauses).
- In the same way as above, regardless of the value of the variable, the clauses are always satisfiable if and only if \(C\) is.

- If \(\abs{C} = 3\), add this clause to \(F'\).
- This clause is already a valid 3SAT clause.

- If \(\abs{C} > 3\), add \(\abs{C} - 2\) clauses \(a_1 \vee a_2 \vee x{i_1}, a_3 \vee \neg x_{i_1} \vee x_{i_2}, a_4 \vee \neg x_{i_2} \vee x_{i_3}, a_5 \vee \neg x_{i_3} \vee x_{i_4}, \ldots, a_{k - 2} \vee \neg x_{i_{k - 4}} \vee x_{i_{k - 3}}, a_{k - 1} \vee a_k \vee \neg x_{i_{k - 3}}\) to \(F'\) where \(a_1, \ldots, a_k\) are new, unused variables (these should not be used in any other clauses).
- We can prove this is correct by proving that \(C\) being satisfiable implies the added clauses are satisfiable (via direct proof), and if the added clauses are satifiable, \(C\) is also satisfiable (via contradiction).

- If \(\abs{C} = 1\), add the clauses \(x_{i_1} \vee a_1 \vee a_2, x_{i_1} \vee a_1 \vee \neg a_2, x_{i_1} \vee \neg a_1 \vee a_2, x_{i_1} \vee \neg a_1 \vee \neg a_2\) to \(F'\), where \(a_1, a_2\) are new, unused variables (these should not used in any other clauses).
- After this is done, \(F'\) is an instance of 3SAT that is a yes-instance if and only if \(F\) is a yes-instance of \(F\).

Since the transformation runs in polynomial time, 3SAT is in NP and \(\text{SAT} \le_P \text{3SAT}\), so 3SAT is NP-complete as well.

We can also prove that 2SAT/2-CNF-satisfiability problem (same as SAT, but \(F\) is a problem of sums where every sum has exactly 2 variables, all different) is in P.

On computers, we can represent SAT problem instances with an \(m\) by \(n\) matrix \(M\) (where \(n\) is the number of variables and \(m\) is the number of clauses), where rows represent clauses \(C_1, \ldots, C_m\), and columns represent variables \(x_1, \ldots, x_n\). Each cell \(M_{i, j}\), then represents whether \(x_j\) is in the clause \(C_i\) as a literal (represented by 1), in the clause \(C_i\) negated as a literal (represented by -1), or not in the clause at all (represented by 0).

Clearly, the size of the problem instance \(M\) is \(\Theta(mn)\), and \(m\) can grow exponentially with respect to \(n\). The algorithm that converts instances of SAT to 3SAT runs in polynomial time with respect to the size of the instance, \(\Theta(mn)\), and so is a polynomial transformation.

We've been talking about **decision problems** all this time, but now we want to also look at problems in general.

Given a problem \(\Pi\) (not necessarily a decision problem), an **oracle** for \(\Pi\) is an (possibly hypothetical) algorithm \(A\) that solves it.

Given another problem \(\Pi'\), an algorithm \(A'\) is a **Turing reduction** from \(\Pi'\) to \(\Pi\) if and only if it solves \(\Pi'\) using \(A\) as a subroutine (\(A'\) depends on \(A\) to work, so \(A\) is the solver). A Turing reduction is sort of like a generalization of the concept of polynomial transformations to all problems. We can say that a Turing reduction exists by writing \(\Pi' \le^T \Pi\).

A Turing reduction \(A'\) using \(A\) as the solver is a **polynomial-time Turing reduction** if and only if \(A'\) has polynomial running time with respect to the running time of \(A\). If this is the case, we can write \(\Pi' \le_P^T \Pi\).

Alternatively, we can use a more intuitive definition: if we have a magical box called an **oracle** that can solve any instance of \(\Pi'\) in polynomial time, and we have an algorithm that transforms instances of \(\Pi\) to instances \(\Pi'\) and vice versa in polynomial time, we could use the oracle to solve \(\Pi\) in polynomial time.

Basically, if there is a polynomial-time Turing reduction \(\Pi' \le_P^T \Pi\) and \(\Pi\) can be solved in polynomial time, then \(\Pi'\) can also be solved in polynomial time.

Turing reductions are interesting because it seems like we can reduce harder versions of problems into simpler versions.

A travelling salesperson problem (TSP) involves finding paths in a graph with weighted edges such that the total weight is minimized.

Given a graph \(G\) and edge weights \(w: E(G) \to \mb{Z}^+\), what is a Hamiltonian cycle \(H\) in \(G\) that minimises \(\sum_{e \in H} w(e)\)? This is the **TSP-optimisation** problem.

Given a graph \(G\) and edge weights \(w: E(G) \to \mb{Z}^+\), what is the smallest \(\sum_{e \in H} w(e)\) where \(H\) is a Hamiltonian cycle in \(G\)? This is the **TSP-optimal-value** problem.

Given a graph \(G\) and edge weights \(w: E(G) \to \mb{Z}^+\) and a positive integer \(T\), is there a Hamiltonian cycle \(H\) in \(G\) such that \(\sum_{e \in H} w(e) \le T\)? This is the **TSP-decision** problem.

Clearly, \(\text{TSP-decision} \le_P^T \text{TSP-optimal-value} \le_P^T \text{TSP-optimisation}\):

```
def tsp_optimisation(vertices, edges, weights):
pass # suppose we are given this algorithm, and it returns a list of edges (or `None` if there are no Hamiltonian cycles)
def tsp_optimal_value(vertices, edges, weights):
hamiltonian_cycle = tsp_optimisation(vertices, edges, weights)
if hamiltonian_cycle is None: return None
return sum(weights[edge] for edge in hamiltonian_cycle)
def tsp_decision(vertices, edges, weights, threshold):
return tsp_optimal_value(vertices, edges, weights) <= threshold
```

However, it is also true that \(\text{TSP-optimisation} \le_P^T \text{TSP-optimal-value} \le_P^T \text{TSP-decision}\)! In other words, if we have a polynomial time algorithm for TSP-decision, we can solve TSP-optimal-value in polynomial time, and if we have a polynomial time algorithm for TSP-optimal-value, we can solve TSP-optimisation in polynomial time:

```
def tsp_decision(vertices, edges, weights, threshold):
pass # suppose we are given this algorithm, and it returns `True` or `False`
def tsp_optimal_value(vertices, edges, weights):
# basically, we do a binary search for the exact total weight of the Hamiltonian cycle
low, high = 0, sum(weights[edge] for edge in edges)
if tsp_decision(vertices, edges, weights, high): return float("inf")
while low < high:
middle = floor((low + high) / 2)
if tsp_decision(vertices, edges, weights, middle):
high = middle
else:
low = middle + 1
return high
def tsp_optimisation(vertices, edges, weights):
# repeatedly remove edges whose removal doesn't increase the optimal total weight (removing edges can never lower the optimal total weight)
lowest_total_weight = tsp_optimal_value(vertices, edges, weights)
if lowest_total_weight == float("inf"): return None
lowest_weight = lowest_total_weight
hamiltonian_cycle = set()
new_weights = dict(weights)
for edge in edges:
new_weights[edge] = float("inf") # remove edge by making it have infinite weight
if not tsp_decision(vertices, edges, new_weights, lowest_total_weight):
new_weights[edge] = weights[edge] # restore the edge, since it's part of the optimal edge
hamiltonian_cycle.add(edge)
```

In summary, \(\text{subset-sum} \le_P^T \text{vertex-cover} \le_P^T \text{clique} \le_P^T \text{3SAT} \le_P^T \text{SAT}\) - these are well-known NP-complete problems.

We can transform 3SAT to the clique problem as follows:

Let \(I\) be a 3SAT problem instance over variables \(x_1, \ldots, x_n\) and clauses \(C_1, \ldots, C_m\). Let \(C_{i, j}\) represent the \(j\)th literal of the \(i\)th clause.

Let \(f(I) = \tup{G, m}\) where vertices of \(G\) are the literals in \(C_1, \ldots, C_m\) (there are exactly \(3m\) vertices, duplicates are allowed) and the edges of \(G\) join any two literals that are not negations of each other and not in the same clause - where \(C_{i, j}, C_{i', j'}\) such that \(i \ne i'\) and \(C_{i, j} \ne \neg C_{i', j'}\). Basically, we transform \(I\) into checking if there is a clique (fully connected subgraph) in \(G\) of size \(m\) or greater. Clearly, \(f(I)\) can be computed in polynomial time. Assume \(I\) is a yes-instance. Then there is some assignment such that all clauses are satisfied. Then every clause must have at least one true literal. Since these literals can't contradict each other if the clauses are satisfied, there are edges between all of them, so these literals form a clique of size \(m\) or more.

Assume \(f(I)\) is a yes-instance, so there is a clique with \(m\) vertices. Clearly, if there are \(m\) vertices then there is one vertex from each clause, since literals within the same clause aren't joined by edges. Since these literals all don't contradict each other as they're connected by edges, it is possible to satisfy all of these literals - all of the clauses are satisfiable and \(I\) is a yes-instance.

Therefore, \(f(I)\) is a yes-instance if and only if \(I\) is a yes-instance.

Given a list of positive integers \(s_1, \ldots, s_n\) and a target sum \(T\), is there a subset of \(s_1, \ldots, s_n\) such that the sum of the elements is \(T\)? This is the **subset-sum** problem. The subset-sum problem is sort of like the knapsack problem.

We can also transform the clique problem to the subset-sum problem:

Let \(I\) be a Clique problem instance over graph \(G\) and clique size threshold \(k\).

Let \(V(G) = \set{v_1, \ldots, v_n}\) and \(E(G) = \set{e_1, \ldots, e_n}\). Let \(c_{i, j} = \begin{cases} 1 &\text{if } v_i \in e_j \\ 0 &\text{if } v_i \notin e_j \\ \end{cases}\).

Let \(a_i = 10^{\abs{E(G)}} + \sum_{j = 0}^{\abs{E(G)} - 1} c_{i, j} 10^j\) and \(b_j = 10^j\). Let \(T = k 10^{\abs{E(G)}} + \sum_{j = 0}^{\abs{E(G)} - 1} 2 \times 10^j\).

We now have a subset-sum problem instance with positive integers \(a_1, \ldots, a_{\abs{V(G)}}, b_1, \ldots, b_{\abs{E}}\) and target sum \(T\).

Each sum value \(a_i\) has lower digits corresponding to every edge, that are 1 if and only if the edge is incident to vertex \(i\). The digits above these act as a counter - adding \(k\) of \(a_i\) together makes the top tigits represent \(k\). Each sum value \(b_j\) has digits that are 0 except for the one corresponding to edge \(j\). The target sum requires these to be added together to make every digit corresponding to edges equal to 2, and also add exactly \(k\) of \(a_i\) to make sure the clique is of size \(k\).

Therefore, this problem instance gives the same answer as \(I\).

A decision problem \(\Pi\) is **undecidable** if and only if there do not exist algorithms that solve it. For an undecidable problem and any algorithm, there is at least one problem instance such that the algorithm won't find the correct answer in finite time.

The **Halting problem** is the most famous example of an undecidable problem - given a computer program \(A\) and the input for the program \(A\), does the program halt in finite time?

This is easy to prove undecidable. Suppose an algorithm solves the Halting problem. Then we can implement it in a program as a function `halts(program, input)`

, returning `True`

if `program`

halts when given input `input`

, and `False`

otherwise.

Consider the following program:

```
def contrarian(program):
if halts(program, program):
while True: pass # never halt
else:
return # immediately halt
contrarian(contrarian)
```

If `contrarian`

halts, then `halts(program, program)`

returns true, but then `contrarian`

doesn't halt if this is the case, which is impossible. If `contrarian`

doesn't halt, then `halts(program, program)`

returns false, but then `contrarian`

halts immediately if that is the case, which is impossible. Therefore, `contrarian`

neither halts or doesn't halt, a contradiction. So no algorithm can exist that solves the halting problem.

We can actually use Turing reductions to prove that some other problems are undecidable - if we can Turing reduce the halting problem to another problem \(\Pi\), then it must also be undecidable, since if it wasn't we could solve the halting problem using it.

This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright 2013-2017 Anthony Zhang.