Special Topics in Computer Science - Introduction to Machine Learning.

```
Yaoliang Yu
Section 001
Email: yaoliang.yu@uwaterloo.ca
Website: http://cs.uwaterloo.ca/~y328yu/mycourses/489
Office Hours: Tuesdays/Thursdays 2:40pm-3:40pm in DC-3617
Tuesdays/Thursdays 4:00pm-5:20pm
```

Course is potentially going to be collaboration between professor and Focal Systems - guest lectures by deep learning engineers from Focal and assignments from real-world systems. Contact agastya@focal.systems for questions and comments.

Course questions on Piazza, content on LEARN. Course will use MATLAB, Python, or Julia, and requires CS341 concepts.

No required textbooks. 5 semi-weekly assignments worth 50% total, submittable via LEARN, remaining 50% from an open book final exam. There's also a 5% bonus project, consisting of a 1 page proposal and 8 page report on a machine-learning-related project.

Machine learning is about giving computers the ability to learn things that they aren't explicitly programmed to do. More specifically, for a machine to **learn** is to, as experience E increases, improve a performance measure P for some class of tasks T. Essentially, a program learns if it **gets better at solving a problem as it gains more experience**.

Machine learning falls into three categories:

- Supervised learning: classification/regression/ranking - there's a source of truth that the machine can use to determine the true answer for at least some of the problem instances.
- Example: the Not Hotdog app - images labelled hotdogs/not-hotdogs are used to train the model, which is then used to make predictions about new images.
- Given a training set of pairs \tup{x, y}, find a function f: X \to Y such that f(x) has good performance on values of x that haven't been seen in the training set.
- We don't actually care that much about performance in the training set - too-high performance in the training set is overfitting

- Reinforcement learning: control/pricing/gaming - there's no explicit source of truth, but doing something gives feedback, like how good the previous output was.
- Example: AlphaGo uses a reinforcement learning model to guide monte-carlo tree search - wins give positive feedback, losses give negative feedback.

- Unsupervised learning: clustering - there's no explicit source of truth.
- Example: Google Youtube clustering 9-layer network from 2012 was trained to cluster objects, and managed to learn to detect faces by itself.

Modern ML research focuses on representation of data (e.g., feature engineering), interpretation of results, generalizing models to different domains (e.g., applying image classifiers to video), time/space complexity, learning efficiency (how many samples do we need? how big does the training set need to be?), and real-world applications.

New notation: A_i is the i-th 1-indexed row of the matrix A, and A_{:j} is the j-th 1-indexed column of the matrix A.

New notation: \sign x = \begin{cases} 1 &\text{if } x > 0 \\ -1 &\text{if } x < 0 \\ \text{undefined} &\text{if } x = 0 \end{cases}.

New notation: derivative of a function is Df(x) = \lim_{\delta \to 0} \frac{f(x + \delta) - f(x)}{\delta}.

New notation: \min_{a: f(a), b: g(b), \ldots} f(a, b, c, \ldots) is the minimum value of f(a, b, c, \ldots) such that f(a), g(b), \ldots are all true. The a: f(a) part might also be written as just a if there's no constraints.

New notation: \argmin_{a: f(a), b: g(b), \ldots} f(a, b, c, \ldots) is the values of a, b, c, \ldots such that f(a, b, c, \ldots) is minimised and f(a), g(b), \ldots are all true. The a: f(a) part might also be written as just a if there's no constraints.

Consider the problem of filtering out spam emails. The training set would be a set X of emails (e.g., a vector where each dimension represents a feature, like whether word i appears in the email) and a set Y representing the spamminess of those emails (e.g., real number between -1 and 1). One of the most important parts of this task is making sure we have a good representation for features in our emails. In a bag of words model, for example, we might make X a 10000-dimensional vector where each element represents whether one of 10000 words appears in the email's subject.

In **batch learning**, we care about performance on the testing set X', and the training set is just the means by which we get there, by performing statistics on X and assuming things about X'. In **online learning**, data is received in a streaming fashion - we need to product the value of y without knowing its true value.

In this course, we'll use <a, b> to represent the inner product a \cdot b = a^T b. Also, \sign(x) is 1 when x > 0, -1 when x < 0, and undefined when x = 0 (some other courses will instead define it to be 0).

The perceptron is a machine learning model based on a highly simplified model of a neuron. It takes in activation from neighboring neurons, takes their weighted sum, and then applies the activation function to them, the \sign function, which is the neuron's output. We'll study Rosenblatt's original design from 1958, along with several additions and improvements made since then.

Perceptrons are used for **binary classification problems**. We are given a training set \set{\tup{\vec x_1, y_1}, \tup{\vec x_2, y_2}, \ldots} and a testing set \set{\vec t_1, \vec t_2, \ldots} where \vec x_i, \vec t_i are feature vectors, and y_i is the binary **category**, either -1 or 1. Using the training set, we want to train the perceptron to determine the category y_i for each \vec t_i.

A perceptron is simply y = \sign(\vec w \cdot \vec x + b), where \vec w is the perceptron's **weights vector**, b is the perceptron's **bias**, \vec x is the **input**, and y \in \set{-1, 1} is the **prediction**. Note that \vec w + b should be a **hyperplane** separating the positive values of y_i from the negative values of y_i, and the sign of \vec w \cdot \vec x + b determines which side of the hyperplace the point \vec x is on (the positive predictions side, or the negative predictions side). For now, let's assume that for the training set, there exists a hyperplane that separates all of the positives from all of the negatives - that the data is **separable**.

Now we'll try to simplify the perceptron formula to make it easier to work with. First, let's get rid of the \sign by multiplying both sides of the perceptron formula by y: y^2 = y \sign(\vec w \cdot \vec x + b), and since y is either -1 or 1, y^2 = 1, so y \sign(\vec w \cdot \vec x + b) = 1, or in other words, y (\vec w \cdot \vec x + b) > 0. Expand to get \vec w \cdot (y\vec x) + by > 0

Let \vec w' = \begin{bmatrix} \vec w \\ b \end{bmatrix} and a = \begin{bmatrix} y \vec x \\ y \end{bmatrix} - we've chosen these definitions specifically so that \vec w \cdot (y\vec x) + by > 0 is equivalent to a \cdot w' > 0, and so that the value of \vec w' represents the perceptron parameters exactly.

When training the perceptron, our goal is to fit the hyperplane to our training set. That means we'll want to make perceptron predictions in bulk, so it would be nice to be able to represent that in a compact way. To do this, we'll let A = \begin{bmatrix} \vec a_1 & \vec a_2 & \ldots \end{bmatrix}, where \vec a_i = \begin{bmatrix} y_i \vec x_i \\ y_i \end{bmatrix} - columns of A are values of \vec a corresponding to each value of \vec x_i. Written out fully, that's A = \begin{bmatrix} y_1 \vec x_1 & y_2 \vec x_2 & \ldots \\ y_1 & y_2 & \ldots \end{bmatrix}.

Clearly, A^T \vec w' > 0 is equivalent to \forall \vec x_i, \sign(\vec w' \cdot \vec x_i + b) = y. We've now simplified the perceptron problem down to a single matrix multiplication and a comparison! Now, \vec w' contains all the perceptron parameters, and the columns of A are the data points (each with a trailing 1 element), premultiplied by the label.

Now the problem becomes: given premultiplied data in a matrix A, find \vec w' such that A^T \vec w' > 0. The **perceptron training algorithm** does this, and works as follows: repeatedly choose a column \vec a_i of A, and if \vec a_i \cdot \vec w' \le 0, change \vec w' by adding \vec a_i to it. Stop when \vec a_i \cdot \vec w' > 0 for all \vec a_i in A, or when we reach an iteration/passes limit.

Why do we correct the weights when \vec a_i \cdot \vec w' \le 0 by adding \vec a_i to \vec w'? Well, the next time we choose the \vec a_i column, we'll get \vec a_i \cdot (\vec w' + \vec a_i) = \vec a_i \cdot \vec w' + \magn{\vec a_i}^2. Since \magn{\vec a_i}^2 > 0, \vec a_i \cdot (\vec w' + \vec a_i) > \vec a_i \cdot \vec w', so \vec a_i \cdot (\vec w' + \vec a_i) is closer to being positive.

(Python implementation not included, since this is a question in assignment 1)

After training, we can make predictions for any given input \vec x with the usual formula, y = \sign\left(\vec w' \cdot \begin{bmatrix} \vec x \\ 1 \end{bmatrix}\right).

This algorithm is very simple to implement, yet works quite well in practice. Also, the fact that its formula is a linear combination is interesting. If we look at the weights, we notice that large positive weights mean that the corresponding feature strongly suggests that the prediction should be positive, whereas large negative weights strongly suggest that the prediction should be negative.

How well does a perceptron converge when running the training algorithm described above? **Block's perceptron convergence theorem** gives us an idea. If A is separable (i.e., a hyperplane exists that separates positive cateogry points from negative category points), then \vec w' will converge to some \vec w^*. If every column of A is selected indefinitely often, then A^T \vec w^* > 0. Furthermore, if \vec w' = \vec 0 initially, then the perceptron converges after at most (R / \gamma)^2 iterations, where R = \max\set{\magn{a_1}, \magn{a_2}, \ldots} and \gamma = \max\set{\min\set{\vec w \cdot \vec a_1, \vec w \cdot \vec a_2, \ldots} : \magn{\vec w} \le 1} (the margin - the minimum distance between the convex hull of the positive points and the negative points). Essentially, the margin represents the distance between the "hardest" two datapoints to classify.

Note that these values of R and \gamma are purely functions of the dataset, and that they don't directly depend on the size of A and the number of dimensions d. In other words, the number of mistakes the perceptron makes would be independent of the dataset size and number of dimensions! The larger the margin is, the faster the perceptron converges. Block's perceptron convergence theorem gives us a worst case bound, but in many practical situations the perceptron will perform a lot better.

Also, the perceptron stops at an arbitrary linear separator that correctly separates the points, not necessarily the one that most cleanly separates the positive and negative points (with the largest possible minimum distance from the hyperplane to positive/negative predictions). In fact, the resulting hyperplane will even depend on the order we feed in the data. We can use support vector machines instead to find that hyperplane (which also happens to be unique for each dataset!). This is the main disadvantage of perceptrons - they might only barely separate the training data, so they're less robust to unseen data than those that find a linear separator with a larger margin.

If the data is **not separable**, Block's perceptron convergence theorem doesn't apply anymore. The **perceptron boundedness theorem** says that convergence is only guaranteed if such a hyperplane exists, but if it doesn't, then the iterations are still bounded, because the perceptron's state will start cycling after a certain number of iterations. In practice, this means we would specify a time or iteration limit when doing training, or when the training/validation error stops changing, or even if weights stop changing much when using diminishing step sizes.

If we end up with non-separable data, we might want to find a better feature representation, use a deeper model, or use a **soft margin** - instead of a hyperplane that perfectly separates positive/negative values of y, we can allow a few mistakes.

There are many ways to extend perceptrons to classify things into more than two categories (positive/negative). One way is **one vs. all**, where we have one perceptron per category, perceptron with highest activation level wins - \max_c(w_c \cdot x). The issue with this is that it's imbalanced - each perceptron has to give negative predictions far more often than positive ones, since it only gives positive prediction for its own category and otherwise must give a negative prediction. Another is **one vs. one**, where we have one perceptron for every pair of categories, where a positive prediction means the datapoint is in the first category and negative means the other category, and then take a vote to find the most commonly predicted category as the final answer.

An example of applying perceptrons online is pricing - selling a product to the user at a price y, and updating weights if the price is too high and the user doesn't buy the product.

Assignment 1 now available, due in two weeks.

A **pass** is a run through all of the training data - 100 passes means we go through the training data 100 times. An **iteration** is a run through a single data point in our training data.

Consider a scatter plot of house market value vs. square footage. We'd expect that these two are pretty well correlated. A linear regression over these two variables can be used to give us a line of best fit.

Regression problems are about fitting models to match datasets as closely as possible. Linear regression problems try to fit linear models to datasets. When we're doing regression problems, we have to consider whether to use linear/nonlinear models, and whether we'll be using it to interpolate or extrapolate (choosing the perfect model is much more important for extrapolation)

Formally, a regression problem is: find f(\vec x) \approxeq \vec y given \vec x (the **feature vector** a real vector) and y (the **response value**, a real number). The hard part of this is that \vec x and y are drawn from unknown distributions, which makes it hard to interpolate/extrapolate. Additionally, we need a way to express how much error there is in our model predictions - a **loss function**.

One family of regression algorithms is **risk minimizers** (expected loss minimizers): algorithms that try to find f such that \min_{f: \vec x \to y} E[\ell(f(\vec x), y)].

A common loss function is **least squares**: \min_{f: \vec x \to y} E[\magn{f(\vec x) - y}^2]. The correct loss function for a given situation is often hard to determine, so we use one that's simple and efficient to compute - least squares works well enough for most situations. Additionally, of all the minimizers of $_W _F, W = A^+ CB^+$ is the one with the smallest F-norm, where A^+ is the pseudo-inverse of A (Sondermann '86, Yu & Shuurmans '11) - this is mostly a theoretical result, but gives us another good reason to use least squares loss.

Clearly, E[\magn{f(\vec x) - y}^2] = E[\magn{f(\vec x) - E(y \mid \vec x)}^2] + E[\magn{E(y \mid \vec x) - y}^2]. Note that the second term doesn't really depend on f - it's the **inherent noise variance**, the noise that we can't get rid of no matter how good our regression function is. Also, the first term gives us the problem in a much simpler form: we want to find an f(\vec x) that approximates E(y \mid \vec x) well, to make this term smaller.

One way to make this optimization process easier is to assume that f(\vec x) is linear, so f(\vec x) = E(\vec y \mid \vec x) = A \vec x + \vec b for some matrix A. If we make this assumption, then with risk minimization we're trying to find \min_{f: \vec x \to \vec y} E[A \vec x + \vec b - \vec y]. We can't minimize this directly because we don't know the true distribution of the variables, but using the law of large numbers, \frac 1 n \sum Z_i = E(Z) for any Z = \set{Z_1, Z_2, \ldots}. So if we assume the model is linear, and the sample is large, then the risk minimization can be approximated by \min_{\vec a, \vec b} \frac 1 n \sum \magn{A \vec x + \vec b - \vec y}^2 (this approximation is called the **empirical risk**).

Let's simplify the \min_{\vec a, \vec b} \frac 1 n \sum \magn{A \vec x + \vec b - \vec y}^2 approximation, using something very similar to what we did for perceptrons. First, let's define W = \begin{bmatrix} A^T \\ {\vec b}^T \end{bmatrix} and \vec x' = \begin{bmatrix} \vec x \\ 1 \end{bmatrix}. Now we have \min_W \frac 1 n \sum \magn{W^T \vec x' - \vec y}^2, which is slightly shorter/cleaner.

Let \vec x_i be the ith value of \vec x in our training set. Just like for the perceptrons simplifications above, we also want to include all of the training set data points in a single expression, to make our minimization problem simpler. To do this, let X = \begin{bmatrix} {\vec x_1'}^T \\ {\vec x_2'}^T \\ \vdots \end{bmatrix}, Y = \begin{bmatrix} {\vec y_1}^T \\ {\vec y_2}^T \\ \vdots \end{bmatrix}. Now, we can write this as \min_W \magn{XW - Y}_F^2 where \magn{A}_F = \sum_{i, j} A_ij is the **Frobenius norm** - each element simply gets squared and the squares are all summed together to get the result, like the Euclidean norm, but extended for any matrix.

The **least squares problem** is now writeable as \min_W \magn{XW - Y}_F^2, and we're minimizing the **sum of square residuals** XW - Y (sum of square distances between the predicted values and true values). Here, Y is a matrix with columns as the true responses, and the residuals are the distances between each true response in Y and the point that the hyperplane would predict given X.

Note that the Frobenius norm can be defined as: \magn{A}_F^2 = \trace{A^T A}. Additionally, the following are identities: \trace(A + B) = \trace(A) + \trace(B), \trace(AB) = \trace(BA), \trace(A) = \trace(A^T), and \trace(cA) = c \trace(A).

Therefore, \magn{XW - Y}_F^2 = \trace((XW - Y)^T (XW - Y)) = \trace((W^T X^T - Y^T) (XW - Y)) = \trace(W^T X^T X W - Y^T X W - W^T X^T Y + Y^T Y) = \trace(W^T X^T X W) - \trace((Y^T X W)^T) - \trace(W^T X^T Y) + \trace(Y^T Y) = \trace(W^T X^T X W) - \trace(W^T X^T Y) - \trace(W^T X^T Y) + \trace(Y^T Y) = \trace(W^T X^T X W - 2 W^T X^T Y + Y^T Y). Clearly, this is a quadratic equation with respect to W, and we want to find its minimum.

Consider \min_x f(x). Fermat's theorem says that at the minimum x, the derivative of f(x) must be 0. Consider a general quadratic function f(x) = \vec x^T A \vec x + \vec x^T \vec b + c. The derivative is then \frac{\dee f(x)}{\dee x} = (A + A^T)\vec x + \vec b.

Note that \magn{XW - Y}_F^2 = W^T(X^T X) W - 2W^T X^T Y + Y^T Y (a quadratic equation), and if set the derivative of this to 0 and solve we get X^T X W = X^T Y as a solution, which is just a linear system - we have X and Y, so we can solve for W. Note that X^T X might be invertible, but we should still never solve for W by using W = (X^T X)^{-1} X^T Y, since this involves solving n linear systems, whereas we can solve it by solving only 1 linear system (in practice, we should almost never actually compute matrix inverses).

Once we have W, we can make predictions for any given X using \hat Y = XW, or evaluate those predictions with (Y - \hat Y)^2. We can also evaluate using a different loss function, a technique often used in calibration theory.

Linear regression is disproportionally affected by large outliers. To mitigate this, we sometimes use Huber loss, which is linear for large differences and quadratic for smaller ones, where "larger" and "smaller" are defined by a threshold \delta. This ensures overly large outliers don't impact the result too much. Huber's loss function is defined as H(\hat y, y) = \begin{cases} \frac 1 2 (\hat y - y)^2 &\text{if } \abs{\hat y - y} \le \delta \\ \delta(\abs{\hat y - y} - \frac{\delta}{2}) &\text{otherwise} \end{cases}.

When we're performing training, the end result is a model. Often, we want the simplest possible model that still makes good predictions, because these tend to generalize the best (Occam's Razor). Regularization techniques are used to make training prefer simpler models over slightly more complex but better models, where "simpler" is defined based on the technique.

Regularization also allows us to solve **ill-posed problems** - those that don't have exactly one solution (zero or more than one solution) or don't have their solutions change continuously with respect to the problem initial conditions (i.e., derivative of solution with respect to initial condition doesn't always exist).

Suppose we have an ill-posed linear regression problem \min_W \magn{XW - Y}_F^2, so we might have many solutions to W. One thing we could say is that linear regressions that use "smaller" values of W are "simpler" than those that use "larger" values. To represent this, we just add a term to the least squares problem: \min_W \magn{XW - Y}_F^2 + \lambda \magn{W}_F^2, or equivlaently, (X^T X + \lambda I)W = X^T Y. A small positive lambda ensures that instead of a small change in the input resulting in a huge difference in the output, it would result in a difference proportional to \frac{1}{\lambda} instead. Adding the \lambda I value is a technique known as **Tikhonov regularization**, and the resulting modified least squares problem is known as the **ridge regression problem**.

The hyperparameter \lambda affects how much we penalize higher weights in the resulting model - higher values force the resulting model to have a smaller Frobenius norm, and smaller values allow us it to "trade off" more weight in return for closer predictions. How do we choose hyperparameters like \lambda? We have a training set (for model training), testing set (which we don't see until the end), and sometimes a small validation set (for tuning parameters), and on the training set, we can apply n-fold **cross-validation**:

- Suppose we have k different values of \lambda we want to consider.
- Split the training set into n roughly-equal sized chunks.
- For each value of \lambda we want to cross-validate:
- For each chunk i:
- Train the model on the dataset formed by combining the n - 1 chunks that are not chunk i.
- Evaluate the model against the training data in chunk i.

- Average the evaluation scores from each chunk to get the average cross-validation score for the value of \lambda.

- For each chunk i:
- Pick the value of \lambda that has the best average cross-validation score.

Guest lecture by Francois from Focal Systems (francois@focal.systems).

Almost any ML problem falls into regression or classification.

For linear regression, we're assuming that the response variable y is approximated by \vec \theta \cdot \vec x + N(0, \sigma), where N(0, \sigma) is a normal distribution centered around 0 with standard deviation \sigma. Further overview of some real-world details in implementing linear regression.

Though linear regression is simplistic, it turns out that it works very well in practice, since more complex models require more advanced ways to do regularization and get decent weight vectors. Tools like SVM are used a lot in the real world to model real phenomena, even when they aren't necessarily linear, because it works well enough for most purposes.

Most modern ML problems use SGD - stochastic gradient descent.

A Bernoulli model predicts y = P(Y_1 = y_1, \ldots, Y_n = y_n \mid X_1 = x_1, \ldots, X_n = x_n). If we assume that the distribution of the variables are a Bernoulli distribution, so they're independent, we can then write this as y = \prod P(Y_i = y_i \mid X_i = x_i) = \prod p(x_i; w)^{y_i} (1 - p(x_i; w)) ;wip: get the formula for this

Logistic regression tries to predict the value of 0 \le y \le 1 given \vec x by fitting the formula \frac 1 {1 + \exp(\vec w \cdot x)}, where \vec w is the thing that we're trying to fit.

We use a sigmoid rather than, say, a step function, because the gradient doesn't have any signal - if we differentiate it, the derivative is just 0 everywhere, so gradient descent wouldn't be able to get closer to the solution at every step. Instead, the sigmoid formula has a gentle curve, so its derivative is more suitable for performing gradient descent on.

Our loss function is then f(y', y) = \ln(y') * y + \ln(1 - y') (1 - y), where y' is the model's prediction and y is the true value. ;wip: why??? look at slides

Tensorflow example, implementing logistic regression using the built in gradient descent optimizer to minimize the loss function. When doing gradient descent, we want the largest learning rate that still converges.

;wip: logistic regression

Guest lecture by Aghastya from Focal Systems (aghastya@focal.systems).

A perceptron tells you which side of a hyperplane a point is, and also includes an algorithm to separate two classes with a hyperplane - a binary classifier. Logistic regression finds a line of best fit, as well as a measure of confidence that our prediction is correct, because the prediction's value is between -1 and 1 rather than exactly -1 or 1 - a binary classifier as well. The logistic regression gives a higher-magnitude prediction the farther it is from the logistic regression's hyperplane.

Perceptrons aren't very good binary classifiers overall for reasons weve previously discussed, but they're computationally cheap to run and easy to reason about. One example of the this is the XOR problem - a hyperplane cannot separate an XOR function: two classes \set{\tup{0, 1}, \tup{1, 0}} and \set{\tup{0, 0}, \tup{1, 1}}.

The goal of deep learning classification is to learn a representation of the input into something that's linearly separable, so we could then use classifier techniques like logistic regression.

Consider a **two-payer perceptron**. We have parameters U, \vec c, \vec w, b. Let \vec z = U\vec x + \vec b and h = f(\vec z), where f is a nonlinear function like x^2 or \arctan(x). Then the output of the two-layer perceptron is then \hat y = \vec h \cdot \vec w + b. There are three layers here:

- The
**linear layer**takes the model input \vec x and transforms it into another linear space via weights U and the bias term \vec c, to get the hidden layer inputs \vec z.- This is the part of the perceptron before the \sgn function.

- The
**hidden layer**takes the hidden input \vec z and applies a non-linear function to it, so we can represent non-linearity in the input, to get the linear layer inputs \vec h.- The function used in these hidden layers is known as an
**activation function**. - Common activation functions are sigmoid \frac{1}{1 + \exp(x)}, inverse tangent \arctan(x), and rectified linear unit (ReLU) \max(0, x).

- The function used in these hidden layers is known as an
- The
**linear layer**takes the hidden layer output \vec h and transforms it into another linear space via weights \vec w.

Essentially, we're weighting the input \vec x by the weight matrix U and the bias term \vec c (a value unaffected of input that influences the output in a particular direction) in a **linear layer** to get the **hidden layer inputs** \vec z. Then, we apply the non-linearity \vec h = f(\vec z) to those to get the **hidden layer output** \vec h. The hidden layer outputs \vec h are then used as the inputs to the linear classifier \hat y = \vec h \cdot \vec w + b, which is the output of the two-layer perceptron.

The idea is to compose a bunch of non-linear functions to approximate the actual, unknown function of \vec x that generates y. The non-linear functions add complexity to the model, and by adding enough of those non-linear functions, we can represent some pretty complex things.

The **width** of a multi-layer perceptron is the number of dimensions of \vec x. Note that this doesn't include the bias term \vec c.

We can't just use linear functions because composing linear functions just makes another linear function - if we applied linear functions on a dataset that wasn't linearly separable, it would still not be linearly separable, but if we apply non-linear functions, there's a chance it can be transformed into something linearly separable.

Consider now an example: U = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}, \vec c = \begin{bmatrix} 0 \\ -1 \end{bmatrix}, \vec w = \begin{bmatrix} 2 \\ -4 \end{bmatrix}, \vec b = -1, f(x) = \max(t, 0). The activation function is ReLU (rectified linear unit). As it turns out, this can correctly classify the output of the XOR function, something that a plain perceptron can't do.

We can stack these two-layer perceptrons - feeding the output of one into the input of another - to build k-layer perceptrons! It's actually been proven that if the activation function is not a polynomial, and there exists at least 1 hidden layer, the resulting model can theoretically learn any function that has values between 0 and 1! This is called the **universal approximation theorem**.

The multi-layer perceptron learns a **hierarchical non-linear feature representation**. Generally, we then feed the outputs of the neural net into our existing classification/regression tools.

As we go from levels closer to the input to levels farther away, we start being able to classify higher-and-higher-level features. For example, an image classifier network might start with learning lines/circles in the lower levels, and then shapes like trees and cars and faces in the higher levels.

Choosing the right activation function and architecture is currently more of an art than a science - there's a lot of nuances that we'll go over in later lectures. Usually, we'll just add more layers than we need, and regularize afterward.

The multilayer perceptron's output is clearly a function of the input, \hat y = q(\vec x; \Theta), where \Theta is the parameters of the model. We usually don't know the true function q^* that we're trying to approximate, but deep learning samples a lot of values of q^* and then tries to approximate that by composing lots of nonlinear functions to get q.

To do machine learning, we need to be able to talk about the difference between the prediction \hat y and the truth y. We can represent this as a **loss function** \ell(\Theta; X, \vec y) a function of the model parameters (e.g., the \vec w values in a linear regressor), as well as some input features \vec x and truths y. Usually, we omit the X and Y values to write this as \ell(\Theta).

One really common loss function is **cross entropy/logistic loss**: \ell(\Theta) = -\sum_{i = 1}^n \left(y_i \ln \hat y_i - (1 - y_i) \ln(1 - \hat y_i)\right) where y_i \in \set{0, 1} is the true value and \hat y_i \in \set{0, 1} is the predicted value. For k classes, we can write this as \ell(\Theta) = -\sum_{i = 1}^n \sum_{c = 1}^k \vec y_{i, c} \ln \hat y_{i, c} where \vec y and \hat y are one-hot vectors. Another really common loss function is mean-squared-error, which is just \sum_{i = 1}^n (y_i - \hat y_i)^2.

Gradient descent is a method of minimizing any function by iteratively taking steps toward minima.

We want to minimize our loss function using gradient descent - minimizing \ell(\Theta) = \frac 1 n \sum \ell(q(x_i; \Theta), y_i) by iteratively computing \Theta_{t + 1} = \Theta_t - \eta_t \frac{\dee}{\dee \Theta_t} \ell(\Theta_t). Here, \Delta \ell(\Theta) is the **gradient**, and \eta_t is the **learning rate**.

The loss function should ideally get smaller as we get closer to a good answer, but really we just need a function \ell(x', x) (x' is the prediction, x is the true value) that's 0 when we have a correct answer, and positive otherwise.

If the function is continuous, the gradient is the derivative of the function. Note that since we're subtracting the gradient, the gradient actually points away from the direction we're optimizing toward.

The gradient at a particular node in the computation graph is known as a local gradient.

;wip: rewrite this section more clearly, talk about forward and backward pass

Backpropagation is a method of computing the gradient of \Theta with respect to \ell(\Theta). Suppose we have r = f(q(x)). By the chain rule, \frac{\dee r}{\dee x} = \frac{r}{\dee f} \frac{f}{\dee q} \frac{q}{\dee x}, so we can easily recursively differentiate to get the gradient.

Backpropagation is implemented by building a computation graph, differentiating at the output layer, and working our way back to the input layer. The innovation is that it's much faster than the naive way of computing the gradient, where we hold all but one variable constant and find the partial derivative.

Backpropagation is an expensive algorithm, and was the reason neural networks weren't practical for a long time, until we can cheap GPU-based computation.

Assignment 1 deadline extended to Thursday. In-class overview of the assignment.

For the winnow algorithm, normalizing is pretty important, since the step size depends on the maximum absolute value of the data, unlike the perceptron, where those initial parameters don't matter as much.

The winnow algorithm will always converge, regardless of the step size. This is because of the normalization step in the loop, where we divide by the sum of the weights and biases. However, certain step size values will minimize the number of passes before we get convergence.

Closed-form solution for alternating minimization on ridge regression:

We want to solve z^* = \argmin_z \frac 1 2 \magn{\vec X_{:j} z + \sum_{k \ne j} \vec X_{:k} w_k - \vec y}^2 + \lambda z^2.

Let \vec u = \vec X_{:j} and \vec v = \sum_{k \ne j} \vec X_{:k} w_k - \vec y, so z^* = \argmin_z \frac 1 2 \magn{\vec u z + \vec v}^2 + \lambda z^2 = \argmin_z \frac 1 2 \magn{\vec u}^2 z^2 + \vec u \cdot \vec v z + \frac 1 2 \magn{v}^2 + \lambda z^2 = \argmin_z (\frac 1 2 \magn{\vec u}^2 + \lambda) z^2 + \vec u \cdot \vec v z + \frac 1 2 \magn{v}^2.

Since \frac 1 2 \magn{v}^2 doesn't depend on z, it doesn't affect the value of \argmin_z. So z^* = \argmin_z (\frac 1 2 \magn{\vec u}^2 + \lambda) z^2 + \vec u \cdot \vec v z.

Clearly, \frac{\dee}{\dee z} \left(\left(\frac 1 2 \magn{\vec u}^2 + \lambda\right) z^2 + \vec u \cdot \vec v z\right) = \left(\magn{\vec u}^2 + 2\lambda\right)z + \vec u \cdot \vec v.

If we set the derivative to 0, then z = -\frac{\vec u \cdot \vec v}{\magn{\vec u}^2 + 2\lambda}. Since the equation was quadratic, this is the only local optima, so it is also the global minimum. Therefore, z is minimized at z = -\frac{\vec u \cdot \vec v}{\magn{\vec u}^2 + 2\lambda}.

Supose we have a perceptron classifier \hat y = \sgn(\vec x \cdot \vec w + b). The **decision boundary** \vec x \cdot \vec w + b = 0 is the decision boundary here, and we can do a lot with it. This is a **parametric** classifier, because it uses a finite-dimensional set of parameters \vec w. In contrast, a **non-parametric** classifier has an arbitrarily large number of parameters. The nearest-neighbor rule is non-parametric, because

The **nearest-neighbor rule** is a way to predict y given a feature vector \vec x and a training set. Given a feature vector \vec x (the **query point**), find the training set entry \tup{\vec x', y'} such that \vec x' is the nearest to \vec x by some distance metric, then predict that y = y'. Essentially, we take the nearest point to the given one, and simply take that point's value as the prediction.

The distance metric d(x_1, x_2) must be symmetic (d(x_1, x_2) = d(x_2, x_1)), definite (d(x, x) = 0), and satisfy the triangle equality (d(x_1, x_3) \le d(x_1, x_2) + d(x_2, x_3)). Some examples of distance metrics are the L_2 norm (Euclidean distance, square root of sum of squares), L_1 norm (Manhattan distance, sum of absolute values), and L_\infty norm (Chebyshev distance, the max value). In general, the L_p norm of a vector \vec x is (\sum_i \abs{x_i}^p)^{\frac 1 n}.

The nearest-neighbor rule doesn't need any training, but in general takes O(nd) space to store the entire training set (n is number of points, d is number of dimensions).

If we implement nearest-neighbor naively, it takes O(nd) time to get a prediction. We can do better by constructing a Voronoi diagram - a diagram where the space is partitioned into n areas, each of those n areas contains exactly one point, and the point is always the nearest neighbor within its associated area. In 2D, this takes O(n \log n) time and O(n) space. In general, this takes n^{O(d)} space and O(d \log n) query time, which is still rather bad. In the real world, we often use faster, approximate nearest-neighbor algorithms that take advantage of hashing and similar techniques.

Nearest-neighbor is affected by normalization if it distorts the distance metric, such as if we scale one dimension but not another. Usually, we normalize every feature vector by subtracting the mean of all the feature vectors and then dividing by the standard deviation of all the feature vectors.

Nearest-neighbor depends heavily on the chosen distance metric. There's actually a way to determine a good metric to use based on the training set, based on tweaking a parameterized distance function known as the Mahalanobias function.

Consider some linear transformation L of \vec x_1, \vec x_2: L\vec x_1, L\vec x_2 that transforms \vec x_1, \vec x_2 into a lower-dimensional space. Clearly, the Euclidean distance between them is d(\vec x_1, \vec x_2) = \sqrt{(L(\vec x_1 - \vec x_2))^T L(\vec x_1 - \vec x_2)} = \sqrt{(\vec x_1 - \vec x_2)^T L^T L (\vec x_1 - \vec x_2)}. Let M = L^T L, then d(\vec x_1, \vec x_2) = \sqrt{(\vec x_1 - \vec x_2)^T M (\vec x_1 - \vec x_2)}. This is called the **Mahalanobias function**, and is parameterized by a matrix M. It's essentially the Euclidean distance of the lower-dimensional transformations of the vectors.

We can then optimize M such that it forms a good distance metric (one that makes the loss function small). Specifically, that means making the distance d_M(\vec x_1, \vec x_2) small if \vec x_1 and \vec x_2 are in the same class, and large if they are not.

The k-nearest-neighbor is the neighbor rule, but instead of just finding the nearest neighbor, we find the k nearest neighbors in the training set, and then take their results and combine them, usually either taking the mode (majority vote) or mean of all those y values (usually k is odd, so for boolean y there is never a tie).

The k in k-nearest-neighbors sort of acts like a regularization parameter - the higher k is, the less complicated the resulting model is (the decision boundaries in the voronoi diagram are smoother). Intuitively, the higher k is, the more points we have to change to get the prediction for a query point to change.

We want to pick a k that avoids overfitting, but doesn't make the resulting predictions overly simple. Usually this is done by just trying a bunch of k values and evaluating each value using cross-validation.

k-nearest-neighbors works surprisingly well on problems with a low number of dimensions and a large training set, and is often used in real-world problems. It works less well with higher dimensionality or smaller training sets.

As the size of the training set tends to infinity, P(Y_1 \ne Y \mid X) \le 2P^* - \frac{c}{c - 1} (P^*)^2 (Cover & Hart, 1967). P(Y_1 \ne Y \mid X) is the probability of a misprediction, c is the number of categories/classes, and P^* is the Bayes error (the theoretical minimum error, for any machine learning model). Essentially, this is saying that 1-nearest-neighbor will have an error within twice the theoretical minimum error - a very useful result for bounding the error!

Review of the last class.

New notation: 1_{a = b} is an **indicator function** - a function that is 1 if a = b, and 0 otherwise. An indicator function can be used inside of expectations to replace probabilities: P(a = b) = E(1_{a = b}). We often use this trick to simplify probabilities.

The **Bayes error** is defined by P^* = \min_{f : \mathcal{X} \to \set{-1, 1}} P(f(X) \ne Y). Here, X is a feature vector, f(X) is the classifier's prediction, and Y is the true value. Essentially, it's the minimum error rate for any possible classifier f(X), regardless of how that classifier is implemented. For a binary classifier, we can write this as E(\eta(X) + 1_{f(X) = 1}(1 - 2 \eta(X))).

The **Bayes rule** is the best possible binary classifier, since it achieves a prediction error equivalent to the Bayes error. This can be written as f^*(X) = \begin{cases} 1 &\text{if } \eta(X) \ge \frac 1 2 \\ -1 &\text{otherwise} \end{cases} where \eta(X) = P(Y = 1 \mid X).

Proof of optimality:

Clearly, P(f(X) \ne Y) = 1 - P(f(X) = Y). We don't know Y, but we do know it's either -1 or 1, so P(f(X) \ne Y) = 1 - P(f(X) = 1 \land Y = 1) - P(f(X) = -1 \land Y = -1).

We now have unknown X and unknown Y. We're going to fix X and look at the probability of Y, a technique known asconditioning. This works because P(f(X) = 1, Y = 1) = \int_{f(x) = 1, y = 1} P(x \land y) \dee x \dee y = \int_{f(x) = 1, y = 1} P(y \mid x) P(x) \dee x \dee y = \int_{f(x) = 1} \left(\int_{y = 1} P(y \mid x) \dee y \right) \dee x = E(1_{f(X) = 1} P(Y = 1 \mid X)).

So P(f(X) \ne Y) = 1 - P(f(X) = 1, Y = 1 \mid X) = 1 - E(1_{f(X) = 1} P(Y = 1 \mid X)) - E(1_{f(X) = -1} P(Y = -1 \mid X)) - we conditioned on X, then took the expectation.

Let \eta(X) = P(Y = 1 \mid X), so P(Y = -1 \mid X) = 1 - \eta(X). Also, 1_{f(X) = -1} = 1 - 1_{f(X) = -1}.

So P(f(X) \ne Y) = 1 - E(1_{f(X) = 1} \eta(X)) - E((1 - 1_{f(X) = 1}) (1 - \eta(X))) = E(-1_{f(X) = 1} \eta(X) + \eta(X) + 1_{f(X) = 1} - 1_{f(X) = 1} \eta(X)).

So P(f(X) \ne Y) = E(\eta(X) + 1_{f(X) = 1}(1 - 2 \eta(X))).

Note that \eta(X) is already defined, but we can choose our own 1_{f(X) = 1}. Since we want to minimize P(f(X) \ne Y), we want 1_{f(X) = 1} to be 1 exactly when 1 - 2 \eta(X) \le 0. This tells us that the optimal classifier f^* is f^* = \begin{cases} 1 &\text{if } \eta(X) \ge \frac 1 2 \\ -1 &\text{otherwise} \end{cases}, which is intuitively obvious - say true if it's more than 50% likely to be true!

Since we won't have \eta(X) = P(Y = 1 \mid X) in most real-world problems, it's generally not possible to implement this in practice.

For more than two classes, we can do something similar to get f^*(X) = \argmax_{1 \le m \le c} P(Y = m \mid X), and the new Bayes rule would be P^* = E(1 - \max_{1 \le m \le c} P(Y = m \mid X)). This is the best we can possibly do assuming the data is sampled independently and identically from \tup{X, Y}.

Note that in the worst case, P^* = \frac{c - 1}{c}. So the more classes we have, the larger the maximum possible Bayes error is.

So looking back at the formula we saw last class, as the training set size goes to infinity for 1-nearest-neighbor, P(Y_1 \ne Y \mid X) \le 2P^* - \frac{c}{c - 1} (P^*)^2. So in the worst case, 1-nearest-neighbor will be less than twice as worse at the Bayes error. For this to be roughly true in practice though, n must grow exponentially with respect to d, which is generally not practical.

The error rate for k-nearest-neighbor for some k = 2t + 1 is \frac 1 {2^n} \sum _{0 \le i \le t} {n \choose i} (this simplifies to \frac 1 {2^n} for 1-NN). This tells us that a larger k is actually going to give us a higher error in the worst case. Consider how this worst case occurs, when all the points are in random classes.

We want to use a smaller k when the classes are easier to separate, and a larger k when the classes tend to be harder to separate - larger values for "harder" problems, smaller for "easier" problems. We usually choose k via cross-validation.

SSBD page 224: For any c > 1 and any learning algorithm L, there exists a distribution \set{0, 1}^d \times \set{0, 1} such that the Bayes error is 0 but for sample sizes n \le \frac{(c + 1)^d}{2}, the error probability for L is greater than \frac 1 4. Essentially, there always exists a distribution such that a given learning algorithm will be wrong 25% or more of the time, but an ideal classifier would get it perfectly, with a limited training set size.

This is especially applicable to k-NN, and we often try to avoid this by increasing training set size and doing dimensionality reduction (e.g., projecting a higher dimension into a lower dimension while preserving the most important elements of the features). One common way to perform dimensionality reduction is using principal component analysis (PCA).

For regression, we can also use something similar to k-nearest-neighbors. Instead of the training set points having discrete classes, they instead have real values. Essentially, given a feature vector \vec x, we take the k-nearest-neighbors in the training set, and average their y values to get the prediction y'. We might additionally do things like do a weighted average based on their distance to \vec x.

Recall that the naive perceptron algorithm assumes that the data is linearly separable, and requires that to be true to converge. Furthermore, the perceptron will find any separating hyperplane, depending on how we feed in the data, rather than what we might consider the "best" separating hyperplane.

The perceptron is solving the minimization problem "minimize 0 subject to y_i(\vec w^T \vec x_i + b) > 0 for all i" - there's no objective function, so we're just finding a feasible solution. The SVM also finds a feasible solution, but also tries to maximize the **margin** - the minimum distance from the hyperplane to any training set point.

Given a separating hyperplane H defined by \vec w \cdot \vec x + b = 0, we want to translate and rotate it until the margin is maximized. First we'll normalize the scale: \frac{\vec w}{\magn{w}} \cdot \vec x + \frac{b}{\magn{\vec w}} = 0. We translate the hyperplane by changing b, and rotate it by changing \vec w. Suppose we translate it upward by s until it hits a point, and downward by t until it hits a point.

Now we have the hyperplanes H_1 defined by \frac{\vec w}{\magn{w}} \cdot \vec x + \frac{b}{\magn{\vec w}} = t and H_{-1} defined by \frac{\vec w}{\magn{w}} \cdot \vec x + \frac{b}{\magn{\vec w}} = s, both touching points and having an empty space between them.

Overview of assignment 2. For question 1, don't print out the distance matrix, since it's too large. Also, the pseudocode isn't very efficient, and you're expected to optimize things, like turning loops into NumPy array operations.

Clearly, the distance between H and H_1, H_{-1} is \min \set{s, t}. Let H_0 be the hyperplane exactly in between H_1 and H_{-1}. Clearly, if d is the distance between the values, then we have H_1 = \vec w \cdot \vec x + b = d and \vec w \cdot \vec x + b = -d. If we scale \vec w and b by dividing them by d, we then get \vec w \cdot \vec x + b = 1 and \vec w \cdot \vec x + b = -1.

Now let's scale the formulas by dividing by \magn{w}, so that we have H_0: \frac{\vec w}{\magn{w}} \cdot \vec x + \frac{b}{\magn{\vec w}} = 0 and H_1: \frac{\vec w}{\magn{w}} \cdot \vec x + \frac{b}{\magn{\vec w}} = \frac 1 {\magn{w}}. Now we've ensured that \magn{\vec w} = 1.

Clearly, we can rotate the entire coodinate space without changing the relative distances between hyperplanes and points. Suppose we rotate the coordinate system such that \vec w = \begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix} - the only weight component is the rotated first component of \vec x. Then we get the minimum distance between H_0 and either H_1 or H_{-1} as x_1 + \frac{b}{\magn{\vec w}} = \frac 1 {\magn{w}}.

Our goal with hard-margin SVM is to compute \vec w, b = \argmax_{\vec w, b: \forall i, y_i(\vec w \cdot \vec x_i + b) \ge 1)} \frac 1 {\magn{\vec w}}. Intuitively, we're maximizing the distance between the hyperplane H_0 and the closest point by changing H_0.

Now we're going to make it more like what textbooks show this as, using these identities: \max f(\vec w) = -\min -f(\vec w) for any function f, \max \frac 1 {f(\vec w)} = \frac 1 {\min f(\vec w)} for positive function f, and \min f(\vec w) is equivalent to \min g(f(\vec w)) for any function f and strictly monotone function g (i.e., the minimums occur at the same values of \vec w, even though the actual minimum values are different).

Clearly, \max_{\vec w,b: \forall i, y_i(\vec w \cdot \vec x_i + b) \ge 1)} \frac 1 {\magn{\vec w}} = \min_{\vec w,b: \forall i, y_i(\vec w \cdot \vec x_i + b) \ge 1)} \frac 1 2 \magn{w}^2, since g(x) = x^2 is strictly monotonic. Then, the margin is \frac{1}{\magn{\vec w}}.

Note that this formulation is essentially: "minimize \frac 1 2 \magn{vec w}^2 subject to y_i(\vec w^T \vec x_i + b) > 0 for all i". Additionally, the minimum \vec w and b always exist, and are always unique - there's exactly one value for each! \vec w is unique because \magn{\vec w}^2 is strongly convex, and b is unique because for the unique \vec w value, there's only one possible value of b that will make \vec w \cdot \vec x_i + b = 0 for all x_i.

The **support vectors** are a subset of the vectors we hit first when we move the hyperplane back and forth - a subset of the points that are on H_{-1} and H_1. Importantly, the support vectors entirely determine the SVM's solution, and there are usually a very small number of them, compared to how many points there are overall. For 2D SVM, for example, there will usually be only 3 or 4 of them in most problems.

We're going to now look at the Lagrangian dual of the problem - sort of like the dual of a linear program. Consider the primal problem: "minimize \frac 1 2 \magn{\vec w}^2 subject to y_i(\vec w^T \vec x_i + b) > 0 for all i". Clearly, this is equivalent to \min_{\vec w, b} \max_{\alpha \ge 0} \frac 1 2 \magn{w}^2 - \sum \alpha_i(y_i(\vec w \cdot \vec x_i + b) - 1). Here, \vec w and b are the primal variables, and \alpha_i are the Lagrangian multipliers/dual variables. This problem is equivalent to the primal problem - the resulting \vec w and b values are the same.

The Lagrangian variables can be thought of as "adversarial" - the \max_{\alpha \ge 0} is trying to make the \min_{\vec w, \vec b} result as bad as possible, and it does this by changing \alpha depending on whether y_i(\vec w \cdot \vec x_i + b) - 1 is positive or negative. Ideally we want y_i(\vec w \cdot \vec x_i + b) - 1 to be non-negative, since then \max_{\alpha \ge 0} has an upper bound.

**Fermat's theorem** says that at \min_x f(x), \frac{\dee}{\dee x} f(x) = 0.

The **minimax theorem** says that given compact convex sets X, Y\ subseteq \mb{R}^n and a function f: X \times Y \to \mb{R} such that f(\vec x, \vec y) is convex if we hold \vec y constant and concave if we hold \vec x constant, then \min_{\vec x \in X} \max_{\vec y \in Y} f(\vec x, \vec y) = \max_{\vec y \in Y} \min_{\vec x \in X} f(\vec x, \vec y). In other words, under certain conditions we can swap the min and max. Intuitively, this represents the idea that if the minimizer and maximizer are players in a zero-sum game using the minimax strategy, then they are in a Nash equilibrium, so swapping them doesn't change the final result.

By the minimax theorem, \min_{\vec w, b} \max_{\alpha \ge 0} \frac 1 2 \magn{w}^2 - \sum \alpha_i(y_i(\vec w \cdot \vec x_i + b) - 1) = \max_{\alpha \ge 0} \min_{\vec w, b} \frac 1 2 \magn{w}^2 - \sum \alpha_i(y_i(\vec w \cdot \vec x_i + b) - 1). Note that there's no constraints on \vec w and b like we had for the primal problem (where we had \forall i, y_i(\vec w \cdot \vec x_i + b) \ge 1 constraint). Therefore, we can easily find the minimum by taking the derivative with respect to \vec w and \vec b.

Clearly, \frac{\dee}{\dee b} \left(\frac 1 2 \magn{w}^2 - \sum \alpha_i(y_i(\vec w \cdot \vec x_i + b) - 1)\right) = \sum \alpha_i y_i. Clearly, \frac{\dee}{\dee \vec w} \left(\frac 1 2 \magn{w}^2 - \sum \alpha_i(y_i(\vec w \cdot \vec x_i + b) - 1)\right) = \vec w - \sum \alpha_i y_i \vec x_i.

By Fermat's theorem, we set the derivatives to 0 to find candidates for the minimum value. So let \sum \alpha_i y_i = 0 and \vec w - \sum \alpha_i y_i \vec x_i = 0. Now, we have \sum \alpha_i y_i \vec x_i = \vec w and \sum \alpha_i y_i = 0.

If we solve this, it gives us now the **Lagrangian dual problem**: \max_{\alpha \ge 0} \left(\sum \alpha_i - \frac 1 2 \magn{\sum \alpha_i y_i \vec x_i}^2\right) subject to \sum \alpha_i y_i = 0. We can expand this and rearrange to get \min_{\alpha \ge 0} \left(\frac 1 2 \sum_i \sum_j \alpha_i \alpha_j y_i y_j \vec x_i \cdot \vec x_j - \sum_k \alpha_k\right) such that \sum a_i y_i = 0. This is the **dual hard-margin SVM problem**, and we sometimes also write it as "\min_{\alpha \ge 0} \left(\frac 1 2 \magn{\sum \alpha_i y_i \vec x_i}^2 - \sum \alpha_i\right) subject to \sum \alpha_i y_i = 0".

Note that this dual problem depends on the number of data points n, rather than the number of dimensions d, like the primal problem. So for high-dimensional, low-data-points problems, we generally want to solve the dual problem rather than the primal.

The dual problem also gives us a new interpretation of support vectors - a support vector is any input feature vector \vec x_i such that \alpha_i > 0. Since \vec w = \sum \alpha_i y_i \vec x_i and all non-support vectors will have their corresponding \alpha_i = 0, the resulting weight only depends on the values of the support vectors.

A convex set is defined in the same way as for CO250 - a set where any weighted average of any two points in C is also in C (the set is closed under weighted averaging). The **convex hull** of a set of points C is denoted \operatorname{ch}(C) = \set{\sum \alpha_i \vec x_i : \vec x_i \in C, \alpha_i \ge 0, \sum_j \alpha_j = 1}. Essentially, it's the set of all linear combinations of the points within the bounds of the set, or the area enclosed if we connect every pair of points in C with a line segment.

While the primal hard-margin SVM problem can be thought of as finding the maximum-margin hyperplane, the dual hard-margin SVM problem can be thought of as finding the points on the convex hulls of the two classes that minimize the distance between the convex hulls of the two classes. Let's take a look at why this interpretation makes sense.

We usually regularize minimization problems of the form \min_W \ell(W) using something like \min_W \ell(W) + r(W), where r(W) is some function that evaluates how not "simple" W is. There always exists a C \in \mb{R} such that \min_W \ell(W) + r(W) is equivalent to the **constrained** version \min_{W: r(W) \le C} \ell(w). The converse isn't always true, so \min_{W: r(W) \le C} \ell(w) doesn't necessarily have an equivalent regularized form, but it is usually true for things we'll look at in this course. We'll usually use the regularized version in practice since it's easier to compute, but the constrainted form has statistical properties that are more useful for theoretical analysis.

Note that the dual hard-margin SVM problem "\min_{\alpha \ge 0} \left(\frac 1 2 \magn{\sum \alpha_i y_i \vec x_i}^2 - \sum \alpha_i\right) subject to \sum \alpha_i y_i = 0" is of the form \min_W \ell(W) + r(W) where r(\vec \alpha) = -\sum \alpha_i. We can therefore choose a C so that the problem becomes equivalent to "\min_{\alpha \ge 0} \frac 1 2 \magn{\sum \alpha_i y_i \vec x_i}^2 subject to \sum \alpha_i y_i = 0 and \sum \alpha_i = C".

If we let \alpha_i' = \frac 2 C \alpha, the SVM problem then becomes \min_{\vec \alpha' \ge 0} \frac 1 2 \magn{\sum \alpha_i' y_i \vec x_i}^2 subject to \sum \alpha_i' y_i = 0 and \sum \alpha_i' = 2.

Let P = \set{i : y_i = 1}, N = \set{i : y_i = -1}. Then \sum \alpha_i' y_i = \sum_{i \in P} \alpha_i' + \sum_{i \in N} \alpha_i'. Since \sum \alpha_i' y_i = 0 from the constraints, \sum_{i \in P} \alpha_i' + \sum_{i \in N} \alpha_i' = 0, so \sum_{i \in P} \alpha_i' = \sum_{i \in N} \alpha_i'. Since \sum \alpha_i' = 2 from the constraints, it must be that \sum_{i \in P} \alpha_i' = \sum_{i \in N} \alpha_i' = 1. We then get the SVM problem formulated as "minimize \frac 1 2 \magn{\sum_{i \in P} \alpha_i' \vec x_i - \sum_{i \in N} \alpha_i' \vec x_i} (note that we flipped the sign because y_i = -1 for all i \in N) subject to \sum_{i \in P} \alpha_i' = 1 and \sum_{i \in N} \alpha_i' = 1".

Let \vec u be the vector where u_i = \alpha_i' for all i \in P and u_i = 0 otherwise, and let \vec v be the vector where v_i = \alpha_i' for all i \in N and v_i = 0 otherwise. Clearly, \sum u_i = \sum_{i \in P} \alpha_i' and \sum v_i = \sum_{i \in N} \alpha_i'. We can then write the dual hard-margin SVM problem as "minimize \frac 1 2 \magn{\sum u_i \vec x_i - \sum v_i \vec x_i} subject to \sum u_i = 1 and \sum v_i = 1".

We can think of P as the set of all positive-classed points, and N as the set of all negative-classed points. We can then think of \sum_{i \in P} \alpha_i' \vec x_i where \sum_{i \in P} \alpha_i' = 1 as an arbitrary point within the convex hull of all positive-classed points, and \sum_{i \in N} \alpha_i' \vec x_i where \sum_{i \in N} \alpha_i' = 1 as an arbitrary point within the convex hull of all negative-classed points.

This also gives us a geometric interpretation of support vectors: the support vectors are a set of points on the surface of each convex hull such that, when we take average of them weighted by \vec u (if this is the positive-classed convex hull) or \vec v (if this is the negative-classed convex hull), we get the closest point to the other convex hull. That is why not all points on the hyperplanes H_1 and H_{-1} are necessarily support vectors - they might not be needed in the weighted average to form the closest point to the other convex hull.

The distance between those points is then \magn{\sum_{i \in P} \alpha_i' \vec x_i - \sum_{i \in N} \alpha_i' \vec x_i}^2 where \sum_{i \in P} \alpha_i' = 1 and \sum_{i \in N} \alpha_i' = 1. If we minimize this distance, we get two points, one in each convex hull, of minimum distance apart.

Recall that \sum \alpha_i y_i \vec x_i = \vec w, from the derivation of the dual hard-margin SVM problem. This is equivalent to \vec w = \frac 2 C \left(\sum_{i \in P} \alpha_i' \vec x_i - \sum_{i \in N} \alpha_i' \vec x_i\right). In other words, the normal of the hyperplane is the difference between the two points within the convex hull. If we do the same for b, we can determine that the resulting hyperplane H_0 must pass through the midpoint of the points in those two convex hulls, normal to their difference vector (this is a bit more complicated, though).

Note that \sum_{i \in P} \alpha_i' \vec x_i and \sum_{i \in N} \alpha_i' \vec x_i are not necessarily the extreme points of the convex hulls, so we can't just try all the pairs of extreme points on both convex hulls. This is a common misconception when learning about the dual SVM formulation.

More overview of assignment 2. The pseudocode in the assignments is only meant as a hint - we don't have to follow it if we can think of a better way to do the same thing. Question 3.1 is meant to show that we don't actually need to have a margin hyperparameter, and that it doesn't add any flexibility.

What if the data isn't linearly separable? If this is the case, any hyperplane will always misclassify at least one of the points. Additionally, some of the data points might end up too close to the hyperplane, so the maximum margin is too small to make a good classifier.

The hard-margin SVM actually came 30 years before soft-margin, which was invented in the 1990s. Originally, they were called soft-margin networks.

For these cases, we can use a **soft-margin SVM**.

To do this, we change the hard constraints into soft constraints: where the primal problem for hard-margin was "minimize \frac 1 2 \magn{vec w}^2 subject to y_i(\vec w^T \vec x_i + b) > 0 for all i", the soft-margin problem is now \min_{\vec w, b} \frac 1 2 \magn{w}^2 + C \sum (1 - y_i \hat y_i)_+. The lack of hard constraints make this easier to analyse.

New notation: (f(x))_+ = \max(f(x), 0), the subscript plus is called the **hinge loss function**, because the graph looks like a door hinge on its side.

Here, C \sum (1 - y_i \hat y_i)_+ are the **soft constraints**, \sum (1 - y_i \hat y_i)_+ is the **training error**, C is the cost parameter (which we tune as a hyperparameter), and \hat y_i = \vec w \cdot \vec x_i + b is the hyperplane's prediction.

Note that (1 - y_i \hat y_i)_+ has 3 main behaviours when we plot it with respect to y_i \hat y_i. When the prediction \hat y_i is on the same side of the hyperplane as the true value y_i and far enough from the hyperplane to be outside of the margin, the value is 0. When the prediction is on the same side as the true value, but within the margin (so 0 \le y_i \hat y_i < 1), we get a little bit of loss (even though the prediction is correct, we still lose a bit because we aren't confident enough in our prediction). Otherwise, we're wrong and get even more loss.

Other loss functions are also in common use besides (1 - y_i \hat y_i)_+, such as zero-one -\sgn(y_i \hat y_i), squared-hinge ((1 - y_i \hat y_i)_+)^2 (this is useful because it happens to be differentiable everywhere), exponential \exp(-y_i \hat y_i), and logistic loss \ln(1 + \exp(-y_i \hat y_i)). ;wip: advantages/disadvantages of both

Note that \frac 1 2 \magn{w}^2 is trying to maximize the hyperplane margin, and C \sum (1 - y_i \hat y_i)_+ is trying to maximize the correct predictions. Since these goals are not always compatible, the C parameter adds balance between these two goals. In other words, we can make wrong prediction in return for a larger margin, and we can reduce the margin in return for more wrong predictions.

A convex loss function \ell(x) is **classification-calibrated** if and only if it's differentiable at 0 and \frac{\dee}{\dee x} \ell(x) < 0. This doesn't tell us what loss function to use in practice, but it does tell us which ones we shouldn't use. Classification calibration means that \argmin_a E(\ell(Y a) \mid (X = x)) has the same sign as 2 \eta(x) - 1. If we minimize that, we get the Bayes rule \eta(x) \ell(a) + (1 - \eta(x))\ell(a).

Now let's look at the Lagrangian dual. First, notice that \min_x f(x) is the same as \min_{x, t} t such that f(x) \le t, because we always want to choose a t = f(x) - this is a useful trick from optimization. Consider now the primal: \min_{\vec w, b} \frac 1 2 \magn{w}^2 + C \sum (1 - y_i \hat y_i)_+. To do the minimization, we want to differentiate, but we can't differentiate the soft constraint since the hinge loss function isn't always differentiable.

Instead, we'll use that trick we just saw to write the primal problem as follows: \min_{\vec w, b, } \frac 1 2 \magn{w}^2 + C \sum \xi_i subject to (1 - y_i \hat y_i)_+ \le \xi_i for all i. We've now written the soft constraints as hard constraints, and \xi_i are called **slack variables**, which we're trying to minimize. The \le in these constraints is necessary in order to ensure that the problem is still convex (that's out of the score of this couse, though).

Each (1 - y_i \hat y_i)_+ \le \xi constraint can actually be written as two constraints, 1 - y_i \hat y_i \le \xi_i and 0 \le \xi_i. Note that each \xi_i depends on \vec w and b. Now we'll write this in the form f_i(\ldots) \le 0: 1 - y_i \hat y_i - \xi_i \le 0 and -\xi_i \le 0. This is because f_i(\ldots) must have a maximum value at f_i(\ldots) = 0.

Now we'll add the Lagrangian maximizer and move the constraints into the objective function: \min_{\vec w, b, \vec \xi} \max_{\alpha \ge 0, \beta \ge 0} \frac 1 2 \magn{\vec w}^2 + \sum \left(C \xi_i + \alpha_i (1 - y_i \hat y_i - \xi_i) + \beta_i (-\xi_i)\right). By the minimax theorem, we can swap the minimizer and maximizer, so \max_{\alpha \ge 0, \beta \ge 0} \min_{\vec w, b, \vec \xi} \frac 1 2 \magn{\vec w}^2 + \sum \left(C \xi_i + \alpha_i (1 - y_i \hat y_i - \xi_i) + \beta_i (-\xi_i)\right).

Now we take the derivatives of \frac 1 2 \magn{\vec w}^2 + \sum \left(C \xi_i + \alpha_i (1 - y_i \hat y_i - \xi_i) + \beta_i (-\xi_i)\right) with respect to \vec w, b, \vec \xi: \frac{\dee}{\dee b} (\ldots) = \sum \alpha_i y_i, \frac{\dee}{\dee \vec w} (\ldots) = \vec w - \sum \alpha_i y_i \vec x_i, and \frac{\dee}{\dee \xi_i} (\ldots) = C - \alpha_i - \beta_i. Just like with hard-margin SVM, we'll set the derivatives to 0 to get \sum \alpha_i y_i = 0, \vec w = \sum \alpha_i y_i \vec x_i, and C = \alpha_i + \beta_i (where \alpha_i \le C since \beta_i \ge 0).

Clearly, \min_{\vec w, b, \vec \xi} \frac 1 2 \magn{\vec w}^2 + \sum \left(C \xi_i + \alpha_i (1 - y_i \hat y_i - \xi_i) + \beta_i (-\xi_i)\right) = \frac 1 2 \magn{\sum \alpha_i y_i \vec x_i}^2 + \sum \left((\alpha_i + \beta_i) \xi_i + \alpha_i - \alpha_i y_i \hat y_i - \alpha_i \xi_i - \beta_i \xi_i\right) = \frac 1 2 \magn{\sum \alpha_i y_i \vec x_i}^2 + \sum \alpha_i (1 - y_i (\vec w \cdot \vec x_i + b)) = \frac 1 2 \magn{\sum \alpha_i y_i \vec x_i}^2 + \sum \alpha_i - \sum \alpha_i y_i (\sum \alpha_i y_i \vec x_i) \cdot \vec x_i - b \sum \alpha_i y_i = \frac 1 2 \sum_i \sum_j \alpha_i \alpha_j y_i y_j \vec x_i \cdot \vec x_j + \sum \alpha_i - \sum_i \sum_j \alpha_i \alpha_j y_i y_j \vec x_i \cdot \vec x_j - b \sum \alpha_i y_i = -\frac 1 2 \sum_i \sum_j \alpha_i \alpha_j y_i y_j \vec x_i \cdot \vec x_j + \sum \alpha_i - b \sum \alpha_i y_i where \alpha_i \le C.

Since \sum \alpha_i y_i = 0, that's equivalent to \sum \alpha_i - \frac 1 2 \sum_i \sum_j \alpha_i \alpha_j y_i y_j \vec x_i \cdot \vec x_j where \alpha_i \le C. We then get the **dual soft0margin SVM problem** "\max_{C \ge \alpha \ge 0, \beta \ge 0} \sum \alpha_i - \frac 1 2 \sum_i \sum_j \alpha_i \alpha_j y_i y_j \vec x_i \cdot \vec x_j subject to \sum \alpha_i y_i = 0".

What is the effect of C? If C is infinity, we simply get the hard-margin SVM, since the dual formulations become identical. If C is 0, \vec \alpha = 0 and we get \vec w = \vec 0 and an undefined b. Note that the only difference between the Lagrangian dual of hard-margin and soft-margin SVM is that the value of \alpha has the upper bound C - a very elegant change.

In general, to take the Lagrangian, we rewrite every constraint i in the form f_i(\ldots) \le 0 (for a minimization primal problem) or f_i(\ldots) \ge 0 (for a maximization primal problem), and then add \alpha_i f_i(\ldots) to the objective function. Then, we add a maximizer (for a minimization primal problem) or minimizer (for a maximization primal problem) before the original minimizer/maximizer, swap the new minimizer/maximizer with the original using the minimax theorem. We then take the derivative of the original minimizer/maximizer's value with respect to each of the original minimizer/maximizer variables, set the derivatives to 0, and solve to get a closed-form solution to the original minimizer/maximizer, in terms of \alpha. We then substitute that solution for the original minimizer/maximizer to get the Langrangian dual.

Clearly, P(\sgn(f(X)) \ne Y) = E[1_{-y f(X) \ge 0}] (X and Y are random variables from unknown distibution) when we're using zero-one hinge loss. Since we can't minimize E[1_{-Y f(X) \ge 0}] directly, we use \frac 1 n \sum_i 1_{-y_i \hat y_i \ge 0} instead, where y_i are the training set Y values and \hat y_i are the training set predictions. We're actually interested in the zero-one loss, but we have to use the hinge loss in practice due to needing a good gradient.

According to the Bayes rule, \argmin_a E[\ell(Ya) \mid X = x] has the same sign as 2 \eta(x) - 1. SVM tries to estimate the sign of 2 \eta(x) - 1, rather than the value of 2 \eta(x) - 1 instead, which is harder because \eta(x) is hard to find.

The **KKT conditions**: given primal constraints on \vec w, b, \xi, (1 - y_i \hat y_i)_+ \le \xi_i and dual constraints \alpha \ge 0, \beta \le 0, we want to combine the constraints from both problems. the KKT conditions are the primary constraints, dual constraints, complementary slackness constraints, and stationarity constraints.

The complementary slackness constraints are \alpha_i(1 - y_i \hat y_i - \xi_i) = 0 and \beta_i \xi_i = 0. Essentially, we added variables such that at the optimal values of the Lagrangian maximizer, the complementary slackness constraints are satisfied.

;wip: do partial derivatives: from this we get \vec w = \sum \alpha_i y_i \vec x_i. If \alpha_i < C, then \xi_i = 0, so y_i \hat y_i \ge 1, so \vec x_i is not within the margin of the hyperplane. If \alpha_i > 0, then 1 - y_i \hat y_i - \xi_i = 0 and y_i \hat y_i = 1 - \xi_i \le 1, so \vec x_i is within the margin of the hyperplane.

Note that the converses are not necessarily true, so it is possible for \vec x_i to be on the hyperplane while \alpha_i is 0 - **points can be on the hyperplane without actually being support vectors**. Most people will define support vectors to be those that have non-zero \alpha_i. For example, this might occur if we have more than enough

In summary: if \alpha_i = 0, point is not on hyperplane or within margin. If 0 < \alpha_i < C, point is on one of the margin's two hyperplanes. If \alpha_i = C, point is between margin's two hyperplanes.

To recover \vec w, b from \alpha_i: we can get \vec w using the \vec w = \sum \alpha_i \xi_i \vec x_i formula, Take any i such that C > \alpha_i > 0 (these points don't always exist, because for small values of C we might be able to shift the hyperplane without changing the objective value function), so \vec x_i is on the hyperplane and 1 - y_i \hat y_i = 0, then use this to solve for b. We can also recover \xi_i after this by setting (1 - y_i \hat y_i)_+ = \xi_i, since those values must be equal at the optimal value.

The primal problem finds the maximum margin, while the dual problem finds the minimum distance between convex hulls.

We can also just solve the primal problem using gradient descent: \vec w_{t + 1} = \vec w_t - \eta_t \Delta \ell(\vec w_t), where \eta_t is the learning rate. There's also **stachastic gradient descent**, where we only compute the gradient for a randomly chosen point,

Recall that the hinge loss is \ell(x) = \max(1 - x, 0). Perceptron is more like \ell(x) = \max(-x, 0), which is not classification calibrated - if we have non-linearly-separable data, perceptron is not good for this use case.

In the early days, we didn't know how to solve the soft-margin SVM dual. Initially, we just dropped the \vec \alpha \cdot \vec y = 0 constraint, which doesn't give us the optimal hyperplane but gives a deccently good result. Modern algorithms exist that can derive a closed-form update for pairs of \alpha_p, \alpha_q, sort of like how we optimized one element at a time for the lasso algorithm in assignment 1.

Gradient descent was first mentioned by Cauchy in 1847, and proven to converge by Haskell Curry in 1944. Stochastic Gradient Descent was proposed by Robbins and Monro in 1951,

For multiclass, we have new SVM formulations, such as this one: \min_W \frac 1 2 \magn{W}_F^2 such that \forall i, forall k \ne y_i, \vec i \cdot \vec w_{y_i} \ge 1 + \vec x_i \cdot \vec w_k. Essentially, \vec i \cdot \vec w_{y_i} \ge 1 + \vec x_i \cdot \vec w_k is saying that we want the prediction for the correct class to be stronger than 1 plus the prediction for each wrong class.

Unlike perceptron, we didn't have to reduce it to binary problems first. The soft-margin version is similar, and requires a lot more calibration theory to understand.

SVM can also be used for regression, my modifying the formula a bit: \min_{\vec w, b} \frac 1 2 \magn{\vec w}^2 + C \sum (\abs{y - \hat y} - \epsilon)_+, using the epsilon-insensitive loss function (\abs{y - \hat y} - \epsilon)_+, which doesn't penalize small mistakes.

We can also train SVM in parallel, by partitioning the data, training the SVM on each machine, compute the center of each training set partition, and for each test data point, find the nearest center, and use the corresponding SVM to make the prediction. This is sort of like k-nearest-neighbors but with the prediction augmented by SVM

Final exam on December 15, 12:30pm-3pm. Project proposals due tonight.

Recall the XOR problem - no linear classifier can learn to classify points the same way as the XOR function y = \prod \sgn(x_i) for any \vec x.

One way we might try to get around this is to use a quadratic classifier: \hat y = \sgn(\vec x^T Q \vec x + \sqrt 2 \vec x^T \vec p + \gamma). We only need to learn the weights Q \in \mb{R}^{d \times d}, \vec p \in \mb{R}^d, \gamma \in \mb{R} to train this.

To make training simpler, we're going to map \vec x into a simpler space. Let \phi(\vec x) = \begin{bmatrix} x_1 \vec x \\ \vdots \\ x_n \vec x \\ \sqrt 2 \vec x \\ 1 \end{bmatrix}, called the **feature transform**. Let \vec w = \begin{bmatrix} Q \\ \vec p \\ \gamma \end{bmatrix}. Both of these are square matrices of side length d^2 + d + 1. Now, note that \hat y = (\vec x^T Q \vec x + \sqrt 2 \vec x^T \vec p + \gamma \ge 0) is equivalent to \vec w \cdot \phi(\vec x) \ge 0 - we've converted the quadratic classification problem into a linear classifier in a different space!

In fact, most classifiers can be replaced by a linear one. It's a technique known as **lifting** - when the data isn't linearly separable, we can transform the feature vectors into a different space, then using linear classifiers on that other space, since linear classifiers are simple to train and are well-understood. This is very powerful, because linear classifiers are well-understood, and we don't need to study any nonlinear ones.

For example, we might train an ellipsis classifier \hat y = \sgn((w_1 x_1)^2 + (w_2 x_2)^2 - r^2) by lifting it into the space \phi(x) = \begin{bmatrix} x_1^2 \\ x_2^2 \end{bmatrix}.

The **curse of dimensionality** is the fact that lifting will often dramatically increase the number of dimensions, which makes algorithms like SVM and similar a lot harder. However, note the Lagrangian dual of hard-margin and soft-margin SVM only depends on the dot product of two points. Additionally, \phi(\vec x_1) \cdot \phi(\vec x_2) = (\vec x_1 \cdot \vec x_2)^2 + 2\vec x_1 \cdot \vec x_2 + 1 = (\vec x_1 \cdot \vec x_2 + 1)^2 - the d^2 + d + 1-dimensional dot product can actually be computed using only a d-dimensional dot product!

When doing lifting, we were finding \phi(\vec x) and then often computing \phi(x_1) \cdot \phi(x_2) in algorithms like SVM. Instead of defining a \phi and then computing the dot product, we can define a **reproducing kernel function** \kappa: \mb{R}^d \times \mb{R}^d \to \mb{R}, usually just called a **kernel**, such that \kappa(\vec x_1, \vec x_2) = \phi(\vec x_1) \cdot \phi(\vec x_2).

As it turns out, it's often a lot easier and efficient to compute \kappa than \phi - as long as \kappa can be computed efficiently, we don't need to care how many dimensions its corresponding \phi has, or even whether it can be computed at all. In fact, kernels don't even have to be over vectors - we can define kernels over anything, and the kernels just need to measure the similarity between two objects.

Some common kernels are:

- Polynomial: \kappa(\vec x_1, \vec x_2) = (\vec x_1 \cdot \vec x_2)^p or \kappa(\vec x_1, \vec x_2) = (\vec x_1 \cdot \vec x_2 + 1)^p - this is usually used for polynomial regression/classification.
- Gaussian: \kappa(\vec x_1, \vec x_2) = \exp(-\magn{\vec x_1 - \vec x_2}^2 / \sigma) - this uses the density function of a Gaussian distribution, and is radially symmetric. It doesn't work that well in practice, but people often use it because it's convenient.
- Laplace: \kappa(\vec x_1, \vec x_2) = \exp(-\magn{\vec x_1 - \vec x_2} / \sigma).
- Matern: \kappa(\vec x_1, \vec x_2) = \frac{1}{2^{v - 1} \Gamma(v)} (2 \sqrt{v} \magn{\vec x_1 - \vec x_2} / \theta)^v H_v(2 \sqrt{v} \magn{\vec x_1 - \vec x_2} / \theta) - this is sort of like an interpolation between Gaussian and Laplace, and is seeing a lot of use in modern ML research.

We can show that any function is a kernel simply by constructing a \phi, but this is often difficult. We can also use the kernel matrix:

Given a dataset \set{\vec x_1, \ldots, \vec x_n}, the **kernel matrix** of a kernel is a matrix K_{ij} = \kappa(\vec x_i, \vec x_j). As it turns out, \kappa is a kernel if and only if K is symmetric and positive semidefinite (K \in \mb{S}_+^d) for any dataset. In other words, we have the following properties:

- Symmetry: K_i = K_j. This is needed because if A is similar to B, then B should be equally similar to A.
- Positive semidefinite: for all \vec \alpha \in \mb{R}^n, \vec \alpha^T K \vec \alpha \ge 0 (i.e., \sum_i \sum_j \alpha_i \alpha_j K_{i, j} \ge 0).

In other words, the kernel matrix is symmetric and positive semidefinite if and only if there exists a \phi(\vec x) such that \kappa(\vec x_1, \vec x_2) = \phi(\vec x_1) \cdot \phi(\vec x_2).

Given reproducing kernel functions \kappa_1, \kappa_2 and a constant a \ge 0, a\kappa_1 and \kappa_1 + \kappa_2 and \kappa_1 \kappa_2 are all kernels. In other words, kernels are closed under linear combinations.

Consider now the Lagrangian dual of soft-margin SVM with a kernel: \min_{C \ge \alpha \ge 0} \frac 1 2 \sum_i \sum_j \alpha_i \alpha_j y_i y_j K_{i, j} - \sum \alpha_i subject to \sum \alpha_i y_i = 0. ;wip: soft-margin or hard-margin?

Now we'd managed to use the the kernel without having to compute \phi(\vec x), but to recover the weights we still have to do \vec w = \sum \alpha_i y_i \phi(\vec x_i). However, \phi(\vec x) might be infeasible to compute. ;wip

Note that for SVM, \vec w \cdot \phi(\vec x) = \sum \alpha_i y_i \phi(\vec x_i) \cdot \phi(\vec x) = \sum \alpha_i y_i \kappa(\vec x_i, \vec x). Since the classification result is \hat y = \sgn(\vec w \cdot \phi(\vec x)), we then have \hat y = \sum \alpha_i y_i \kappa(\vec x_i, \vec x), which is much more feasible to compute. Note that we're never explicitly computing \phi, only the dot product of two \phi values.

So previously training cost O(nd) and testing took O(n). With kernels, training takes O(n^2) and testing takes O(n), where n is the size of the dataset and d is the number of dimensions in \phi(\vec x).

For example, for the XOR problem we might end up with weights \vec w = \begin{bmatrix} 0 \\ -\frac 1 {\sqrt 2} \end{bmatrix}, and \vec w \cdot \phi(\vec x) = -x_1 x_2 and \phi(). ;wip: what

Overview of assignment 3. For the hyperparameters in this one, we don't have to use cross validation, since that would be really time consuming.

Since we don't know \phi, we can't really make the test predictions. However, \vec w \cdot \phi(\vec x) = \sum \alpha_i y_i \phi(\vec x)

We also don't need to pick a kernel at the beginning - the kernel itself can be learned as well. We can start with a non-negative combination of t pre-selected kernels, and learn the coefficients \zeta_s. The kernel K_{i, j} in the SVM dual formulation then becomes \sum_s{s = 1}^t \zeta_s K_{i, j}^{(s)}. We also often add a small regularization term to the expression to ensure we select as few kernels as possible ;wip: full SVM formulation with this, from slides

Recall logistic regression, \min_{\vec w \in \mb{R}^d} \sum_i \ln(1 + \exp(-y_i \vec w \cdot \vec x_i)) + \lambda \magn{\vec w}^2. Finding the Lagrangian dual of this is difficult, so we want to try something else. Consider \vec w \cdot \vec x_i in the logistic regression's expression. Let X = \operatorname{span} \set{\vec x_1, \ldots, \vec x_n} - every linear combination of the training set. Let \vec w = \vec w_1 + \vec w_1' where we decompose \vec w into two vectors, of which w_1 is entirely in the subspace X, while w_1' is entirely in the orthogonal complement of X. ;wip: w_1' is the orthogonal complement

Now, $W x_i = $. Also, since w_1 and w_1' are in subspaces that are orthogonal to each other, \magn{w}^2 = \magn{\vec w_1}^2 + \magn{\vec w_1'}. Also \vec w \cdot \vec x_i = \vec w_1 \cdot \vec x_1, because ;wip: what

Now we'll map values of \vec x_i into a new space. Let \phi(\vec x_i) be an arbitrary transformation of the linear space for the

Now we have \min_{\vec w \in \mb{R}^h} \sum \ln(1 + \exp(-y_i \vec w \cdot \phi(\vec x_i))) + \lambda \magn{\vec w}^2.

According to the **representer theorem**, the optimal \vec w is of the form \sum \alpha_i \phi(\vec x_i). If we substitude this into the logistic regression formula and simplifying a bit, we get the new primal problem \min_{\alpha \in \mb{R}^n} \sum \ln(1 + \exp(-y_i \vec \alpha \cdot K_{:i})) + \lambda \vec \alpha^T K \vec \alpha ;wip: do the full derication for this

In the general form (not just for the training set), we have \min_{\alpha \in \mb{R}^d} \sum \ln(1 + \exp(-y_i \vec \alpha \cdot K_{:i})) + \lambda \vec w. ;wip: what's the rest of this formula

We'll solve this using Newton's method - \vec \alpha_{t + 1} = \vec \alpha_t - \eta_n \frac 1 {\Delta^2 f(\vec \alpha_t)} \Delta f(\vec \alpha_t) where \Delta^2 is the second order gradient and \Delta is the first order gradient. Clearly, \Delta^2 f(*t) p_i (1 - p_i) K*{:i} K_{:i}^T$ where p_i = \frac{1}{1 + \exp(-\vec \alpha_t \cdot K_{:1})} (the probability of the ith training set point being in class 1). Note that p_i(1 - p_i) is sort of like a weight determining how "unsure" we are about the category for the point, since it's maximized when p_i = 0.5 (we have no idea which class it is in either way), so as we do more iterations, we seem to assign more weight to the points we are unsure about the categories for.

;wip: rest of newton's method, specifically defining the derivative values

The Hessian is the second order derivative, and gives uncertain points a higher weight ;wip: is this true? what is a hessian?

A kernel is **universal** if and only if for any compact set Z, continuous function f: Z \to \mb{R}, \epsilon > 0, there exists x_1, \ldots, x_n \in Z and \alpha_1, \ldots, \alpha_n \in \mb{R} such that \max_{x \in Z} \abs{f(x) - \sum \alpha_i \kappa(x, x_i)} \le \epsilon. Essentially, this says that if you choose a universal kernel, we can approximate any continuous function, which means kernel methods plus linear classifiers/regressions can learn any continuous function. This also tells us that neural networks can learn any function if they have at least 2 layers, which has sparked a lot of interest in neural networks lately.

One example of a universal kernel is the Gaussian kernel \kappa(x, x') = \exp(-\magn{\vec x - \vec x'}^2 / \sigma). Even though it performs poorly in practice, this is one reason we often use it, because it's theoretically guaranteed to work.

Recall the Gaussian distribution, with probability density function \frac{\dee}{\dee x} P(X \le x) = \frac{1}{\sqrt{2 \pi} \sigma} \exp(-\frac{(x - \mu)^2}{2 \sigma^2}). Now, if x was a vector instead, we get \frac{\dee}{\dee \vec x} P(\vec X \le \vec x) = \frac{1}{(2 \pi)^{d/2}} \abs{\Sigma}^{-1/2} \exp(-\frac 1 2 (\vec x - \vec \mu)^T \Sigma^{-1} (\vec x - \vec \mu)), where \vec X \sim N_d(\vec \mu, \Sigma). ;wip: is actually a covariance matrix? the kernel is the covariance matrix of the Gaussian random variables?

Let A \in \,b{R}^{p \times d}, then AX \sim N_d(A \vec \mu, A \Sigma A^T). Likewise, \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \sim N(\begin{bmatrix} \vec \mu_1 \\ \vec \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{1, 1} & \Sigma_{1, 2} \\ \Sigma_{2, 1} & \Sigma_{2, 2} \end{bmatrix}). In other words, for any vector random variable \vec X, we can partition it into two vectors \vec X_1, \vec X_2, and the joint distibution between X_1 and X_2 can be broken up into the marginal distributions and the conditional distributions of X_1 and X_2.

So X_1 \sim N(\vec \mu_1, \Sigma_{1, 1}), and X_2 \mid X_1 \sim N(\vec \mu_2, \Sigma_{2, 1} \Sigma_{1, 1}^{-1} (\vec X_1 - \vec \mu_1), \Sigma_{2, 2}^{-1} - \Sigma{2, 1} \Sigma_{1, 1}^{-1} \Sigma_{1, 2}). If we know X_1, then we get a bit of information about X_2, and that previous formula describes precisely what information.

A random variable is a function Z(\omega). If we roll a die, we see the result of Z(\omega), but not \omega itself.

A **Gaussian process** is just a collection of Gaussian random variables G = \set{Z_t : t \in T} such that any finite subset of G is jointly Gaussian (for infinite subsets, there simply can't be a joint Gaussian distribution over infinite things). A Gaussian process is a function of two variables Z(t, \omega). For any finite N \subseteq T, \set{Z_t = Z(t, \omega) \mid t \in N} is a Guassian random vector, and for any fixed \omega, Z(t) is a path. Each Gaussian random variable can have its own \mu and \sigma. We can refer to each Z_t's parameters using the mean function \mu(t) and sigma function \sigma(t), so Z_t \sim N(\mu(t), \sigma(t)).

Usually, we prove that something is a Gaussian random process by proving that each Z_t is Gaussian.

;wip: what do gaussian processes this have to do with kernels? seems like the kernel function is a gausian process

Assignment 3 due Tuesday, 1 week from now. Some useful sample code is available on the website.

Demonstration of Gaussian random processes. Sampling many t values for a fixed \omega value yields a really spiky curve for Laplace kernels (in fact, not differentiable anywhere), while a Guassian kernel gives a nice smooth curve. Demonstration of how changing Gaussian kernel parameters like \sigma and \mu affect the Gaussian random vectors drawn from the kernel.

If we sample Z(t, \omega) for a fixed \omega at many different t values, we get a Gaussian random vector. If we do this for many different \omega values and add them together, we get the mean of each Gaussian distribution.

The type of kernel we use depends on the application. For example, in financial models we might use a Laplace kernel to handle the spikiness, and in robotics we might use a gaussian kernel since it's smoother.

We initially introduced kernels as a feature transformation. We then looked at it as a bunch of random variables.

Imagine we have a problem where we want to learn a curve, and nature will pick a random curve from a fixed \omega on Z(t, \omega). Our task would then be to figure out the parameters to the Gaussian process, such as \mu(t), so we can predict future samples. For example, we can think of stock prices as a sample of many t values from Z(t, \omega) where \omega is fixed, and our task would be to predict future stock prices by learning the Gaussian process parameters.

How do we do this inference? Recall that \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} \sim N(\begin{bmatrix} \vec \mu_1 \\ \vec \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{1, 1} & \Sigma_{1, 2} \\ \Sigma_{2, 1} & \Sigma_{2, 2} \end{bmatrix}). Recall that the covariance function is a kernel.

Recall that linear regression is Y = \vec x \cdot \vec w + \vec \epsilon, where \vec w is deterministic and \epsilon is Gaussian noise N(0, \sigma). Here, Y is a Gaussian process where \vec x is t - Y \sim GP(\vec w \cdot \vec x, \kappa(\vec x, \vec x')).

Let's verify that Y is a Gaussian process. To do this, we will show that it gives a Gaussian random vector for any t. Clearly, \begin{bmatrix} \vec w \\ \vec \epsilon \end{bmatrix} is a Gaussian random vector (even though \vec w is deterministic, it's a Gaussian random variable with 0 variance), so \begin{bmatrix} \vec x & I \end{bmatrix} \begin{bmatrix} \vec w \\ \vec \epsilon \end{bmatrix} is also a Gaussian random vector, since the matrix represents a linear transform, and any linear transform of a Gaussian random vector is also a Gaussian random vector.

Since Y(\vec x) is always Gaussian random vector, any finite number of Y(\vec x) values is also jointly Gaussian, and so must be a Gaussian process.

The generalized gaussian kernel \exp(-\magn{\vec x - \vec x'}^t / \sigma) where 1 \le t \le 2.

We find \vec w using the method of maximum likelihood. Clearly, \max_{\vec w} P(\tup{\vec x_1, y_0} \land \ldots \land \tup{\vec x_1, y_0}). As it turns out, if we write this out, we get \min_{\vec w} \magn{\vec x - \overline x}^2, which is the least squares linear regression. In other words, the maximum likelihood estimator for a Gaussian process is least-squares regression. ;wip: rest of formulas

Bayesian linear regression is a generalization of Gaussian linear regression, where Y = \vec x \cdot \vec W + \vec \epsilon where \vec W \sim N(\vec \mu, \Sigma) and \vec \epsilon \sim N(0, \sigma^2).

For this we want to use the maximum a posteriori method to find \vec W. Essentially, we're maximizing \max_{\vec W} P(\vec w \mid \tup{\vec x_1, y_0} \land \ldots \land \tup{\vec x_1, y_0}), the posterior probability, which by Bayes rule can be written as \frac{P(\tup{\vec x_1, y_0} \land \ldots \land \tup{\vec x_1, y_0} \mid \vec W) P(\vec W)}{P(\tup{\vec x_1, y_0} \land \ldots \land \tup{\vec x_1, y_0})}. Note that P(\vec W) is the prior - if we have \vec W \sim N(0, 1) for our prior we get ridge regression, while if we have \vec W \sim \mathrm{Laplace}(0, 1) we get Lasso regression.

;wip: everything from Abstract View slide

While we could think of a kernel as a dot product of a feature transformation, we can also think of a kernel as Z = \phi(\vec t) \cdot \vec W + \mu(t) where \vec W \sim N(\vec 0, I) and \mu(t) = E(\phi(t) \cdot \vec W + m(\vec t)). Also, \kappa(s, t) = E(\phi(t))

In other words, a kernel is a linear regression problem in the feature space with a Gaussian distributed weight vector.

;wip: rest of class missed due to interview

Some notes about assignment 3: suppose we have a sequence of kernels k_1, k_2, \ldots. What is the limit of those kernels \kappa(\vec s, \vec t) = \lim_{n \to \infty} k_n?

\kappa(\vec s, \vec t) is a function of two vectors. Usually we compute this pointwise - we take a given \vec s, \vec t and compute the limit on that.

Note that \kappa(\vec s, \vec t) = \kappa(\vec t, \vec s) since all kernels are symmetric.

Recall that the density function for the multivariate gaussian distribution is P(\vec X = x) = \frac 1 {(2 \pi)^d} \abs{\Sigma} ;wip: rest of slide

Consider a plot of people's heights in a typical population - we would have a bimodal distribution, one for men and one for women. To model this, we would need to use two Gaussian distributions.

To represent this, we can use a **mixture of Gaussians** here P(x \mid \theta) = \sum_{0 \le k \le K} P(z = k) P(x \mid z = k \land \theta). Here, z is the person's sex, x is the height, and \theta are the remaining parameters. So P(z = k) means the probability of a person having a particular sex, and once we know which class the person is in, we can say it's in that class's Gaussian distribution P(x \mid z = k \land \theta) - each class can have its own Gaussian distribution.

Let \pi_k = P(z = k), so P(x \mid \theta) = \sum_{0 \le k \le K} \pi_k P(x \mid z = k \land \theta). Note that \pi_k \ge 0 and \sum \pi_k = 1 since they're just probabilities.

We can think of \pi_k as the **mixing distribution**, and P(x \mid z = k \land \theta) as the **component distributions** (a kernel for a particular component).

A **Gaussian mixture model** is P(x \mid \theta) = \sum_{1 \le k \le K} \pi_k N(x \mid \mu_k \land \Sigma_k). This lets us mix a bunch of kernels together when different classes require different kernels - each class k has its own \mu_k and \Sigma_k to learn.

Gaussian mixture models with sufficiently many components (k is large enough) can approximate any probability density function on \mb{R}^d. This works because given any density function, we can simply keep adding Gaussian distributions with small \Sigma values until we've created the shape of the density function, and this actually works for many different distributions, like student-T or chi-squared.

Note that for a Gaussian mixture model, we only have x, not y, so it's sort of unsupervised learning. There's a variation of this called **mixture of experts**, which is P(y \mid x \land \theta) = \sum_{1 \le k \le K} P(y \land z = k \mid x \land \theta) = \sum_{1 \le k \le K} P(z = k \mid x \land \theta) P(y \mid x \land z = k \land \theta). We write it in the last form because it allows us to separate the mixing and component distributions. Also, P(y \mid x \land z = k \land \theta) can be N(y \mid \vec w_k \cdot \vec x, \sigma_k^2), which is just linear regression if we try to learn \vec w_k.

Mixture models allow us to essentially use different kernels whenever they work better. The mixing distribution can be trained to learn when to use each component distribution, and the component distribution can just be trained for where it'll perform best. This is sort of analogous to a neural network, where some neurons can turn entire parts of the network on and off.

Interestingly, for Gaussian mixture models, the factorization is always unique - for any given P(x \mid \theta), there is only one k such that P(x \mid \theta) = \sum_{1 \le k \le K} P((z = k \mid x) \land \theta) P(y \mid x \land z = k \land \theta).

Mixture models are heavily used in boosting, where we take a bunch of okay models and turn them into a single good model.

Crucially, we only know the datapoint x, not the component z (we call z the **latent variable**). The problem setup is as follows: we're given independent and identically distributed X_1, X_2, \ldots \sim P(x \mid \theta), and we want to estimate \theta or z. Finding the maximum likelihood estimate is NP-hard, but we can use the variational form of maximum likelihood instead to make this feasible.

If L(\theta) is the likelihood function, then \min L(\theta) = \sum -\ln P(x_i \mid \theta) = \sum -\ln \left(\sum P(x_i \land (z_i \mid \theta))\right). Computing this is NP-complete, but we can construct a sequence that converges to a good approximation, called the **expectation maximization algorithm** (also, it doesn't necessarily give a global optimum). As it turns out, \min L(\theta) = \min_{q_i(z_i), \theta} \sum_i (\sum_{j} q_i(z_j)) ;wip: rest of formula

If you fix q_i(z_i), we get \min_\theta -\sum_i \sum z_i q_i(z_i) \ln P() ;wip; rest of slide

If we do this for a Gaussian mixture model, we get \pi_k = \frac 1 n \sum_i r_{i, k} where r_{i, k} = P(z_i = k \mid (x_i \land \theta)) (the posterior likelihood, proportional to P(z_i = k) P(x_i \mid ;wip) \pi_k N(x_i \mid (\mu_k \land \Sigma_k))).

One thing Gaussian mixture models are often used for is k-clustering, where we automatically find clusters and the number of clusters.

If we know the latent variable, we can also significantly simplify our mixture models. For example, we can approximate the L1 norm using a mixture of L2 norms. Since the L2 norms have a closed form solution, we can approximate the L1 norm using only a combination of closed form solutions to the L2 norms.

New notation: P(a \mid c_1, c_2, c_3) = P(a \mid (c_1 \land c_2 \land c_3))

Guest lecture by Aghastya from Focal Systems. This and the next four lectures are about deep learning.

Loss/cost functions can be derived from the method of maximum likelihood - finding the input that maximizes the likelihood of getting the outcome that we did.

For supervised learning, we're trying to maximize P(y \mid x; \theta) = \prod P(y_i \mid x_i; \theta), but we're only allowed to change \theta. This is equivalent to minimizing J(\theta) = -E_{x, y \sim \hat P_d} \ln P_m(y \mid x), where P_d is the estimated distribution of the data, and P_m is the distribution of the model.

Recall the regression problem: \hat y = W^T h + b where h = f(x, \theta). Again, we're trying to maximize P(y \mid x) by changing only \theta.

;wip: deriving MSE from maximum likelihood

Once we have the MSE, we can minimize it using SGD.

The MSE has several interesting properties:

- The gradient of the MSE is linear with respect to x since MSE is quadratic - gradients can explode
- High gradient values can propagate throughout the neural network, while neural networks work best if the weights stay small
- We often work around this by normalizing all feature vectors so their component is between 0 and 1 so the gradient stays under 1 - this helps the training converge in many cases.
- MSE often has a lot of local minima, which means we will often get stuck in local minima
- For the above reasons, we often use MSE for classification but not regression.

For binary classification, we ;wip: read slide about binary classification, binary crossentropy is the loss function we usually use for binary classification, because P(y x) = y^n (1 - y)^{1 - n}

For multiclass classification, we ;wip: the next slide, softmax is the loss function we usually use for multiclass, derive that, also, the softmax uses exp to turn the log probabilities z_i, z_j back into normal probabilities

;wip: softmax can't saturate, which is really good for SGD, prove that softmax will be approximately the same as max(exp(x_i)), but unlike that it is also smooth everywhere, which is useful for SGD - the softmax function encodes info about how to move in the gradient ;wip: z is the output from the hidden layer

;wip: for softmax, the prediction output is roughly z_i - \max_j z_j - if the prediction is correct, that's z_i - z_i = 0, if it's incorrect then z_i - z_j > 0

In general, to get a cost function with this technique, we find the MLE, and try to minimize it as the loss function ;wip: do an example based on softmax

;wip: loss functions cheat sheet

Intuitively, model bias is the size of the family of functions the model can represent - the smaller the family, the more biased the model is. For example, linear models have high bias, because they can only represent linear problems.

Intuitively, model variance is the number of model configurations that can solve the problem. For example, linear models have low variance because many of them just have a closed form.

Intuitively, model capacity is the number of functions the model can represent. Linear models have low capacity because they can only model linear functions, but deep neural nets have huge capacity - they can model almost any function (though there's high variance).

Error mainly comes from training error (how wrong we were on the training set), and generalization error (how much worse we did on the testing set than the training set - subtract training set error rate from testing set error rate).

A large model capacity reduces training error (reduces underfitting). A large variance increases generalization error (increases overfitting). Unfortunately, increasing capacity often also increases variance - there's generally a balance between variance.

The goal of regularization is to increase bias to decrease variance without removing too much capacity. For example, we can fit a set of n points with a polynomial of degree n, but it's probably not going to generalize to the testing set very well. Instead, we might ;wip

Regularization is **any modification made to the learning algorithm intended to reduce generalization error, but not intended to reduce raining error**. We usually use this on high capacity models where our training error is already small.

One common regularization technique is the L2 penalty, where we add \magn{\vec w}_2^2 to our minimizer objective function. Local constancy means that a small change in x should result in a small change in y. Many ML models assume local constancy, and the L2 penalty makes us prefer models that are more locally constant - if the weights are small, then the model output will change less with any change in x.

Another common technique is the L1 penalty, where we add \magn{\vec w}_1. Since the gradient of the L1 norm is just the sign of the input, this always tries to push weights toward 0. Therefore, the L1 penalty tries to make the model as sparse as possible.

While the L2 penalty tries to make all the weights be as evenly distributed as possible, the L1 penalty tries to force the weights to be 0 as often as possible.

Data augmentation is a very important technique for regularization. Essentially, we add mutated versions of the dataset to itself so the model has more training data to work with. For example, for ImageNet we might duplicate the dataset with rotated, flipped, and scaled versions of itself, so our model can then handle rotated/flipped/scaled versions of itself. Another example is a spelling corrector - we might intentionally introduce spelling errors into duplicated versions of the dataset, so our model can learn about the possible spelling mistakes.

;wip: slides after data augmentation

Multi-task training is also an important regularization technique - train the model to do many different tasks, so it has to be able to generalize to do well. This is rarely used, because it requires multiple labelled datasets, which is more expensive.

A very simple regularization technique is early stopping - checkpoint the model regularly, then after training is done choose the iteration count that minimizes cross-validation error. This is sort of like a hyperparameter search over time!

;wip: ensembles, will often give ~2% improvement on real-world competitions, really powerful if errors are independent

;wip: dropout, we conceptually train the neural network normally, then randomly delete some neurons with some probability (the dropout probability) to get a random subnetwork, then we take an ensemble of many different sampled subnetworks (we usually approx this by multplying all weights by dropout probability, this isn't formally proven to be a good approx but works well in practice) ;wip: essentially, dropout works by saying that no one neuron can be too important, since it might just get deleted in some networks - this encourages redundancy and therefore the ability to generalize ;wip: batch normalization and the rest of the slides ;wip: initialization is important, for example for reLU, we often initialize values randomly to between say 0.01 and 0.1, there's a lot of math to find proper initialization algorithms

Overview of assignment 4. Assignment 3 due tonight.

Another guest lecture by Aghastya.

Usually Minibatch SGD is just called SGD. Minibatch SGD is ;wip: what is it?

A Taylor series expansion at a given point can approximate any continuous function around that point. A neural network formalizes this idea.

Some issues that can happen with SGD are exploding gradients, local minima, and that it's really slow. In deep networks, we usually find that the space is so large and the dimension so high that we usually just find something really close to the global minima anyways.

;wip: momentum in SGD - not only do we store the current position, we also store the previous gradient. this means that if the gradient keeps being small in the direction we care about, they'll accumulate and we'll converge faster. we also be sure to use a terminal velocity dependent on g and alpha (get this formula from slides, also alpha is the variables we're trying to optimize in that slide)

;wip: nesterov momentum, the slide has a mistake where the f call should be f(x; \theta + \alpha v), also v starts off at 0, and weights are randomly initialized (we can't initialize weights to 0 because that would make weights symmetric)

;wip: setting learning rate slide: with too high a learning rate, we tend to just converge to an easy solution and stay there, not getting to better solutions. a too low learning rate will work, but slower. a good learning rate pushes the error rate all the way toward 0. consider the analogy where we're trying to get to the bottom of a valley, but a too high learning rate makes us bounce back and forth between the sides of the valley, in this slide alpha is the learning rate

;wip: a commonly used learning decay function is step decay - run a large learning rate until we plateau, then decay (e.g., by 10x) the learning rate, then decay again when we plateau again, and so on - in fact this guarantees an optimal result. exponential decay is actually max(*{min}, *{start} exp(-kt)). exponential decay is strictly worse than step decay, because it decays learning rate before we absolutely have to (they both derive from the same idea though)

;wip: these are all rules of thumb, not proven rules

;wip: adagrad learning: this is mainly convex-only because the first gradient matters a lot - if we go the wrong direction, it's hard to recover. it suffers from premature decay because it's hard to get the learning rate back up if we get a single big gradient, RMSprop replaces the sum with an exponentially weighted moving average, which works really well in practice but doesn't have the theoretical basis adagrad learning does, just a lecture slide in hinton's ML course - it's pretty widely used.

;wip: adam momentum is sort of like adagrad with momentum, it's adding a per-parameter momentum term, we can usually leave parameters at their defaults, and is also almost an industry standard, adam doesn't really have many hyperparameters to tweak

;wip: how to choose slide, giving in-practice performance comparisons (not theoretical performance). anything before 2014 ish tends to use adagrad, everything after tends to use RMSprop and adam

Fully connected layer is a matrix multiplication. A convolutional layer is similar, and tries to preserve the spatial structure of the input - a pixel in one image is more likely to be important to nearby pixels than farther pixels.

For example, consider a 32x32x3 image - we could apply a 5x5x3 filter to each pixel, dot producting each possible 5x5 window on the ;wip: describe convolutional layer, also you can do this in one matrix mult, then reshape into the d+1 dimensional tensor

;wip: convolutional net

Assignment 4 should be a bit easier.

Guest lecture by Aghastya from focal systems.

;wip: N is the input side length, P is number of values to pad on each side

;wip: the stride S is how many rows/cols we skip for each pixel - a stride of 2 means we skip every other column and every other row when moving the filter, the stride should never be larger than the filter to avoid losing information

;wip: floor((N + 2P - F) / S) + 1 is the side length of the output image

;wip: define convolutional layer

If we have a 225 by 224 image with 3 channels, red/green/blue, we have a 224 by 224 by 3 matrix. When we apply a 5 by 5 filter, the depth of the filter must then be the same depth, so the filter is a 5 by 5 by 3 matrix. If we want our output image to have 64 channels, then we need 64 of these 5 by 5 by 3 filters.

So if F is the filter side length, D_i is the input image depth, and D_o is the output image depth, then we have F^2 D_i D_2 parameters for all of the filters.

When not specified, we assume that D_i = D_o = 1 and S = 1 and we pad the edges with zeroes so the output image is exactly the same size as the input image.

Applying a filter means that we elementwise multiply each element of the filter with the corresponding element in the input image, then sum them all up - it's the dot product of the filter and the segment of the image if we flattened each one out into two big vectors. Each dot product is a channel in a pixel in the output image, known as a **descriptor** (because it describes the area covered by its corresponding filter).

The **receptive field** of one descriptor (channel of a pixel in the output image) is the region of the original image that the descriptor depends on, and is the 2D size of the filter - if we applied a 5 by 5 by 3 filter to an image, the resulting descriptor's receptive field would be 5 by 5. Note that a descriptor depends only on the receptive field - we can change anything in the input image outside of the receptive field, without affecting that channel of that pixel in the output image. The receptive field is usually the same for all of the descriptors in an output image.

As we nest convolutional layers, the receptive field grows. For example, if we apply a 5 by 5 filter to an image, and then a 3 by 3 filter to the resulting image, all with a stride of 1, each descriptor in the final image has a 7 by 7 receptive field from the first layer. Basically, the 5 by 5 filter has 1 pixel in the middle, and 2 pixels on each side, and the 3 by 3 filter has 1 pixel in the middle, and 1 pixel on each side. We then add the pixels on each side, so it's as if we had 1 pixel with 3 pixels on each side, which is a 7 by 7 square.

A stride of 2 doubles our perceptive field, a stride of 3 triples it, and so on. This allows us to have wide receptive fields using fewer filters, which are easier to change. It's commonly used to reduce the number of parameters.

Formally, the side length of the receptive field at layer i is r_0 = 1, r_i = s_{i - 1} r_{i - 1} + F_i - s_{i - 1}, where s_i is the stride at layer i and F_i is the filter side length for layer i.

Convolution networks need to be translation invariant in many real-world applications - we care more about whether a feature is present than exactly where it is in the original image. Naively, we can do this by making the image really low res, so a translation wouldn't make much of a difference in such a low-resolution image. However, this causes us to lose information. We can get around this by only preserving the most important information.

**Pooling** is a technique where we downsample an image by taking a summary statistic for nearby descriptors. For example, max-pooling takes the maximum value within each k by k chunk of descriptors to get the value of that channel in the pooled output image. Likewise, mean-pooling and median-pooling takes the mean and median values. For example, if we have a 224 by 224 image and pool with a 2 by 2 max-filter and stride 2, we get a 112 by 112 image (that's max-pooling). k is usually chosen by a human and the stride is almost always k as well.

Pooling gives us slight translation invariance (proportional to k), doesn't have any hyperparameters that need to me learned (because k is chosen by a human), is easy to backpropagate (it's just the derivative of the max/mean), and significantly reduces the number of parameters. The idea with pooling is that we're making the image smaller while discarding less important features and keeping the important ones.

The purpose of the pooling layer is essentially to increase the receptive field of each descriptor without increasing the number of parameters too much.

In real-world architectures, we often alternate convolutional layers with pooling layers. Essentially, this is doing some transformations, summarizing the important features, doing more transformations, summarizing the important features, and so on until we're done. The outputs are then usually fed into a more typical neural network architecture, perhaps a couple of fully connected layers followed by a softmax. we often also use activation layers like ReLU after each conv layer to add nonlinearity.

;wip: AlexNet was first practical convolutional net, designed to be broken out over 2 GPUs, which makes the architecture pretty complicated, VGGNet has smaller filters and is a lot deeper, uses alternating conv, relu, pooling layers followed by a couple fully connected layers and then softmax

;wip: why not just keep adding more layers to make it deeper? we get y = f_1(... f_k(x) ...), and the gradient is the product of a bunch of chain rule results. note that the gradient starts to explode because we'll sometimes multiply a ton of small numbers or large numbers when those derivatives are large. we can clip huge gradients, but we can't clip small gradients because we don't know whether they should be positive/negative. to solve this, we can add an offset to each input to make the gradient closer to 1, like learning y = h(x) + x instead of y = f(x). this is what resnet did ;wip: more info about this technique

;wip: fundamental idea of deep learning: we can represent any function by mixing simple nonlinearities

So far our neural network paradigm has been: input feeds into hidden transformations, which feeds into output. The output is then a representation of some aspect of the input. This is a **feed-forward neural network**.

However, real-world problems often can't be reduced to one fixed-size input. To solve this, we have **recurrent neural networks** (RNN), which simply feed some of the network outputs back into the inputs. We do this by iterating over the sequence - at each iteration, we take some of the outputs of the network from the previous iteration, combine it with the current iteration's input, and then feed it into the network. We can choose the initial value for "some of the outputs of the network from the previous iteration", or choose to do iterations with fake data before or after the actual sequence - this is called the **process sequence**.

For example, for sentiment analysis, the input would be a sentence (maybe represented as a sequence of words), and the output would be how positive/negative that sentence is. At each word, the RNN would have both the current word as an input, and its previous outputs. The network would learn to create a good description of the input seen so far up to the current word, and to use that description along with the current word to output a new, better description.

Since we have a sequence of inputs and we're running the network on each input, an RNN takes in a sequence and outputs another sequence. RNNs can be many-to-many, many-to-one, one-to-many, and one-to-one (one-to-one RNNs are equivalent but not equal to feedforward networks, as there can still be multiple iterations), depending on the **process sequence**. This is a hyperparameter that is generally chosen by hand. For example, for sentiment analysis we would usually want to have a many-to-one process sequence.

RNNs are essentially y = f(x_t, f(x_{t - 1}, f(x_{t - 2}, \ldots f(x_0, v) \ldots))), where f is the non-recurrent part of the network, x_i is the sequence of inputs, and v is the initial value of the previous iteration's output. For example, ;wip: the vanilla RNN slide formulas

New notation: s = \tanh.

The **vanilla RNN**: given x \in \mb{R}^n, h \in \mb{R}^d, W \in \mb{R}^{d \times (m + d)}, \vec h_t = \tanh\left(W \begin{bmatrix} \vec h_{t - 1} \\ \vec x_t \end{bmatrix}\right). h_0 and W are learned through training, and h_0 is usually initialized to \vec 0.

To train the vanilla RNN, at each iteration we expanding the entire recurrence so far into a closed form, then do standard backpropagation. For example, at iteration 3 we have \vec h_3 = \tanh\left(W \begin{bmatrix} \tanh\left(W \begin{bmatrix} \tanh\left(W \begin{bmatrix} \vec h_0 \\ \vec x_1 \end{bmatrix}\right) \\ \vec x_2 \end{bmatrix}\right) \\ \vec x_3 \end{bmatrix}\right).

Clearly, finding the gradient of this with respect to W and h_0 would involve many factors of W and a lot of repeated \tanh, which results in rapidly exploding or vanishing gradients. This makes vanilla RNNs really hard to train. We usually limit the number of expansions so that the

A **long-short-term-memory RNN** (LSTM) is a variation in which we make the memory more explicit, to make the gradients more manageable. It's essentially a layer structure that has inputs for forgetting, saving, value to save, and whether to read its memory value (or output \vec 0).

With LSTMs, we now have \vec c_0 and \vec h_0 rather than just \vec h_0.

New notation: A \odot B is an elementwise (Hadamar) multiplication.

Consider W \begin{bmatrix} \vec h_{t - 1} \\ \vec x_t \end{bmatrix}. We can split this into four vectors (it doesn't matter too much how we split it, but usually into equally sized vectors), and apply certain activation functions to each one. The four vectors are \vec f (forget vector), \vec i (insert vector), \vec g (value to insert), and \vec o (output vector). We apply sigmoid to \vec i, \vec i, and \vec f (so they're between 0 and 1), and \tanh to \vec g (so it's between -1 and 1).

We then have \vec c_t = \vec c_{t - 1} \odot \vec f + \vec g \odot \vec i. Then, \vec h_t = \vec o \odot \tanh(\vec c_t). Note that c_t depends on \vec h_{t - 1} because the four vectors depend on \vec h_{t - 1}.

This is interesting because the derivative of \vec c_t with respect to \vec c_{t - 1} is just \vec f. This solves the vanishing/exploding gradient problem very neatly - the gradient with respect to c_0 is just a bunch of \vec f values multiplied together, and the gradient with respect to W is a lot simpler too because a lot of complexity came from \vec c_t, which is a lot simpler now. We can't backpropagate arbitrarily, but we can backpropagate a lot farther than with vanilla RNNs without the gradients vanishing/exploding.

A bi-directional RNN can incorporate future and past information, instead of just past information like vanilla RNNs - we can do this when you have the future data already available, just not fed into the network yet. We train one RNN over the normal sequence, another RNN over the reversed sequence, and then a feedforward network takes the outputs of both and uses that to make a final output.

Siri used to use a bidirectional RNN, but since it requires all the future data to be available, it needed you to finish talking before it started processing. Instead, Siri changed to use an RNN with a 3 second delay, assuming that when talking, words stop affecting words more than 3 seconds in the past.

;wip: recursive networks from slides

;wip: mixture of Gaussians is good for low-dimensional data ;wip: dropout loss isn't used that much anymore in state of the art models

Assignment 4 overview. Some notes on assignment:

- The pseudocode has some numerical computation issues if implemented naively. Make sure to write it in a way that avoids this.
- The MNIST dataset has all of the digits centered. That means the edges of every image are constant 0, which causes a division by 0 when trying to invert the determinant. To fix this, we can just add a very small number to the matrix diagonal (like
`eps`

in MATLAB). In general, whenever we divide by an arbitrary non-negative number, we'll want to add a small value to prevent zero division. - The covariance formula in the pseudocode is for a full covariance matrix. However, the actual matrix will end up being a diagonal matrix, so computing the full covariance matrix is really inefficient. There's a formula that is equivalent to the diagonal of the full matrix.
- \vec x^T A \vec x = \vec x \cdot A \vec x = \sum x_i (A_{\vec x})_i = \sum_i x_i \sum_j A_{i, j} x_j = \sum_i \sum_j x_i x_j A_{i, j}.
- r_{i, k} = \pi_k \abs{s_k}^{-1/2} \exp(-(x_i - \mu_k)^T s_k^{-1} (x_i - \mu_k)) has \abs{s_k}^{-1/2}, which depends on the diagonals of s_k. If the diagonals are tiny values, the determinant will be very tiny, and the inverse square will be huge. Avoid these numerical issues by taking the log of the factors and summing them instead of multuplying the original values, so \log r_{i, k} = \log \pi_k - \frac 1 2 \log s_k - (x_i - \mu_k)^T s_k^-1 (x_i - \mu_k).
- There might also be numerical issues with r_{i, k} = \frac{r_{i, k}}{\sum_k r_{i, k}}. We can use magnitude cancellation to improve this: in general, if we want to compute \sum e^{a_i} where a_i could potentially be pretty big, we can use the fact that \sum e^{a_i} = e^{\max_j a_j} \sum e^{a_i - \max_j a_j}. In many ML problems, the e^{\max_j a_j} will cancel out pretty well. In this case, since r_{i, k} is in the numerator and denominator, we can factor out the exponential value directly with r_{i, k} = \frac{\exp(r_{i, k})}{\exp(\sum_k r_{i, k})} = \frac{1}{\exp((\sum_k r_{i, k}) - r_{i, k})}.
- For p \epsilon, this is not a typo - the product of p and \epsilon is a Bernoulli distribution. It means that on average, we aren't going to change the value at all.
- Computing vector expected values: E((\vec w \cdot (\vec x \odot \epsilon))^2). Make sure to compute this correctly, by expanding it out into a sum, like E((\sum_j w_j x_j \epsilon_j)^2) = E(\sum_j \sum_k w_j w_k x_j x_k \epsilon_j \epsilon_k).
- To improve performance, you might apply PCA to the training set. The PCA can be applied on all of the classes at once, or per-class, depending on the dataset (usually we just try both and pick the better one), to reduce the feature dimensionality. When performing PCA, we usually apply the same transform we applied to the training set to the testing set, rather than performing PCA directly on the testing set.

GANs were invented a few years ago, and have been really useful for certain applications.

A **generative model** takes training data with a certain unknown distribution p_{data} and generates new samples from the model distribution p_{model} such that p_{model} is really similar to p_{data}. This is an example of unsupervised learning. For example, a generative model for images takes in a bunch of images, and tries to generate new images that are look similar - if we feed in pictures of cars, it might output pictures of other cars.

The naive approach to this is to set estimate the density function for p_{data} and use that estimate as p_{model}, then sample that to get the outputs - the Gaussian mixture model is essentially this approach, where the estimate is a mixture of Gaussians. Although this works, estimating the density function is actually a really hard problem, especially for high-dimensional data (which images almost always are). Additionally, sampling a high-dimensional distribution, in general, is very difficult as well (though easy for Gaussian and mixture of Gaussian distributions).

Instead of estimating the density functions and then sampling them, we could instead just directly generate the samples. For example, image super-resolution nets like Ledig et al 2017 takes a set of images, downsamples/pools them as the training set X, takes the original as the training set Y, and then trains a deep neural network to predict the original image given the downsampled/pooled image.

Essentially, it's sort of like an automated Turing test. We have two neural networks - a **generator**, trying to generate realistic images, and **discriminator**, trying to tell the difference between real and fake images. The generative network's objective is to fool the discriminator, and the classifier's objective is to not get fooled by the generator. Concretely, the generator network is fed by random noise, and the outputs are mixed with the training set images to get the inputs for the discriminator. The discriminator then tries to learn to correctly classify the images as either from the training set, or from the generator. We train these two networks in parallel, and get a GAN.

We can think of this as a minimax game \min_G \max_D E(\log D(x)) + E(\log(1 - D(G(z)))) where:

- x \sim p_{data} is the input distribution.
- z \sim p_z is random noise.
- D(s) is the predicted probability that some sample s comes from p_{data}.
- D is the maximum probability of training data and the minimum probability of generated sample.
- \max_G means maximize the probability that ;wip: slide about minimax GAN

This works because E(\log(1 - D(G(z)))) is equivalent to E(\log(1 - D(m))) where m \sim p_{model}. Essentially, G(z) is a function that transforms a uniform random sample into a p{model} random sample, so if we feed it with random noise we get a random sample from p_{model}.

;wip: Clearly, if we fix G's outputs, we can simplify the game to \min_G \max_D \int \left(\log D(x) P_{data}(x) \dee x + \int \log(1 - D(x)) P_{model}(x)\right) \dee x = \min_G \max_D \int \left(\log D(x) P_{data}(x) + \log(1 - D(x)) P_{model}(x)\right) \dee x.

;wip: Clearly, \log D(x) P_{data}(x) + \log(1 - D(x)) P_{model}(x) is equivalent to cross-entropy loss, so the discriminator is essentially trying to minimise its cross entropy loss.

So the optimal discriminator is one that lets \frac 1 {D(x)} P_{data}(x) + \frac{-1}{1 - D(x)} P_{model}(x) = 0. Solving, we get D(x) = \frac{P_{data}(x)}{P_{data}(x) + P_{model}(x)}. ;wip: KL divergence is non-negative, so 0 is the minimum value, and this occurs when p_data = p_model, so the discriminator's goal in the minimax game is to ???

More hints for A4:

- Don't confuse r_{i,.} with r_{i, :} or r_{., k} with r_{:, k}
- In the assignment hint for q1, the full formula for the diagonals of S_k is given in the s_j formula, but it's recommended that you try deriving it yourself.
- For 1.2, we're supposed to compute a prior probability. Each class will end up having its own mixture of Gaussians model, and we then combine them to get the final prediction.
- For q2.2, we're trying to integrate out Z_{t_{n + 2}}, \ldots, Z_{t_{n + p}} out of the left side of the conditional distribution for Z_{t_{n + 2}}, \ldots, Z_{t_{n + p}} \mid Z_{t_1}, \ldots, Z_{t_n}. We don't need to do anything to the right side of the conditional, since they're not a part of the distribution itself. There's a formula in the slides that does almost exactly this.

The competition between the generator and the discriminator is a zero-sum game. We're tuning G and D such that they get better at competing with each other.

If we write the expectations in terms of integrals, it ends up becoming logistic regression! \min_G \max_D E_{x \sim p_{data}(\ln D(x))} + E_{x \sim p_{model}}(\ln(1 - D(x))) becomes \min_G \max_D \int P_{data}(x) \ln D(x) + P_{model}(x) \ln(1 - D(x)) \dee x. Since it's analogous to minimizing the cross-entropy loss y \ln p + (1 - y) \ln(1 - p) where $P_{data}(x) = y, $, this becomes logistic regression.

For an optimal G, D is optimal when D(x) = \frac{P_{data}(x)}{P_{data}(x) + P_{model}(x)} - if we have a perfect generator, the discriminator would never be able to tell the difference better than random guessing, so it might as well randomly guess. For an optimal D, G is optimal when p_{model} = p_{data} - if we have a perfect discriminator, we have to generate indistinguishable images to fool it.

We can use the minimax theorem to swap the min/min (proof is out of the scope of this course). Suppose we do this, so we fix D and then try to optimize G. Since G is trying to minimize the \ln(1 - D(x)), it will focus on the ones where D(x) gives the highest value, which results in the lowest values for \ln(1 - D(x)). ;wip: what does collapsing on modes of D mean? slide 16 IIRC

;wip: fixed G, what's optimal D? fixed D, what's optimal G?

As a result, GANs can't be trained by alternately training the generator and discriminator, because if we fix either G or D, training the other might not get a strictly better result. In general, we can't alternately train players for minimax games, because we easily get into oscillations.

The model space is the set of possible distributions p_{model}. Usually, the model space isn't large enough to include the true p_{data}, but it can get really close.

In practice, D and G are rarely optimal, and are usually deep convnets that are optimized with SGD. When computing E_{x \sim p_{data}(\ln D(x))} or E_{x \sim p_{model}}(\ln(1 - D(x))), we approximate the expected value by taking the average of those formulas for each value in the training set.

;wip: pseudocode for GAN training algorithm

Note that we don't fix G or D and then train the other in this algorithm (as we discussed above, this wouldn't work very well). Instead, for each iteration we perform k SGD steps for D and one SGD step for G - we're repeatedly switching between training G and D to ensure they improve in a compatible way.

For small values for D(x), the gradient of \ln(1 - D(G(z))) tends to be really small. To make the gradient larger in these cases, one trick is to replace \ln(1 - D(G(z))) with -\ln(D(G(z))), which has a larger gradient near these value. When we do this we essentially have \min_G \max_D E_{x \sim p_{data}(\ln D(x))} + E_{x \sim p_{model}}(-\ln D(x)). There's no strong mathematical basis for this, but it's used pretty often in industry. However, if we look at what happens when we fix G or D and optimize the other, we can show that G and D are optimal with the same conditions.

After training the GAN, the usual output is to use the generator network to generate new images. Sometimes we also use the discriminator network to tell the difference between real and fake images. For example, we can feed the generator noise to generate one image, and then perturb the noise slightly and run it through the generator again, we get a slightly different image.

The Wasserstein GAN is a brand new objective function for GANs that is much better behaved. The GAN's current objective function is equivalent to \min_G \operatorname{JS}(p_{data} \| p_{model}), which can be discontinuous, but a GAN using the Wasserstein loss function like \min_G(W(p_{data}, p_{model})) will be a lot better behaved. Also, we can approximate it using kernel methods, which makes practical implementation very interesting. The Wasserstein function is out of the scope of this course, as it relies heavily on measure theory.

A tree is an acyclic connected undirected graph. A decision tree is a tree where each interior node contains a variable, and each edge is a possible value of that variable (or range of values). Each leaf node contains the output value.

We make decisions using decision trees by starting at the root node, and following all of the edges that match until there are no decisions left and we've reached a set of leaf nodes. We can represent any boolean function as a decision tree, though the resulting tree might end up being infeasibly large.

For classification purposes, leaf nodes contain the output class, and we can do something like take the majority vote of all the leaf nodes we reach. For regression purposes, leaf nodes contain the predicted value, and we can do something like take the average of all the leaf nodes we reach.

Decision trees are often also represented as binary trees, where each node contains a boolean test (like x_5 < 0.45), and we take the left node if the test fails and the right node if the test passes.

To train decision trees, we need to choose an variable for each node, choose which edges of a node to grow out, decide when to stop growing, and decide what to put in the leaf nodes. Choosing variables to test and edges to grow is already NP-hard, but we'll use a greedy approach by always choosing the lowest-cost variable and lowest-cost value ranges.

;wip: formula for decision tree training, t is a real number so there are infinite values, but we can restrict it to values in the training set, because that is enough to arbitrarily partition the training set points.

For deciding when to stop training a decision tree, we usually stop once the tree reaches a certain depth, after a certain amount of time spent, if the loss function stops shrinking, or if we manage to correctly classify all of the training set points. Stopping at a certain depth is most common, and the depth limit is sort of a regularization technique.

;wip: loss function for regression using decision trees - the variance of the y values like \sum_i (y_i - \overline y)^2 where \overline y is the average of all the leaf node values we've reached ;wip: it's NP complete to train decision trees optimally because it reduces to the sum problem

For classifiers we generally use mean-squared-error for the loss function. For decision trees we often use the classification cost \hat p_c = \frac 1 {\abs{D}} \sum_{i \in D} 1_{y_i = c} and the prediction \hat y = \argmax_c \hat p_c to construct loss functions, where D is the training set. For example, the misclassification error is \ell(D) = 1 - \hat p_c ;wip: entropy and etc.

An ideal decision node in a decision tree should lead to all of one class passing the test, and all of the other failing the test. In other words, we want to try choose decisions that partition the X values such that each partition has Y values as homogenous as possible.

In practice, we usually train a decision tree via pruning. We first grow a very deep, very accurate tree on the training set, which would definitely be very overfit. We then go through the tree bottom-up, and along the way, simplify nodes that increase our estimated generalization error the most (e.g., replacing the node with a node that always predicts 1, or always predicts 0, etc.).

Decision trees are nice because they are relatively fast to train and are often even interpretable by humans.

When solving a machine learning problem, we generally want to try a lot of options (e.g., SVM, deep neural nets, decision trees) and pick the best one. However, what if we could combine a bunch of algorithms together? If we had the resources to run all of those, this could potentially give better results than any single algorithm could give.

Consider the jellybean guessing game, where the person who has the best guess as to how many jellybeans are in a jar wins the game. As it turns out, we can consistently get closer results by averaging many guesses. This is the power of aggregation, and these types of models are known as **ensemble models**.

Supose we have many binary classifiers h_1, \ldots, h_t on separate datasets D_1, \ldots, D_t, where each classifier has accuracy better than 50%. Since the datasets are independent, the classifiers must be independent as well, so the random variables representing a given classifier making a mistake must be independent as well.

If we then predict the output as being the majority label among all t classifiers (assuming t is odd), we'll make a correct prediction whenever the majority of h_1, \ldots, h_t are correct. Since they're independent, we have many independent Bernoulli distributions, the sum of which is a Binomial distribution. So the probability we get it wrong is P(\text{majority is right}) \le \sum_{k = \ceil{\frac t 2}}^t {t \choose k} p^k(1 - p)^{t - k}, where p is the lowest-accuracy classifier.

This is hard to analyze, so we approximate this using a Gaussian distribution N(tp, tp(1 - p)). So as t goes to infinity, the probability of an error approaches 0. ;wip: better proof

In other words, as we use more and more weak classifiers, taking the majority vote from all those classifiers will eventually give us an arbitrarily small error, assuming the classifiers are independent. However, in practice we can't make this assumption, since we would need independent training sets for each classifier, which would often require an infeasible amount of training data.

To help with this, we do **bagging** (Bootstrap Aggregating) - for each classifier we randomly sample a bunch of training set entries with replacement (which gives a training set with possibly duplicated or missing entries), train the classifiers are those, then aggregate the classifier predictions by taking the mode (for classification) or the mean/median (for regression). In contrast, dividing the training set up for each classifier is like sampling without replacement. Sampling with replacement can't give us fully independent training sets, but in practice, using bagging over straightforward splitting of the dataset almost always yields an improvement.

Bagging can also be used with regression, where we replace the classifiers with regressors, and then take the mean of the regressor outputs rather than the majority vote of the classifiers. Averaging those t independent outputs reduces variance by a factor of t.

Another practical improvement to regression with bagging is to add a small noise to every Y value in the training set, which is different for each regressor. The equivalent of this for classification is to randomly flip some of the training set Y values (say, 5% to 10%). This works because it improves independence of the training sets, and works as a form of regularization. We can think of this technique as analogous to dropout in neural networks.

Bagging is basically averaging over classifiers or regressors - it smooths over instabilities/high-variance-regions in the component classifiers, so it's beneficial if performance of certain classifiers changes a lot with small perturbations. That means bagging works well for algorithms like decision trees (which can be relatively unstable), but not kNN (which is very stable already).

**Hoefting's inequality** says that given independent and identical Bernoulli random variables X_1, \ldots, X_n \in \set{0, 1} where P(X_i = 1) = p, then P(\sum X_i \le (p - \epsilon)n) \le \exp(-2 \epsilon^2 n) for any \epsilon > 0. We can think of the classifier outputs as independent and identically distributed Bernoulli random variables, where X_i = 1 means classifier i made a prediction error and X_i = 0 means it was correct. Therefore, the expected error rate of the ensemble model is under 100k% with c% confidence, where c = 1 - \exp(-2 \epsilon^2 n) and c = p - \epsilon.

This gives us an upper bound on the number of classifiers we need to get a certain accuracy (with a certain confidence). Namely, we choose the number of classifiers n by solving for n in c = 1 - \exp(-2 \epsilon^2 n) and c = p - \epsilon, to get n = -frac{\ln(1 - c)}{2 (p - c)^2}.

The **random forest** algorithm combines some of the techniques we've seen so far. We use bagging-style sampling without replacement to get many "independent" training sets, then train a bunch of decision trees on those, taking the majority vote (classification) or average (regression) of all of their predictions.

Given many classifiers h_1, \ldots, h_t, each one barely better than randomly guessing, is it always possible to construct a classifier with really good performance? Schapire showed that this is possible for an arbitrary level of accuracy using a specific algorithm in 1990 (specifically, by proving that majority voting on 3 classifiers is strictly better than the individual classifiers, and we can repeat this to get arbitrarily good accuracy), and this was improved by Freund in 1995 (a more practical algorithm that weights and re-weights training set points according to how often classifiers get them right).

In-class demo of hedge algorithm on a real-world example.

Freund and Schapire further improved this in 1997 with their combined algorithm, known as the Hedge Algorithm:

This algorithm is actually powerful enough to work well in practice for stock markets, sports betting, and many more difficult machine learning problems - it's competitive with our current state of the art algorithms.

```
# hyperparameters include: number of classifiers, learning rate, number of iterations or number of passes through training set
def hedge(classifiers, training_set, learning_rate = 0.5):
# the amount by which we penalize losses
beta = np.full(len(classifiers), learning_rate)
# the fraction of resources we'll spend on each classifier (usually starts off initialized to random values)
weights = np.random.uniform(0.1, 1.0, num_classifiers)
# we don't necessarily have to go through the training set in order, and we might make multiple passes too
for x, y in training_set:
# normalize so that all elements sum up to 1
# we also usually add a small epsilon value to the denominator to prevent underflow
weights /= weights.sum() + 1e-12
# receive performance data for each classifier for this training set element
# all loss function values must be between 0 and 1 inclusive, but besides that they can be arbitrary functions
loss = weights @ [classifier.loss(y, classifier.predict(x)) for classifier in classifiers]
# reweight classifiers based on how they performed this time
weights *= beta ** loss
# the resulting weights can then be used to combine the classifiers into the final model
```

The hedge algorithm can provably always do at least as well as the best individual classifier, because eventually it will allocate all the weight to the best one. Even if we have an opponent that knows our weight vector and controls our loss vector, trying to maximize our losses, the individual classifiers would all end up doing at least as bad. The loss vectors in most algorithm are assumed to be independent and identically distributed, but that isn't required for hedge, and the loss vectors can even be adversarial.

In fact, if N is the number of classifiers, \vec p^t is the weight vector at time t, and \vec l^t is the loss vector at time t, then \sum_t \vec p^t \cdot \vec l^t \le \frac{\ln N - \ln ;wip}{1 - \beta}. If we choose \beta = ;wip, then ;wip: the second version of the hedge algorithm performance bound formula.

Basically, if we were betting on horses, we can use the hedge algorithm to get arbitrarily close to doing just as well as someone who can see the future and choose a single horse to always bet on, by not seeing the future and being able to change our bet. If we were seeing a particular horse perform poorly, we might stop betting as much on it in subsequent rounds.

The algorithm was inspired by the weighting-by-majority-vote algorithm, which was invented by the person who invented Winnow for perceptrons - that's why there are some similarities between how they work.

Adaptive boost (AdaBoost) is the hedge algorithm converted into a boosting meta-algorithm - given some training data and some machine learning algorithms, it trains those algorithms and produces a better-performing classifier (in contrast, the hedge algorithm just takes already-trained classifiers). We treat each training set entry as a horse, and get a certain amount on each horse. The resulting loss values are used to reweight how much we bet on each horse. ;wip: intuitive explanation

;wip: pseudocode for adaboost and making final predictions h_f(x) from the output ;wip: it's called adaptive boost because we change \beta over time ;wip: the closer the prediction is to the true value, the larger the exponent - we lower the weight of training set points that are performing well, so the ones we're performing poorly on are more represented in the training set, which forces the classifiers to focus more on those bad-performing examples when they're being trained ;wip: the overall performance needs to be strictly more than 0.5, because when we set \beta = \epsilon / (1 - \epsilon), we need 0 < \beta < 1 ;wip; the larger is, the larger is, so the more its weight is preserved ;wip: when we have a multiplicative algorithm like this one (weights always multiplied by a value, like winnow), the weight will always be nonzero if they start off nonzero, and will stay 0 if they start off 0. therefore we must ensure that we never initialize any weight vector elements to 0 - always positive numbers ;wip: afterwards, we often throw away any classifiers that have a weight below the certain threshold, because they have very little effect on the end result ;wip: if we use h_t with p_{t + 1}, we will get \epsilon = 0.5, because the way we chose \beta and the exponent of beta such that the classifier sucks again. this means the classifier is no better than chance, so the hypothesis needs to improve to do better than chance - it prevents us from just returning the same hypothesis over and over again, which wouldn't improve anything

;wip: the training error is guaranteed to decay as per P(h_f(x_i) \ne y_i) \le 2^T \prod_{t = 1}^T \sqrt{\epsilon_t (1 - \epsilon_t)}. the training error goes down exponentially fast, since \epsilon_t \le \frac 1 2 - \gamma \le \exp(-2T \gamma^2)

;wip: we can think of AdaBoost as doing gradient descent to minimize the exponential loss \sum_i \exp(-y_i h(x_i)) where h is in the conic hull of all the h_t's ;wip: since the training error drops so fast (exponentially), is overfitting a concern? no, because we usually use super simple classifiers like decision stumps, which are not individually prone to overfitting at all. ;wip: another explanation compares bagging and boosting: as the number of classifiers increases, the training error drops to 0 almost immediately, while test error is continues going down even after the error is 0. the adaboost authors explained it as follows: the adaboost algorithm isn't just trying to minimize training error, but also maximizing the margin y h(x) (this is 1 if they agree, and -1 if they don't agree). so the training error drops to 0, but it continues trying to maximize the margin ;wip: grove and shuurmans showed that latter explanation isn't so good in 1998: they just ran it for more iterations, and eventually the test error started increasing again! in other words, run adaboost long enough and it will start overfitting. also, they invented LP-boosting, which explicitly tries to maximize the margin (using linear programming), and LP-boosting turned out to perform very poorly. this implies that maximizing the margin isn't a very good goal, because explicitly maximizing it didn't decrease test error ;wip: therefore, why adaboost works so well isn't that well understood even today, despite other explanations like soft-margins and so on

;wip: adaboost is a straightforward way to boost performance and doesn't have many requirements for acceptable classifiers, but the results are hard to interpret and is hard to parallelize

;wip: for bagging, we already have a bunch of classifiers and try to weight their outputs. for boosting, we train a weak classifier on a training set, modify the training set, and then repeatedly retrain until we get a bunch of trained weak classifiers that work really well. basically we train a weak classifier on the training set, train a weak classifier on the ones it did poorly on, train a weak classifier on the ones those two didn't do well on, and so on ;wip: bagging is easily parallelized, because the classifiers don't depend on each other, while boosting can't really be parallelized because each training set depends on the results of the previous classifiers

Adaboost was used in Viola-Jones face detection, where there were lots of simple pattern detectors adaboosted into a strong face signal. Viola-Jones used a cascading architecture - the first classifier is really fast and uses few features, the second is slower with more features, and so on (the presented architecture used 38 layers). This allows us to move this window across the image really fast and accurately, crucial for realtime face recognition, because at each level the classifier can say "that's definitely not a face".

We have to use an asymmetric loss function here because the number of detection window that has a face is actually really small, or else it would just always predict no-face for a really high accuracy. Viola-Jones used one where agreement give 0 loss, missing a true face gives cost \sqrt{k} and falsely finding a face gives cost \frac 1 {\sqrt{k}}, where k is a hyperparameter that is a large positive number.

Overview of the exam: open book, all notes allowed, minimal linear algebra, format will be similar to assignments but proportionately easier. One question about designing an algorithm, a few true/false, short answer, derivations, proofs, calculations. 6 questions total. Overview of assignment 5, including some highlighted piazza answers.

Course review:

- Perceptron:
- The perceptron classifies by learning a weight vector - when it classifies a feature wrong, it just adds the feature to the weight (possibly scaled by a learning rate).
- The perceptron convergence theorem says that the perceptron will always converge on linearly separable data, and faster the larger the margin is. However, it just gives us an arbitrary separating hyperplane, not necessarily the best one.
- The perceptron bounding theorem says that if the data isn't linearly separable, the perceptron will eventually cycle.
- We usually stop perceptron when the error stops decreasing or the validation set reaches a certain accuracy.
- To make it multi-class, we can use one vs. all (each perceptron demarcates class i vs. every other class, highest activation wins), and one vs. one (each perceptron demarcates class i vs. class j, plurality vote).
- The winnow algorithm trains perceptrons by multiplying by input features rather than adding. Winnow is a lot better at eliminating irrelevant features quickly, because its convergence time grows with respect to the number of dimensions rather than the number of training set points. Winnow also requires that the weight vector always be non-negative, which we can do by appending the negative version of each feature vector to itself.
- Linear regression:
- We can solve for the least squares equation X^T X W = X^T Y by taking its derivative and solving for when the derivative is 0.
- Huber loss is really nice because it has a good derivative near the origin but stays linear when far away from the origin.
- Training techniques: Ridge regression (L2 regularization, encourages minimal squared weights overall which makes the weights all consistently small), Lasso regression (L1 regularization, encourages individual values to go to 0 which makes the weights more sparse).
- Splitting the test/train/validation sets, using cross-validation (especially when we don't have validation sets). Why we should never test on the test set.
- Multilayer perceptrons:
- Useful when the data isn't linearly separable. Classic example: XOR dataset, with points \tup{0, 0} \to 0, \tup{0, 1} \to 1, \tup{1, 0} \to 1, \tup{1, 1} \to 0. Prove that it's not linearly separable.
- Multiple perceptrons stacked together, but instead of \sign we can use a bunch of other nice activation functions like ReLU, which have nicer gradients for gradient descent purposes.
- To perform gradient descent, we define a loss function, then take the gradient of the loss function with respect to each weight.
- Training takes place in two phases: a forward phase where the inputs are run through the network, and a backward pass where we compute derivative values, then update weights according to the gradients.
- Chain rule: if f(x) = g(h(x)), then f'(x) = g'(h(x)) * h'(x). The forward pass computes h(x), and the backwards pass computes the rest.
- A two-layer perceptron can approximate an arbitrary continuous function with arbitrary precision, if we have an activation function that is locally bounded, has a negligible closure of discontinuous points, and isn't a polynomial (almost all real-world activation functions). Note that kernel methods with linear classifiers can also approximate arbitrary continuous functions.
- Logistic regression:
- Treat each label y_i as Bernoulli random variables, with a hidden parameter p_i = P(y_i = 1). We then assume that each y_i is indpeendent. Then, the probability of seeing the training set X given w is \prod P(\vec x_i; \vec w)^{y_i} ;wip
- We define P(\vec x_i; \vec w) = \log \frac{P(\vec x_i; \vec w)}{1 - P(\vec x_i; \vec w)}. This is equivalent to P(\vec x_i; \vec w) = \frac{1}{1 + \exp(-\vec w \cdot \vec x_i)}.
- We are then optimizing the following: \sum \ln(1 + \exp(;wip)).
- Cross-entropy loss:
- \ell(\Theta) = -\sum_{i = 1}^n \sum_{c = 1}^k \vec y_{i, c} \ln \hat y_{i, c} where \vec y and \hat y are one-hot vectors.
- Suppose we have a bunch of values q_i, where q_i \ge 0 and \sum_i q_i = 1 (we call this a
**distribution**). The entropy of q is H(q) = \sum q_i \ln q_i. The regression problem wants us to minimize the square loss (y - \hat y)^2, which is minimized at \hat y = y. - Suppose we have y_i as labels in one-hot encoding. Clearly, each y_i is a distribution because each element of y_i is non-negative and they sum up to 1.
- The KL divergence KL(\vec y_i \| \vec p_i) = \sum_c y_{i, c} \ln \frac{y_{i, c}}{p_{i, c}} = \sum_c y_{i, c} \ln y_{i, c} - \sum_c y_{i, c} \ln p_{i, c}.
- Since the first term above is a constant, minimizing the KL divergence is equivalent to minimizing -\sum_c y_{i, c} \ln p_{i, c}.
- Therefore, minimizing the cross entropy is just the same thing as minimizing the KL divergence between \vec y_i and \vec p_i.
- Logistic regression does the same thing as SVM (determining a linear decision boundary), but minimizes logistic loss rather than hinge loss. SVM usually works a bit better though, since hinge loss is less sensitive to outliers and can classify with full confidence if a point is outside the margin area.

Exam is in MC 2034 for section 001 undergrads, duration 150 minutes. Format will be similar to the sample midterm, but harder.

Continued course review:

- The Bayes error is the absolute best classifier. We can derive it as follows:
- The error of a classifier is is P(f(X) \ne Y). The expected error is then E(1_{f(X) \ne Y}). This is a common pattern, replacing a probability with the expectation of an indicator variable.
- Assuming a binary classifier, E(1_{f(X) \ne Y}) = E(1_{f(X) = 1, Y = -1} + 1_{f(X) = -1, Y = 1}).
- Since f(X) and y are independent, E(1_{f(X) = 1, Y = -1} + 1_{f(X) = -1, Y = 1}) = E(1_{f(X) = 1} 1_{Y = -1} + 1_{f(X) = -1} 1_{Y = 1}).
- Condition on x to turn X into a constant x: E(1_{f(X) = 1} 1_{Y = -1} + 1_{f(X) = -1} 1_{Y = 1}) = E(E(1_{f(x) = 1} 1_{Y = -1} + 1_{f(x) = -1} 1_{Y = 1} \mid x)).
- That turns f(X) into f(x), a constant, which lets us move its indicators out of the expectation: E(E(1_{f(x) = 1} 1_{Y = -1} + 1_{f(x) = -1} 1_{Y = 1} \mid x)) = E(1_{f(x) = 1} E(1_{Y = -1} \mid x) + 1_{f(x) = -1} E(1_{Y = 1} \mid x)).
- Since exactly one of 1_{f(x) = 1} and 1_{f(x) = -1} are 1 and the other is 0. ;wip: which one is more than 50%, choose that one
- k-nearest-neighbors:
- k is chosen to balance overfitting vs. underfitting - larger k tends to underfit, smaller tends to overfit.
- 1-NN has worst-case error of about twice the Bayes error.
- SVM: like perceptron, we're finding separating hyperplane, but we're also trying to maximize the margin of the hyperplane: \min_{\vec w, b} \frac 1 2 \magn{\vec w}^2 subject to y_i(\vec w \cdot \vec x_i + b) \ge 1 for all i.
- Note that the 1 can be any positive number, and the exponent on \magn{w} can also be any positive number. The above is the most common formulation, however.
- Some of the closest points to the separating hyperplane are support vectors (formally, support vectors are those where \alpha \ne 0 in the Lagrangian dual). Essentially the support vectors are part of each convex hull that are necessary in order for the minimum-difference line segment between the two convex hulls to be what it is.
- For hard-margin SVM, the solution for \vec w, b always exists and is always unique.
- To get kernel SVM, we take the Lagrangian dual of the primal SVM problem, which contains a \vec x \cdot \vec x. We can then replace \vec x \cdot \vec x with \kappa(\vec x, \vec x) to get kernelized SVM from the dual problem.
- The Lagrangian dual of SVM can be thought of as minimizing the distance between the convex hull of the two classes (the separating hyperplane then passes through the midpoint of that line segment with the line segment's direction as the normal). The primal SVM problem can be thought of as maximizing the margin of a hyperplane.
- To get soft-margin SVM, we pay a cost C for misclassifying points rather than having a constraint saying we can't misclassify any of them. We can compute the Lagrangian dual for this too, and the only difference from hard-margin is that we have \max_{C \ge \alpha \ge 0} rather than \max_{\alpha \ge 0}.
- Soft-margin SVM usually uses hinge loss, but we also often use epsilon-insensitive loss (hinge loss with a bit of play \epsilon), L2 loss, L1 loss, and Huber loss.
- Lagrangian duals:
- We usually solve min/max problems such as \min_x f(x) by taking the derivative f'(x) and setting it to 0, then solving for x. However, if there are constraints, we can no longer do this.
- Lagrangian duals solve this. We move the constraints into the objective function by introducing a maximizer, use the minimax theorem to swap the min and max, then taking derivative and solving for the min to leave only the \max.
- The Lagrangian dual is often useful because it switches the problem from working on each row of the training set to each dimension. When we have high-dimensional problems, we often use the primal to improve performance, while we might use the dual when we have a lot of training set points. Additionally, the dual might lend itself better to kernelizing, like for SVM.
- Kernels: kernelized machine learning algorithms give us a way to efficiently transform each input \vec x using some function \phi(\vec x), even if computing \phi(\vec x) can't be computed efficiently, as long as we can efficiently compute \kappa(\vec x_1, \vec x_2) = \phi(\vec x_1) \cdot \phi(\vec x_2).
- To kernelize a machine learning algorithm, we write it in a way such that it contains \vec x_1 \cdot \vec x_2, and then replace that with \kappa(\vec x_1, \vec x_2).
- To prove that a function \kappa(\vec x_1, \vec x_2) is actually a kernel, we prove that its corresponding matrix is positive semidefinite, or construct \phi(\vec x) such that \kappa(\vec x_1, \vec x_2) = \phi(\vec x_1) \cdot \phi(\vec x_2).
- This is especially useful for linear classifiers, because a linear classifier plus a non-linear transform is equivalent to a non-linear classifier.
- Gaussian process: a function Z(t, \omega) such that Z(t, \omega) at any real fixed t is a Gaussian distribution.
- If we take Z(t, \omega) at a bunch of values of t, \begin{bmatrix} Z(t_1, \omega) \\ \vdots \\ Z(t_k, \omega) \end{bmatrix}, we get a multivariate Gaussian distribution.
- ;wip: what does it mean to marginalize a gaussian? if we integrate \vec X_2 out of \begin{bmatrix} X_1 \\ X_2 \end{bmatrix} we just get X_1
- There's a 1:1 correspondence between kernels and Gaussian processes - they're actually roughly interchangeable. A Gaussian process is specified by a mean function m(t) and a covariance function \kappa. Note that the covariance function is a kernel - every Gaussian process is defined by a mean function and kernel.
- Mixture of Gaussians model: p(x \mid \theta) = \sum_k \pi_k G_k where G_k \sim N(\mu_k, S_k).
- We can train this using expectation-maximization, since it's too hard to directly minimize the log likelihood function, because it contains an absolute value.
- Useful trick: \min_x 2 \abs{x} is roughly equivalent to \min_{x, y > 0} \frac{x^2}{y} + y (introduce a new variable y), because the minimum usually occurs when \frac{x^2}{y} = y, so y^2 = x^2 and y = \abs{x}.
- Expectation maximization: If we fix x, we can minimize y. If we fix y, we can minimize x. Repeat until we get a pretty minimal value.
- ;wip: EM algorithm slide, where y is a distribution q_i and then we alternatingly estimate and minimize each one.
- CNNs:
- A 3 by 3 filter is actually 3 by 3 by (number of channels in the lower layer). A convolutional layer can contain many filters.
- Pooling layers operate on individual channels - they don't change the number of channels, only the image dimensions.

ML interview questions: What is logistic regression doing (minimizing logistic loss)? Can you prove perceptron error bound (nope)?

;wip: logistic loss is log_2, not ln ;wip: blog post explaining lagrangian duals simply and tricks like the one we use to derive soft-margin SVM

This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright 2013-2017 Anthony Zhang.