# CS489

Special Topics in Computer Science - Introduction to Machine Learning.

Yaoliang Yu
Section 001
Email: yaoliang.yu@uwaterloo.ca
Website: http://cs.uwaterloo.ca/~y328yu/mycourses/489
Office Hours: Tuesdays/Thursdays 2:40pm-3:40pm in DC-3617
Tuesdays/Thursdays 4:00pm-5:20pm

# 7/9/17

Course is potentially going to be collaboration between professor and Focal Systems - guest lectures by deep learning engineers from Focal and assignments from real-world systems. Contact agastya@focal.systems for questions and comments.

Course questions on Piazza, content on LEARN. Course will use MATLAB, Python, or Julia, and requires CS341 concepts.

No required textbooks. 5 semi-weekly assignments worth 50% total, submittable via LEARN, remaining 50% from an open book final exam. There's also a 5% bonus project, consisting of a 1 page proposal and 8 page report on a machine-learning-related project. ;wip: conssider one for robotics control

Machine learning is about giving computers the ability to learn things that they aren't explicitly programmed to do. More specifically, for a machine to learn is to, as experience E increases, improve a performance measure P for some class of tasks T. Essentially, a program learns if it gets better at solving a problem as it gains more experience.

Machine learning falls into three categories:

• Supervised learning: classification/regression/ranking - there's a source of truth that the machine can use to determine the true answer for at least some of the problem instances.
• Example: the Not Hotdog app - images labelled hotdogs/not-hotdogs are used to train the model, which is then used to make predictions about new images.
• Given a training set of pairs \tup{x, y}, find a function f: X \to Y such that f(x) has good performance on values of x that haven't been seen in the training set.
• We don't actually care that much about performance in the training set - too-high performance in the training set is overfitting
• Reinforcement learning: control/pricing/gaming - there's no explicit source of truth, but doing something gives feedback, like how good the previous output was.
• Example: AlphaGo uses a reinforcement learning model to guide monte-carlo tree search - wins give positive feedback, losses give negative feedback.
• Unsupervised learning: clustering - there's no explicit source of truth.
• Example: Google Youtube clustering 9-layer network from 2012 was trained to cluster objects, and managed to learn to detect faces by itself.

Modern ML research focuses on representation of data (e.g., feature engineering), interpretation of results, generalizing models to different domains (e.g., applying image classifiers to video), time/space complexity, learning efficiency (how many samples do we need? how big does the training set need to be?), and real-world applications.

New notation: A_i is the i-th 1-indexed row of the matrix A, and A_{:j} is the j-th 1-indexed column of the matrix A.

New notation: \sign x = \begin{cases} 1 &\text{if } x > 0 \\ -1 &\text{if } x < 0 \\ \text{undefined} &\text{if } x = 0 \end{cases}.

New notation: derivative of a function is Df(x) = \lim_{\delta \to 0} \frac{f(x + \delta) - f(x)}{\delta}.

New notation: \min_{a: f(a), b: g(b), \ldots} f(a, b, c, \ldots) is the minimum value of f(a, b, c, \ldots) such that f(a), g(b), \ldots are all true. The a: f(a) part might also be written as just a if there's no constraints.

New notation: \argmin_{a: f(a), b: g(b), \ldots} f(a, b, c, \ldots) is the values of a, b, c, \ldots such that f(a, b, c, \ldots) is minimised and f(a), g(b), \ldots are all true. The a: f(a) part might also be written as just a if there's no constraints.

# 12/9/17

Consider the problem of filtering out spam emails. The training set would be a set X of emails (e.g., a vector where each dimension represents a feature, like whether word i appears in the email) and a set Y representing the spamminess of those emails (e.g., real number between -1 and 1). One of the most important parts of this task is making sure we have a good representation for features in our emails. In a bag of words model, for example, we might make X a 10000-dimensional vector where each element represents whether one of 10000 words appears in the email's subject.

In batch learning, we care about performance on the testing set X', and the training set is just the means by which we get there, by performing statistics on X and assuming things about X'. In online learning, data is received in a streaming fashion - we need to product the value of y without knowing its true value.

In this course, we'll use <a, b> to represent the inner product a \cdot b = a^T b. Also, \sign(x) is 1 when x > 0, -1 when x < 0, and undefined when x = 0 (some other courses will instead define it to be 0).

## Perceptrons

The perceptron is a machine learning model based on a highly simplified model of a neuron. It takes in activation from neighboring neurons, takes their weighted sum, and then applies the activation function to them, the \sign function, which is the neuron's output. We'll study Rosenblatt's original design from 1958, along with several additions and improvements made since then.

Perceptrons are used for binary classification problems. We are given a training set \set{\tup{\vec x_1, y_1}, \tup{\vec x_2, y_2}, \ldots} and a testing set \set{\vec t_1, \vec t_2, \ldots} where \vec x_i, \vec t_i are feature vectors, and y_i is the binary category, either -1 or 1. Using the training set, we want to train the perceptron to determine the category y_i for each \vec t_i.

A perceptron is simply y = \sign(\vec w \cdot \vec x + b), where \vec w is the perceptron's weights vector, b is the perceptron's bias, \vec x is the input, and y \in \set{-1, 1} is the prediction. Note that \vec w + b should be a hyperplane separating the positive values of y_i from the negative values of y_i, and the sign of \vec w \cdot \vec x + b determines which side of the hyperplace the point \vec x is on (the positive predictions side, or the negative predictions side). For now, let's assume that for the training set, there exists a hyperplane that separates all of the positives from all of the negatives - that the data is separable.

Now we'll try to simplify the perceptron formula to make it easier to work with. First, let's get rid of the \sign by multiplying both sides of the perceptron formula by y: y^2 = y \sign(\vec w \cdot \vec x + b), and since y is either -1 or 1, y^2 = 1, so y \sign(\vec w \cdot \vec x + b) = 1, or in other words, y (\vec w \cdot \vec x + b) > 0. Expand to get \vec w \cdot (y\vec x) + by > 0

Let \vec w' = \begin{bmatrix} \vec w \\ b \end{bmatrix} and a = \begin{bmatrix} y \vec x \\ y \end{bmatrix} - we've chosen these definitions specifically so that \vec w \cdot (y\vec x) + by > 0 is equivalent to a \cdot w' > 0, and so that the value of \vec w' represents the perceptron parameters exactly.

When training the perceptron, our goal is to fit the hyperplane to our training set. That means we'll want to make perceptron predictions in bulk, so it would be nice to be able to represent that in a compact way. To do this, we'll let A = \begin{bmatrix} \vec a_1 & \vec a_2 & \ldots \end{bmatrix}, where \vec a_i = \begin{bmatrix} y_i \vec x_i \\ y_i \end{bmatrix} - columns of A are values of \vec a corresponding to each value of \vec x_i. Written out fully, that's A = \begin{bmatrix} y_1 \vec x_1 & y_2 \vec x_2 & \ldots \\ y_1 & y_2 & \ldots \end{bmatrix}.

Clearly, A^T \vec w' > 0 is equivalent to \forall \vec x_i, \sign(\vec w' \cdot \vec x_i + b) = y. We've now simplified the perceptron problem down to a single matrix multiplication and a comparison! Now, \vec w' contains all the perceptron parameters, and the columns of A are the data points (each with a trailing 1 element), premultiplied by the label.

Now the problem becomes: given premultiplied data in a matrix A, find \vec w' such that A^T \vec w' > 0. The perceptron training algorithm does this, and works as follows: repeatedly choose a column \vec a_i of A, and if \vec a_i \cdot \vec w' \le 0, change \vec w' by adding \vec a_i to it. Stop when \vec a_i \cdot \vec w' > 0 for all \vec a_i in A, or when we reach an iteration/passes limit.

Why do we correct the weights when \vec a_i \cdot \vec w' \le 0 by adding \vec a_i to \vec w'? Well, the next time we choose the \vec a_i column, we'll get \vec a_i \cdot (\vec w' + \vec a_i) = \vec a_i \cdot \vec w' + \magn{\vec a_i}^2. Since \magn{\vec a_i}^2 > 0, \vec a_i \cdot (\vec w' + \vec a_i) > \vec a_i \cdot \vec w', so \vec a_i \cdot (\vec w' + \vec a_i) is closer to being positive.

(Python implementation not included, since this is a question in assignment 1)

After training, we can make predictions for any given input \vec x with the usual formula, y = \sign\left(\vec w' \cdot \begin{bmatrix} \vec x \\ 1 \end{bmatrix}\right).

This algorithm is very simple to implement, yet works quite well in practice. Also, the fact that its formula is a linear combination is interesting. If we look at the weights, we notice that large positive weights mean that the corresponding feature strongly suggests that the prediction should be positive, whereas large negative weights strongly suggest that the prediction should be negative.

How well does a perceptron converge when running the training algorithm described above? Block's perceptron convergence theorem gives us an idea. If A is separable (i.e., a hyperplane exists that separates positive cateogry points from negative category points), then \vec w' will converge to some \vec w^*. If every column of A is selected indefinitely often, then A^T \vec w^* > 0. Furthermore, if \vec w' = \vec 0 initially, then the perceptron converges after at most (R / \gamma)^2 iterations, where R = \max\set{\magn{a_1}, \magn{a_2}, \ldots} and \gamma = \max\set{\min\set{\vec w \cdot \vec a_1, \vec w \cdot \vec a_2, \ldots} : \magn{\vec w} \le 1} (the margin - the minimum distance between the convex hull of the positive points and the negative points). Essentially, the margin represents the distance between the "hardest" two datapoints to classify.

Note that these values of R and \gamma are purely functions of the dataset, and that they don't directly depend on the size of A and the number of dimensions d. In other words, the number of mistakes the perceptron makes would be independent of the dataset size and number of dimensions! The larger the margin is, the faster the perceptron converges. Block's perceptron convergence theorem gives us a worst case bound, but in many practical situations the perceptron will perform a lot better.

Also, the perceptron stops at an arbitrary linear separator that correctly separates the points, not necessarily the one that most cleanly separates the positive and negative points (with the largest possible minimum distance from the hyperplane to positive/negative predictions). In fact, the resulting hyperplane will even depend on the order we feed in the data. We can use support vector machines instead to find that hyperplane (which also happens to be unique for each dataset!). This is the main disadvantage of perceptrons - they might only barely separate the training data, so they're less robust to unseen data than those that find a linear separator with a larger margin.

If the data is not separable, Block's perceptron convergence theorem doesn't apply anymore. The perceptron boundedness theorem says that convergence is only guaranteed if such a hyperplane exists, but if it doesn't, then the iterations are still bounded, because the perceptron's state will start cycling after a certain number of iterations. In practice, this means we would specify a time or iteration limit when doing training, or when the training/validation error stops changing, or even if weights stop changing much when using diminishing step sizes.

If we end up with non-separable data, we might want to find a better feature representation, use a deeper model, or use a soft margin - instead of a hyperplane that perfectly separates positive/negative values of y, we can allow a few mistakes.

There are many ways to extend perceptrons to classify things into more than two categories (positive/negative). One way is one vs. all, where we have one perceptron per category, perceptron with highest activation level wins - \max_c(w_c \cdot x). The issue with this is that it's imbalanced - each perceptron has to give negative predictions far more often than positive ones, since it only gives positive prediction for its own category and otherwise must give a negative prediction. Another is one vs. one, where we have one perceptron for every pair of categories, where a positive prediction means the datapoint is in the first category and negative means the other category, and then take a vote to find the most commonly predicted category as the final answer.

An example of applying perceptrons online is pricing - selling a product to the user at a price y, and updating weights if the price is too high and the user doesn't buy the product.

# 14/9/17

Assignment 1 now available, due in two weeks.

A pass is a run through all of the training data - 100 passes means we go through the training data 100 times. An iteration is a run through a single data point in our training data.

## Linear Regression

Consider a scatter plot of house market value vs. square footage. We'd expect that these two are pretty well correlated. A linear regression over these two variables can be used to give us a line of best fit.

Regression problems are about fitting models to match datasets as closely as possible. Linear regression problems try to fit linear models to datasets. When we're doing regression problems, we have to consider whether to use linear/nonlinear models, and whether we'll be using it to interpolate or extrapolate (choosing the perfect model is much more important for extrapolation)

Formally, a regression problem is: find f(\vec x) \approxeq \vec y given \vec x (the feature vector a real vector) and y (the response value, a real number). The hard part of this is that \vec x and y are drawn from unknown distributions, which makes it hard to interpolate/extrapolate. Additionally, we need a way to express how much error there is in our model predictions - a loss function.

One family of regression algorithms is risk minimizers (expected loss minimizers): algorithms that try to find f such that \min_{f: \vec x \to y} E[L(f(\vec x), y)], where L is the loss function.

A common loss function is least squares: \min_{f: \vec x \to y} E[\magn{f(\vec x) - y}^2]. The correct loss function for a given situation is often hard to determine, so we use one that's simple and efficient to compute - least squares works well enough for most situations. Additionally, of all the minimizers of $_W _F, W = A^+ CB^+$ is the one with the smallest F-norm, where A^+ is the pseudo-inverse of A (Sondermann '86, Yu & Shuurmans '11) - this is mostly a theoretical result, but gives us another good reason to use least squares loss.

Clearly, E[\magn{f(\vec x) - y}^2] = E[\magn{f(\vec x) - E(y \mid \vec x)}^2] + E[\magn{E(y \mid \vec x) - y}^2]. Note that the second term doesn't really depend on f - it's the inherent noise variance, the noise that we can't get rid of no matter how good our regression function is. Also, the first term gives us the problem in a much simpler form: we want to find an f(\vec x) that approximates E(y \mid \vec x) well, to make this term smaller.

One way to make this optimization process easier is to assume that f(\vec x) is linear, so f(\vec x) = E(\vec y \mid \vec x) = A \vec x + \vec b for some matrix A. If we make this assumption, then with risk minimization we're trying to find \min_{f: \vec x \to \vec y} E[A \vec x + \vec b - \vec y]. We can't minimize this directly because we don't know the true distribution of the variables, but using the law of large numbers, \frac 1 n \sum Z_i = E(Z) for any Z = \set{Z_1, Z_2, \ldots}. So if we assume the model is linear, and the sample is large, then the risk minimization can be approximated by \min_{\vec a, \vec b} \frac 1 n \sum \magn{A \vec x + \vec b - \vec y}^2 (this approximation is called the empirical risk).

Let's simplify the \min_{\vec a, \vec b} \frac 1 n \sum \magn{A \vec x + \vec b - \vec y}^2 approximation, using something very similar to what we did for perceptrons. First, let's define W = \begin{bmatrix} A^T \\ {\vec b}^T \end{bmatrix} and \vec x' = \begin{bmatrix} \vec x \\ 1 \end{bmatrix}. Now we have \min_W \frac 1 n \sum \magn{W^T \vec x' - \vec y}^2, which is slightly shorter/cleaner.

Let \vec x_i be the ith value of \vec x in our training set. Just like for the perceptrons simplifications above, we also want to include all of the training set data points in a single expression, to make our minimization problem simpler. To do this, let X = \begin{bmatrix} {\vec x_1'}^T \\ {\vec x_2'}^T \\ \vdots \end{bmatrix}, Y = \begin{bmatrix} {\vec y_1}^T \\ {\vec y_2}^T \\ \vdots \end{bmatrix}. Now, we can write this as \min_W \magn{XW - Y}_F^2 where \magn{A}_F = \sum_{i, j} A_ij is the Frobenius norm - each element simply gets squared and the squares are all summed together to get the result, like the Euclidean norm, but extended for any matrix.

The least squares problem is now writeable as \min_W \magn{XW - Y}_F^2, and we're minimizing the sum of square residuals XW - Y (sum of square distances between the predicted values and true values). Here, Y is a matrix with columns as the true responses, and the residuals are the distances between each true response in Y and the point that the hyperplane would predict given X.

Note that the Frobenius norm can be defined as: \magn{A}_F^2 = \trace{A^T A}. Additionally, the following are identities: \trace(A + B) = \trace(A) + \trace(B), \trace(AB) = \trace(BA), \trace(A) = \trace(A^T), and \trace(cA) = c \trace(A).

Therefore, \magn{XW - Y}_F^2 = \trace((XW - Y)^T (XW - Y)) = \trace((W^T X^T - Y^T) (XW - Y)) = \trace(W^T X^T X W - Y^T X W - W^T X^T Y + Y^T Y) = \trace(W^T X^T X W) - \trace((Y^T X W)^T) - \trace(W^T X^T Y) + \trace(Y^T Y) = \trace(W^T X^T X W) - \trace(W^T X^T Y) - \trace(W^T X^T Y) + \trace(Y^T Y) = \trace(W^T X^T X W - 2 W^T X^T Y + Y^T Y). Clearly, this is a quadratic equation with respect to W, and we want to find its minimum.

Consider \min_x f(x). Fermat's theorem says that at the minimum x, the derivative of f(x) must be 0. Consider a general quadratic function f(x) = \vec x^T A \vec x + \vec x^T \vec b + c. The derivative is then \frac{\dee f(x)}{\dee x} = (A + A^T)\vec x + \vec b.

Note that \magn{XW - Y}_F^2 = W^T(X^T X) W - 2W^T X^T Y + Y^T Y (a quadratic equation), and if set the derivative of this to 0 and solve we get X^T X W = X^T Y as a solution, which is just a linear system - we have X and Y, so we can solve for W. Note that X^T X might be invertible, but we should still never solve for W by using W = (X^T X)^{-1} X^T Y, since this involves solving n linear systems, whereas we can solve it by solving only 1 linear system (in practice, we should almost never actually compute matrix inverses).

Once we have W, we can make predictions for any given X using \hat Y = XW, or evaluate those predictions with (Y - \hat Y)^2. We can also evaluate using a different loss function, a technique often used in calibration theory.

Linear regression is disproportionally affected by large outliers. To mitigate this, we sometimes use Huber loss, which is linear for large differences and quadratic for smaller ones, where "larger" and "smaller" are defined by a threshold \delta. This ensures overly large outliers don't impact the result too much. Huber's loss function is defined as H(\hat y, y) = \begin{cases} \frac 1 2 (\hat y - y)^2 &\text{if } \abs{\hat y - y} \le \delta \\ \delta(\abs{\hat y - y} - \frac{\delta}{2}) &\text{otherwise} \end{cases}.

Ill-posed problems are those that don't have exactly one solution (zero or more than one solution) or don't have their solutions change continuously with respect to the problem initial conditions (i.e., derivative of solution with respect to initial condition doesn't always exist). To handle this sort of regression task, we can use Tiknohov regularization. To do this, we just add a term to the formula: \min_W \magn{XW - Y}_F^2 + \lambda \magn{W}_F^2, or equivlaently, (X^T X + \lambda I)W = X^T Y. A small positive lambda ensures that instead of a small change in the input resulting in a huge difference in the output, it would result in a difference proportional to \frac{1}{\lambda} instead. Another way to handle ill-posed problems is to use data augmentation - essentially, adding more data points to make the data appear more regular.

How do we choose hyperparameters like \lambda for Tiknohov regularization? We have a training set (for model training), testing set (which we don't see until the end), and sometimes a small validation set (for tuning parameters), and on the training set, we can apply n-fold cross-validation. Suppose we have k different values of \lambda we want to consider:

1. Split the training set into n roughly-equal sized chunks.
2. For each value of \lambda we want to cross-validate:
1. For each chunk i:
1. Train the model on the dataset formed by combining the n - 1 chunks that are not chunk i.
2. Evaluate the model against the training data in chunk i.
2. Average the evaluation scores from each chunk to get the average cross-validation score for the value of \lambda.
3. Pick the value of \lambda that has the best average cross-validation score.

# 19/9/17

Guest lecture by Francois from Focal Systems (francois@focal.systems).

Almost any ML problem falls into regression or classification.

For linear regression, we're assuming that the response variable y is approximated by \vec \theta \cdot \vec x + N(0, \sigma), where N(0, \sigma) is a normal distribution centered around 0 with standard deviation \sigma. Further overview of some real-world details in implementing linear regression.

Though linear regression is simplistic, it turns out that it works very well in practice, since more complex models require more advanced ways to do regularization and get decent weight vectors. Tools like SVM are used a lot in the real world to model real phenomena, even when they aren't necessarily linear, because it works well enough for most purposes.

Most modern ML problems use SGD - stochastic gradient descent.

A Bernoulli model predicts y = P(Y_1 = y_1, \ldots, Y_n = y_n \mid X_1 = x_1, \ldots, X_n = x_n). If we assume that the distribution of the variables are a Bernoulli distribution, so they're independent, we can then write this as y = \prod P(Y_i = y_i \mid X_i = x_i) = \prod p(x_i; w)^{y_i} (1 - p(x_i; w)) ;wip: get the formula for this

Logistic regression tries to predict the value of 0 \le y \le 1 given \vec x by fitting the formula \frac 1 {1 + \exp(\vec w \cdot x)}, where \vec w is the thing that we're trying to fit.

We use a sigmoid rather than, say, a step function, because the gradient doesn't have any signal - if we differentiate it, the derivative is just 0 everywhere, so gradient descent wouldn't be able to get closer to the solution at every step. Instead, the sigmoid formula has a gentle curve, so its derivative is more suitable for performing gradient descent on.

Our loss function is then f(y', y) = \ln(y') * y + \ln(1 - y') (1 - y), where y' is the model's prediction and y is the true value. ;wip: why??? look at slides

Tensorflow example, implementing logistic regression using the built in gradient descent optimizer to minimize the loss function. When doing gradient descent, we want the largest learning rate that still converges.

;wip: logistic regression

# 21/9/17

Guest lecture by Aghastya from Focal Systems (aghastya@focal.systems).

A perceptron tells you which side of a hyperplane a point is, and also includes an algorithm to separate two classes with a hyperplane - a binary classifier. Logistic regression finds a line of best fit, as well as a measure of confidence that our prediction is correct, because the prediction's value is between -1 and 1 rather than exactly -1 or 1 - a binary classifier as well. The logistic regression gives a higher-magnitude prediction the farther it is from the logistic regression's hyperplane.

Perceptrons aren't very good binary classifiers overall for reasons weve previously discussed, but they're computationally cheap to run and easy to reason about. One example of the this is the XOR problem - a hyperplane cannot separate an XOR function: two classes \set{\tup{0, 1}, \tup{1, 0}} and \set{\tup{0, 0}, \tup{1, 1}}.

The goal of deep learning classification is to learn a representation of the input into something that's linearly separable, so we could then use classifier techniques like logistic regression.

Consider a two-payer perceptron. We have parameters U, \vec c, \vec w, b. Let \vec z = U\vec x + \vec b and h = f(\vec z), where f is a nonlinear function like x^2 or \arctan(x). Then the output of the two-layer perceptron is then \hat y = \vec h \cdot \vec w + b. There are three layers here:

1. The linear layer takes the model input \vec x and transforms it into another linear space via weights U and the bias term \vec c, to get the hidden layer inputs \vec z.
• This is the part of the perceptron before the \sgn function.
2. The hidden layer takes the hidden input \vec z and applies a non-linear function to it, so we can represent non-linearity in the input, to get the linear layer inputs \vec h.
• The function used in these hidden layers is known as an activation function.
• Common activation functions are sigmoid \frac{1}{1 + \exp(x)}, inverse tangent \arctan(x), and rectified linear unit (ReLU) \max(0, x).
3. The linear layer takes the hidden layer output \vec h and transforms it into another linear space via weights \vec w.

Essentially, we're weighting the input \vec x by the weight matrix U and the bias term \vec c (a value unaffected of input that influences the output in a particular direction) in a linear layer to get the hidden layer inputs \vec z. Then, we apply the non-linearity to those to get the hidden layer output \vec h, which is also the inputs to the next linear layer, which is a linear classifier. ;wip: explain this better

The idea is to compose a bunch of non-linear functions to learn an arbitrary function, and then use that composition of non-linear functions so that the result is linearly separable. The non-linear functions add complexity to the model, and by adding enough of those non-linear functions, we can represent a lot of complex things.

The width of a multi-layer perceptron is the number of dimensions of \vec x. Note that this doesn't include the bias term \vec c.

We can't just use linear functions because composing linear functions just makes another linear function - if we applied linear functions on a dataset that wasn't linearly separable, it would still not be linearly separable, but if we apply non-linear functions, it may end up linearly separable.

Consider now an example: U = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}, \vec c = \begin{bmatrix} 0 \\ -1 \end{bmatrix}, \vec w = \begin{bmatrix} 2 \\ -4 \end{bmatrix}, \vec b = -1, f(x) = \max(t, 0). The activation function is ReLU (rectified linear unit). As it turns out, this can correctly classify the output of the XOR function.

We can stack these two-layer perceptrons - feeding the output of one into the input of another. It's actually been proven that if the activation function is not a polynomial, and there are enough layers, the resulting model can theoretically learn any function that has values between 0 and 1!

The multi-layer perceptron learns a hierarchical non-linear feature representation, which we can then apply our existing classification/regression tools to.

As we go from levels closer to the input to levels farther away, we start being able to classify higher-and-higher-level features. For example, an image classifier network might start with learning lines/circles in the lower levels, and then shapes like trees and cars and faces in the higher levels.

Choosing the right activation function and architecture is currently more of an art than a science - there's a lot of nuances that we'll go over in later lectures. Usually, we'll just add more layers than we need, and regularize afterward.

## Backpropagation

The multilayer perceptron's output is clearly a function of the input, \hat y = q(\vec x; \Theta), where \Theta is the parameters of the model. We usually don't know the true function q^* that we're trying to approximate, but deep learning samples a lot of values of q^* and then tries to approximate that by composing lots of nonlinear functions to get q.

To do machine learning, we need to be able to talk about the difference between the prediction \hat y and the truth y - a loss function l(\Theta). We already looked at cross-entropy, but for now we'll just use the L2 norm (\hat y - y)^2 ;wip: are those the same things?

Gradient descent is a method of minimizing any function by iteratively taking steps toward the minima.

We want to minimize our loss function using gradient descent - minimizing L(\Theta) = \frac 1 n \sum l(q(x_i; \Theta), y_i) by iteratively computing \Theta_{t + 1} = \Theta_t - \eta_t \frac{\dee}{\dee \Theta_t} L(\Theta_t). Here, \Delta L(\Theta) is the gradint, and \eta_t is the learning rate.

The loss function should ideally get smaller as we get closer to a good answer, but really we just need a function l(x', x) (x' is the prediction, x is the true value) that's 0 when we have a correct answer, and positive otherwise.

If the function is continuous, the gradient is the derivative of the function. Note that since we're subtracting the gradient, the gradient actually points away from the direction we're optimizing toward.

The gradient at a particular node in the computation graph is known as a local gradient.

;wip: rewrite this section more clearly, talk about forward and backward pass

Backpropagation is a method of computing the gradient of \Theta with respect to l(\Theta). Suppose we have r = f(q(x)). By the chain rule, \frac{\dee r}{\dee x} = \frac{r}{\dee f} \frac{f}{\dee q} \frac{q}{\dee x}, so we can easily recursively differentiate to get the gradient.

Backpropagation is implemented by building a computation graph, differentiating at the output layer, and working our way back to the input layer. The innovation is that it's much faster than the naive way of computing the gradient, where we hold all but one variable constant and find the partial derivative.

Backpropagation is an expensive algorithm, and was the reason neural networks weren't practical for a long time, until we can cheap GPU-based computation.

# 26/9/17

Assignment 1 deadline extended to Thursday. In-class overview of the assignment.

For the winnow algorithm, normalizing is pretty important, since the step size depends on the maximum absolute value of the data, unlike the perceptron, where those initial parameters don't matter as much.

The winnow algorithm will always converge, regardless of the step size. This is because of the normalization step in the loop, where we divide by the sum of the weights and biases. However, certain step size values will minimize the number of passes before we get convergence.

Closed-form solution for alternating minimization on ridge regression:

We want to solve z^* = \argmin_z \frac 1 2 \magn{\vec X_{:j} z + \sum_{k \ne j} \vec X_{:k} w_k - \vec y}^2 + \lambda z^2.
Let \vec u = \vec X_{:j} and \vec v = \sum_{k \ne j} \vec X_{:k} w_k - \vec y, so z^* = \argmin_z \frac 1 2 \magn{\vec u z + \vec v}^2 + \lambda z^2 = \argmin_z \frac 1 2 \magn{\vec u}^2 z^2 + \vec u \cdot \vec v z + \frac 1 2 \magn{v}^2 + \lambda z^2 = \argmin_z (\frac 1 2 \magn{\vec u}^2 + \lambda) z^2 + \vec u \cdot \vec v z + \frac 1 2 \magn{v}^2.
Since \frac 1 2 \magn{v}^2 doesn't depend on z, it doesn't affect the value of \argmin_z. So z^* = \argmin_z (\frac 1 2 \magn{\vec u}^2 + \lambda) z^2 + \vec u \cdot \vec v z.
Clearly, \frac{\dee}{\dee z} \left(\left(\frac 1 2 \magn{\vec u}^2 + \lambda\right) z^2 + \vec u \cdot \vec v z\right) = \left(\magn{\vec u}^2 + 2\lambda\right)z + \vec u \cdot \vec v.
If we set the derivative to 0, then z = -\frac{\vec u \cdot \vec v}{\magn{\vec u}^2 + 2\lambda}. Since the equation was quadratic, this is the only local optima, so it is also the global minimum. Therefore, z is minimized at z = -\frac{\vec u \cdot \vec v}{\magn{\vec u}^2 + 2\lambda}.

## Nearest neighbor rule

Supose we have a perceptron classifier \hat y = \sgn(\vec x \cdot \vec w + b). The decision boundary \vec x \cdot \vec w + b = 0 is the decision boundary here, and we can do a lot with it. This is a parametric classifier, because it uses a finite-dimensional set of parameters \vec w. In contrast, a non-parametric classifier has an arbitrarily large number of parameters. The nearest-neighbor rule is non-parametric, because

The nearest-neighbor rule is a way to predict y given a feature vector \vec x and a training set. Given a feature vector \vec x (the query point), find the training set entry \tup{\vec x', y'} such that \vec x' is the nearest to \vec x by some distance metric, then predict that y = y'. Essentially, we take the nearest point to the given one, and simply take that point's value as the prediction.

The distance metric d(x_1, x_2) must be symmetic (d(x_1, x_2) = d(x_2, x_1)), definite (d(x, x) = 0), and satisfy the triangle equality (d(x_1, x_3) \le d(x_1, x_2) + d(x_2, x_3)). Some examples of distance metrics are the L_2 norm (Euclidean distance, square root of sum of squares), L_1 norm (Manhattan distance, sum of absolute values), and L_\infty norm (Chebyshev distance, the max value). In general, the L_p norm of a vector \vec x is (\sum_i \abs{x_i}^p)^{\frac 1 n}.

The nearest-neighbor rule doesn't need any training, but in general takes O(nd) space to store the entire training set (n is number of points, d is number of dimensions).

If we implement nearest-neighbor naively, it takes O(nd) time to get a prediction. We can do better by constructing a Voronoi diagram - a diagram where the space is partitioned into n areas, each of those n areas contains exactly one point, and the point is always the nearest neighbor within its associated area. In 2D, this takes O(n \log n) time and O(n) space. In general, this takes n^{O(d)} space and O(d \log n) query time, which is still rather bad. In the real world, we often use faster, approximate nearest-neighbor algorithms that take advantage of hashing and similar techniques.

Nearest-neighbor is affected by normalization if it distorts the distance metric, such as if we scale one dimension but not another. Usually, we normalize every feature vector by subtracting the mean of all the feature vectors and then dividing by the standard deviation of all the feature vectors.

Nearest-neighbor depends heavily on the distance metric. There's actually a way to determine a good metric to use based on the training set. by first using the the Mahalanobias distance d_M(x_1, x_2) = \sqrt{(\vec x_1 - \vec x_2) \cdot M (\vec x_1 - \vec x_2)} where M \in \mb{S}_+. This is equivalent to M = L L^T for some d by h matrix L. In other words, we're projecting \vec x_1 - \vec x_2 into a smaller-dimensional space h with \vec x = L^T(\vec x_1 - \vec x_2), then taking the Euclidean norm of that point within that space ;wip

The k-nearest-neighbor is the neighbor rule, but instead of just finding the nearest neighbor, we find the k nearest neighbors in the training set, and then take their results and combine them, usually either taking the mode (majority vote) or mean of all those y values (usually k is odd, so for boolean y there is never a tie).

The k in k-nearest-neighbors sort of acts like a regularization parameter - the higher k is, the less complicated the resulting model is (the decision boundaries in the voronoi diagram are smoother). Intuitively, the higher k is, the more points we have to change to get the prediction for a query point to change.

We want to pick a k that avoids overfitting, but doesn't make the resulting predictions overly simple. Usually this is done by just trying a bunch of k values and evaluating using cross-validation.

k-nearest-neighbors works surprisingly well on problems with a low number of dimensions and a large training set, and is often used in real-world problems. It works less well with higher dimensionality or smaller training sets.

As the size of the training set tends to infinity, P(Y_1 \ne Y \mid X) \le 2P^* - \frac{c}{c - 1} (P^*)^2 (Cover & Hart, 1967). P(Y_1 \ne Y \mid X) is the probability of a misprediction, c is the number of categories/classes, and P^* is the Bayes error (the theoretical minimum error, for any machine learning model). Essentially, this is saying that 1-nearest-neighbor will have an error within twice the theoretical minimum error - a very useful result for bounding the error!

# 28/9/17

Review of the last class.

The Bayes error is defined by P^* = \min_{f : \mathcal{X} \to \set{-1, 1}} P(f(X) \ne Y). Here, X is a feature vector, f(X) is the classifier's prediction, and Y is the true value. Essentially, it's the minimum error rate for any possible classifier f(X), regardless of how that classifier is implemented.

The Bayes rule is the best possible classifier, since it achieves an error equivalent to the Bayes error. ;wip: write down the rule

Derivation:

Clearly, P(f(X) \ne Y) = 1 - P(f(X) = Y). We don't know Y, but we do know it's either -1 or 1, so P(f(X) \ne Y) = 1 - P(f(X) = 1 \land Y = 1) - P(f(X) = -1 \land Y = -1).
We now have unknown X and unknown Y. We're going to fix X and look at the probability of Y, a technique known as conditioning. This works because P(f(X) = 1, Y = 1) = \int_{f(x) = 1, y = 1} P(x \land y) \dee x \dee y = \int_{f(x) = 1, y = 1} P(y \mid x) P(x) \dee x \dee y = \int_{f(x) = 1} \left(\int_{y = 1} P(y \mid x) \dee y \right) \dee x = E(1_{f(X) = 1} P(Y = 1 \mid X)). Here, 1_{f(X) = 1} is an indicator function - a function that is 1 if f(X) = 1, and 0 otherwise.
So P(f(X) \ne Y) = 1 - E[P(f(X) = 1, Y = 1 \mid X)] = 1 - E(1_{f(X) = 1} P(Y = 1 \mid X)) - E(1_{f(X) = -1} P(Y = -1 \mid X)) - we conditioned on X and took the expectation.
Let \eta(X) = P(Y = 1 \mid X), so 1 - \eta(X) = P(Y = -1 \mid X). Clearly, 1_{f(X) = -1} = 1 - 1_{f(X) = -1}. ;wip: copy the rest from the slides So P(f(X) \ne Y) = E(\eta(X) + 1_{f(X) = 1}(1 - 2 \eta(X))).

Note that \eta(X) is already defined, but we can choose our own 1_{f(X) = 1}. Since we want to minimize P(f(X) \ne Y), we want 1_{f(X) = 1} to be 1 exactly when 1 - 2 \eta(X) \le 0. This tells us that the optimal classifier f^* is f^* = \begin{cases} 1 &\text{if } \eta(X) \ge \frac 1 2 \\ -1 &\text{otherwise} \end{cases}.

Since we won't have \eta(X) = P(Y = 1 \mid X) in most real-world problems, it's generally not possible to implement this in practice.

For more than two classes, we can do something similar to get f^*(X) = \argmax_{1 \le m \le c} P(Y = m \mid X), and the new Bayes rule would be P^* = E(1 - \max_{1 \le m \le c} P(Y = m \mid X)). This is the best we can possibly do assuming the data is sampled independently and identically from \tup{X, Y}.

Note that in the worst case, P^* = \frac{c - 1}{c}. So the more classes we have, the larger the maximum possible Bayes error is.

So looking back at the formula we saw last class, as the training set size goes to infinity for 1-nearest-neighbor, P(Y_1 \ne Y \mid X) \le 2P^* - \frac{c}{c - 1} (P^*)^2. So in the worst case, 1-nearest-neighbor will be less than twice as worse at the Bayes error. For this to be roughly true in practice though, n must grow exponentially with respect to d, which is generally not practical.

The error rate for k-nearest-neighbor for some k = 2t + 1 is \frac 1 {2^n} \sum _{0 \le i \le t} {n \choose i} (this simplifies to \frac 1 {2^n} for 1-NN). This tells us that a larger k is actually going to give us a higher error in the worst case. Consider how this worst case occurs, when all the points are in random classes.

We want to use a smaller k when the classes are easier to separate, and a larger k when the classes tend to be harder to separate - larger values for "harder" problems, smaller for "easier" problems. We usually choose k via cross-validation.

SSBD page 224: For any c > 1 and any learning algorithm L, there exists a distribution \set{0, 1}^d \times \set{0, 1} such that the Bayes error is 0 but for sample sizes n \le \frac{(c + 1)^d}{2}, the error probability for L is greater than \frac 1 4. Essentially, there always exists a distribution such that a given learning algorithm will be wrong 25% or more of the time, but an ideal classifier would get it perfectly, with a limited training set size.

This is especially applicable to k-NN, and we often try to avoid this by increasing training set size and doing dimensionality reduction (e.g., projecting a higher dimension into a lower dimension while preserving the most important elements of the features).

;wip: locally linear slide?

For regression, we can also use something similar to k-nearest-neighbors. Instead of the training set points having discrete classes, they instead have real values. Essentially, given a feature vector \vec x, we take the k-nearest-neighbors in the training set, and average their y values to get the prediction y'. We might additionally do things like do a weighted average based on their distance to \vec x.

## Hard-Margin Support Vector Machine (SVM)

Recall that the naive perceptron algorithm assumes that the data is linearly separable, and requires that to be true to converge. Furthermore, the perceptron will find any separating hyperplane, depending on how we feed in the data, rather than what we might consider the "best" separating hyperplane.

The perceptron is solving the minimization problem "minimize 0 subject to y_i(\vec w^T \vec x_i + b) > 0 for all i" - there's no objective function, so we're just finding a feasible solution. The SVM also finds a feasible solution, but also tries to maximize the margin - the minimum distance from the hyperplane to any training set point.

Given a separating hyperplane H defined by \vec w \cdot \vec x + b = 0, we want to translate and rotate it until the margin is maximized. First we'll normalize the scale: \frac{\vec w}{\magn{w}} \cdot \vec x + \frac{b}{\magn{\vec w}} = 0. We translate the hyperplane by changing b, and rotate it by changing \vec w. Suppose we translate it upward by s until it hits a point, and downward by t until it hits a point.

Now we have the hyperplanes H_1 defined by \frac{\vec w}{\magn{w}} \cdot \vec x + \frac{b}{\magn{\vec w}} = t and H_{-1} defined by \frac{\vec w}{\magn{w}} \cdot \vec x + \frac{b}{\magn{\vec w}} = s, both touching points and having an empty space between them.

Clearly, the distance between H and H_1 is . Clearly, we maximize the distance by ;wip: put into formula slide