Lecture Notes by Anthony Zhang.


Special Topics in Computer Science - Introduction to Machine Learning.

Yaoliang Yu
Section 001
Email: yaoliang.yu@uwaterloo.ca
Website: http://cs.uwaterloo.ca/~y328yu/mycourses/489
Office Hours: Tuesdays/Thursdays 2:40pm-3:40pm in DC-3617
Tuesdays/Thursdays 4:00pm-5:20pm


Course is potentially going to be collaboration between professor and Focal Systems - guest lectures by deep learning engineers from Focal and assignments from real-world systems. Contact agastya@focal.systems for questions and comments.

Course questions on Piazza, content on LEARN. Course will use MATLAB, Python, or Julia, and requires CS341 concepts.

No required textbooks. 5 semi-weekly assignments worth 50% total, submittable via LEARN, remaining 50% from an open book final exam. There's also a 5% bonus project, consisting of a 1 page proposal and 8 page report on a machine-learning-related project. ;wip: conssider one for robotics control

Machine learning is about giving computers the ability to learn things that they aren't explicitly programmed to do. More specifically, for a machine to learn is to, as experience E increases, improve a performance measure P for some class of tasks T. Essentially, a program learns if it gets better at solving a problem as it gains more experience.

Machine learning falls into three categories:

Modern ML research focuses on representation of data (e.g., feature engineering), interpretation of results, generalizing models to different domains (e.g., applying image classifiers to video), time/space complexity, learning efficiency (how many samples do we need? how big does the training set need to be?), and real-world applications.

New notation: A_i is the i-th 1-indexed row of the matrix A, and A_{:j} is the j-th 1-indexed column of the matrix A.

New notation: \sign x = \begin{cases} 1 &\text{if } x > 0 \\ -1 &\text{if } x < 0 \\ \text{undefined} &\text{if } x = 0 \end{cases}.

New notation: derivative of a function is Df(x) = \lim_{\delta \to 0} \frac{f(x + \delta) - f(x)}{\delta}.

New notation: \min_{a: f(a), b: g(b), \ldots} f(a, b, c, \ldots) is the minimum value of f(a, b, c, \ldots) such that f(a), g(b), \ldots are all true. The a: f(a) part might also be written as just a if there's no constraints.

New notation: \argmin_{a: f(a), b: g(b), \ldots} f(a, b, c, \ldots) is the values of a, b, c, \ldots such that f(a, b, c, \ldots) is minimised and f(a), g(b), \ldots are all true. The a: f(a) part might also be written as just a if there's no constraints.


Consider the problem of filtering out spam emails. The training set would be a set X of emails (e.g., a vector where each dimension represents a feature, like whether word i appears in the email) and a set Y representing the spamminess of those emails (e.g., real number between -1 and 1). One of the most important parts of this task is making sure we have a good representation for features in our emails. In a bag of words model, for example, we might make X a 10000-dimensional vector where each element represents whether one of 10000 words appears in the email's subject.

In batch learning, we care about performance on the testing set X', and the training set is just the means by which we get there, by performing statistics on X and assuming things about X'. In online learning, data is received in a streaming fashion - we need to product the value of y without knowing its true value.

In this course, we'll use <a, b> to represent the inner product a \cdot b = a^T b. Also, \sign(x) is 1 when x > 0, -1 when x < 0, and undefined when x = 0 (some other courses will instead define it to be 0).


The perceptron is a machine learning model based on a highly simplified model of a neuron. It takes in activation from neighboring neurons, takes their weighted sum, and then applies the activation function to them, the \sign function, which is the neuron's output. We'll study Rosenblatt's original design from 1958, along with several additions and improvements made since then.

Perceptrons are used for binary classification problems. We are given a training set \set{\tup{\vec x_1, y_1}, \tup{\vec x_2, y_2}, \ldots} and a testing set \set{\vec t_1, \vec t_2, \ldots} where \vec x_i, \vec t_i are feature vectors, and y_i is the binary category, either -1 or 1. Using the training set, we want to train the perceptron to determine the category y_i for each \vec t_i.

A perceptron is simply y = \sign(\vec w \cdot \vec x + b), where \vec w is the perceptron's weights vector, b is the perceptron's bias, \vec x is the input, and y \in \set{-1, 1} is the prediction. Note that \vec w + b should be a hyperplane separating the positive values of y_i from the negative values of y_i, and the sign of \vec w \cdot \vec x + b determines which side of the hyperplace the point \vec x is on (the positive predictions side, or the negative predictions side). For now, let's assume that for the training set, there exists a hyperplane that separates all of the positives from all of the negatives - that the data is separable.

Now we'll try to simplify the perceptron formula to make it easier to work with. First, let's get rid of the \sign by multiplying both sides of the perceptron formula by y: y^2 = y \sign(\vec w \cdot \vec x + b), and since y is either -1 or 1, y^2 = 1, so y \sign(\vec w \cdot \vec x + b) = 1, or in other words, y (\vec w \cdot \vec x + b) > 0. Expand to get \vec w \cdot (y\vec x) + by > 0

Let \vec w' = \begin{bmatrix} \vec w \\ b \end{bmatrix} and a = \begin{bmatrix} y \vec x \\ y \end{bmatrix} - we've chosen these definitions specifically so that \vec w \cdot (y\vec x) + by > 0 is equivalent to a \cdot w' > 0, and so that the value of \vec w' represents the perceptron parameters exactly.

When training the perceptron, our goal is to fit the hyperplane to our training set. That means we'll want to make perceptron predictions in bulk, so it would be nice to be able to represent that in a compact way. To do this, we'll let A = \begin{bmatrix} \vec a_1 & \vec a_2 & \ldots \end{bmatrix}, where \vec a_i = \begin{bmatrix} y_i \vec x_i \\ y_i \end{bmatrix} - columns of A are values of \vec a corresponding to each value of \vec x_i. Written out fully, that's A = \begin{bmatrix} y_1 \vec x_1 & y_2 \vec x_2 & \ldots \\ y_1 & y_2 & \ldots \end{bmatrix}.

Clearly, A^T \vec w' > 0 is equivalent to \forall \vec x_i, \sign(\vec w' \cdot \vec x_i + b) = y. We've now simplified the perceptron problem down to a single matrix multiplication and a comparison! Now, \vec w' contains all the perceptron parameters, and the columns of A are the data points (each with a trailing 1 element), premultiplied by the label.

Now the problem becomes: given premultiplied data in a matrix A, find \vec w' such that A^T \vec w' > 0. The perceptron training algorithm does this, and works as follows: repeatedly choose a column \vec a_i of A, and if \vec a_i \cdot \vec w' \le 0, change \vec w' by adding \vec a_i to it. Stop when \vec a_i \cdot \vec w' > 0 for all \vec a_i in A, or when we reach an iteration/passes limit.

Why do we correct the weights when \vec a_i \cdot \vec w' \le 0 by adding \vec a_i to \vec w'? Well, the next time we choose the \vec a_i column, we'll get \vec a_i \cdot (\vec w' + \vec a_i) = \vec a_i \cdot \vec w' + \magn{\vec a_i}^2. Since \magn{\vec a_i}^2 > 0, \vec a_i \cdot (\vec w' + \vec a_i) > \vec a_i \cdot \vec w', so \vec a_i \cdot (\vec w' + \vec a_i) is closer to being positive.

(Python implementation not included, since this is a question in assignment 1)

After training, we can make predictions for any given input \vec x with the usual formula, y = \sign\left(\vec w' \cdot \begin{bmatrix} \vec x \\ 1 \end{bmatrix}\right).

This algorithm is very simple to implement, yet works quite well in practice. Also, the fact that its formula is a linear combination is interesting. If we look at the weights, we notice that large positive weights mean that the corresponding feature strongly suggests that the prediction should be positive, whereas large negative weights strongly suggest that the prediction should be negative.

How well does a perceptron converge when running the training algorithm described above? Block's perceptron convergence theorem gives us an idea. If A is separable (i.e., a hyperplane exists that separates positive cateogry points from negative category points), then \vec w' will converge to some \vec w^*. If every column of A is selected indefinitely often, then A^T \vec w^* > 0. Furthermore, if \vec w' = \vec 0 initially, then the perceptron converges after at most (R / \gamma)^2 iterations, where R = \max\set{\magn{a_1}, \magn{a_2}, \ldots} and \gamma = \max\set{\min\set{\vec w \cdot \vec a_1, \vec w \cdot \vec a_2, \ldots} : \magn{\vec w} \le 1} (the margin - the minimum distance between the convex hull of the positive points and the negative points). Essentially, the margin represents the distance between the "hardest" two datapoints to classify.

Note that these values of R and \gamma are purely functions of the dataset, and that they don't directly depend on the size of A and the number of dimensions d. In other words, the number of mistakes the perceptron makes would be independent of the dataset size and number of dimensions! The larger the margin is, the faster the perceptron converges. Block's perceptron convergence theorem gives us a worst case bound, but in many practical situations the perceptron will perform a lot better.

Also, the perceptron stops at an arbitrary linear separator that correctly separates the points, not necessarily the one that most cleanly separates the positive and negative points (with the largest possible minimum distance from the hyperplane to positive/negative predictions). In fact, the resulting hyperplane will even depend on the order we feed in the data. We can use support vector machines instead to find that hyperplane (which also happens to be unique for each dataset!). This is the main disadvantage of perceptrons - they might only barely separate the training data, so they're less robust to unseen data than those that find a linear separator with a larger margin.

If the data is not separable, Block's perceptron convergence theorem doesn't apply anymore. The perceptron boundedness theorem says that convergence is only guaranteed if such a hyperplane exists, but if it doesn't, then the iterations are still bounded, because the perceptron's state will start cycling after a certain number of iterations. In practice, this means we would specify a time or iteration limit when doing training, or when the training/validation error stops changing, or even if weights stop changing much when using diminishing step sizes.

If we end up with non-separable data, we might want to find a better feature representation, use a deeper model, or use a soft margin - instead of a hyperplane that perfectly separates positive/negative values of y, we can allow a few mistakes.

There are many ways to extend perceptrons to classify things into more than two categories (positive/negative). One way is one vs. all, where we have one perceptron per category, perceptron with highest activation level wins - \max_c(w_c \cdot x). The issue with this is that it's imbalanced - each perceptron has to give negative predictions far more often than positive ones, since it only gives positive prediction for its own category and otherwise must give a negative prediction. Another is one vs. one, where we have one perceptron for every pair of categories, where a positive prediction means the datapoint is in the first category and negative means the other category, and then take a vote to find the most commonly predicted category as the final answer.

An example of applying perceptrons online is pricing - selling a product to the user at a price y, and updating weights if the price is too high and the user doesn't buy the product.


Assignment 1 now available, due in two weeks.

A pass is a run through all of the training data - 100 passes means we go through the training data 100 times. An iteration is a run through a single data point in our training data.

Linear Regression

Consider a scatter plot of house market value vs. square footage. We'd expect that these two are pretty well correlated. A linear regression over these two variables can be used to give us a line of best fit.

Regression problems are about fitting models to match datasets as closely as possible. Linear regression problems try to fit linear models to datasets. When we're doing regression problems, we have to consider whether to use linear/nonlinear models, and whether we'll be using it to interpolate or extrapolate (choosing the perfect model is much more important for extrapolation)

Formally, a regression problem is: find f(\vec x) \approxeq \vec y given \vec x (the feature vector a real vector) and y (the response value, a real number). The hard part of this is that \vec x and y are drawn from unknown distributions, which makes it hard to interpolate/extrapolate. Additionally, we need a way to express how much error there is in our model predictions - a loss function.

One family of regression algorithms is risk minimizers (expected loss minimizers): algorithms that try to find f such that \min_{f: \vec x \to y} E[L(f(\vec x), y)], where L is the loss function.

A common loss function is least squares: \min_{f: \vec x \to y} E[\magn{f(\vec x) - y}^2]. The correct loss function for a given situation is often hard to determine, so we use one that's simple and efficient to compute - least squares works well enough for most situations. Additionally, of all the minimizers of $_W _F, W = A^+ CB^+$ is the one with the smallest F-norm, where A^+ is the pseudo-inverse of A (Sondermann '86, Yu & Shuurmans '11) - this is mostly a theoretical result, but gives us another good reason to use least squares loss.

Clearly, E[\magn{f(\vec x) - y}^2] = E[\magn{f(\vec x) - E(y \mid \vec x)}^2] + E[\magn{E(y \mid \vec x) - y}^2]. Note that the second term doesn't really depend on f - it's the inherent noise variance, the noise that we can't get rid of no matter how good our regression function is. Also, the first term gives us the problem in a much simpler form: we want to find an f(\vec x) that approximates E(y \mid \vec x) well, to make this term smaller.

One way to make this optimization process easier is to assume that f(\vec x) is linear, so f(\vec x) = E(\vec y \mid \vec x) = A \vec x + \vec b for some matrix A. If we make this assumption, then with risk minimization we're trying to find \min_{f: \vec x \to \vec y} E[A \vec x + \vec b - \vec y]. We can't minimize this directly because we don't know the true distribution of the variables, but using the law of large numbers, \frac 1 n \sum Z_i = E(Z) for any Z = \set{Z_1, Z_2, \ldots}. So if we assume the model is linear, and the sample is large, then the risk minimization can be approximated by \min_{\vec a, \vec b} \frac 1 n \sum \magn{A \vec x + \vec b - \vec y}^2 (this approximation is called the empirical risk).

Let's simplify the \min_{\vec a, \vec b} \frac 1 n \sum \magn{A \vec x + \vec b - \vec y}^2 approximation, using something very similar to what we did for perceptrons. First, let's define W = \begin{bmatrix} A^T \\ {\vec b}^T \end{bmatrix} and \vec x' = \begin{bmatrix} \vec x \\ 1 \end{bmatrix}. Now we have \min_W \frac 1 n \sum \magn{W^T \vec x' - \vec y}^2, which is slightly shorter/cleaner.

Let \vec x_i be the ith value of \vec x in our training set. Just like for the perceptrons simplifications above, we also want to include all of the training set data points in a single expression, to make our minimization problem simpler. To do this, let X = \begin{bmatrix} {\vec x_1'}^T \\ {\vec x_2'}^T \\ \vdots \end{bmatrix}, Y = \begin{bmatrix} {\vec y_1}^T \\ {\vec y_2}^T \\ \vdots \end{bmatrix}. Now, we can write this as \min_W \magn{XW - Y}_F^2 where \magn{A}_F = \sum_{i, j} A_ij is the Frobenius norm - each element simply gets squared and the squares are all summed together to get the result, like the Euclidean norm, but extended for any matrix.

The least squares problem is now writeable as \min_W \magn{XW - Y}_F^2, and we're minimizing the sum of square residuals XW - Y (sum of square distances between the predicted values and true values). Here, Y is a matrix with columns as the true responses, and the residuals are the distances between each true response in Y and the point that the hyperplane would predict given X.

Note that the Frobenius norm can be defined as: \magn{A}_F^2 = \trace{A^T A}. Additionally, the following are identities: \trace(A + B) = \trace(A) + \trace(B), \trace(AB) = \trace(BA), \trace(A) = \trace(A^T), and \trace(cA) = c \trace(A).

Therefore, \magn{XW - Y}_F^2 = \trace((XW - Y)^T (XW - Y)) = \trace((W^T X^T - Y^T) (XW - Y)) = \trace(W^T X^T X W - Y^T X W - W^T X^T Y + Y^T Y) = \trace(W^T X^T X W) - \trace((Y^T X W)^T) - \trace(W^T X^T Y) + \trace(Y^T Y) = \trace(W^T X^T X W) - \trace(W^T X^T Y) - \trace(W^T X^T Y) + \trace(Y^T Y) = \trace(W^T X^T X W - 2 W^T X^T Y + Y^T Y). Clearly, this is a quadratic equation with respect to W, and we want to find its minimum.

Consider \min_x f(x). Fermat's theorem says that at the minimum x, the derivative of f(x) must be 0. Consider a general quadratic function f(x) = \vec x^T A \vec x + \vec x^T \vec b + c. The derivative is then \frac{\dee f(x)}{\dee x} = (A + A^T)\vec x + \vec b.

Note that \magn{XW - Y}_F^2 = W^T(X^T X) W - 2W^T X^T Y + Y^T Y (a quadratic equation), and if set the derivative of this to 0 and solve we get X^T X W = X^T Y as a solution, which is just a linear system - we have X and Y, so we can solve for W. Note that X^T X might be invertible, but we should still never solve for W by using W = (X^T X)^{-1} X^T Y, since this involves solving n linear systems, whereas we can solve it by solving only 1 linear system (in practice, we should almost never actually compute matrix inverses).

Once we have W, we can make predictions for any given X using \hat Y = XW, or evaluate those predictions with (Y - \hat Y)^2. We can also evaluate using a different loss function, a technique often used in calibration theory.

Linear regression is disproportionally affected by large outliers. To mitigate this, we sometimes use Huber loss, which is linear for large differences and quadratic for smaller ones, where "larger" and "smaller" are defined by a threshold \delta. This ensures overly large outliers don't impact the result too much. Huber's loss function is defined as H(\hat y, y) = \begin{cases} \frac 1 2 (\hat y - y)^2 &\text{if } \abs{\hat y - y} \le \delta \\ \delta(\abs{\hat y - y} - \frac{\delta}{2}) &\text{otherwise} \end{cases}.

Ill-posed problems are those in which small changes in the input to linear models results in huge differences in the output. To handle this sort of regression task, we can use Tiknohov regularization. To do this, we just add a term to the formula: \min_W \magn{XW - Y}_F^2 + \lambda \magn{W}_F^2, or equivlaently, (X^T X + \lambda I)W = X^T Y. A small positive lambda ensures that instead of a small change in the input resulting in a huge difference in the output, it would result in a difference proportional to \frac{1}{\lambda} instead. Another way to handle ill-posed problems is to use data augmentation - essentially, adding more data points to make the data appear more regular.

How do we choose hyperparameters like \lambda for Tiknohov regularization? We have a training set (for model training), testing set (which we don't see until the end), and sometimes a small validation set (for tuning parameters), and on the training set, we can apply cross-validation. Suppose we have k different values of \lambda we want to consider. First, we split the training set into n chunks, and evaluate the model for each different value of \lambda for each of the chunks - a total of kn evaluations. The overall performance score for each value of \lambda is then the sum of the performance scores for that value of \lambda within each chunk. We can then choose the value of \lambda with the greatest overall performance score.


Guest lecture by Francois from Focal Systems (francois@focal.systems).

Almost any ML problem falls into regression or classification.

For linear regression, we're assuming that the response variable y is approximated by \vec \theta \cdot \vec x + N(0, \sigma), where N(0, \sigma) is a normal distribution centered around 0 with standard deviation \sigma. Further overview of some real-world details in implementing linear regression.

Though linear regression is simplistic, it turns out that it works very well in practice, since more complex models require more advanced ways to do regularization and get decent weight vectors. Tools like SVM are used a lot in the real world to model real phenomena, even when they aren't necessarily linear, because it works well enough for most purposes.

Most modern ML problems use SGD - stochastic gradient descent.

A Bernoulli model predicts y = P(Y_1 = y_1, \ldots, Y_n = y_n \mid X_1 = x_1, \ldots, X_n = x_n). If we assume that the distribution of the variables are a Bernoulli distribution, so they're independent, we can then write this as y = \prod P(Y_i = y_i \mid X_i = x_i) = \prod p(x_i; w)^{y_i} (1 - p(x_i; w)) ;wip: get the formula for this

Logistic regression tries to predict the value of 0 \le y \le 1 given \vec x by fitting the formula \frac 1 {1 + \exp(\vec w \cdot x)}, where \vec w is the thing that we're trying to fit.

We use a sigmoid rather than, say, a step function, because the gradient doesn't have any signal - if we differentiate it, the derivative is just 0 everywhere, so gradient descent wouldn't be able to get closer to the solution at every step. Instead, the sigmoid formula has a gentle curve, so its derivative is more suitable for performing gradient descent on.

Our loss function is then f(y', y) = \ln(y') * y + \ln(1 - y') (1 - y), where y' is the model's prediction and y is the true value. ;wip: why???

Tensorflow example, implementing logistic regression using the built in gradient descent optimizer to minimize the loss function. When doing gradient descent, we want the largest learning rate that still converges.

Creative Commons License This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Copyright 2013-2017 Anthony Zhang.