Lecture Notes by Anthony Zhang.


Introduction to Artificial Intelligence.

Alan Tsang
Section 001
Email: akhtsang@uwaterloo.ca
Website: https://cs.uwaterloo.ca/~klarson/teaching/F17-486/, https://cs.uwaterloo.ca/~akhtsang/cs486.html
Office Hours: Tuesdays 1:15pm to 2:45pm in DC2306B, accessible via the AI Lab
Mondays/Wednesdays 4:00pm-5:20pm


Info on course website, assignments on LEARN, questions on Piazza. 5 assignments worth 40% total, 15% midterm on October 18, 2017, 45% final exam, optional project worth 5% bonus marks. Assignments can be submitted up to 48 hours after deadline.

There's no real agreed-upon definition for AI, but in this course we'll consider it to be the mathematical study of intelligent action in a complex environment. We usually think about 4 different kinds of AI: systems that think like humans, act like humans, think rationally, or act rationally. The definitions usually differ along the dimensions of thinking vs. acting and human behaviour vs. idealized rational behaviour. In other words, we're trying to duplicate what the human brain does (passing the turing test), and what the human brain should do (acting rationally).

Overview of AI in popular perception: threats of superintelligence, automation displacing jobs, biases encoded in AI models, self-driving cars. Overview of the turing test, including issues like not being amenable to mathematical analysis, whether it's more important to understand underlying principles than mimicking true intelligence, and whether it's even a good test (we don't make planes by trying to make something so bird-like that it fools a bird). Overview of history of AI, like Turing's work, the Dartmouth meeting, AI winter, and modern computational AI.

Classically, AI was seen to mean general problem solving, like solving chess. More recently, we focus more on specific problems, such as perception, robotics, and deliberative reasoning (thinking about what to do based on available data), with a greater focus on statistical methods, decision theories, and probability.

In this course we'll talk about classic AI problems like search, as well as knowledge representation/reasoning, making decisions under uncertainty, and learning.

An agent is an entity that perceives and acts. We usually evaluate agents based on things like how well it achieves goals and how many resources it consumes. Agents exist within an environment, with actuators, sensors, and environmental constraints. Environments can be fully/partially observable (can we see the entire environment at once?), deterministic/stochastic (does the same action always result in the same result?), episodic/dynamic (do previous steps affect later steps? in classification no, since previous predictions are independent, but in chess yes, since previous moves are very important), discrete/continuous (chess AI could be considered discrete, whereas controlling a robot arm might be considered continuous), and single/multi agent (is the agent the only one in its environment?).


Search is one of the first topics in AI, used in everything from theorem proving to robotic navigation. Search problems consist of a state space (the world we're searching within), a successor function (given a state, determines possible actions/next states and the cost of getting to those states), a start state (the state at the start of the search), and a goal test (determines whether we've reached the goal, and possibly also information about how close we are to the goal). A path is a sequence of states and operations. A solution is a path from the start state to a state that satisfies the goal test. An optimal solution is a solution of minimal cost.

Consider a pac-man AI. the state space would be the game map, the successor function would define actions like, moving left/right/up/down and how expensive those would be, the start state would be the initial game state, and the goal test would be whether we've gotten to the end of the level.

Overview of N-queens problem and why it's a search problem, as well as various other types of search problems. For our purposes, search problems won't have elements of chance, continuous states, adversaries, or partial observation - all of our problems will be deterministic, discrete, single-agent, and fully observable.

The world state includes every detail of the environment at a certain point in the search process. For something like pac-man, we might care about pac-man's position, the dots, and the ghosts, but pathfinding might only have current position in its state.

Search problems can be thought of as minimum-cost pathfinding on a graph of world states, so we can apply algorithms like breadth-first search and A*. The search tree is the tree made of possible paths within the graph of world states. The factors we care about in search algorithms are things like completeness (will it always find a solution?), optimality (does it always find the minimum cost solution?), time complexity, and space complexity. While DFS is light on memory and goes deep quickly (O(b^h) time and O(h) space, where b is arity of tree and h is tree height), cycles in the graph can result in infinite loops, so it's not complete. BFS is complete and can find shallower solutions quickly, but needs a lot of memory to store visited states (O(b^h) space and time where b is arity of tree and h is tree height).

Instead, we can use iterative deepening search (IDA*), which is essentially a mix of both - do DFS with a small depth limit, then repeatedly increase the limit (usually linearly) and rerun DFS until we reach the goal. This is both memory-efficient and complete, since the depth limit prevents infinite loops. This sounds wasteful, but it isn't too bad in practice since the number of nodes we visit in each iteration tends to grow exponentially, so most nodes we visit will be in the later iterations.

Note that all three of these algorithms we've looked at so far are the same besides the order we visit nodes in - they're generic and aren't specialized for specific problems. We can do better in specific problems by taking advantage of their inherent structure.


Assignment 1 will be out this week.

Uninformed search is search that doesn't depend on any knowledge about the problem, like BFS or DFS, or iterative deepening. It's widely suitable, but performs poorly and is computationally expensive. We would like to do informed search instead, where we can take advantage of knowledge we have about the problem to make our searches more efficient.

Usually when searching a graph, we often have knowledge about the merit of a node - some approximation of how good the node is. There are often different ways to defined merit for any given problem, such as how much it would cost to use the node in our solution, or how computationally expensive it would be to use it. For example, uninformed search expands nodes based on the distance from the start node, but we might augment this by using a heuristic like Euclidean distance to the goal, if we're searching in a Euclidean space.

A heuristic estimates the distance from any given node to the goal node, such as Euclidean or Manhattan distance. In our search, if node A has a lower heuristic than node B, we might choose node A over B (and also factor in things like how far from the start nodes A or B are).

Best-first search is the most naive informed search. At each step, we expand exactly the node with the lowest heuristic function value - the node that seems closest to the goal function. This is not a complete search, since it might get stuck in a minimim-heuristic cycle of nodes (we don't keep track of which nodes we've already visited). Additionally, this is not optimal, since the resulting path isn't always the shortest one - the length of the path taken so far at any given node isn't taken into account.

While best-first search looked at the estimated forward cost (expand node with lowest estimated cost from current node to goal), breadth-first search looked at the backward cost (expand node with lowest cost from start to current node). We want to combine these two to make breadth-first search more efficient, and this gives us A* search. Essentially, let g(n) be the backward cost for the node n and h(n) be an estimate of the forward cost for the node n. We then perform essentially breadth-first search, repeatedly queueing up each visited nodes' neighbors, but always choose the node in the priority queue with the lowest value of f(n) = g(n) + h(n), stopping when the priority queue is empty (the priority queue starts off only having the start node).

A* is complete - it will give us a valid path for any start/goal vertex in any graph. A* is also optimal if the heuristic is admissible and we're searching in a tree, or if the heuristic is consistent. Admissible means that the forward cost heuristic's value is always less or equal to the true forward cost. For example, h(n) = 0 is an admissable heuristic for any graph. Proof of optimality:

Let s be the start node and e be the goal node. Suppose A* gives us a non-optimal solution within a tree. Since it's a tree, we must have reached a different ending node, so there must be a non-optimal ending node e'.
Let n be some node along the path from s to e. Clearly, at some point both n and g' must be in the A* priority queue at the same time, and A* must have removed e' over n.
So f(e') \le f(n) and g(e') + h(e') \le g(n) + h(n). Let c(x, y) be the true cost from node x to y. Since the path is non-optimal, g(e') > g(e). Clearly, g(e) = c(s, n) + c(n, e) \ge g(n) + h(n) = f(n).
So f(e') > f(n), which isn't possible since f(e') \le f(n). So A* cannot have given us a non-optimal solution within a tree.

Consistent means that the heuristic is admissible regardless of the goal node - h(n) \le c(n, n') + h(n'), where c(x, y) is the true cost of getting from node x to node y - sort of analogous to the triangle inequality. Most admissible heuristics will also be consistent, but not all of them.

A* has good performance in practice, but in the worst case needs to visit every node, and keep track of all the backtracking information for every node. The real-world performance of A* depends heavily on the heuristic function, but it's generally a lot better than plain breadth-first search.

For example, if we were to use A* to solve a sliding puzzle game (a 3x3 grid of squares with one tile missing, squares must be slid into a particular pattern), one heuristic we might use is to count the number of moves to get each tile to its proper place, assuming the tiles can occupy squares regardless of whether there's already a tile there - the sum of the Manhattan distances between each node to its destination. A somewhat worse heuristic would be to count the number of tiles not in their correct places.

To design a heuristic, we generally start by removing some constraints of the problem to get a relaxed version of the problem, finding a computationally cheap solution to the relaxed version (or precompute a database of solutions for relaxed problem instances), and then use that as the heuristic. For example, for the sliding puzzle game, we started by relaxing the requirements that there only be one tile per square, and then trivially found a solution by just moving the tiles to their correct places. There's often a big tradeoff between the cost of computing the heuristic, versus how good the heuristic is - finding a good relaxation of the problem is a bit of an art.

We often want to use A* while also limiting how much memory we use. One way to do this is using Iterative Deepening A*, which is sort of like iterative deepening search, but we consider nodes by order of their f(n) value, limit the value of f(n), and progressively rerun the search with increasing f(n) limit until the goal is reached. Another way is simplified memory-bounded A*, which simply starts dropping nodes from the priority queue with the highest f(n) values (or oldest nodes, as a tiebreaker) when it's out of memory (this is still optimal/complete if distance to shallowest goal node is less or equal to memory size).


Assignment 1 due October 4.

Constraint Satisfaction

Informed search uses domain specific knowledge to make search faster. Constraint satisfaction problems (CSPs) use knowledge about the current state itself to help us find solutions. In a standard search problem the states are black boxes, with a goal test and successor function to deal with them. A constraint satisfaction problem has states defined by variables X_1, \ldots, X_n whose values are from a domain D, and the goal test is a set of constraints that tell us about what values a subset of the variables S \subseteq \set{X_1, \ldots, X_n} must take on. We can then exploit knowledge of these constraints to improve our performance.

Consider the map coloring problem: coloring a map such that no neighboring regions have the same color. Here, the constraints are that neighboring regions cannot have the same color, X_1, \ldots, X_n is the color of each region, and the domain might be D = \set{\text{red}, \text{green}, \text{blue}}. We want to find a solution, which is an assignment of values to X_1, \ldots, X_n satisfying all the constraints. Another famous constraint solving problem is 3-SAT, which tries to find boolean values that satisfy a set of constraints in a particular form.

We distinguish betwen finite constraint satisfaction problems like 3-SAT, and infinite ones like linear systems. We also care about the arity of the constraints - for example, binary constraints operate over two variables, ternary over three, and so on. There's also soft constraints, which don't have to be satisfied but are preferred if possible - this forms a constrained optimization problem.

We can use standard search algorithms to solve these binary constraint satisfaction problems. For example, for the map coloring problem, all the constraints are binary, so we can perform a search - states in the graph are partial assignments of colors, the search graph is the map's graph, the initial state is the uncolored map, the successor function colors one currently uncolored region, and the goal is that all colors are assigned and no two adjacent regions have the same color.

In general, we can solve a CSP using search: states are partial assignments of variables, initial state is that no variables are assigned, successor function assigns a value to a variable that isn't currently assigned, and the goal is that all variables are assigned and all constraints are satisfied.

CSPs are commutative - the order that we assign values to variables doesn't matter. This can be used to significantly reduce our search space - at each node in the search tree, we only have to consider the value of a single variable, not the values of variables we looked at before it.

Backtracking search is the main way to solve general CSPs: select an unassigned variable X_i, try every possible value of X_i to make sure they satisfy constraints. If there's a value v that does, set X_i = v and repeat the whole process, and if there is no such v, go back to the preceding variable X_j and keep trying more values of X_j.

Backtracking is basically the same as DFS, but there are a lot of ways to improve on it:

;wip: slides

Also, it turns out that if the digraph is a tree, then we can solve it in O(nd^2), where n is the number of nodes and d is the number of values in the domain. To do this, we use topological sort to get an ordering such that parents occur before their children.

;wip: slides

;wip: tree widths of 0 and 1 are still trees, but the larger the tree width, the less like a tree it is - the larger the width, the less tree-like it is, and worse the decomposition approach will work, also finding optimal tree decomposition is NP-hard


Local Search/Optimization

So far we've looked at uninformed and informed search. Today we'll look at local search and optimization - searching in very large problems, where we're willing to settle for a non-optimal but workable solution. Local search is used when we can't use global search/optimization techniques like A* or linear programming.

As with the other search techniques so far, we have a domain, constraints, and a cost function V(P). We're trying to find a point P in that domain that satisfies the constraints and has a low cost function value (not necessarily the lowest one, though).

Search algorithms so far have systemically explored the search graph, and the path is the solution, but in many cases we don't care about the path. For example, scheduling exams without conflicts (the thing we care about is the final exam schedule, not the path to the schedule), or chip design (routing wires around to minimize chip footprint). Local search techniques take advantage of this to make larger problems feasible.

Local search problems tend to have a wide search tree and a cost function that tells us how good a given solution is. Many of these problems are NP-hard to find optimal solutions, but we just want feasible solutions that are good enough, even if they're not optimal.

For example, for N-queens, we might try to find a solution that removes a lot of the queens from attacking positions, rather than a solution that gets all of the queens out of attacking positions. When there are few constraints vs. variables, it's easy to solve - if we just throw the queens on the board somewhere, it's probably a valid solution. When there are lots of constraints vs. variables, it's easy to see it's infeasible - if there are no non-attacking positions left to put queens in. However, when there's roughly the same number of constraints and variables, there are some solutions, but they're hard to find - searching for any solutions becomes very computationally intensive. Local search problems tend to have the "phase transitions", where most problems are really easy, but in certain cases are really difficult.

Hill Climbing

One way we commonly solve search problems is by iteratively improving a random initial state - a greedy hill climbing algorithm often also known as gradient descent. Essentially, we repeatedly move in the direction that would improve our current situation the most:

  1. Run the following many times:
    1. Choose a random initial state S with value V(S). This state isn't necessarily optimal or even feasible.
    2. Generate a moveset \delta(S) = \set{S_1, \ldots, S_k} - the set of all neighboring states.
    3. Let S' = \argmax_{S_i} V(S_i) - the best possible next state.
    4. If V(S') > V(S), set S = S' and go back to step 2 (we might also limit the number of iterations here). Otherwise, return S - we've found a maxima.
  2. Take the largest value of S from all the iterations run in step 1.

Hill climbing is easy to implement and light on memory, but is only complete or optimal when the problem space is convex - for non-convex problems, there can exist local maxima that are neither feasible nor optimal, and we might end up hitting those local maxima every iteration.

Also, there can be "plateaus" - regions where the cost function is constant, where we would get stuck. We can avoid this by allowing movement in directions that don't change our cost function (usually in the same direction as the previous step) - this lets us move through a plateau and hopefully find higher values on the other side.

Simulated Annealing

One way to improve hill climbing is to use simulated annealing. This is quite similar to hill climbing, but at each step, instead of always moving to the best neighboring state, we instead randomly choose a neighbor, and move to it with a probability based on how much better it is than the current solution, and the amount of steps we've taken:

  1. Choose a random initial state S with value V(S). This state isn't necessarily optimal or even feasible.
  2. Repeat n times:
    1. Generate a moveset \delta(S) = \set{S_1, \ldots, S_k} - the set of all neighboring states.
    2. Let S' be a random next state from \delta(S).
    3. If V(S') > V(S), let S = S'. Otherwise, let S = S' with probability \exp\left(\frac{V(S') - V(S)}{T}\right) (otherwise, do nothing). Here, T is the temperature - a function of the current iteration that starts high in the first iteration and gradually goes toward 0 as we get toward n iterations (usually exponentially by multiplying by a constant 0 < \alpha < 1 at each step).
  3. Take the value of S as a solution.

Note that in step 2.3, if the new step was better we would accept it every time, and if it was the same or worse there's a certain chance we accept it anyways. Note that the chance decreases when the next state is much worse (and increases when the next state is only a little worse), and decreases significantly as the temperature gets closer to 0.

So when T is large, we're often choosing next states that might be better or worse (exploration stage), and as T gets smaller, we're often choosing next states that are strictly better (exploitation phase). As it turns out, if we decrease T slowly enough (if we run enough iterations), we are guaranteed to find an optimal solution.

Genetic Algorithms

Genetic algorithms are sort of like an extension of simulated annealing, in the exploration stage. Genetic algorithms can be used when states have a "fitness" function f(S), it's possible to "combine" two states to form a new state c(S_1, S_2), and it's possible to mutate a state to get a slightly different state m(S). Essentially, we have a a population of initial states, and repeatedly combine high-fitness states and mutate them until we have an acceptable result:

  1. Let P = {S_1, \ldots, S_n} be a population of random initial states.
  2. Repeat until we have a state S_i with high-enough f(S_i) value:
    1. Randomly choose two distinct states S_i, S_j from P, weighted by f value.
    2. Crossover S_i and S_j to produce a child S' = c(S_i, S_j).
    3. With some small probability m, mutate S' to get m(S'). In some variations of genetic algorithms, we might mutate a randomly chosen state in P instead.
    4. Add S' to P. In some setups, we also kick out the lowest-fitness S_i if P has reached a certain size.
  3. The solution is the highest-fitness S_i in P.

States are often represented as binary strings. The crossover function c(S_1, S_2) might then be something like "generate a random bitmask m, then new_S = (S_1 & m) | (S_2 & ~m)", and the mutation function might then be something like flipping a random bit.

Softmax is nice because it handles outliers well.


We now introduce an adversary with conflicting goals. Usually this means a game. Games are well-studied in AI because they're easy to represent, aren't too easy, and are easy to evaluate performance on.

We classify games into categories like perfect vs. imperfect information, deterministic (fully controlled by players) vs. stochastic (element of change). As a search problem, a 2-player perfect information game has a state defined by the board state and whose turn it is, the successor function is the set of the results of every legal moves at a given state, the terminal/leaf states are states where players win/lose/draw, the heuristic is the player's estimated utility function. A solution to the game is a strategy that tells you how to move at every state in the game - a way to get to a goal node from every node in the tree.

This is a challenging problem because there's a malicious opponent, and we need to take that into account with our search. A max-player is trying to maximize its own utility, while a min-player, the opponent, is trying to minimize it. We're trying to find an optimal strategy - a strategy that guarantees that gives us an outcome that is as favourable as possible, given that the opponent is playing as good as possible (we won't necessarily always win). Note that this assumes an optimal opponent, so we might prefer different strategies when the other player isn't playing optimally.

In this course, we'll mostly focus on zero-sum games (games where one player winning means the other loses) with perfect information. In theory these are easy to solve given infinite time, and most of the challenge comes from making this practically fast.

Consider the centipede game: players take turns either incrementing a counter, or deciding to end the game, which lets them take all the value in the counter and win. Clearly, we want to take the counter to get the value, but we want to keep going for more turns to get more value when we decide to take it.

We can analyze this using minimax search. We represent the value of each node in a tree as \text{minimax}(s) = \begin{cases} \text{utility}(s) &\text{if } s \text{ is a terminal state} \\ \max\set{\text{minimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a max-node (our turn)} \\ \min\set{\text{minimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a min-node (their turn)} \end{cases} (this is also called the security value). Essentially, it represents the worst-case utility of the game starting from the current state, assuming both players are playing optimally (they're both minimaxing). Essentially, each player is minimizing the amount it can possibly lose by.

Minimax search is essentially DFS through this minimax tree. This isn't complete since it doesn't work if the game doesn't terminate, but it has pretty good space complexity.

In the centipede game, the course of action that minimizes the amount we lose by in the worst-case is to betray at the first possible turn. However, real people will generally play more than that. For chess, the branching factor was empirically determined to be around 35, and the tree depth is around 100, at least. This is impractical to search exhaustively.

For certain types of games, we can do alpha-beta pruning. Essentially, \alpha is the max-player's best (highest) security value guaranteed so far, and beta is the min-player's best (lowest) security value guaranteed so far. We then skip searching nodes where \beta \le \alpha - taking this move will make the current player worse off than before, because the min-player's guaranteed low score is no longer greater than the max-player's guaranteed high score, so a win or draw is guaranteed for the other player. In a minimax tree, this gives us the same result as an exhaustive DFS.

Alpha-beta pruning lets us eliminate branches from consideration. If we look at branches in a game tree left to right, we'd update \alpha and \beta for each node as we traverse the tree, and alpha-beta prune branches as we move from left to right. Alpha-beta pruning in this case will tend to remove branches from the right hand sides of nodes that would give poor results.

This is very helpful - minimax search on chess can look 5 moves ahead, but alpha-beta pruning gives us enough performance to look up to 10 moves ahead! This still isn't good enough - alpha-beta pruning requires us to go all the way down to terminal nodes in order to get the alpha/beta values. We can supplement this using heuristics to estimate alpha and beta. A fast evaluation function gives the true utility for a terminal state, and the estimated expected utility for any other state. There's a lot of AI research in making a good evaluation function using expert knowledge and experience - one that makes good estimates of how much payoff we'll get from a given game state.

We also don't have to go down all the way to the terminal states, and cut off the search at certain thresholds. We'll usually want to cut off the search when we reach a certain depth, or when the state becomes relatively stable (not really leading towards a win/loss), though we'll often want to give singular extensions - additional depth for moves that seem particularly promising (this helps avoid some cases where we'll miss promising options just beyond the cutoff). The parameters for the cutoff would also depend on the opponent - a novice chess player might only need 5 move lookeahead, while a grandmaster might need even 14, with exhaustive search near the endgame and a good evaluation function.

For stochastic games, we treat chance itself as a player, so there's the max-player (us), the min-player (the opponent), and the chance player (the whims of nature), where the chance player makes its move with probabilities defined by the path in the tree, in between every max-player and min-player turns. We then construct an expectiminimax tree and consider it a search problem. The expectiminimax tree is the same as the minimax tree, but with another case: \text{expectiminimax}(s) = \begin{cases} \text{utility}(s) &\text{if } s \text{ is a terminal state} \\ \max\set{\text{expectiminimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a max-node (our turn)} \\ \min\set{\text{expectiminimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a min-node (their turn)} \\ E(\set{\text{expectiminimax}(s') : s' \in \text{successors}(s)}) &\text{if } s \text{ is a min-node (their turn)} \end{cases}, where E(X) is the expected value.

For Go, the branching factor is about 250, and the terminal states are usually up to 150 moves deep. For this we can use techniques like Monte-Carlo tree search.


Utility Theory

A preference ordering for an agent over a set of states S is a weak ordering \succsim of all the states in S - a ranking S_1 \succsim S_2 that determines whether S_1 is at least as good as S_2. Likewise, S_1 > S_2 means that S_1 is strictly better than S_2, and S_1 \sim S_2 means that S_1 is equally preferable to S_2. By weak ordering, we mean that every state is ranked, but there may exist ties (unlike a total ordering, where there can exist ties).

For deterministic worlds, the consequences of an action is always the same, so we can know exactly what the state of the world will be after taking that action. For non-deterministic worlds, there's an element of chance for actions, so we can only know what the probability of each world state occurring will be after taking that action. We can represent a nondeterministic outcome as a lottery L = [p_1, S_1; \ldots; p_k, S_k], where p_i is the probability of S_i occurring. Essentially, this is a probability distribution over all outcomes.

Any preference ordering must obey certain properties. Given world states A, B, C, the following must all be true:

These conditions are required in order to maintain basic rationality requirements. For example, if we didn't have transitivity, then there could exist a A > B > C > A, so we could get arbitrarily-large amounts of utility by repeatedly taking actions to go from A to C to B and to A again.

A decision problem under certainty is a tuple \tup{D, S, f, \succsim}, where D is the set of possible decisions/actions to take, S is the set of possible states/outcomes, f: D \to S is a function mapping actions to their resulting world states (the outcome function), and \succsim is a preference ordering over S.

A solution to a decision problem under certainty is an action d^* \in D such that f(d^*) \succsim f(d) for any d \in D - an action that is equally or more preferable than any other possible action.

When actions don't have deterministic outcomes, we can't easily compare outcomes directly, since we're representing them as lotteries. Instead, we want to quantify how much we prefer each world-state to each other world-state, so we can determine whether one lottery is preferable to another.

We do this by introducing a utility function U: S \to R - a function that, given a state s \in S, produces a real value that measures how much we prefer s over some arbitrary reference state. Also, let's define \succsim_U to be a preference ordering where s_1 \succsim_U s_2 exactly when U(s) \ge U(t). Now for a lottery l = [p_1, s_1; \ldots; p_k, s_k], we can define U(l) = \sum p_i U(s_i) - the expected utility of a lottery is the expected value of the utility function over all the possible outcomes in the lottery.

When we have any nondeterministic outcomes (when we have lotteries as outcomes), we have a decision problem under uncertainty. We represent this as a tuple \tup{D, S, f, U}, where D and S have the same meanings as for decision problems under certainty, f: D \to \Delta(S) is a function mapping actions to lotteries over states (distributions over states), and U is a utility function.

According to expected utility theory, a solution to a decision problem under uncertainty is an action d^* \in D such that U(f(d^*)) \ge U(f(d)) for any d \in D - the action whose resulting lottery maximizes expected utility.

Note that U is invariant under positive affine transforms - d^* = \argmax_{d \in D} U(f(d)) = \argmax_{d \in D} aU(f(d)) + b for any a > 0, b. That means we can multiply U by a positive constant or add any constant to U, and the set of solutions to any decision problem using U would not change.

Decision problems are conceptually simple to define and solve. In practice, they are often hard to solve because S is so large and it is hard to compute f. Additionally, even if a single decision problem is feasible to solve, we often want to solve a sequence of decision problems to determine the best sequence of actions that we should take.

New notation: P_a(s_2 \mid s_1) is the probability of moving to state s_2 given that we are taking action a while at state s_1.

For sequences of actions, we can draw a tree containing all possible sequences of actions - the root node is the initial state connected to "chance" nodes (like in a game tree), via edges labelled by each possible action at that state. Chance nodes then connect to outcomes of those actions, via edges labelled by the probability of that outcome occuring given that action. The outcomes then connect to their own chance nodes, and those to more outcomes, and so on. Any particular sequence of actions can then be thought of as moves downward through this tree, much like playing a game moves downward through its game tree.

By thinking of all possible sequences of actions as a tree, it's clear that the expected utility of a sequence is the expected utility of the final outcome distribution - a weighted average over the utilities of all the leaf nodes (which are states) that can possibly occur by taking our sequence of actions. For example, if we're playing tic-tac-toe, a sequence of actions that leads to losing might have very little utility, while a sequence that leads to winning might have a lot.

To compute this final outcome distribution, we can recurse through the tree:

def expected_utility(state, action_sequence, probabilities_for_action_at_sequence):
    # concrete outcomes have a real utility value
    if is_real_state(state):
        return utility(state)

    # distributions of outcomes have an expected utility value
    current_action, new_action_sequence = action_sequence[0], action_sequence[1:]
    return sum(
        probability * expected_utility(new_state, new_action_sequence, probabilities_for_action_at_sequence)
        for probability, new_state in probabilities_for_action_at_sequence

For nondeterministic decision problems, we usually don't want to explicitly plan out sequences of actions in advance, because after we perform the first few actions in the sequence, we have more information than we did before, and could potentially make a better decision than we originally planned. For example, suppose we originally planned to flip a coin (action 1), and then guess that it would land heads face-down (action 2), and the utility is 1 if we get it right and 0 if we don't. Clearly, this is the best we can plan to do since we don't know the outcome of the coin in advance. However, after performing action 1, we know the new state of the world, and can look at the face-up side to guess what the face-down side is.

We can generalize this into the idea of policies - a sequence of states and actions to take if our previous actions lead us to those states. For example, here's a policy for the coin guessing decision problem above: "flip a coin (action 1); if the coin outcome is heads, then guess that tails is face-down (action 2a), if the coin outcome is tails, then guess that heads is face-down (action 2b)".

Let A be the set of actions, O be the set of outcomes. Then there are \abs{A}^k possible plans consisting of k actions, and (\abs{A}\abs{O})^k possible policies.

A decision tree for a decision problem under uncertainty is a tree consisting of alternating layers of state nodes (also known as choice nodes, because we have to make choices at each state) and chance nodes, where the root is the initial state. Each final outcome/state s has an associated utility value U(s) (the leaf nodes). Using something like Minimax search, we can then define our utility function at every node, rather than just the leaf nodes:

When implementing the above, we would usually compute expected utilities bottom-up, in a dynamic programming fashion. This allows us to compute the utility value of each node in the tree in O((\abs{A} \abs{O}))^d time where d is the depth of the tree, so it's in the same order as the number of possible policies. Note that the naive, top-down implementation that computes U(s) for each node would instead take O(n^d m^{2d}).

A policy essentially defines a decision for choice nodes in the tree. We can represent it as a mapping from states to actions. Note that because we took only the maximum-expected-utility action at each choice node as we were evaluating our utility function, the nodes under every other possible action will never be reached. Therefore, what action we assign there is more-or-less arbitrary. Two policies are implementationally indistinguishable if they assign the same actions to on all reachable choice nodes (though they may be different for unreachable ones).

Most decision problems aren't fully observable - we usually work around this by using heuristics. Additionally, representing outcome distributions often requires a huge number of states and probabilities - these are usually worked around by using decision networks or Bayes net representations, which we'll cover later on.

A decision problem is fully observable if and only if we know the initial state, and it's possible to know the outcome of every action (which state we ended up being in). For partially observable decision problems, we might only know the outcomes of some actions after taking them. If we have a partially observable decision problem, we can sometimes design the choices to move the unobservable outcomes into leaf nodes, where we won't need to observe their outcomes: for example, instead of a decision tree that decides whether to proceed as if a patient has malaria or the flu, and then decides whether to prescibe drug A or B in each case, we might have a decision tree that decides whether to prescribe drug A or B, and then decides whether to proceed as if the patient has malaria or the flu.


Markov Decision Processes


;wip: missed due to interviews


Midterm next week, includes everything up to and including today's content, Markov Decision Processes. Assignment 2 is due week after that.

We usually apply a future discount to utility values - utility gains in the future are discounted proportionally to how far in the future they are. For example, getting 10 dollars now is worth more than getting 10 dollars in a year.

;wip: missed due to interviews


;wip: missed due to interviews


Midterm tonight.

;wip: missed due to interviews


Midterm average is 68 in this section.

Chain rule: P(A_1 \land \ldots \land A_n) = P(A_1 \mid A_2 \land \ldots \land A_n) P(A_2 \land \ldots \land A_n).

Bayesian networks

Suppose we have a variable C representing whether you have a cough, F for whether you have a fever, and I for whether you have influenza. Clearly, C and F are conditionally independent of each other given that I occurs - P(C \mid F \wedge I) = P(C \mid I) and P(F \mid C \wedge I) = P(F \mid I). In other words, if we know that a person has influenza, we already can make a prediction about C and I - a cough depends on having the flu, not having a fever, and vice versa.

Since they're conditionally independent, P(C \land F \mid I) = P(C \mid I) P(F \mid I). Likewise, P(C \land F \land I) = P(C \mid F \wedge I) P(F \land I) = P(C \mid I) P(F \mid I) P(I). This is really useful, because each factor is easier to measure individually - P(C \mid I), for example, is just the fraction of flu patients that cough.

Note that the joint distribution for P(C \mid I) P(F \mid I) P(I) has 7 possible values for each possible assignment of values to C, F, I (minus one because you can compute the last value by using 1 minus the rest of the probabilities).

Note that the number of possibilities in the distribution grows exponentially with respect to the number of variables being joined. This makes it quickly infeasible to store the entire joint distribution in memory, or gather data for each of those possibilities.

Two random variables A, B are independent if knowledge of A doesn't change uncertainty in B. So P(A \mid B) = P(A) = P(B \mid A), P(A \wedge B) = P(A) P(B), or in general, P(X_1 \wedge \ldots \wedge X_n) = \prod P(X_i).

Two random variables A, B are conditionally independent given another random variable Z if and only if knowledge of A doesn't influence P(B \mid Z) and knowledge of B doesn't influence P(A \mid Z). So P(A \mid B \wedge Z) = P(A \mid Z) and P(B \mid A \wedge Z) = P(B \mid Z). In other words, conditional independence means that given Z, A and B are independent.

If we have n independent boolean variables, then we can specify their joint distribution with O(n) memory rather than O(2^n). A similar optimization can be done for conditionally independent variables, and many things in the real world are conditionally independent. Bayes networks use this to concisely represent joint distributions.

Notation concerns: P(X) is the distribution of the random variable X - a function of x that returns the probability that X = x, denoted P(x) or P(X = x). P(X \mid Y) is a family of distributions of the random variable X - a set of functions of x, one for each value of y.

If I wake up early (E), then I'll likely be unhappy (U), and if I'm unhappy, then I'll likely do poorly on an exam (X). If we have knowledge about waking up early or being unhappy, then that affects our uncertainty about doing poorly on the exam, so doing poorly is dependent on both. So the joint probability distribution is P(E \land U \land X) = P(X \mid E \land U) P(U \mid E) P(E). However, if we simply know I'm unhappy, then knowing whether I woke up early or not doesn't affect my uncertainty about doing poorly on the exam, because it doesn't add any new information. So P(E \land U \land X) = P(X \mid U) P(U \mid E) P(E). Basically, we managed to change the exponential number of table entries into linear!

This can be generalized to any number of chained dependent variables - knowing one part of the chain (such as being unhappy) means that the later parts (such as doing poorly on an exam) no longer depends on previous parts (waking up early). So P(x) = \sum P(g \mid u_i) P(u_i) = \sum_i P(g \mid u_i) \sum_j P(u_i \mid e_j) P(e_j).

A Bayesian/belief/causal/probabilistic network is a graphical representation of direct dependencies between random variables. A Bayesian network over random variables X_1, \ldots, X_n is a directed acyclic graph where nodes are variables and an arc X_i \to X_j exists whenever X_j directly depends on X_i (representing conditional probability distributions).

As usual we have the set of parent nodes \operatorname{Par}(X_i) (direct parents, not all ancestors), children, descendents, and ancestors for a node in a Bayesian network. We also have the family of a node, which is just the node plus its parents \set{X_i} \cup \operatorname{Par}(X_i).

The Bayesian network lets us easily find the joint distribution of any set of random variables. This works because given a node

;wip: all the way to constructing a BN

Bayesian networks are not unique for their variables - the order that we choose variables in while constructing it is very important, and some orders are better than others. Usually Bayesian networks are simpler when the variables are chosen by following causal intuitions - we want variables representing causes to precede variables representing effects.

;wip: all the way to but not including blocking








Multi-agent Systems

So far, there has only been a single agent in our environments. Not only are we reasoning about our own agents, we're reasoning about how other agents are reasoning. Our goal is to model how agents behave in this setting, and design systems so they'll behave in a certain way.

We assume that all agents are self-interested - acting according to their own goals and on their own world states.

Game theory problems deal with analyzing how groups of rational, strategic agents interact with each other (a special case is the decision problem, where there is only one agent). Rational means that agents choose the best possible action they know of, strategic means that they take their actions into account in the wider context of the game, and interaction means that the actions of one agent will affect other agents.

A strategic/matrix-form/normal-form game is defned as a set of agents I = \set{1, \ldots, N} with utility functions u_1, \ldots, u_n, a set of possible actions for each player i A_i = \set{a_i^1, \ldots, a_i^m}, an outcome profile a = (a_1^{j_1}, \ldots, a_N^{j_N}) (a tuple of actions that ended up being chosen by the agents, representing a game state).

We can often represent these sorts of games as a payoff matrix. Here's an example for playing Chicken:

                           Agent 2
                  | Turn Away | Continue  |
Agent 1 Turn Away | 10, 10    | 0, 20     |
         Continue | 20, 0     | 5, 5      |

This is a variation on the prisoner's dilemma, where two people are driving toward each other. If they both turn away, they don't die and built rapport with each other. If one turns away, they are embarassed while the other looks brave. If they both continue, they both die but look brave. Agent 1 is usually called the row player, while Agent 2 is usually called the column player.

When written in this form, a zero-sum game is simply a game where \sum_{o \in \text{outcomes}} U(o) = 0.

Let a_{-i} represent the actions of every player other than i. Let p_i(a_{-1}) represent what agent i thinks all the other players will do. Then an agent i is rational if it chooses actions according to \argmax_{a_i} u_i(a_i, p(a_{-i})) p(a_{-i}).

;wip: apply belief about others' actions to prisoner's dilemma.

An action strictly dominates another action if it gives more utility regardless of what other agents do - if u_i(a_i', a_{-i}) > u_i(a_i, a_{-i}) for all a_{-i}, then a_i' strictly dominates a_i. If an action is strictly dominated by another action, a rational agent will never play it. However, the converse isn't necessarily true, so this doesn't give us the full picture about what rational agents will do.

A Nash equilibrium is an action profile where no agent would want to change their action given that no other agents change their action. In other words, if we asked them after the fact whether they would have picked a different decision, every agent would say no. Formally, for every agent i and a current action profile a^*, for any possible action a_i', u_i(a_i^*, a_{-i}^*) \ge u_i(a_i', a_{-i}^*).

A Nash equilibrium is essentially an outcome where no one player can take advantage of any others, even in hindsight.

For randomized games, we can't use the pure Nash equilibria, because we don't know what the outcome might be. Instead, we can analyze the mixed Nash equilibria, which is the same thing but with expected utilities instead of the known utilities, and strategies (probability distributions over actions) instead of actions.

For the prisoner's dilemma, there's only one Nash equilibrium at defect-defect, and interestingly, it turns out not to be the globally optimal outcome. At any other outcome, at least one of the players would rather have defected than cooperated.

Nash's theorem: every game with finite-sized action sets, has at least one mixed-strategy Nash equilibrium. The proof is non-constructive, however, and finding Nash equilibria remains very hard. 2-player zero-sum games can be solved in polynomial time with linear programming, while for arbitrary problems it's even harder than NP-hard, in PPAD.

Prisoner's dilemma changes once agents play against each other multiple rounds, because each agent gives information to the other agent through their history of cooperating/defecting. With history, the strategy becomes a lot more complex, because agents can reward/punish past behaviour, and consider things like reputation and trust.

Some common repeated prisoner's dilemma strategies are "grim", where an agent cooperates until the first defect, and defects forever afterward, and "tit-for-tat", where an agent starts off coorperating, and copies the opponent's action in the previous round.

While normal-form games have agents choose actions simultaneously, an extensive form game allows the order of actions to affect the outcome, such as games where players take turns. Minimax is a strategy designed for extensive-form games.

Extensive form games still have the set of agents, action sets, and utility functions from normal form games, but we have terminal and choice nodes instead of action profiles. Just like we looked at with minimax, extensive form games can be represented with a tree.

Every extensive form game can be transformed into a normal form game by having each agent simultaneously commit to playing certain policies. Those commitments can then be thought of as the agent actions, and the utility of the final game state can be thought of as the actions' utility values.

An interesting case of Nash equilibria behaving unusally is mutually assured destruction. Consider two nuclear-capable countries A and B. Missiles are on the way from A to B, and B has to decide whether to retaliate.

Subgame perfect equilibria for extensive form games are ;wip: MAD example

;wip: centipede game, nobody actually plays like

;wip: ultimatum game: I must split $10 with you, you say yes/no, if no then nobody gets the money. optimal is to always say yes, but make them think you'll say no

;wip: cultural aspects- wall street bankers always defect, rural fisherman always cooperate


Mechanism Design

Mechanism design is the design of systems that make rational agents behave in a certain way. Instead of talking about agents given a game, we're talking about games given an agent - sort of like a reverse game theory.

We can represent mechanisms with something very similar to games - a set of agents I = \set{1, \ldots, N} with utility functions u_1, \ldots, u_n, a set of possible actions for each player i A_i = \set{a_i^1, \ldots, a_i^m}, an ;wip: update from slides? a = (a_1^{j_1}, \ldots, a_N^{j_N}) (a tuple of actions that ended up being chosen by the agents, representing a game state). Additionally, each agent has a type \theta_1, \ldots, \theta_N that represents all of the agent's private information that might affect their utility function, and there's a social choice function f: \Theta_1 \times \ldots \times \Theta_N \to O that reads the agents' minds and returns an outcome. ;wip: rather than action profile, we have outcomes

Social choice functions might include voting one candidate from a group of candidates, choosing which resources to assign to which agents, or whether to buy something as a group. This might depend on the types of all the agents, so the social choice function's value is unknown to any individual agent. We really want the social choice function, but it's not available within a game.

Since agents can't actually read each others' minds, we have to go through a trusted institution. A mechanism is a formal representation of this. A mechanism M is an assignment of strategies to each agent S_1, \ldots, S_N, and an outcome function g ;wip: outcome function

Each agent submits a report to the mechanism, and the mechanism aggregates all those reports into an outcome using the outcome function. For example, for voting we might have ;wip.

A mechanism implements a social choice function f(\theta) if there exists a Nash equilibrium s^* = \tup{s_1^*(\theta_1), \ldots, s_N^*(\theta_N)}. ;wip: rest of slide

Even better is the direct mechanism, where S_i = \Theta_i for every agent i. If this is the case, then g(\theta) = f(\theta) for all \theta \in \Theta_1 \times \ldots \times \Theta_N. In other words, if the strategy space is the same as the ;wip

A direct mechanism is incentive compatible if and only if there's an equilibrium s^* where s_i^*(\theta) ;wip. For the voting example, it's incentive compatible whenever it's in every agent's best interest to just submit their voting preference (their type information) directly than .

A direct mechanism is strategy proof if it is incentive compatible ;wip

;wip: revelation principle slide: if it's implementable in dominant strategies, then it's directly

Gibbard-Satterwaite theorem - if : ;wip

So we either we allow the voting system to not be strategy-proof, or we make the voting system dictatorial - there would exist an agent whose top-rated outcome becomes the outcome of the voting system. To get around this in practical situations, we can use a weaker definition of an equilibrium - we can design mechanisms where it can be NP-hard to figure out how to manipulate the voting system.

Alternatively, if the agent preferences follow certain patterns, we can still guarantee strategy-proofing. ;wip: single peaked and quasilinear preferences (quasilinear means that we introduce money, money is worth the same to everyone, and we can transfer utility between people just by sending money around).

;wip: groves mechanisms for quasilinear mechanisms, transferring money so everyone is happy but balancing that out with h_i(\theta_{-i}) such that the game still works. If we just set h_i to 0, then we just equalized everyone's utility, as if we split all the utility equally. ;wip: is that true? groves mechanisms are efficient and strategy proof under the quasilinear assumption

;wip: vickrey-clarke-groves are groves mechanisms that defines a good h_i function. marginal contribution - how much happier/sadder did you make everyone else by participating in the game vs. not participating in the game? that is how much you are paid/getting paid (e.g., I made everyone at an auction sadder by winning and walking away with the prize).

;wip: Vickrey auction: we pay the lowest amount we could've won with, by paying the second highest bidder's bid. this eliminates overbiddin regret - "I wish I bid a little lower, because I would have still won anyways" - because if they paid any less than the second highest bidder's bid, they would've lost ;wip: show that this is a vickrey-clarke-groves mechanism according to the formula



Machine Learning

So far we've looked at classical approaches for AI - top-down, mathematically-backed approaches to problems like playing games. Real-world problems, however, are not as appropriate for these approaches because they're ill-defined/messy, and even when we can define them, there are often so many details that designing things top-down becomes impractical for humans to do. For example, just fully describing the problem that self-driving cars are solving would be impractical for any human to do - there's so many possible things that can happen that we can't list them all.

Machine learning is a closely related field to AI that deals with more unstructured, bottom-up approaches to solving these problems. For example, supervised learning (induction from pre-labelled data, like classifier algorithms), unsupervised learning (learning patterns in input, like clustering algorithms), and reinforcement learning (learning from feedback that comes later, like TD-learning and Q-learning). Others include semi-supervised learning, active learning (figure out what questions to ask the user), transfer learning (transferring knowledge into other models, such as higher-performance ones or in other domains), apprenticeship/inverse-reinforcement learning (learning from an existing expert, such as learning how a pilot flies a helicopter).

One big problem in ML is representation - what are we trying to learn? How do we represent the model's knowledge? Some common forms are linear polynomials, propositional logic, first-order logic, bayes nets, etc.

Supervised machine learning

Given a training set of examples \set{\tup{x_1, f(x_1)}, \ldots}, a supervised learning algorithm will return a function h that approximates f well, even on unseen examples, known as a hypothesis. The hypothesis space is the set of all possible hypotheses that can be returned by the algorithm (e.g., all quadratic functions for a quadratic regression algorithm). This is a hard problem because good hypotheses are rare in most hypothesis spaces.

Two functions are consistent with a training set if they give the same value for points in the training set. There are often many possible hypotheses for a given training set, but we want one that will perform well on unseen samples. To approximate this, we use Ockham's razor - we want to

A realizable problem is a supervised learning problem where f is inside the hypothesis space. We generally want our problems to be realizable because that makes it possible to get the actual answer, so we would want to expand our hypothesis space. However, we don't want to expand it too much because it makes hypothesis more complex and the algorithms less effective - it's a delicate balance (e.g., if we made the hypothesis space all Turing machines, it would be very difficult to learn anything about any input).

A loss function L(y, h(\vec x)) is a function that represents how off a hypothesis is from a true value, where y is the true value and h(\vec x) is the predicted value.

Linear threshold classifiers find a hyperplane h(\vec x) = \vec w \cdot \vec x \ge 0 to approximate an unknown function f(\vec x) \in \set{0, 1}. Our loss function is usually the sum of squared errors in the training set. One way to learn a linear threshold classifier is to use gradient descent along the space of all hyperplanes.

Decision trees classify points by consecutively testing them with some fixed test at each node in the tree, and taking a corresponding branch in the tree. The leaf node that the point ends up at is the output class. ;wip: training decision trees

The entropy of a random variable V is the information content of that random variable, defined as I(V) = -\sum P(V = v_i) \log_2(P(V = v_i)) (the proof for this is very much out of the scope of this course). The unit of entropy is bits - a coin flip has 1 bit of entropy, while a random ASCII character has 7 bits of entropy.

Uncertainty matters a lot depending on the problem. For example, if there's a 50% chance that A is true, and a 90% chance that B is true, it's worth more to know the answer to A than B, because it eliminates more uncertainty. In this case, A has 1 bit of entropy, and B has about 0.47, by the formula above.


For a decision tree, at each node we want to test on an attribute that gives us the most possible information gain. Ideally, the attribute would split all of the values into two roughly equal-sized sets, one of which is all in the positive class, and the other all in the negative class - this is the outcome that maximizes the entropy of making the decision.

Aside: a decision stump is a decision tree with only one node.

;wip: information gain slide, about choosing an attribute by measuring how much information testing that attribute could give us (information gain) - this is determined by computing the remaining entropy at each step and subtracting it from the previous remaining entropy

The entropy at a given node is I(\frac{p}{p + n}, \frac{n}{p + n}) = -\frac{p}{p + n} \log_2 \frac{p}{p + n} - \frac{n}{p + n} \log_2 \frac{n}{p + n}. This has a maximum at n = p with value 1 - there is 1 bit of entropy if the attribute test splits the values perfectly into two classes that are all in the positive class and all in the negative class, respectively.

;wip: how to compute information gain?

Algorithm C4.5: when we introduce a node, we want to ask the question that will reduce the remaining entropy the most. We usually combine this with a greedy algorithm (repeatedly grow the tree wherever we can reduce remaining entropy the most, as long as we still have examples or attribute tests to try), which works pretty well in practice. The resulting decision tree will work for the data, but can often be different from the true, underlying decision tree that generated the data (generally, a simpler tree or a slightly more complex tree than the original). The output of this algorithm is a decision tree (the decision tree is a hypothesis).

We usually use the training-set/testing-set split to evaluate our hypothesis. The learning algorithm is good if the classifiers it generates will give good results on the unseen data in the testing set(data that the classifier has never had access to). Note that the fact that it needs to be unseen means that we must use a new testing set every time we're evaluating our hypotheses.

If a hypothesis performs well on the training set but poorly on the testing set, this usually means we were overfitting - making out hypothesis too dependent on extraneous details that only coincidentally happen to correlate (for classifiers, these are often spurious correlations). Formally, a hypothesis overfits a training set if there exists another hypothesis that does worse on the training points, but better on the entire distribution of points.

If we train decision trees naively, overfitting causes a 10%-25% loss in accuracy. How do we prevent decision trees from making these spurious correlations? One technique is to use pruning. We introduce the null hypothesis "there is no pattern in this data" (;wip: formulas from avoiding overfitting slide), and then compute the probability that a sample of size p + n would have the result. We can then do a chi-squared test to figure out whether the null hypothesis is plausible. If the null hypothesis is plausible, we can then prune the corresponding decision tree node.

When a testing set isn't available, or we can't afford enough testing sets to test hypotheses as much as we need, we can use techniques like cross-validation, where we train on parts of the training set, and test on the remaining parts. The most common cross-validation technique is K-fold cross-validation, where we split the training set into K sets, then run K experiments where each set is the validation set and the rest of the sets are combined to get the training set. Leave-one-out cross-validation is just n-fold cross-validation where n is the training set size.

So far we looked at algorithms that choose a single hypothesis from the hypothesis space. Ensemble methods choose a lot of hypotheses and then combine their predictions to make the final prediction. The idea is that individual hypotheses might be wrong, but they might have partial information that would make the majority less likely to be wrong - ensemble methods can help improve model performance. Additionally, ensembles can be used to expand the hypothesis space itself - for example, an ensemble of linear classifiers where positive predictions are the AND of all the linear classifiers can have arbitrary polyhedrons as their hypothesis, rather than a hyperplane.

One way to combine predictions in ensemble methods is bagging - all the hypotheses take a majority vote and the winner is the final prediction. If we pretend that each of n hypothesis is wrong with probability p independently (even though this doesn't really happen in real life), we get P(k \text{ hypotheses are wrong}) = {n \choose k} p^k (1 - p)^{n - k}. We then sum up the wrong cases to get P(\text{ensemble is wrong}) = \sum_{k = \ceil{n / 2}}^n P(k \text{ hypotheses are wrong}) = \sum_{k = \ceil{n / 2}}^n {n \choose k} p^k (1 - p)^{n - k}.

One way to improve this is to weight hypotheses differently, for example by reducing the weight of correlated hypotheses (because their outputs would be overrepresented in the output) and increase the weights of high-performing hypotheses. This is a technique called boosting - we weight each hypothesis, and weight each point in the training set. Whenever we test a single hypothesis on a single training set point, we increase the weight of that training set point and decrease the weight of that hypothesis if it was misclassified, and decrease the weight of the training set point and increase the weight of that hypothesis otherwise. The other hypotheses will then have to do better on those misclassified ones in order to perform well.

;wip: adaboost algorithm

AdaBoost is very powerful and often used in industry. It takes in a weak learning algorithms and returns a hypothesis that classifies with accuracy 100% as the size of the ensemble increases.

Boosting means we don't need to learn perfect hypotheses. It's also easy to implement, generalizes well to the testing set (reduces overfitting), and is widely applicable to any weak learning algorithm. This is used everywhere from sensor fusion to radio decoders.


Machine learning is essentially about approximating real-world functions, such as the temperature of a region, the next frame of a video, and so on. Real-world functions can be extremely complex, so we need to design our machine learning methods to handle arbitrary relationships, run efficiently, and avoid overfitting.

Neurons in the brain are cells that act like wires, memory, and processors, and the brain is a densely connected network of neurons. Neurons receive chemical signals from their soma, perform rudimentary functions on those incoming signals in the soma/body (such as "10 or more signals received in last second"), and then might fire a signal down its axon, where it is broadcast chemically via synapses. The function that occurs in the soma isn't very well understood, but the rest of the process is, more or less. Fundamentally neurons are really simple, and can approximate arbitrarily complex functions if we use enough of them.

Biological neurons are either firing or not firing - the information content is encoded in the frequency and timing of the firings. When the brain learns, the connection between synapses and other neurons' dendrites are chemically altered to be more or less efficient. Artificial neural networks are generally based on really early, simplified models, though biologists will often use more biologically accurate neural networks based on models from theoretical neuroscience.

A typical artificial neural network models each neural as follows: each neuron has a weight vector w, and outputs a value g(\vec w \cdot \vec x), where \vec x is the vector of values from some other neurons and g(v) is the activation function (e.g., \tanh(v) for hyperbolic tangent, \max(0, v) for ReLU, \frac{1}{1 + \exp(v)} for sigmoid, or \begin{cases} 1 &\text{if} v \ge b \\ 0 &\text{otherwise}\end{cases} for threshold where b is a constant).

Suppose we wanted the logical AND of two Boolean inputs x_1, x_2 using a single neuron with threshold activation. In other words, the neuron should fire if and only if x_1 = x_2 = 1. We can write our goal as a system of inequalities w_1 + w_2 + b \ge 0, w_1 + b < 0, w_2 + b < 0, b < 0, one equation per truth table entry.

A feedforward network is a directed acyclic graph of neurons. A recurrent network is a directed cyclic graph of neurons. Feedforward networks are fully described by their structure and weights, while recurrent networks additionally have their previous neuron outputs as state - this makes them harder to train but capable of remembering past inputs.

Input units are neurons that get their inputs directly from the network inputs. Hidden units are neurons that get their inputs entirely from other neurons, and give their output entirely to other neurons. Output units are neurons that give their output directly to the network outputs.

A perceptron is a single-layer feedforward network, where all neurons get their inputs directly from the network inputs and their outputs are given directly to the network outputs, and all neurons use the threshold activation function. The error/cost/loss of a perceptron is E = \frac 1 2 \sum (y_i - g(\vec w \cdot \vec x_i))^2, where \vec x is the network inputs and g(v) is the network outputs (the outputs of the activation function from each neuron). We use squared loss because we'd rather have lots of small errors than some large errors.

We train neural networks using local search techniques. The most commonly used one is gradient descent - we take our loss function E, take its derivative with respect to the weights \Delta E = \frac{\dee E}{\dee \vec w}, and then update those weights according to \vec w \leftarrow \vec w - \alpha \Delta E, where \alpha is the learning rate.

For example, for the perceptron we have \Delta E = \frac 1 2 \frac{\dee}{\dee \vec w} \sum (y_i - g(\vec x_i))^2 = \sum (y_i - g(\vec x_i))(-\frac{\dee g}{\dee v} (\vec w \cdot \vec x)) \vec x.


;wip: catch up on this class, the one before statistical learning


Learning probabilistic models

Recall that agents model uncertainty in the world and the utility gained from different action plans, and that Bayes nets are one way to represent agent knowledge.

Bayes nets grow exponentially. In the past, some Bayes nets were painstakingly constructed by hand, such as the Pathfinder network for diagnosing lymph node diseases. The main bottleneck here is acquiring knowledge from experts, which is very expensive.

In contrast, data is often a lot cheaper. What if we build models (such as Bayes nets) directly from data?

Suppose we have a bag of candy with an unknown ratio of two different flavours (the ratio is known to be either 25%, 50%, or 75%, however). After eating k candies, can we estimate the flavour ratio? Can be predict the next candy's flavour?

With Bayesian learning, we then have three hypotheses: H_1 "the flavour ratio is 25%", H_2 "the flavour ratio is 50%", and H_3 "the flavour ratio is 75%". When we take a piece of candy out, we get evidence d, telling us a particular piece of candy has a particular flavour.

We also have our prior probabilities P(H_1), P(H_2), P(H_3) for each of the hypotheses being true, any and all previous knowledge about the probability of each hypothesis being true. We also have the likelihood of seeing a particular piece of evidence d given any hypothesis, P(d \mid H_1), P(d \mid H_2), P(d \mid H_3).

Suppose we took 10 candies out, and they were all the first flavour. If we assume the candy distributions are independent and identically distributed (this is roughly accurate for a large bag of candy), we can say that the likelihoods are P(d \mid H_1) = 0.25^{10}, P(d \mid H_2) = 0.5^{10}, P(d \mid H_3) = 0.75^{10}. We then update our priors that each probability is correct using P(H_1 \mid d) = P(d \mid H_1) ;wip.

Predictions are essentially weighted averages over all of the hypotheses, where the weight is how likely we think the hypothesis is to be true. The predicted hypothesis at any point is the one that has the highest P(H \mid d) value so far. It's mathematically guaranteed to be the best guess we can make given the evidence available (it'll be correct more often than any other predictor). Not only that, but we can also perform regularization by adjusting priors - simply reduce the prior probability of each hypothesis by its complexity.

This is usually intractable if there are a lot of hypotheses. We can approximate this using MAP (maximum a posteriori estimation): instead of predicting based on the prediction of every hypothesis, we only predict based on the most probable hypothesis. In other words, h_{MAP} = \argmax_{H_i} P(H_i \mid d) (the hypothesis that maximizes the probability that the evidence occurs), and then only updates/predicts based on that most likely hypothesis. MAP prediction is less accurate than Bayesian, but it's guaranteed to converge to the same answer as Bayesian learning given enough evidence. Additionally, we can still do regularization by reducing prior probabilities according to their hypothesis complexity.

However, MAP may still be intractable if the \argmax can't be efficiently computed. ;wip: rest of MAP computation slide, and linearizing the values and then solving it using linear programming on the log likelihood function \ln P(H) + \sum_i \ln(P(d_i \mid H)).

If we don't have any priors, we still need to assume something to start with. In this case, we usually use a uniform prior, as if every hypothesis is equally probable. If we use a uniform prior, then the only thing that matters is the evidence. In this case, this is called maximum likelihood learning - doing \argmax_{H_i} P(d \mid H_i) rather than \argmax_{H_i} P(H_i) P(d \mid H_i), because every P(H_i) is equal. This tends to overfit, since there's no priors to penalize complex hypotheses. Maximum likelihood learning is more tractable than MAP learning, because we don't need to keep track of priors anymore.

Maximum likelihood learning, and to a lesser extent MAP learning, tend to jump to conclusions quickly, because they take an idea and tend to run with it until it fails enough. However, both will eventually converge to the same answer as Bayesian learning.

For the candy example but allowing arbitrary ratios (rather than just the 3 ratios presented), the maximum likelihood learning gives the maximum likelihood estimator that says "the true ratio of candies is most likely to be the actual ratio we've observed". We can compute this by constructing the likelihood function, then finding the value that maximizes this likelihood function.

These algorithms can be extended to arbitrary Bayes nets. One issue to be aware of is that if we don't see an instance of a particular class, we set that class's probability to 0, which causes a lot of problems for Bayes nets (zero probabilities eliminate entire trees in Bayes nets). We can get around this with Laplace smoothing - start counting from a small number instead of from 0. For example, if we're predicting English word frequencies in a text document, we're unlikely to see all English words, so we would predict 0 probability for any unseen words. With Laplace smoothing, we instead pretend as if we've seen every English word once, and then count starting from that.

A naive Bayes model is a special case of a Bayes net where we assume that every attribute is independently distributed. This is often not a valid assumption, but it works pretty well in practice. Naive Bayes nets require much fewer probabilities because all the conditional probabilities are equal to their corresponbding unconditional probabilities, of which there are much fewer.


Naive Bayes scales well (i.e., linearly) as the number of features increases, and performs really well in practice, often even in situations where the independence assumption doesn't hold very well. We often use it for things like text classification, where high performance is needed.

Common use of Naive Bayes include spam detection, plaigerism detection, and search engines. The information we're given is a set of documents, and the goal is to classify those documents into two classes (e.g., spam vs. not spam, plaigerized vs. not plaigerized, relevant vs. not relevant). In these situations it's called a bag-of-words model: for each document, we just stem words, remove stopwords, and perform a word count.

We then assume words are independent fo each other - order of words doesn't matter, co-occurrences of words, don't matter (bad assumption, but works decently in practice). Under this assumption, for a document d_i and a class y, the probability of d_i being in y, P(y \mid d_i), is the product of all the probabilities P(w_j \mid y) = \frac{\text{number of documents with class } y \text{ that contain the word } w_j}{\text{number of documents with class } y} for each word w_j in d_i.

To avoid zero probabilities, we apply Laplace smoothing by initializing word counts to 1 instead of 0.

Expectation Maximization

So far we've had access to all the attributes in our data. In real life this is rarely the case. Consider a Bayes net for diagnosing diseases. The Bayes net can see symptons and suggest treatments, but it wouldn't be able to access the underlying diseases.

If we ignore hidden variables, the Bayes net would have to directly associate symptons with different treatments, which massively increases the size of the Bayes net and the number of probabilities we need. If we ignore problem instances with missing values, we might end up without any values (and probably a lot of bias as well). In real life, we need to be able to consider hidden variables.

In direct ML on Bayes nets, we find the maximum likelihood estimate h_{ML} = \argmax_h P(E \mid h) = \argmax_h \prod_i \mathrm{CPT}(V_i) of the hidden variables given the evidence E. Usually, we would solve this by introducing a log, so h_{ML} = \argmax_h \sum_i \ln \mathrm{CPT}(V_i) - the linearized version can be maximized more easily.

When we have hidden variables Z, we instead have h_{ML} = \sum_Z \argmax_h P(E, Z \mid h). We can't use the log trick here anymore, because of the sum. Instead, we use expectation maximization (EM): guess the values of the hidden variables (using the expected value), and then compute the new maximum likelihood hypothesis based on those guesses. We then repeat the process until we converge to an approximation of the true maximum likelihood hypothesis:

  1. Make a guess for h_{ML}, the maximum likelihood hypothesis.
  2. Repeat until convergence:
    1. Based on h_{ML}, compute the expected value of any hidden/missing variables.
    2. Based on the expected value of those hidden/missing variables, compute a new h_{ML}.

Formally, we can write this as h_{i + 1} = \argmax_h \sum_{Z} P(Z \mid h_i, e) \ln P(e, Z \mid h_i). In practice, we would be computing the expectation and the maximization steps separately, even though this formula shows them occuring together. ;wip: check this formula

Expectation maximization is nice because we can linearize P(e, Z \mid h) (which is the product of bunch of variables) by moving the \ln inside the product and turning it into a summation - this makes it a lot easier to compute.

Expectation maximization guarantees that P(e \mid h_{i + 1}) \ge P(e \mid h_i). ;wip: examples from slides


So far we've looked at what we can do with AI. Today we look at what we should do with AI.

Tips for final exam:

Creative Commons License This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Copyright 2013-2017 Anthony Zhang.