Introduction to Artificial Intelligence.

```
Alan Tsang
Section 001
Email: akhtsang@uwaterloo.ca
Website: https://cs.uwaterloo.ca/~klarson/teaching/F17-486/, https://cs.uwaterloo.ca/~akhtsang/cs486.html
Office Hours: Tuesdays 1:15pm to 2:45pm in DC2306B, accessible via the AI Lab
Mondays/Wednesdays 4:00pm-5:20pm
```

Info on course website, assignments on LEARN, questions on Piazza. 5 assignments worth 40% total, 15% midterm on October 18, 2017, 45% final exam, optional project worth 5% bonus marks. Assignments can be submitted up to 48 hours after deadline.

There's no real agreed-upon definition for AI, but in this course we'll consider it to be the mathematical study of intelligent action in a complex environment. We usually think about 4 different kinds of AI: systems that think like humans, act like humans, think rationally, or act rationally. The definitions usually differ along the dimensions of thinking vs. acting and human behaviour vs. idealized rational behaviour. In other words, we're trying to duplicate what the human brain does (passing the turing test), and what the human brain should do (acting rationally).

Overview of AI in popular perception: threats of superintelligence, automation displacing jobs, biases encoded in AI models, self-driving cars. Overview of the turing test, including issues like not being amenable to mathematical analysis, whether it's more important to understand underlying principles than mimicking true intelligence, and whether it's even a good test (we don't make planes by trying to make something so bird-like that it fools a bird). Overview of history of AI, like Turing's work, the Dartmouth meeting, AI winter, and modern computational AI.

Classically, AI was seen to mean general problem solving, like solving chess. More recently, we focus more on specific problems, such as perception, robotics, and deliberative reasoning (thinking about what to do based on available data), with a greater focus on statistical methods, decision theories, and probability.

In this course we'll talk about classic AI problems like search, as well as knowledge representation/reasoning, making decisions under uncertainty, and learning.

An **agent** is an entity that perceives and acts. We usually evaluate agents based on things like how well it achieves goals and how many resources it consumes. Agents exist within an environment, with actuators, sensors, and environmental constraints. Environments can be fully/partially observable (can we see the entire environment at once?), deterministic/stochastic (does the same action always result in the same result?), episodic/dynamic (do previous steps affect later steps? in classification no, since previous predictions are independent, but in chess yes, since previous moves are very important), discrete/continuous (chess AI could be considered discrete, whereas controlling a robot arm might be considered continuous), and single/multi agent (is the agent the only one in its environment?).

Search is one of the first topics in AI, used in everything from theorem proving to robotic navigation. Search problems consist of a **state space** (the world we're searching within), a **successor function** (given a state, determines possible actions/next states and the cost of getting to those states), a **start state** (the state at the start of the search), and a **goal test** (determines whether we've reached the goal, and possibly also information about how close we are to the goal). A **path** is a sequence of states and operations. A solution is a path from the start state to a state that satisfies the goal test. An optimal solution is a solution of minimal cost.

Consider a pac-man AI. the state space would be the game map, the successor function would define actions like, moving left/right/up/down and how expensive those would be, the start state would be the initial game state, and the goal test would be whether we've gotten to the end of the level.

Overview of N-queens problem and why it's a search problem, as well as various other types of search problems. For our purposes, search problems won't have elements of chance, continuous states, adversaries, or partial observation - all of our problems will be deterministic, discrete, single-agent, and fully observable.

The world state includes every detail of the environment at a certain point in the search process. For something like pac-man, we might care about pac-man's position, the dots, and the ghosts, but pathfinding might only have current position in its state.

Search problems can be thought of as minimum-cost pathfinding on a graph of world states, so we can apply algorithms like breadth-first search and A*. The search tree is the tree made of possible paths within the graph of world states. The factors we care about in search algorithms are things like completeness (will it always find a solution?), optimality (does it always find the minimum cost solution?), time complexity, and space complexity. While DFS is light on memory and goes deep quickly (O(b^h) time and O(h) space, where b is arity of tree and h is tree height), cycles in the graph can result in infinite loops, so it's not complete. BFS is complete and can find shallower solutions quickly, but needs a lot of memory to store visited states (O(b^h) space and time where b is arity of tree and h is tree height).

Instead, we can use **iterative deepening search** (IDA*), which is essentially a mix of both - do DFS with a small depth limit, then repeatedly increase the limit (usually linearly) and rerun DFS until we reach the goal. This is both memory-efficient and complete, since the depth limit prevents infinite loops. This sounds wasteful, but it isn't too bad in practice since the number of nodes we visit in each iteration tends to grow exponentially, so most nodes we visit will be in the later iterations.

Note that all three of these algorithms we've looked at so far are the same besides the order we visit nodes in - they're generic and aren't specialized for specific problems. We can do better in specific problems by taking advantage of their inherent structure.

Assignment 1 will be out this week.

Uninformed search is search that doesn't depend on any knowledge about the problem, like BFS or DFS, or iterative deepening. It's widely suitable, but performs poorly and is computationally expensive. We would like to do informed search instead, where we can take advantage of knowledge we have about the problem to make our searches more efficient.

Usually when searching a graph, we often have knowledge about the **merit** of a node - some approximation of how good the node is. There are often different ways to defined merit for any given problem, such as how much it would cost to use the node in our solution, or how computationally expensive it would be to use it. For example, uninformed search expands nodes based on the distance from the start node, but we might augment this by using a heuristic like Euclidean distance to the goal, if we're searching in a Euclidean space.

A **heuristic** estimates the distance from any given node to the goal node, such as Euclidean or Manhattan distance. In our search, if node A has a lower heuristic than node B, we might choose node A over B (and also factor in things like how far from the start nodes A or B are).

Best-first search is the most naive informed search. At each step, we expand exactly the node with the lowest heuristic function value - the node that seems closest to the goal function. This is not a complete search, since it might get stuck in a minimim-heuristic cycle of nodes (we don't keep track of which nodes we've already visited). Additionally, this is not optimal, since the resulting path isn't always the shortest one - the length of the path taken so far at any given node isn't taken into account.

While best-first search looked at the estimated forward cost (expand node with lowest estimated cost from current node to goal), breadth-first search looked at the backward cost (expand node with lowest cost from start to current node). We want to combine these two to make breadth-first search more efficient, and this gives us **A* search**. Essentially, let g(n) be the backward cost for the node n and h(n) be an estimate of the forward cost for the node n. We then perform essentially breadth-first search, repeatedly queueing up each visited nodes' neighbors, but always choose the node in the priority queue with the lowest value of f(n) = g(n) + h(n), stopping when the priority queue is empty (the priority queue starts off only having the start node).

A* is complete - it will give us a valid path for any start/goal vertex in any graph. A* is also optimal if the heuristic is **admissible and we're searching in a tree**, or if the heuristic is **consistent**. Admissible means that the forward cost heuristic's value is always less or equal to the true forward cost. For example, h(n) = 0 is an admissable heuristic for any graph. Proof of optimality:

Let s be the start node and e be the goal node. Suppose A* gives us a non-optimal solution within a tree. Since it's a tree, we must have reached a different ending node, so there must be a non-optimal ending node e'.

Let n be some node along the path from s to e. Clearly, at some point both n and g' must be in the A* priority queue at the same time, and A* must have removed e' over n.

So f(e') \le f(n) and g(e') + h(e') \le g(n) + h(n). Let c(x, y) be the true cost from node x to y. Since the path is non-optimal, g(e') > g(e). Clearly, g(e) = c(s, n) + c(n, e) \ge g(n) + h(n) = f(n).

So f(e') > f(n), which isn't possible since f(e') \le f(n). So A* cannot have given us a non-optimal solution within a tree.

Consistent means that the heuristic is admissible regardless of the goal node - h(n) \le c(n, n') + h(n'), where c(x, y) is the true cost of getting from node x to node y - sort of analogous to the triangle inequality. Most admissible heuristics will also be consistent, but not all of them.

A* has good performance in practice, but in the worst case needs to visit every node, and keep track of all the backtracking information for every node. The real-world performance of A* depends heavily on the heuristic function, but it's generally a lot better than plain breadth-first search.

For example, if we were to use A* to solve a sliding puzzle game (a 3x3 grid of squares with one tile missing, squares must be slid into a particular pattern), one heuristic we might use is to count the number of moves to get each tile to its proper place, assuming the tiles can occupy squares regardless of whether there's already a tile there - the sum of the Manhattan distances between each node to its destination. A somewhat worse heuristic would be to count the number of tiles not in their correct places.

To design a heuristic, we generally start by removing some constraints of the problem to get a relaxed version of the problem, finding a computationally cheap solution to the relaxed version (or precompute a database of solutions for relaxed problem instances), and then use that as the heuristic. For example, for the sliding puzzle game, we started by relaxing the requirements that there only be one tile per square, and then trivially found a solution by just moving the tiles to their correct places. There's often a big tradeoff between the cost of computing the heuristic, versus how good the heuristic is - finding a good relaxation of the problem is a bit of an art.

We often want to use A* while also limiting how much memory we use. One way to do this is using Iterative Deepening A*, which is sort of like iterative deepening search, but we consider nodes by order of their f(n) value, limit the value of f(n), and progressively rerun the search with increasing f(n) limit until the goal is reached. Another way is simplified memory-bounded A*, which simply starts dropping nodes from the priority queue with the highest f(n) values (or oldest nodes, as a tiebreaker) when it's out of memory (this is still optimal/complete if distance to shallowest goal node is less or equal to memory size).

Assignment 1 due October 4.

Informed search uses domain specific knowledge to make search faster. Constraint satisfaction problems (CSPs) use knowledge about the current state itself to help us find solutions. In a standard search problem the states are black boxes, with a goal test and successor function to deal with them. A constraint satisfaction problem has states defined by variables X_1, \ldots, X_n whose values are from a domain D, and the goal test is a set of constraints that tell us about what values a subset of the variables S \subseteq \set{X_1, \ldots, X_n} must take on. We can then exploit knowledge of these constraints to improve our performance.

Consider the map coloring problem: coloring a map such that no neighboring regions have the same color. Here, the constraints are that neighboring regions cannot have the same color, X_1, \ldots, X_n is the color of each region, and the domain might be D = \set{\text{red}, \text{green}, \text{blue}}. We want to find a solution, which is an assignment of values to X_1, \ldots, X_n satisfying all the constraints. Another famous constraint solving problem is 3-SAT, which tries to find boolean values that satisfy a set of constraints in a particular form.

We distinguish betwen finite-domain constraint satisfaction problems like 3-SAT (variables have finite possible values), and infinite-domain ones like linear systems (variables have infinite possible values). We also care about the arity of the constraints - for example, binary constraints operate over two variables, ternary over three, and so on. There's also soft constraints, which don't have to be satisfied but are preferred if possible - this forms a constrained optimization problem.

We can use standard search algorithms to solve these binary constraint satisfaction problems. For example, for the map coloring problem, all the constraints are binary, so we can perform a search - states in the graph are partial assignments of colors, the search graph is the map's graph, the initial state is the uncolored map, the successor function colors one currently uncolored region, and the goal is that all colors are assigned and no two adjacent regions have the same color.

In general, we can solve a CSP using search: states are partial assignments of variables, initial state is that no variables are assigned, successor function assigns a value to a variable that isn't currently assigned, and the goal is that all variables are assigned and all constraints are satisfied.

CSPs are commutative - the order that we assign values to variables doesn't matter. This can be used to significantly reduce our search space - at each node in the search tree, we only have to consider the value of a single variable, not the values of variables we looked at before it.

Backtracking search is the main way to solve general CSPs: select an unassigned variable X_i, try every possible value of X_i to make sure they satisfy constraints. If there's a value v that does, set X_i = v and repeat the whole process, and if there is no such v, go back to the preceding variable X_j and keep trying more values of X_j.

Backtracking is basically the same as DFS, but there are a lot of ways to improve on it:

- Ordering: choose unassigned variables and variable values in a specific order using heuristics. This can help us avoid a lot of bad branches, if we look at promising branches first.
- A common way to choose variables is to choose the one with the fewest remaining possible values (
**most constrained variable/minimum remaining values**heuristic), and as a tiebreaker, select the one referenced by the largest number of unsatisfied constraints (**most constraining variable**). This helps us eliminate bad previous choices as fast as possible, because anything that will eliminate all possible variable values gets detected more quickly. - A common way to choose values for a variable is to choose the one that rules out the fewest values in the remaining variables (
**least-constraining value heuristic**) - the value that allows variables that will be chosen next to have the largest number of values. In other words, we try different values of the variable, and pick the one that reduces the sum of the remaining variables' possible values the least.

- A common way to choose variables is to choose the one with the fewest remaining possible values (
- Filtering: detect when the state we're currently in definitely won't lead to a solution if we continue searching, so we can immediately backtrack.
- Forward checking: keep track of remaining possible values of unassigned variables, and when any variable runs out of possible values, backtrack, since we know we won't find an assignment if we proceed from the current state.
- Arc consistency: given two nodes in a search digraph A, B with domains D_A, D_B respectively, an arc A \to B is
**consistent**if and only if for every variable value x \in D_A, there exists a variable value y \in D_B such that x and y are consistent (don't violate any constraints when in a partial assignment together).

;wip: slides

Also, it turns out that if the digraph is a tree, then we can solve it in O(nd^2), where n is the number of nodes and d is the number of values in the domain. To do this, we use topological sort to get an ordering such that parents occur before their children.

;wip: slides

;wip: tree widths of 0 and 1 are still trees, but the larger the tree width, the less like a tree it is - the larger the width, the less tree-like it is, and worse the decomposition approach will work, also finding optimal tree decomposition is NP-hard

So far we've looked at uninformed and informed search. Today we'll look at local search and optimization - searching in very large problems, where we're willing to settle for a non-optimal but workable solution. Local search is used when we can't use global search/optimization techniques like A* or linear programming.

As with the other search techniques so far, we have a domain, constraints, and a cost function V(P). We're trying to find a point P in that domain that satisfies the constraints and has a low cost function value (not necessarily the lowest one, though).

Search algorithms so far have systemically explored the search graph, and the path is the solution, but in many cases we don't care about the path. For example, scheduling exams without conflicts (the thing we care about is the final exam schedule, not the path of partial schedules we used to arrive at a conflict-free schedule), or chip design (we care about routing wires around to minimize chip footprint, not which wires we should route first). Local search techniques take advantage of this to make larger problems feasible.

Local search problems tend to have a wide search tree and a cost function that tells us how good a given solution is. Many of these problems are NP-hard to find optimal solutions, but **we just want feasible solutions that are good enough, even if they're not optimal**.

For example, for N-queens, we might try to find a solution that removes a lot of the queens from attacking positions, rather than a solution that gets all of the queens out of attacking positions. When there are few constraints vs. variables, it's easy to solve - if we just throw the queens on the board somewhere, it's probably a valid solution. When there are lots of constraints vs. variables, it's easy to see it's infeasible - if there are no non-attacking positions left to put queens in. However, when there's roughly the same number of constraints and variables, there are some solutions, but they're hard to find - searching for any solutions becomes very computationally intensive. Local search problems tend to have these sorts of "phase transitions" - most problem instances tend to be really easy, but a certain smaller number of problem instances are really difficult.

One way we commonly solve search problems is by iteratively improving a random initial state - a greedy **hill climbing** algorithm often also known as gradient descent. Essentially, we repeatedly move in the direction that would improve our current situation the most:

- Run the following many times:
- Choose a random initial state S with value V(S). This state isn't necessarily optimal or even feasible.
- Generate a moveset \delta(S) = \set{S_1, \ldots, S_k} - the set of all neighboring states.
- Let S' = \argmax_{S_i} V(S_i) - the best possible next state.
- If V(S') > V(S), set S = S' and go back to step 2 (we might also limit the number of iterations here). Otherwise, return S - we've found a maxima.

- Take the largest value of S from all the iterations run in step 1.

Hill climbing is easy to implement and light on memory, but is only complete or optimal when the problem space is convex - for non-convex problems, there can exist local maxima that are neither feasible nor optimal, and we might end up hitting those local maxima every iteration.

Also, there can be "plateaus" - regions where the cost function is constant, where we would get stuck. We can avoid this by allowing movement in directions that don't change our cost function (usually in the same direction as the previous step) - this lets us move through a plateau and hopefully find higher values on the other side.

One way to improve hill climbing is to use **simulated annealing**. This is quite similar to hill climbing, but at each step, instead of always moving to the best neighboring state, we instead randomly choose a neighbor, and move to it with a probability based on how much better it is than the current solution, and the amount of steps we've taken:

- Choose a random initial state S with value V(S). This state isn't necessarily optimal or even feasible.
- Repeat n times:
- Generate a moveset \delta(S) = \set{S_1, \ldots, S_k} - the set of all neighboring states.
- Let S' be a random next state from \delta(S).
- If V(S') > V(S), let S = S'. Otherwise, let S = S' with probability \exp\left(\frac{V(S') - V(S)}{T}\right) (otherwise, do nothing). Here, T is the
**temperature**- a function of the current iteration that starts high in the first iteration and gradually goes toward 0 as we get toward n iterations (usually exponentially by multiplying by a constant 0 < \alpha < 1 at each step).

- Take the value of S as a solution.

Note that in step 2.3, if the new step was better we would accept it every time, and if it was the same or worse there's a certain chance we accept it anyways. Note that the chance decreases when the next state is much worse (and increases when the next state is only a little worse), and decreases significantly as the temperature gets closer to 0.

So when T is large, we're often choosing next states that might be better or worse (exploration stage), and as T gets smaller, we're often choosing next states that are strictly better (exploitation phase). As it turns out, if we decrease T slowly enough (if we run enough iterations), we are guaranteed to find an optimal solution.

Genetic algorithms are sort of like an extension of simulated annealing, in the exploration stage. Genetic algorithms can be used when states have a "fitness" function f(S), it's possible to "combine" two states to form a new state c(S_1, S_2), and it's possible to mutate a state to get a slightly different state m(S). Essentially, we have a a population of initial states, and repeatedly combine high-fitness states and mutate them until we have an acceptable result:

- Let P = {S_1, \ldots, S_n} be a population of random initial states.
- Repeat until we have a state S_i with high-enough f(S_i) value:
- Randomly choose two distinct states S_i, S_j from P, where the choice is weighted by each element's f value.
- Combine S_i and S_j to produce a child S' = c(S_i, S_j).
- With some small probability m, mutate S' to get m(S'). Also, in some variations of genetic algorithms, we instead mutate a randomly chosen state in P instead.
- Add S' to P. In most variations, we also kick out the lowest-fitness S_i once P has reached a certain size.

- The solution is the highest-fitness S_i in P.

States are often represented as binary strings. The crossover function c(S_1, S_2) might then be something like "generate a random bitmask m, then `new_S = (S_1 & m) | (S_2 & ~m)`

", and the mutation function might then be something like flipping a random bit.

Softmax is nice because it handles outliers well.

We now introduce an adversary with conflicting goals. Usually this means a game. Games are well-studied in AI because they're easy to represent, aren't too easy, and are easy to evaluate performance on.

We classify games into categories like perfect vs. imperfect information, deterministic (fully controlled by players) vs. stochastic (element of change). As a search problem, a 2-player perfect information game has a state defined by the board state and whose turn it is, the successor function is the set of the results of every legal moves at a given state, the terminal/leaf states are states where players win/lose/draw, the heuristic is the player's estimated utility function. A solution to the game is a strategy that tells you how to move at every state in the game - a way to get to a goal node from every node in the tree.

This is a challenging problem because there's a malicious opponent, and we need to take that into account with our search. A **max-player** is trying to maximize its own utility, while a **min-player**, the opponent, is trying to minimize it. We're trying to find an optimal strategy - a strategy that guarantees that gives us an outcome that is as favourable as possible, given that the opponent is playing as good as possible (we won't necessarily always win). Note that this assumes an optimal opponent, so we might prefer different strategies when the other player isn't playing optimally.

In this course, we'll mostly focus on zero-sum games (games where one player winning means the other loses) with perfect information. In theory these are easy to solve given infinite time, and most of the challenge comes from making this practically fast.

Consider the centipede game: players take turns either incrementing a counter, or deciding to end the game, which lets them take all the value in the counter and win. Clearly, we want to take the counter to get the value, but we want to keep going for more turns to get more value when we decide to take it.

We can analyze this using **minimax search**. We represent the value of each node in a tree as \text{minimax}(s) = \begin{cases} \text{utility}(s) &\text{if } s \text{ is a terminal state} \\ \max\set{\text{minimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a max-node (our turn)} \\ \min\set{\text{minimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a min-node (their turn)} \end{cases} (this is also called the **security value**). Essentially, it represents the worst-case utility of the game starting from the current state, assuming both players are playing optimally (they're both minimaxing). Essentially, each player is minimizing the amount it can possibly lose by.

Minimax search is essentially DFS through this minimax tree. This isn't complete since it doesn't work if the game doesn't terminate, but it has pretty good space complexity.

In the centipede game, the course of action that minimizes the amount we lose by in the worst-case is to betray at the first possible turn. However, real people will generally play more than that. For chess, the branching factor was empirically determined to be around 35, and the tree depth is around 100, at least. This is impractical to search exhaustively.

For certain types of games, we can do **alpha-beta pruning**. Essentially, at each node \alpha is the max-player's best (highest) utility guaranteed so far, and beta is the min-player's best (lowest) utility guaranteed so far (including sibling nodes we just visited). We then skip visiting nodes where \beta \le \alpha - taking this move will make the current player worse off than before, because the min-player's guaranteed low score is no longer greater than the max-player's guaranteed high score, so a win or draw is guaranteed for the other player. In a minimax tree, this gives us the same result as an exhaustive DFS. Usually, the effect of alpha-beta pruning is to stop looping through a node's children when we've found a child that guarantees us a win.

Alpha-beta pruning lets us eliminate branches from consideration. If we look at branches in a game tree left to right, we'd update \alpha and \beta for each node as we traverse the tree, and alpha-beta prune branches as we move from left to right. Alpha-beta pruning in this case will tend to remove branches from the right hand sides of nodes that would give poor results.

This is very helpful - minimax search on chess can look 5 moves ahead, but alpha-beta pruning gives us enough performance to look up to 10 moves ahead! This still isn't good enough - alpha-beta pruning requires us to go all the way down to terminal nodes in order to get the alpha/beta values. We can supplement this using heuristics to estimate alpha and beta. A fast evaluation function gives the true utility for a terminal state, and the estimated expected utility for any other state. There's a lot of AI research in making a good evaluation function using expert knowledge and experience - one that makes good estimates of how much payoff we'll get from a given game state.

We also don't have to go down all the way to the terminal states, and cut off the search at certain thresholds. We'll usually want to cut off the search when we reach a certain depth, or when the state becomes relatively stable (not really leading towards a win/loss), though we'll often want to give **singular extensions** - additional depth for moves that seem particularly promising (this helps avoid some cases where we'll miss promising options just beyond the cutoff). The parameters for the cutoff would also depend on the opponent - a novice chess player might only need 5 move lookeahead, while a grandmaster might need even 14, with exhaustive search near the endgame and a good evaluation function.

For stochastic games, we treat chance itself as a player, so there's the max-player (us), the min-player (the opponent), and the chance player (the whims of nature), where the chance player makes its move with probabilities defined by the path in the tree, in between every max-player and min-player turns. We then construct an **expectiminimax tree** and consider it a search problem. The expectiminimax tree is the same as the minimax tree, but with another case: \text{expectiminimax}(s) = \begin{cases} \text{utility}(s) &\text{if } s \text{ is a terminal state} \\ \max\set{\text{expectiminimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a max-node (our turn)} \\ \min\set{\text{expectiminimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a min-node (their turn)} \\ E(\set{\text{expectiminimax}(s') : s' \in \text{successors}(s)}) &\text{if } s \text{ is a chance-node (nature's turn)} \end{cases}, where E(X) is the expected value.

For Go, the branching factor is about 250, and the terminal states are usually up to 150 moves deep. For this we can use techniques like Monte-Carlo tree search.

A **preference ordering** for an agent over a set of states S is a weak ordering \succsim of all the states in S - a ranking S_1 \succsim S_2 that determines whether S_1 is **at least as good as** S_2. Likewise, S_1 > S_2 means that S_1 is **strictly better than** S_2, and S_1 \sim S_2 means that S_1 is **equally preferable** to S_2. By weak ordering, we mean that every state is ranked, but there may exist ties (unlike a total ordering, where there can exist ties).

For deterministic worlds, the consequences of an action is always the same, so we can know exactly what the state of the world will be after taking that action. For non-deterministic worlds, there's an element of chance for actions, so we can only know what the probability of each world state occurring will be after taking that action. We can represent a nondeterministic outcome as a **lottery** L = [p_1, S_1; \ldots; p_k, S_k], where p_i is the probability of S_i occurring. Essentially, this is a probability distribution over all outcomes.

Any preference ordering must obey certain properties. Given world states A, B, C, the following must all be true:

- Orderability - one or more of the following must be true: A \succsim B, B \succsim A, or A \sim B.
- Transitivity - if A \succsim B and B \succsim C, then A \succsim C.
- Continuity - if A \succsim B \succsim C, then there must exist a probability p such that [p, A; 1 - p, C] \sim B (there must exist a mixture of outcomes between A and C that is identically preferable to B).
- Substitutability - if A \sim B, then there must exist a probability p such that [p, A; 1 - p, C] \sim [p, B; 1 - p, C].
- Monotonicity - if A \succsim B, then for any probabilities p, q, p \ge q exactly when [p, A; 1 - p, B] \succsim [q, A; 1 - q, B].
- Decomposability - [p, A; 1 - p, [q, B, 1 - q, C]] \sim [p, A; (1 - p)q, B; (1 - p)(1 - q), C].

These conditions are required in order to maintain basic rationality requirements. For example, if we didn't have transitivity, then there could exist a A > B > C > A, so we could get arbitrarily-large amounts of utility by repeatedly taking actions to go from A to C to B and to A again.

A **decision problem under certainty** is a tuple \tup{D, S, f, \succsim}, where D is the set of possible decisions/actions to take, S is the set of possible states/outcomes, f: D \to S is a function mapping actions to their resulting world states (the outcome function), and \succsim is a preference ordering over S.

A **solution to a decision problem under certainty** is an action d^* \in D such that f(d^*) \succsim f(d) for any d \in D - an action that is equally or more preferable than any other possible action.

When actions don't have deterministic outcomes, we can't easily compare outcomes directly, since we're representing them as lotteries. Instead, we want to quantify how much we prefer each world-state to each other world-state, so we can determine whether one lottery is preferable to another.

We do this by introducing a **utility function** U: S \to R - a function that, given a state s \in S, produces a real value that measures how much we prefer s over some arbitrary reference state. Also, let's define \succsim_U to be a preference ordering where s_1 \succsim_U s_2 exactly when U(s) \ge U(t). Now for a lottery l = [p_1, s_1; \ldots; p_k, s_k], we can define U(l) = \sum p_i U(s_i) - the **expected utility** of a lottery is the expected value of the utility function over all the possible outcomes in the lottery.

When we have any nondeterministic outcomes (when we have lotteries as outcomes), we have a **decision problem under uncertainty**. We represent this as a tuple \tup{D, S, f, U}, where D and S have the same meanings as for decision problems under certainty, f: D \to \Delta(S) is a function mapping actions to lotteries over states (distributions over states), and U is a utility function.

According to expected utility theory, a **solution to a decision problem under uncertainty** is an action d^* \in D such that U(f(d^*)) \ge U(f(d)) for any d \in D - the action whose resulting lottery maximizes expected utility.

Note that U is invariant under positive affine transforms - d^* = \argmax_{d \in D} U(f(d)) = \argmax_{d \in D} aU(f(d)) + b for any a > 0, b. That means we can multiply U by a positive constant or add any constant to U, and the set of solutions to any decision problem using U would not change.

Decision problems are conceptually simple to define and solve. In practice, they are often hard to solve because S is so large and it is hard to compute f. Additionally, even if a single decision problem is feasible to solve, we often want to solve a sequence of decision problems to determine the best sequence of actions that we should take.

New notation: P_a(s_2 \mid s_1) is the probability of moving to state s_2 given that we are taking action a while at state s_1.

For sequences of actions, we can draw a tree containing all possible sequences of actions - the root node is the initial state connected to "chance" nodes (like in a game tree), via edges labelled by each possible action at that state. Chance nodes then connect to outcomes of those actions, via edges labelled by the probability of that outcome occuring given that action. The outcomes then connect to their own chance nodes, and those to more outcomes, and so on. Any particular sequence of actions can then be thought of as moves downward through this tree, much like playing a game moves downward through its game tree.

By thinking of all possible sequences of actions as a tree, it's clear that the expected utility of a sequence is the expected utility of the **final outcome distribution** - a weighted average over the utilities of all the leaf nodes (which are states) that can possibly occur by taking our sequence of actions. For example, if we're playing tic-tac-toe, a sequence of actions that leads to losing might have very little utility, while a sequence that leads to winning might have a lot.

To compute this final outcome distribution, we can recurse through the tree:

```
def expected_utility(state, action_sequence, probabilities_for_action_at_sequence):
# concrete outcomes have a real utility value
if is_real_state(state):
return utility(state)
# distributions of outcomes have an expected utility value
current_action, new_action_sequence = action_sequence[0], action_sequence[1:]
return sum(
probability * expected_utility(new_state, new_action_sequence, probabilities_for_action_at_sequence)
for probability, new_state in probabilities_for_action_at_sequence
)
```

For nondeterministic decision problems, we usually don't want to explicitly plan out sequences of actions in advance, because after we perform the first few actions in the sequence, we have more information than we did before, and could potentially make a better decision than we originally planned. For example, suppose we originally planned to flip a coin (action 1), and then guess that it would land heads face-down (action 2), and the utility is 1 if we get it right and 0 if we don't. Clearly, this is the best we can plan to do since we don't know the outcome of the coin in advance. However, after performing action 1, we know the new state of the world, and can look at the face-up side to guess what the face-down side is.

We can generalize this into the idea of **policies** - a sequence of states and actions to take if our previous actions lead us to those states. For example, here's a policy for the coin guessing decision problem above: "flip a coin (action 1); if the coin outcome is heads, then guess that tails is face-down (action 2a), if the coin outcome is tails, then guess that heads is face-down (action 2b)".

Let A be the set of actions, O be the set of outcomes. Then there are \abs{A}^k possible plans consisting of k actions, and (\abs{A}\abs{O})^k possible policies.

A decision tree for a decision problem under uncertainty is a tree consisting of alternating layers of state nodes (also known as choice nodes, because we have to make choices at each state) and chance nodes, where the root is the initial state. Each final outcome/state s has an associated utility value U(s) (the leaf nodes). Using something like Minimax search, we can then define our utility function at every node, rather than just the leaf nodes:

- For a chance node c = [p_1, s_1; \ldots; p_k, s_k] (possible outcomes and their probabilities of occurring), we take the weighted average to get U(c) = \sum p_i U(s_i).
- For a choice node s = \set{a_1, \ldots, a_k} (actions and their associated chance nodes), we take the expected utility of the action with the greatest expected utility to get U(s) = \max U(a_i).

When implementing the above, we would usually compute expected utilities bottom-up, in a dynamic programming fashion. This allows us to compute the utility value of each node in the tree in O((\abs{A} \abs{O}))^d time where d is the depth of the tree, so it's in the same order as the number of possible policies. Note that the naive, top-down implementation that computes U(s) for each node would instead take O(n^d m^{2d}).

A policy essentially defines a decision for choice nodes in the tree. We can represent it as a mapping from states to actions. Note that because we took only the maximum-expected-utility action at each choice node as we were evaluating our utility function, the nodes under every other possible action will never be reached. Therefore, what action we assign there is more-or-less arbitrary. Two policies are **implementationally indistinguishable** if they assign the same actions to all choice nodes that are reachable from the start node, when applying either policy (though they may be different for unreachable ones).

Most decision problems aren't fully observable - we usually work around this by using heuristics. Additionally, representing outcome distributions often requires a huge number of states and probabilities - these are usually worked around by using decision networks or Bayes net representations, which we'll cover later on.

A decision problem is **fully observable** if and only if we know the initial state, and it's possible to know the outcome of every action (which state we ended up being in). For partially observable decision problems, we might only know the outcomes of some actions after taking them. If we have a partially observable decision problem, we can sometimes design the choices to move the unobservable outcomes into leaf nodes, where we won't need to observe their outcomes: for example, instead of a decision tree that decides whether to proceed as if a patient has malaria or the flu, and then decides whether to prescibe drug A or B in each case, we might have a decision tree that decides whether to prescribe drug A or B, and then decides whether to proceed as if the patient has malaria or the flu.

;wip

;wip: missed due to interviews

Midterm next week, includes everything up to and including today's content, Markov Decision Processes. Assignment 2 is due week after that.

We usually apply a future discount to utility values - utility gains in the future are discounted proportionally to how far in the future they are. For example, getting 10 dollars now is worth more than getting 10 dollars in a year.

;wip: missed due to interviews

;wip: missed due to interviews

Midterm tonight.

;wip: missed due to interviews

Midterm average is 68 in this section.

Chain rule: P(A_1 \land \ldots \land A_n) = P(A_1 \mid A_2 \land \ldots \land A_n) P(A_2 \land \ldots \land A_n).

Suppose we have a variable C representing whether you have a cough, F for whether you have a fever, and I for whether you have influenza. Clearly, C and F are conditionally independent of each other given that I occurs - P(C \mid F \wedge I) = P(C \mid I) and P(F \mid C \wedge I) = P(F \mid I). In other words, if we know that a person has influenza, we already can make a prediction about C and I - a cough depends on having the flu, not having a fever, and vice versa.

Since they're conditionally independent, P(C \land F \mid I) = P(C \mid I) P(F \mid I). Likewise, P(C \land F \land I) = P(C \mid F \wedge I) P(F \land I) = P(C \mid I) P(F \mid I) P(I). This is really useful, because each factor is easier to measure individually - P(C \mid I), for example, is just the fraction of flu patients that cough.

Note that the joint distribution for P(C \mid I) P(F \mid I) P(I) has 7 possible values for each possible assignment of values to C, F, I (minus one because you can compute the last value by using 1 minus the rest of the probabilities).

Note that the number of possibilities in the distribution grows exponentially with respect to the number of variables being joined. This makes it quickly infeasible to store the entire joint distribution in memory, or gather data for each of those possibilities.

Two random variables A, B are **independent** if knowledge of A doesn't change uncertainty in B. So P(A \mid B) = P(A) = P(B \mid A), P(A \wedge B) = P(A) P(B), or in general, P(X_1 \wedge \ldots \wedge X_n) = \prod P(X_i).

Two random variables A, B are **conditionally independent given another random variable Z** if and only if knowledge of A doesn't influence P(B \mid Z) and knowledge of B doesn't influence P(A \mid Z). So P(A \mid B \wedge Z) = P(A \mid Z) and P(B \mid A \wedge Z) = P(B \mid Z). In other words, conditional independence means that given Z, A and B are independent.

If we have n independent boolean variables, then we can specify their joint distribution with O(n) memory rather than O(2^n). A similar optimization can be done for conditionally independent variables, and many things in the real world are conditionally independent. Bayes networks use this to concisely represent joint distributions.

Notation concerns: P(X) is the distribution of the random variable X - a function of x that returns the probability that X = x, denoted P(x) or P(X = x). P(X \mid Y) is a family of distributions of the random variable X - a set of functions of x, one for each value of y.

If I wake up early (E), then I'll likely be unhappy (U), and if I'm unhappy, then I'll likely do poorly on an exam (X). If we have knowledge about waking up early or being unhappy, then that affects our uncertainty about doing poorly on the exam, so doing poorly is dependent on both. So the joint probability distribution is P(E \land U \land X) = P(X \mid E \land U) P(U \mid E) P(E). However, if we simply know I'm unhappy, then knowing whether I woke up early or not doesn't affect my uncertainty about doing poorly on the exam, because it doesn't add any new information. So P(E \land U \land X) = P(X \mid U) P(U \mid E) P(E). Basically, we managed to change the exponential number of table entries into linear!

This can be generalized to any number of chained dependent variables - knowing one part of the chain (such as being unhappy) means that the later parts (such as doing poorly on an exam) no longer depends on previous parts (waking up early). So P(x) = \sum P(g \mid u_i) P(u_i) = \sum_i P(g \mid u_i) \sum_j P(u_i \mid e_j) P(e_j).

A **Bayesian/belief/causal/probabilistic network** is a graphical representation of direct dependencies between random variables. A Bayesian network over random variables X_1, \ldots, X_n is a directed acyclic graph where nodes are variables and an arc X_i \to X_j exists whenever X_j directly depends on X_i (representing conditional probability distributions).

As usual we have the set of parent nodes \operatorname{Par}(X_i) (direct parents, not all ancestors), children, descendents, and ancestors for a node in a Bayesian network. We also have the **family** of a node, which is just the node plus its parents \set{X_i} \cup \operatorname{Par}(X_i).

The Bayesian network lets us easily find the joint distribution of any set of random variables. This works because given a node

;wip: all the way to constructing a BN

Bayesian networks are not unique for their variables - the order that we choose variables in while constructing it is very important, and some orders are better than others. Usually Bayesian networks are simpler when the variables are chosen by following causal intuitions - we want variables representing causes to precede variables representing effects.

;wip: all the way to but not including blocking

;wip

;wip

;wip

So far, there has only been a single agent in our environments. Not only are we reasoning about our own agents, we're reasoning about how other agents are reasoning. Our goal is to model how agents behave in this setting, and design systems so they'll behave in a certain way.

We assume that all agents are self-interested - acting according to their own goals and on their own world states.

Game theory problems deal with analyzing how groups of **rational**, **strategic** agents **interact** with each other (a special case is the decision problem, where there is only one agent). Rational means that agents choose the best possible action they know of, strategic means that they take their actions into account in the wider context of the game, and interaction means that the actions of one agent will affect other agents.

A **strategic/matrix-form/normal-form game** is defned as a set of agents I = \set{1, \ldots, N} with utility functions u_1, \ldots, u_n, a set of possible actions for each player i A_i = \set{a_i^1, \ldots, a_i^m}, an outcome profile a = (a_1^{j_1}, \ldots, a_N^{j_N}) (a tuple of actions that ended up being chosen by the agents, representing a game state).

We can often represent these sorts of games as a payoff matrix. Here's an example for playing Chicken:

```
Agent 2
| Turn Away | Continue |
Agent 1 Turn Away | 10, 10 | 0, 20 |
Continue | 20, 0 | 5, 5 |
```

This is a variation on the prisoner's dilemma, where two people are driving toward each other. If they both turn away, they don't die and built rapport with each other. If one turns away, they are embarassed while the other looks brave. If they both continue, they both die but look brave. Agent 1 is usually called the **row player**, while Agent 2 is usually called the **column player**.

When written in this form, a zero-sum game is simply a game where \sum_{o \in \text{outcomes}} U(o) = 0.

Let a_{-i} represent the actions of every player other than i. Let p_i(a_{-1}) represent what agent i thinks all the other players will do. Then an agent i is **rational** if it chooses actions according to \argmax_{a_i} u_i(a_i, p(a_{-i})) p(a_{-i}).

;wip: apply belief about others' actions to prisoner's dilemma.

An action strictly dominates another action if it gives more utility regardless of what other agents do - if u_i(a_i', a_{-i}) > u_i(a_i, a_{-i}) for all a_{-i}, then a_i' strictly dominates a_i. If an action is strictly dominated by another action, a rational agent will never play it. However, the converse isn't necessarily true, so this doesn't give us the full picture about what rational agents will do.

A **Nash equilibrium** is an action profile where no agent would want to change their action given that no other agents change their action. In other words, if we asked them after the fact whether they would have picked a different decision, every agent would say no. Formally, for every agent i and a current action profile a^*, for any possible action a_i', u_i(a_i^*, a_{-i}^*) \ge u_i(a_i', a_{-i}^*).

A Nash equilibrium is essentially an outcome where no one player can take advantage of any others, even in hindsight.

For randomized games, we can't use the pure Nash equilibria, because we don't know what the outcome might be. Instead, we can analyze the **mixed Nash equilibria**, which is the same thing but with expected utilities instead of the known utilities, and strategies (probability distributions over actions) instead of actions.

For the prisoner's dilemma, there's only one Nash equilibrium at defect-defect, and interestingly, it turns out not to be the globally optimal outcome. At any other outcome, at least one of the players would rather have defected than cooperated.

Nash's theorem: every game with finite-sized action sets, has at least one mixed-strategy Nash equilibrium. The proof is non-constructive, however, and finding Nash equilibria remains very hard. 2-player zero-sum games can be solved in polynomial time with linear programming, while for arbitrary problems it's even harder than NP-hard, in PPAD.

Prisoner's dilemma changes once agents play against each other multiple rounds, because each agent gives information to the other agent through their history of cooperating/defecting. With history, the strategy becomes a lot more complex, because agents can reward/punish past behaviour, and consider things like reputation and trust.

Some common repeated prisoner's dilemma strategies are "grim", where an agent cooperates until the first defect, and defects forever afterward, and "tit-for-tat", where an agent starts off coorperating, and copies the opponent's action in the previous round.

While normal-form games have agents choose actions simultaneously, an **extensive form game** allows the order of actions to affect the outcome, such as games where players take turns. Minimax is a strategy designed for extensive-form games.

Extensive form games still have the set of agents, action sets, and utility functions from normal form games, but we have terminal and choice nodes instead of action profiles. Just like we looked at with minimax, extensive form games can be represented with a tree.

Every extensive form game can be transformed into a normal form game by having each agent simultaneously commit to playing certain policies. Those commitments can then be thought of as the agent actions, and the utility of the final game state can be thought of as the actions' utility values.

An interesting case of Nash equilibria behaving unusally is mutually assured destruction. Consider two nuclear-capable countries A and B. Missiles are on the way from A to B, and B has to decide whether to retaliate.

Subgame perfect equilibria for extensive form games are ;wip: MAD example

;wip: centipede game, nobody actually plays like

;wip: ultimatum game: I must split $10 with you, you say yes/no, if no then nobody gets the money. optimal is to always say yes, but make them think you'll say no

;wip: cultural aspects- wall street bankers always defect, rural fisherman always cooperate

Mechanism design is the design of systems that make rational agents behave in a certain way. Instead of talking about agents given a game, we're talking about games given an agent - sort of like a reverse game theory.

We can represent mechanisms with something very similar to games - a set of agents I = \set{1, \ldots, N} with utility functions u_1, \ldots, u_n, a set of possible actions for each player i A_i = \set{a_i^1, \ldots, a_i^m}, an ;wip: update from slides? a = (a_1^{j_1}, \ldots, a_N^{j_N}) (a tuple of actions that ended up being chosen by the agents, representing a game state). Additionally, each agent has a **type** \theta_1, \ldots, \theta_N that represents all of the agent's private information that might affect their utility function, and there's a **social choice** function f: \Theta_1 \times \ldots \times \Theta_N \to O that reads the agents' minds and returns an outcome. ;wip: rather than action profile, we have outcomes

Social choice functions might include voting one candidate from a group of candidates, choosing which resources to assign to which agents, or whether to buy something as a group. This might depend on the types of all the agents, so the social choice function's value is unknown to any individual agent. We really want the social choice function, but it's not available within a game.

Since agents can't actually read each others' minds, we have to go through a trusted institution. A **mechanism** is a formal representation of this. A mechanism M is an assignment of strategies to each agent S_1, \ldots, S_N, and an outcome function g ;wip: outcome function

Each agent submits a report to the mechanism, and the mechanism aggregates all those reports into an outcome using the outcome function. For example, for voting we might have ;wip.

A mechanism implements a social choice function f(\theta) if there exists a Nash equilibrium s^* = \tup{s_1^*(\theta_1), \ldots, s_N^*(\theta_N)}. ;wip: rest of slide

Even better is the direct mechanism, where S_i = \Theta_i for every agent i. If this is the case, then g(\theta) = f(\theta) for all \theta \in \Theta_1 \times \ldots \times \Theta_N. In other words, if the strategy space is the same as the ;wip

A direct mechanism is **incentive compatible** if and only if there's an equilibrium s^* where s_i^*(\theta) ;wip. For the voting example, it's incentive compatible whenever it's in every agent's best interest to just submit their voting preference (their type information) directly than .

A direct mechanism is **strategy proof** if it is incentive compatible ;wip

;wip: revelation principle slide: if it's implementable in dominant strategies, then it's directly

Gibbard-Satterwaite theorem - if : ;wip

- O is finite and has more than 2 options.
- Each outcome in O ;wip
- \theta includes all possible strict orderings over O.

So we either we allow the voting system to not be strategy-proof, or we make the voting system dictatorial - there would exist an agent whose top-rated outcome becomes the outcome of the voting system. To get around this in practical situations, we can use a weaker definition of an equilibrium - we can design mechanisms where it can be NP-hard to figure out how to manipulate the voting system.

Alternatively, if the agent preferences follow certain patterns, we can still guarantee strategy-proofing. ;wip: single peaked and quasilinear preferences (quasilinear means that we introduce money, money is worth the same to everyone, and we can transfer utility between people just by sending money around).

;wip: groves mechanisms for quasilinear mechanisms, transferring money so everyone is happy but balancing that out with h_i(\theta_{-i}) such that the game still works. If we just set h_i to 0, then we just equalized everyone's utility, as if we split all the utility equally. ;wip: is that true? groves mechanisms are efficient and strategy proof under the quasilinear assumption

;wip: vickrey-clarke-groves are groves mechanisms that defines a good h_i function. marginal contribution - how much happier/sadder did you make everyone else by participating in the game vs. not participating in the game? that is how much you are paid/getting paid (e.g., I made everyone at an auction sadder by winning and walking away with the prize).

;wip: Vickrey auction: we pay the lowest amount we could've won with, by paying the second highest bidder's bid. this eliminates overbiddin regret - "I wish I bid a little lower, because I would have still won anyways" - because if they paid any less than the second highest bidder's bid, they would've lost ;wip: show that this is a vickrey-clarke-groves mechanism according to the formula

;wip

So far we've looked at classical approaches for AI - top-down, mathematically-backed approaches to problems like playing games. Real-world problems, however, are not as appropriate for these approaches because they're ill-defined/messy, and even when we can define them, there are often so many details that designing things top-down becomes impractical for humans to do. For example, just fully describing the problem that self-driving cars are solving would be impractical for any human to do - there's so many possible things that can happen that we can't list them all.

Machine learning is a closely related field to AI that deals with more unstructured, bottom-up approaches to solving these problems. For example, supervised learning (induction from pre-labelled data, like classifier algorithms), unsupervised learning (learning patterns in input, like clustering algorithms), and reinforcement learning (learning from feedback that comes later, like TD-learning and Q-learning). Others include semi-supervised learning, active learning (figure out what questions to ask the user), transfer learning (transferring knowledge into other models, such as higher-performance ones or in other domains), apprenticeship/inverse-reinforcement learning (learning from an existing expert, such as learning how a pilot flies a helicopter).

One big problem in ML is representation - what are we trying to learn? How do we represent the model's knowledge? Some common forms are linear polynomials, propositional logic, first-order logic, bayes nets, etc.

Given a training set of examples \set{\tup{x_1, f(x_1)}, \ldots}, a supervised learning algorithm will return a function h that approximates f well, even on unseen examples, known as a **hypothesis**. The **hypothesis space** is the set of all possible hypotheses that can be returned by the algorithm (e.g., all quadratic functions for a quadratic regression algorithm). This is a hard problem because good hypotheses are rare in most hypothesis spaces.

Two functions are **consistent** with a training set if they give the same value for points in the training set. There are often many possible hypotheses for a given training set, but we want one that will perform well on unseen samples. To approximate this, we use Ockham's razor - we want to

A **realizable problem** is a supervised learning problem where f is inside the hypothesis space. We generally want our problems to be realizable because that makes it possible to get the actual answer, so we would want to expand our hypothesis space. However, we don't want to expand it too much because it makes hypothesis more complex and the algorithms less effective - it's a delicate balance (e.g., if we made the hypothesis space all Turing machines, it would be very difficult to learn anything about any input).

A **loss function** L(y, h(\vec x)) is a function that represents how off a hypothesis is from a true value, where y is the true value and h(\vec x) is the predicted value.

Linear threshold classifiers find a hyperplane h(\vec x) = \vec w \cdot \vec x \ge 0 to approximate an unknown function f(\vec x) \in \set{0, 1}. Our loss function is usually the sum of squared errors in the training set. One way to learn a linear threshold classifier is to use gradient descent along the space of all hyperplanes.

Decision trees classify points by consecutively testing them with some fixed test at each node in the tree, and taking a corresponding branch in the tree. The leaf node that the point ends up at is the output class. ;wip: training decision trees

The **entropy** of a random variable V is the information content of that random variable, defined as I(V) = -\sum P(V = v_i) \log_2(P(V = v_i)) (the proof for this is very much out of the scope of this course). The unit of entropy is bits - a coin flip has 1 bit of entropy, while a random ASCII character has 7 bits of entropy.

Uncertainty matters a lot depending on the problem. For example, if there's a 50% chance that A is true, and a 90% chance that B is true, it's worth more to know the answer to A than B, because it eliminates more uncertainty. In this case, A has 1 bit of entropy, and B has about 0.47, by the formula above.

For a decision tree, at each node we want to test on an attribute that gives us the most possible information gain. Ideally, the attribute would split all of the values into two roughly equal-sized sets, one of which is all in the positive class, and the other all in the negative class - this is the outcome that maximizes the entropy of making the decision.

Aside: a decision stump is a decision tree with only one node.

;wip: information gain slide, about choosing an attribute by measuring how much information testing that attribute could give us (information gain) - this is determined by computing the remaining entropy at each step and subtracting it from the previous remaining entropy

The entropy at a given node is I(\frac{p}{p + n}, \frac{n}{p + n}) = -\frac{p}{p + n} \log_2 \frac{p}{p + n} - \frac{n}{p + n} \log_2 \frac{n}{p + n}. This has a maximum at n = p with value 1 - there is 1 bit of entropy if the attribute test splits the values perfectly into two classes that are all in the positive class and all in the negative class, respectively.

;wip: how to compute information gain?

**Algorithm C4.5**: when we introduce a node, we want to ask the question that will reduce the remaining entropy the most. We usually combine this with a greedy algorithm (repeatedly grow the tree wherever we can reduce remaining entropy the most, as long as we still have examples or attribute tests to try), which works pretty well in practice. The resulting decision tree will work for the data, but can often be different from the true, underlying decision tree that generated the data (generally, a simpler tree or a slightly more complex tree than the original). The output of this algorithm is a decision tree (the decision tree is a hypothesis).

We usually use the training-set/testing-set split to evaluate our hypothesis. The learning algorithm is good if the classifiers it generates will give good results on the unseen data in the testing set(data that the classifier has never had access to). Note that the fact that it needs to be unseen means that we must use a new testing set every time we're evaluating our hypotheses.

If a hypothesis performs well on the training set but poorly on the testing set, this usually means we were **overfitting** - making out hypothesis too dependent on extraneous details that only coincidentally happen to correlate (for classifiers, these are often spurious correlations). Formally, a hypothesis overfits a training set if there exists another hypothesis that does worse on the training points, but better on the entire distribution of points.

If we train decision trees naively, overfitting causes a 10%-25% loss in accuracy. How do we prevent decision trees from making these spurious correlations? One technique is to use pruning. We introduce the null hypothesis "there is no pattern in this data" (;wip: formulas from avoiding overfitting slide), and then compute the probability that a sample of size p + n would have the result. We can then do a chi-squared test to figure out whether the null hypothesis is plausible. If the null hypothesis is plausible, we can then prune the corresponding decision tree node.

When a testing set isn't available, or we can't afford enough testing sets to test hypotheses as much as we need, we can use techniques like cross-validation, where we train on parts of the training set, and test on the remaining parts. The most common cross-validation technique is K-fold cross-validation, where we split the training set into K sets, then run K experiments where each set is the validation set and the rest of the sets are combined to get the training set. Leave-one-out cross-validation is just n-fold cross-validation where n is the training set size.

So far we looked at algorithms that choose a single hypothesis from the hypothesis space. **Ensemble methods** choose a lot of hypotheses and then combine their predictions to make the final prediction. The idea is that individual hypotheses might be wrong, but they might have partial information that would make the majority less likely to be wrong - ensemble methods can help improve model performance. Additionally, ensembles can be used to expand the hypothesis space itself - for example, an ensemble of linear classifiers where positive predictions are the AND of all the linear classifiers can have arbitrary polyhedrons as their hypothesis, rather than a hyperplane.

One way to combine predictions in ensemble methods is **bagging** - all the hypotheses take a majority vote and the winner is the final prediction. If we pretend that each of n hypothesis is wrong with probability p independently (even though this doesn't really happen in real life), we get P(k \text{ hypotheses are wrong}) = {n \choose k} p^k (1 - p)^{n - k}. We then sum up the wrong cases to get P(\text{ensemble is wrong}) = \sum_{k = \ceil{n / 2}}^n P(k \text{ hypotheses are wrong}) = \sum_{k = \ceil{n / 2}}^n {n \choose k} p^k (1 - p)^{n - k}.

One way to improve this is to weight hypotheses differently, for example by reducing the weight of correlated hypotheses (because their outputs would be overrepresented in the output) and increase the weights of high-performing hypotheses. This is a technique called **boosting** - we weight each hypothesis, and weight each point in the training set. Whenever we test a single hypothesis on a single training set point, we increase the weight of that training set point and decrease the weight of that hypothesis if it was misclassified, and decrease the weight of the training set point and increase the weight of that hypothesis otherwise. The other hypotheses will then have to do better on those misclassified ones in order to perform well.

;wip: adaboost algorithm

AdaBoost is very powerful and often used in industry. It takes in a weak learning algorithms and returns a hypothesis that classifies with accuracy 100% as the size of the ensemble increases.

Boosting means we don't need to learn perfect hypotheses. It's also easy to implement, generalizes well to the testing set (reduces overfitting), and is widely applicable to any weak learning algorithm. This is used everywhere from sensor fusion to radio decoders.

Machine learning is essentially about approximating real-world functions, such as the temperature of a region, the next frame of a video, and so on. Real-world functions can be extremely complex, so we need to design our machine learning methods to handle arbitrary relationships, run efficiently, and avoid overfitting.

Neurons in the brain are cells that act like wires, memory, and processors, and the brain is a densely connected network of neurons. Neurons receive chemical signals from their soma, perform rudimentary functions on those incoming signals in the soma/body (such as "10 or more signals received in last second"), and then might fire a signal down its axon, where it is broadcast chemically via synapses. The function that occurs in the soma isn't very well understood, but the rest of the process is, more or less. Fundamentally neurons are really simple, and can approximate arbitrarily complex functions if we use enough of them.

Biological neurons are either firing or not firing - the information content is encoded in the frequency and timing of the firings. When the brain learns, the connection between synapses and other neurons' dendrites are chemically altered to be more or less efficient. Artificial neural networks are generally based on really early, simplified models, though biologists will often use more biologically accurate neural networks based on models from theoretical neuroscience.

A typical artificial neural network models each neural as follows: each neuron has a weight vector w, and outputs a value g(\vec w \cdot \vec x), where \vec x is the vector of values from some other neurons and g(v) is the activation function (e.g., \tanh(v) for hyperbolic tangent, \max(0, v) for ReLU, \frac{1}{1 + \exp(v)} for sigmoid, or \begin{cases} 1 &\text{if} v \ge b \\ 0 &\text{otherwise}\end{cases} for threshold where b is a constant).

Suppose we wanted the logical AND of two Boolean inputs x_1, x_2 using a single neuron with threshold activation. In other words, the neuron should fire if and only if x_1 = x_2 = 1. We can write our goal as a system of inequalities w_1 + w_2 + b \ge 0, w_1 + b < 0, w_2 + b < 0, b < 0, one equation per truth table entry.

A feedforward network is a directed acyclic graph of neurons. A recurrent network is a directed cyclic graph of neurons. Feedforward networks are fully described by their structure and weights, while recurrent networks additionally have their previous neuron outputs as state - this makes them harder to train but capable of remembering past inputs.

Input units are neurons that get their inputs directly from the network inputs. Hidden units are neurons that get their inputs entirely from other neurons, and give their output entirely to other neurons. Output units are neurons that give their output directly to the network outputs.

A perceptron is a single-layer feedforward network, where all neurons get their inputs directly from the network inputs and their outputs are given directly to the network outputs, and all neurons use the threshold activation function. The error/cost/loss of a perceptron is E = \frac 1 2 \sum (y_i - g(\vec w \cdot \vec x_i))^2, where \vec x is the network inputs and g(v) is the network outputs (the outputs of the activation function from each neuron). We use squared loss because we'd rather have lots of small errors than some large errors.

We train neural networks using local search techniques. The most commonly used one is gradient descent - we take our loss function E, take its derivative with respect to the weights \Delta E = \frac{\dee E}{\dee \vec w}, and then update those weights according to \vec w \leftarrow \vec w - \alpha \Delta E, where \alpha is the learning rate.

For example, for the perceptron we have \Delta E = \frac 1 2 \frac{\dee}{\dee \vec w} \sum (y_i - g(\vec x_i))^2 = \sum (y_i - g(\vec x_i))(-\frac{\dee g}{\dee v} (\vec w \cdot \vec x)) \vec x.

;wip: catch up on this class, the one before statistical learning

Recall that agents model uncertainty in the world and the utility gained from different action plans, and that Bayes nets are one way to represent agent knowledge.

Bayes nets grow exponentially. In the past, some Bayes nets were painstakingly constructed by hand, such as the Pathfinder network for diagnosing lymph node diseases. The main bottleneck here is acquiring knowledge from experts, which is very expensive.

In contrast, data is often a lot cheaper. What if we build models (such as Bayes nets) directly from data?

Suppose we have a bag of candy with an unknown ratio of two different flavours (the ratio is known to be either 25%, 50%, or 75%, however). After eating k candies, can we estimate the flavour ratio? Can be predict the next candy's flavour?

With Bayesian learning, we then have three hypotheses: H_1 "the flavour ratio is 25%", H_2 "the flavour ratio is 50%", and H_3 "the flavour ratio is 75%". When we take a piece of candy out, we get evidence d, telling us a particular piece of candy has a particular flavour.

We also have our prior probabilities P(H_1), P(H_2), P(H_3) for each of the hypotheses being true, any and all previous knowledge about the probability of each hypothesis being true. We also have the likelihood of seeing a particular piece of evidence d given any hypothesis, P(d \mid H_1), P(d \mid H_2), P(d \mid H_3).

Suppose we took 10 candies out, and they were all the first flavour. If we assume the candy distributions are independent and identically distributed (this is roughly accurate for a large bag of candy), we can say that the likelihoods are P(d \mid H_1) = 0.25^{10}, P(d \mid H_2) = 0.5^{10}, P(d \mid H_3) = 0.75^{10}. We then update our priors that each probability is correct using P(H_1 \mid d) = P(d \mid H_1) ;wip.

Predictions are essentially weighted averages over all of the hypotheses, where the weight is how likely we think the hypothesis is to be true. The predicted hypothesis at any point is the one that has the highest P(H \mid d) value so far. It's mathematically guaranteed to be the best guess we can make given the evidence available (it'll be correct more often than any other predictor). Not only that, but we can also perform regularization by adjusting priors - simply reduce the prior probability of each hypothesis by its complexity.

This is usually intractable if there are a lot of hypotheses. We can approximate this using MAP (maximum a posteriori estimation): instead of predicting based on the prediction of every hypothesis, we only predict based on the most probable hypothesis. In other words, h_{MAP} = \argmax_{H_i} P(H_i \mid d) (the hypothesis that maximizes the probability that the evidence occurs), and then only updates/predicts based on that most likely hypothesis. MAP prediction is less accurate than Bayesian, but it's guaranteed to converge to the same answer as Bayesian learning given enough evidence. Additionally, we can still do regularization by reducing prior probabilities according to their hypothesis complexity.

However, MAP may still be intractable if the \argmax can't be efficiently computed. ;wip: rest of MAP computation slide, and linearizing the values and then solving it using linear programming on the log likelihood function \ln P(H) + \sum_i \ln(P(d_i \mid H)).

If we don't have any priors, we still need to assume something to start with. In this case, we usually use a **uniform prior**, as if every hypothesis is equally probable. If we use a uniform prior, then the only thing that matters is the evidence. In this case, this is called maximum likelihood learning - doing \argmax_{H_i} P(d \mid H_i) rather than \argmax_{H_i} P(H_i) P(d \mid H_i), because every P(H_i) is equal. This tends to overfit, since there's no priors to penalize complex hypotheses. Maximum likelihood learning is more tractable than MAP learning, because we don't need to keep track of priors anymore.

Maximum likelihood learning, and to a lesser extent MAP learning, tend to jump to conclusions quickly, because they take an idea and tend to run with it until it fails enough. However, both will eventually converge to the same answer as Bayesian learning.

For the candy example but allowing arbitrary ratios (rather than just the 3 ratios presented), the maximum likelihood learning gives the maximum likelihood estimator that says "the true ratio of candies is most likely to be the actual ratio we've observed". We can compute this by constructing the likelihood function, then finding the value that maximizes this likelihood function.

These algorithms can be extended to arbitrary Bayes nets. One issue to be aware of is that if we don't see an instance of a particular class, we set that class's probability to 0, which causes a lot of problems for Bayes nets (zero probabilities eliminate entire trees in Bayes nets). We can get around this with Laplace smoothing - start counting from a small number instead of from 0. For example, if we're predicting English word frequencies in a text document, we're unlikely to see all English words, so we would predict 0 probability for any unseen words. With Laplace smoothing, we instead pretend as if we've seen every English word once, and then count starting from that.

A naive Bayes model is a special case of a Bayes net where we assume that every attribute is independently distributed. This is often not a valid assumption, but it works pretty well in practice. Naive Bayes nets require much fewer probabilities because all the conditional probabilities are equal to their corresponbding unconditional probabilities, of which there are much fewer.

Naive Bayes scales well (i.e., linearly) as the number of features increases, and performs really well in practice, often even in situations where the independence assumption doesn't hold very well. We often use it for things like text classification, where high performance is needed.

Common use of Naive Bayes include spam detection, plaigerism detection, and search engines. The information we're given is a set of documents, and the goal is to classify those documents into two classes (e.g., spam vs. not spam, plaigerized vs. not plaigerized, relevant vs. not relevant). In these situations it's called a bag-of-words model: for each document, we just stem words, remove stopwords, and perform a word count.

We then assume words are independent fo each other - order of words doesn't matter, co-occurrences of words, don't matter (bad assumption, but works decently in practice). Under this assumption, for a document d_i and a class y, the probability of d_i being in y, P(y \mid d_i), is the product of all the probabilities P(w_j \mid y) = \frac{\text{number of documents with class } y \text{ that contain the word } w_j}{\text{number of documents with class } y} for each word w_j in d_i.

To avoid zero probabilities, we apply Laplace smoothing by initializing word counts to 1 instead of 0.

So far we've had access to all the attributes in our data. In real life this is rarely the case. Consider a Bayes net for diagnosing diseases. The Bayes net can see symptons and suggest treatments, but it wouldn't be able to access the underlying diseases.

If we ignore hidden variables, the Bayes net would have to directly associate symptons with different treatments, which massively increases the size of the Bayes net and the number of probabilities we need. If we ignore problem instances with missing values, we might end up without any values (and probably a lot of bias as well). In real life, we need to be able to consider hidden variables.

In direct ML on Bayes nets, we find the maximum likelihood estimate h_{ML} = \argmax_h P(E \mid h) = \argmax_h \prod_i \mathrm{CPT}(V_i) of the hidden variables given the evidence E. Usually, we would solve this by introducing a log, so h_{ML} = \argmax_h \sum_i \ln \mathrm{CPT}(V_i) - the linearized version can be maximized more easily.

When we have hidden variables Z, we instead have h_{ML} = \sum_Z \argmax_h P(E, Z \mid h). We can't use the log trick here anymore, because of the sum. Instead, we use **expectation maximization** (EM): guess the values of the hidden variables (using the expected value), and then compute the new maximum likelihood hypothesis based on those guesses. We then repeat the process until we converge to an approximation of the true maximum likelihood hypothesis:

- Make a guess for h_{ML}, the maximum likelihood hypothesis.
- Repeat until convergence:
- Based on h_{ML}, compute the expected value of any hidden/missing variables.
- Based on the expected value of those hidden/missing variables, compute a new h_{ML}.

Formally, we can write this as h_{i + 1} = \argmax_h \sum_{Z} P(Z \mid h_i, e) \ln P(e, Z \mid h_i). In practice, we would be computing the expectation and the maximization steps separately, even though this formula shows them occuring together. ;wip: check this formula

Expectation maximization is nice because we can linearize P(e, Z \mid h) (which is the product of bunch of variables) by moving the \ln inside the product and turning it into a summation - this makes it a lot easier to compute.

Expectation maximization guarantees that P(e \mid h_{i + 1}) \ge P(e \mid h_i). ;wip: examples from slides

So far we've looked at what we can do with AI. Today we look at what we should do with AI.

- Autonomous vehicles:
- Accessibility.
- Trolley problem.
- We have to solve the hardest problem first (mixed human/AI drivers on roads) before we can make it easier (AI drivers only).
- Accounting for local conventions, even if those conventions are illegal (e.g., the Pittsburgh left, driving slightly above the speed limit).
- Who's responsible for inevitable accidents?
- Automation of employment:
- Technical unemployment.
- Self-driving cars could eliminate 3.5 million trucking jobs.
- Militarization of AI:
- UN "lethal autonomous weapon systems".
- Human in the loop systems, and future autonomous weapons skipping that step.
- Social aspects of AI:
- Explainability, transparency, bias in AI.
- Training data encodes our current biases, for example predictive policing that's based on current policing practices, which would encode current policing biases into the model. Bias is introduced when biased people make different decisions when building models.
- AI safety:
- Avoid unintended behaviours.
- Asimov's three laws - many of Asimov's stories are about how these simple, seemingly reasonable laws go wrong.
- Humanity and AI.
- Threats to privacy and human dignity:
- AI can be used to perform mass surveillance, where it was previously infeasible for humans to read so many messages.
- AI rights, strong AI.
- Superintelligence.

Tips for final exam:

- Reinforcement learning, multi-agent systems, bayes nets, hidden markov models, machine learning. Focus on multi-agent systems (both profs work on them for their research).
- "How would you use AI techniques to solve problem X?".
- Same format as midterm, but focusing on post-midterm material.
- Ask lead TA if you have questions.

Notes from exam review session by yours truly.

- search:
- algorithms: DFS, BFS, A
*, IDA*, SMA* - completeness (always finds an answer)
- optimality (always finds best answer)

- algorithms: DFS, BFS, A
- constraint satisfaction:
- backtracking
- ordering: most constrained variable (least remaining values), most constraining variable (most unsatisfied constraints references), least constraining value (precludes least number of other variable values)
- filtering: forward checking, arc consistency

- local search: hill climbing, simulated annealing, genetic algorithms
- adversarial search:
- computing minimax utility bottom-up
- minimax search (dfs through minimax tree, possibly where minimax utility values are estimated after a certain depth)
- alpha-beta pruning (stop looping through the children of a node when we're guaranteed that one option is guaranteed win in minimax)
- estimating minimax utility when the tree is too large to expand, singular extensions (keep expanding past depth limit for especially promising options)
- for stochastic games: chance player and the expectiminimax tree (chance player takes average of all children weighted by their probabilities)

- utility theory:
- preference ordering, lotteries (mapping from outcomes to probabilities of those outcomes; a distribution over outcomes)
- decision problems under certainty/uncertainty: a set of actions D, a set of action outcomes S, a mapping from actions to outcomes/lotteries f, and a utility function U
- policies: mapping from outcomes to actions to take at those outcomes (these are better than sequences of actions because they allow us to incorporate new info available to us after taking each action)
- computing policies from expected utility tree (bottom up, from leaf nodes using dynamic programming)

- markov chains:
- markov property: distribution of next states depends only on current state, not how we got to current state
- markov transition matrix: matrix P where P_{ij} is probability of going into state j given that we're currently on state i, rows and columns represent outcomes/states
- markov problem: set of states S, state rewards R, time discount factor \gamma, transition probability function (probability of going to state j given that we're on state i and just took action k)
- expected utility: reward at current state plus (expected utility of each possible next state, weighted by probability of reaching that state) multiplied by time discount factor
- solving markov problems: goal is to find policy that maximizes expected utility
- value iteration: repeatedly compute maximum expected discounted future reward for each action for 1 iteration, 2 iterations, etc. until reward value converges (dynamic programming)
- bellman's equation: V^{t + 1}(s_i) = \max_k \set{r_i + \gamma \sum_j P_{ij}^k V^t(s_j)} where P_{ij}^k is probability of going to state j from i when taking action k (this equation is suitable for solving wih dynamic programming, or linear system solver due to being linear equation)
- partially-observable MDPs: state for MDP replaced by distribution over states called belief states, solutions are policies mapping belief states to actions

- reinforcement learning:
- learning actions that maximize expected discounted reward, difficult because of time-delayed reward
- problem formulation: states S, actions A, rewards R (usually delayed from actions by multiple steps), time is usually discretized
- goal is to find optimal policy, just like MDPs but we initially don't know the transition probability function or the rewards, can only execute actions and observe results later
- model-based (learns transition probability function and rewards to compute policy) vs. model-free (learns policy directly)
- passive (given a policy, evaluate it) vs. active (output actions, and given action results, output policy)
- on-policy reinforcement learning (behaviour while training converges to the final learned policy) vs off-policy reinforcement learning (behaviour while training can always differ from final learned policy)
- explore (try new unknown actions) vs. exploit (take current best action)
- boltzmann exploration: choose best action with probability \frac{\exp(Q(s, a)/T)}{\sum_a \exp(Q(s, a)/T)} and random action otherwise (sort of like lowering temperature in simulated annealing)
- RL techniques:
- direct estimation: run a policy many times, average utility values from each run to get expected utilities
- adaptive dynamic programming: solve as per MDPs using bellman's equation - we always have r(s_i), and we can estimate P_{ij}^k by the fraction of times performing action k at state i moved us to state j, and with these we have enough data to perform the update
- temporal difference: use observed transitions to update expected utilities as per Bellman's equation V^\pi(s_i) \leftarrow (1 - \alpha) V^\pi(s_i) + \alpha(r(s_i) + \gamma V^\pi(s_j)), decrease \alpha over time to ensure convergence, \pi is the current policy
- Q-learning: replace V^\pi(s_i) with Q(s, a) to get Q(s_i, a) = (1 - \alpha) Q(s_i, a) + \alpha(r(s_i) + \gamma \sum_{s_j} \max_{a'} Q(s_j, a')) - current reward plus discounted expected Q value - this is like temporal difference except more suitable for implementation
- SARSA is on-policy learning: policy maps each state to the action with highest Q value, initialize Q, s, a, take action a then record s' and choose a' from policy, update Q values per Q(s, a) \leftarrow Q(s, a) + \alpha(r + \gamma Q(s', a') - Q(s, a)), set a = a', s = s'
- Q-learning is off-policy learning: at each step do arbitrary action a, update Q with same formula as SARSA, go to new state s = s' (with boltzmann exploration, this converges as we visit states infinitely often and temperature and learning rate go to 0)

- bayes nets:
- bayes rule: P(A \mid B) = P(B \mid A) P(A) / P(B), aka posterior = likelihood * prior / normalization_constant (for Bayesian inference, posterior is P(hypothesis evidence))
- posterior joint distribution: table of P(Y \mid E_1, \ldots, E_n) values where each dimension i of the table is possible values of E_i
- conditional independence: P(A \mid B, C) = P(A \mid C) means A and B are conditionally independent under C - knowing B doesn't change our belief in A if we already know C
- independence: P(A \land B) = P(A) P(B) means A and B are independent (this is used to make joint probability tables smaller, since we only need to know P(A) and P(B), rather than every combination of values for P(A \land B))
- bayes net: digraph where nodes are events and edges A \to B are conditional dependence "B depends on the value of A" (each node is directly dependent only on the nodes that are directly connected into it!), also every node includes a conditional probability table P(the node first parent of this node, second parent of this node, ...)
- sibling and cousin nodes are conditionally independent of each other given their shared ancestor, non-connected nodes are independent
- nodes are conditionally independent of every non-decendent given its parents
- markov blanket: nodes are conditionally independent of all other nodes given its parents, children, and children's parents
- factor: function over a set of random variables, or a table for each combination of values of those variables, where last variable is conditional on rest (f(a, b, c, d) is table for P(D \mid A, B, C))
- a set of nodes E d-separates nodes X and Y if removing every node in E from the bayes net disconnects X and Y - X and Y are conditionally independent given E
- factor rule: f(A, B) \times g(B, C) is table where each row's value is the product of the corresponding rows in f and g - abc = f(a, b) g(b, c)
- relevant nodes: given evidence (nodes whose values are known) and a query (nodes whose values we're trying to compute), the nodes that can affect the query node's value if we knew their values

- stochastic processes:
- used to represent probabilistic inference over time, where distributions of variables can change over time
- one-variable stochastic process can be represented with a bayes net where each node represents the variable at a certain time-slice and nodes are pointed to by all older nodes
- stationary assumption: state probabilities don't change over time
- markov assumption: state probabilities depend on a fixed number k of past states (a k-order markov process is one where state probabilities depend on the last k states), this lets us specify the entire process using observations from a finite number of time slices
- we generally can't directly observe our state (hidden states), sensors reduce uncertainty about current state

- hidden markov model (of order k):
- set of states S, set of observations O, transition model P(s_t \mid s_{t - 1}, \ldots, s_{t - k}), observation model P(o_t \mid s_t), prior P(s_0)
- example of first-order hidden markov model: S is robot's coordinates, O is distances to nearby objects, transition model is movements of robot (including uncertainty about where we'll end up), observation model is distance measurements (including uncertainly from sensor noise), model is trying to compute P(s_t \mid o_1, \ldots, o_t) (position given all sensor measurements)
- common tasks: monitoriing P(s_t \mid o_1, \ldots, o_t) (e.g., robot localization), prediction P(s_t \mid o_1, \ldots, o_t) (stock trading), hindsignt P(s_{t - k} \mid o_1, \ldots, o_t) (crime scene investigation), explanation \argmax_{s_1, \ldots, s_t} P(s_1, \ldots, s_t \mid o_1, \ldots, o_t) (speech recognition, autocorrect)
- for all use cases, solved using variations of variable elimination over the corresponding bayes net, for explanation we can use viterbi algorithm
- dynamic bayes nets: represent states/observations using multiple random variables, rather than just one state variable and one observation variable, which allows us to exploit conditional dependence for better time complexity
- for non-markovian or non-stationary processes, just add state components until it's markovian and stationary again (e.g., if state depends on history length, add history length to the state tuples)

- decision networks:
- networks consisting of random variable nodes (like in bayes nets, includes conditional probability tables, oval shape), choice nodes (where choices are made, based entirely on information available from parents, rectangle shape), and value nodes (contains utility of every combination of parents' values, diamond shape, usually only one in a network)
- decision networks make decisions always in the same order and don't forget previously available info (parents of a decision are also parents of all subsequent decisions)
- policy: a set of mappings from parent assignments (values of parent nodes) to decision values for each decision node
- expected utility of policy: expected utility for all possible assignments of the random variables (average utility for those assignments weighted by probability of those assignments occurring)
- optimal policy can be computed using dynamic programming, just like for bayes nets - at each decision node just pick the highest-expected-utility choice

- game theory:
- a rational agent has beliefs about what other players will do, will perform the action that maximizes its own utility under those beliefs - we study how rational agents behave
- normal form game: set of agents I, set of actions A, action profile/outcomes \tup{a_1, \ldots, a_n} (decisions made by each player), outcome utility U
- dominant strategy: a way of choosing actions that has better expected utility than any other strategy, regardless of other players' actions (necessary but insufficient condition for rational strategy)
- nash equilibrium: an outcome profile \tup{a_1, \ldots, a_n} such that no single agent would change their decision if they knew every other agent was going to choose what they did
- mixed strategy: probability distribution over set of actions
- strategy profile: tuple of probability distributions over set of actions, one entry per player
- mixed nash equilibrium: a strategy profile such that no single agent would change their strategy profile if they knew every other agent was going to use the strategy profile they did
- every game with a finite action set has a mixed nash equilibrium
- we can find nash equilibria of two player zero sum games using linear programming, but is PPAD hard problem in general
- extensive form game: set of agents I, set of actions A, choice nodes H (with associated valid actions/resulting node, player), terminal nodes Z, and utility function (this can be represented as a game tree)
- extensive form game vs. normal form: normal form game is where players all play their moves at the same time, extensive form game is where they take turns, can be written as game tree
- any extensive form game can be written as a normal form game: each player precommits to playing a certain sequence of games
- subgame perfect equilibria: nash equilibria in extensive form game that are also nash equilibria in all subtrees (this exists in every finite extensive form game, find them by computing all equilibria bottom-up)

- mechanism design:
- designing games to account for rational agents to make certain outcomes occur
- mechanism: set of outcomes O, set of agents N (each with private information \theta_i and utility function u_i(o, \theta) and strategy space S_i), outcome function g: S_1 \times \ldots \times S_n \to O
- social choice function: f: \theta_1 \times \ldots \theta_n \to O (e.g., accepts agent votes and outputs winning vote, accepts bids and outputs auction winner)
- mechanism implements social choice function iff there exists a nash equilibrium such that g(s_1(\theta_1), \ldots) = f(\theta_1, \ldots) (equilibria at social choice function's outcome given rational agents)
- direct mechanism: each agent reports one of its possible private information values as its action
- incentive compatible direct mechanism: direct mechanism where there's a nash equilibrium for every agent to tell the truth about its private information
- strategy proof direct mechanism: incentive compatible direct mechanism where that nash equilibrium is equilibrium of the dominant strategy
- revelation principle: if mechanism implements social choice function in dominant strategies, we can transform it into a strategy proof direct mechanism by moving the agents' transformation of \theta_i into the mechanism itself
- gibbard-satterthwaite theorem: under pretty common conditions, social cost functions can only be implemented if they're dictatorial (one agent controls the rest)
- to get around having to be dictatorial, we design mechanisms to make it computationally hard to avoid social cost function outcomes rather than just relying on rationality, and restrict agent preferences
- quasilinear preferences: outcomes are vote result and transfers of money, where money linearly influences utility
- groves mechanisms: mechanisms for quasilinear preferences where money is used to pay agents who didn't get their top choice, and agents pay to get their top choice, results in efficient and strategy-proof mechanisms
- vickrey mechanisms: special case of groves mechanisms, where equilibrium utility of each agent is how much utility they add to the system by being there vs. not being there (e.g., vickrey auction: highest bidder gets item by paying the second-highest bid value to system)

- machine learning:
- inductive learning hypothesis: hypothesis that approximates training set well will also approximate testing set well
- realizable target function: target function (true function that is being approximated) is contained in hypothesis space
- binary decision trees can represent any boolean function
- implementing decision trees: output mode of examples if no attributes left or all examples have same class, otherwise choose best attribute, partition examples along that attribute, and recurse on the two partitions and the remaining attributes
- entropy: \sum_i -P(v_i) \log_2 P(v_i), for binary training set we have P(v_1) = \frac{p}{p + n} and P(v_2) = \frac{n}{p + n}
- choosing best attribute: pick the attribute that has higest information gain - lowers the sum of entropies in the two partitions the most
- overfitting: finding patterns that aren't there (avoid this using regularization and cross validation)
- bagging: train a bunch of models on training set (sampled with replacement) and take majority vote
- boosting: each training set point has a weight, repeatedly train models on training set, but after each one, weight the misclassified examples higher so future classifiers will focus on getting those right more (e.g., adaboost)

- neural nets:
- feedforward, recurrent
- perceptrons: single-layer feedforward network with threshold activation, only one weights matrix, can only learn linear boundary, train by adding or subtracting \alpha \vec x_i whenever \vec x_i is misclassified, as appropriate
- backpropagation: technique for training deeper networks, compute gradient of loss function with respect to weights, subtract gradient from weights to get weights that give lower loss function value
- deep learning: neural networks with many layers (tends to overfit due to huge hypothesis space, hard to interpret, but can learn very complex hypotheses given a lot of data)
- neural network with single hidden layer can approximate any function, but requires fewer neurons if we use more layers

- bayesian learning:
- P(x \mid d) = \sum_i P(x \mid h_i) P(h_i \mid d), as we get more values of d some hypotheses will get better predictions and P(h_i \mid d) will increase - prediction is prediction of every hypothesis, weighted by our belief in that hypothesis being true
- optimal given prior: no other prediction will be correct more often (and prior can be used to regularize - complex hypotheses get lower prior)
- true bayesian learning is usually intractable - instead we might use maximum a posteriori learning, where we only use prediction of the hypothesis with highest P(h_i \mid d) value (this is less accurate but converges to same predictions with enough data, but can still be intractable)
- maximum a posteriori learning: \argmax_h P(h \mid d) = \argmax_h P(h \land d) = \argmax_h P(h) P(d \mid h) = \argmax_h P(h) \prod P(d_i \mid h) = \argmax_h \ln P(h) + \sum \ln P(d_i \mid h), can then use linear programming to compute the argmax value
- maximum likelihood learning: maximum a posteriori learning, but with uniform priors (all P(h) are equal, so we just have \argmax_h \prod P(d_i \mid h)), still converges to same values as MAP and Bayesian but less accurate before converging, and faster to compute because no need for priors
- laplace smoothing: when we have observation \vec x from multinomial distribution, we get maximum likelihood estimate for multinomial distribution means as \hat \theta = \frac{x_i}{n}, but if we laplace smooth \vec x, we get the estimate \hat \theta_i = \frac{x_i + \alpha}{n + \alpha d}
- naive bayes model: assume all x_i are independent of each other, P(C \mid x_1, \ldots, x_n) is proportional to P(C) P(x_1 \mid C) \cdots P(x_n \mid C), then pick C that maximizes P(C \mid x_1, \ldots, x_n) (even though the assumption is rarely true, scales and performs well in practice)
- naive bayes example: document classification, assume each word is independent, P(class_i document_i) is proportional to \prod P(word_i in document \mid class_i) / P(class_i)

- expectation maximization:
- used when some attribute values missing or when learning directly is infeasible
- guess maximum likelihood hypothesis, then repeatedly predict missing attributes values using hypothesis, and use those predicted values to compute new maximum likelihood hypothesis, repeat until hypothesis stops changing
- formally, repeatedly compute: h_{i + 1} = \argmax_h \sum_Z P(Z \mid h_i, e) \ln P(Z \land e \mid h_i) where Z is missing attribute values and e is known attribute values
- monotonically increases P(e \mid h_i) as i increases

This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright 2013-2017 Anthony Zhang.