Lecture Notes by Anthony Zhang.

CS486

Introduction to Artificial Intelligence.

Alan Tsang
Section 001
Email: akhtsang@uwaterloo.ca
Website: https://cs.uwaterloo.ca/~klarson/teaching/F17-486/, https://cs.uwaterloo.ca/~akhtsang/cs486.html
Office Hours: Tuesdays 1:15pm to 2:45pm in DC2306B, accessible via the AI Lab
Mondays/Wednesdays 4:00pm-5:20pm

11/9/17

Info on course website, assignments on LEARN, questions on Piazza. 5 assignments worth 40% total, 15% midterm on October 18, 2017, 45% final exam, optional project worth 5% bonus marks. Assignments can be submitted up to 48 hours after deadline.

There's no real agreed-upon definition for AI, but in this course we'll consider it to be the mathematical study of intelligent action in a complex environment. We usually think about 4 different kinds of AI: systems that think like humans, act like humans, think rationally, or act rationally. The definitions usually differ along the dimensions of thinking vs. acting and human behaviour vs. idealized rational behaviour. In other words, we're trying to duplicate what the human brain does (passing the turing test), and what the human brain should do (acting rationally).

Overview of AI in popular perception: threats of superintelligence, automation displacing jobs, biases encoded in AI models, self-driving cars. Overview of the turing test, including issues like not being amenable to mathematical analysis, whether it's more important to understand underlying principles than mimicking true intelligence, and whether it's even a good test (we don't make planes by trying to make something so bird-like that it fools a bird). Overview of history of AI, like Turing's work, the Dartmouth meeting, AI winter, and modern computational AI.

Classically, AI was seen to mean general problem solving, like solving chess. More recently, we focus more on specific problems, such as perception, robotics, and deliberative reasoning (thinking about what to do based on available data), with a greater focus on statistical methods, decision theories, and probability.

In this course we'll talk about classic AI problems like search, as well as knowledge representation/reasoning, making decisions under uncertainty, and learning.

An agent is an entity that perceives and acts. We usually evaluate agents based on things like how well it achieves goals and how many resources it consumes. Agents exist within an environment, with actuators, sensors, and environmental constraints. Environments can be fully/partially observable (can we see the entire environment at once?), deterministic/stochastic (does the same action always result in the same result?), episodic/dynamic (do previous steps affect later steps? in classification no, since previous predictions are independent, but in chess yes, since previous moves are very important), discrete/continuous (chess AI could be considered discrete, whereas controlling a robot arm might be considered continuous), and single/multi agent (is the agent the only one in its environment?).

13/9/17

Search is one of the first topics in AI, used in everything from theorem proving to robotic navigation. Search problems consist of a state space (the world we're searching within), a successor function (given a state, determines possible actions/next states and the cost of getting to those states), a start state (the state at the start of the search), and a goal test (determines whether we've reached the goal, and possibly also information about how close we are to the goal). A path is a sequence of states and operations. A solution is a path from the start state to a state that satisfies the goal test. An optimal solution is a solution of minimal cost.

Consider a pac-man AI. the state space would be the game map, the successor function would define actions like, moving left/right/up/down and how expensive those would be, the start state would be the initial game state, and the goal test would be getting to the end.

Overview of N-queens problem and why it's a search problem, as well as various other types of search problems. For our purposes, search problems won't have elements of chance, continuous states, adversaries, or partial observation - all of our problems will be deterministic, discrete, single-agent, and fully observable.

The world state includes every detail of the environment at a certain point in the search process. For something like pac-man, we might care about pac-man's position, the dots, and the ghosts, but pathfinding might only have current position in its state.

Search problems can be thought of as minimum-cost pathfinding on a graph of world states - we can use algorithms like breadth-first search and A*. The search tree is the tree made of possible paths within the graph of world states. The factors we care about in search algorithms are things like completeness (will it always find a solution?), optimality (does it always find the minimum cost solution?), time complexity, and space complexity. While DFS is light on memory and goes deep quickly (O(b^h) time and O(h) space, where b is arity of tree and h is tree height), cycles in the graph can result in infinite loops, so it's not complete. BFS is complete and can find shallower solutions quickly, but needs a lot of memory to store visited states (O(b^h) space and time where b is arity of tree and h is tree height).

Instead, we can use iterative deepening search, which is essentially a mix of both - do DFS with a small depth limit, then repeatedly increase the limit and rerun DFS until we reach the goal. This is both memory-efficient and complete, since the depth limit prevents infinite loops. This sounds wasteful, but it isn't too bad in practice since the number of nodes grows exponentially, so most nodes we visit will be in the later iterations.

Note that all three of these algorithms we've looked at so far are the same besides the order we visit nodes in - they're generic and aren't specialized for specific problems. We can do better in specific problems by taking advantage of their inherent structure.

18/9/17

Assignment 1 will be out this week.

Uninformed search is search that doesn't depend on any knowledge about the problem, like BFS or DFS, or iterative deepending. It's widely suitable, but performs poorly and is computationally expensive. We would like to do informed search instead, where we can take advantage of knowledge we have about the problem to make our searches more efficient.

Usually when searching a graph, we often have knowledge about the merit of a node - some approximation of how good the node is. There are often different ways to defined merit for any given problem, such as how much it would cost to use the node in our solution, or how computationally expensive it would be to use it. For example, uninformed search expands nodes based on the distance from the start node, but we might augment this by using a heuristic like Euclidean distance to the goal, if we're searching in a Euclidean space.

A heuristic estimates the distance from any given node to the goal node, such as Euclidean or Manhattan distance. In our search, if node A has a lower heuristic than node B, we might choose node A over B (and also factor in things like how far from the start nodes A or B are).

Best-first search is the most naive informed search. At each step, we expand exactly the node with the lowest heuristic function value - the node that seems closest to the goal function. This is not a complete search, since it might get stuck in a minimim-heuristic cycle of nodes (we don't keep track of which nodes we've already visited). Additionally, this is not optimal, since the resulting path isn't always the shortest one - the length of the path taken so far at any given node isn't taken into account.

While best-first search looked at the estimated forward cost (expand node with lowest estimated cost from current node to goal), breadth-first search looked at the backward cost (expand node with lowest cost from start to current node). We want to combine these two to make breadth-first search more efficient, and this gives us A* search. Essentially, let g(n) be the backward cost for the node n and h(n) be an estimate of the forward cost for the node n. We then perform essentially breadth-first search, repeatedly queueing up each visited nodes' neighbors, but always choose the node in the priority queue with the lowest value of f(n) = g(n) + h(n), stopping when the priority queue is empty (the priority queue starts off only having the start node).

A* is complete - it will give us a valid path for any start/goal vertex in any graph. A* is also optimal if the heuristic is admissible and we're searching in a tree, or if the heuristic is consistent. Admissible means that the forward cost heuristic's value is always less or equal to the true forward cost. For example, h(n) = 0 is an admissable heuristic for any graph. Proof of optimality:

Let s be the start node and e be the goal node. Suppose A* gives us a non-optimal solution within a tree. Since it's a tree, we must have reached a different ending node, so there must be a non-optimal ending node e'.
Let n be some node along the path from s to e. Clearly, at some point both n and g' must be in the A* priority queue at the same time, and A* must have removed e' over n.
So f(e') \le f(n) and g(e') + h(e') \le g(n) + h(n). Let c(x, y) be the true cost from node x to y. Since the path is non-optimal, g(e') > g(e). Clearly, g(e) = c(s, n) + c(n, e) \ge g(n) + h(n) = f(n).
So f(e') > f(n), which isn't possible since f(e') \le f(n). So A* cannot have given us a non-optimal solution within a tree.

Consistent means that the heuristic is admissible regardless of the goal node - h(n) \le c(n, n') + h(n'), where c(x, y) is the true cost of getting from node x to node y. Most admissible heuristics will also be consistent, but not all of them.

A* has good performance in practice, but in the worst case needs to visit every node, and keep track of all the backtracking information for every node. The real-world performance of A* depends heavily on the heuristic function.

For example, if we were to use A* to solve a sliding puzzle game (a 3x3 grid of squares with one tile missing, squares must be slid into a particular pattern), One heuristic we might use is to count the number of squares to move each tile to get to its proper place, assuming the tiles can occupy squares regardless of whether there's already a tile there - the sum of the Manhattan distances between each node to its destination. A somewhat worse heuristic would be to count the number of tiles not in their correct places.

To design a heuristic, we generally start by removing some constraints of the problem, finding a computationally cheap solution to the relaxed problem (or precompute a database of solutions for those simplified problem instances), and then use that as the heuristic. For example, for the sliding puzzle game, we started by relaxing the requirements that there only be one tile per square, and then trivially found a solution by simply moving the tiles to their correct places. There's often a big tradeoff between the cost of computing the heuristic, versus how good the heuristic is.

We often want to use A* while also limiting how much memory we use. One way to do this is using Iterative Deepening A*, which is sort of like iterative deepening search, but we consider nodes by order of their f(n) value, limit the value of f(n), and progressively rerun the search with increasing f(n) limit until the goal is reached. Another way is simplified memory-bounded A*, which simply starts dropping nodes from the priority queue with the highest f(n) values (or oldest nodes, as a tiebreaker) when it's out of memory (this is still optimal/complete if distance to shallowest goal node is less or equal to memory size).

20/9/17

Assignment 1 due October 4.

Constraint Satisfaction

Informed search uses domain specific knowledge to make search faster. Constraint satisfaction problems (CSPs) use knowledge about the current state itself to help us find solutions. In a standard search problem the states are black boxes, with a goal test and successfor function to deal with them. A constraint satisfaction problem has states defined by variables X_1, \ldots, X_n whose values are from a domain D, and the goal test is a set of constraints that tell us about what values a subset of the variables S \subseteq \set{X_1, \ldots, X_n} must take on. We can then exploit knowledge of these constraints to improve our performance.

Consider the map coloring problem: coloring a map such that no adjacent regions have the same color. Here, the constraints are that adjacent regions cannot have the same color, X_1, \ldots, X_n is the color of each region, and the domain might be D = \set{\text{red}, \text{green}, \text{blue}}. We want to find a solution, which is an assignment of values to X_1, \ldots, X_n satisfying all the constraints. Another famous constraint solving problem is 3-SAT, which tries to find boolean values that satisfy a set of constraints in a particular form.

We distinguish betwen finite constraint satisfaction problems like 3-SAT, and infinite ones like linear systems. We also care about the arity of the constraints - for example, binary constraints operate over two variables, ternary over three, and so on. There's also soft constraints, which don't have to be satisfied but are preferred if possible - this forms a constrained optimization problem.

Binary constraints can be represented by a constraint graph, where nodes are variables and edges are constraints. Therefore, we can use standard search algorithms to solve these binary constraint satisfaction problems. For example, for the map coloring problem, all the constraints are binary, so we can perform a search - states in the graph are partial assignments of colors, the search graph is the map's graph, the initial state is the uncolored map, the successor function colors one currently uncolored region, and the goal is that all colors are assigned and no two adjacent regions have the same color.

In general, we can solve a CSP using search: states are partial assignments of variables, initial state is that no variables are assigned, successor function assigns a value to a variable that isn't currently assigned, and the goal is that all variables are assigned and all constraints are satisfied.

CSPs are commutative - the order that we assign values to variables doesn't matter. This can be used to significantly reduce our search space - at each node in the search tree, we only have to consider the value of a single variable.

Backtracking search is the main way to solve general CSPs: select an unassigned variable X_i, try every possible value of X_i to make sure they satisfy constraints. If there's a value v that does, set X_i = v and repeat the whole process, and if there is no such v, go back to the preceding variable X_j and keep trying more values of X_j.

Backtracking is basically the same as DFS, but there are a lot of ways to improve on it:

;wip: slides

Also, it turns out that if the digraph is a tree, then we can solve it in O(nd^2), where n is the number of nodes and d is the number of values in the domain. To do this, we use topological sort to get an ordering such that parents occur before their children.

;wip: slides

;wip: tree widths of 0 and 1 are still trees, but the larger the tree width, the less like a tree it is - the larger the width, the less tree-like it is, and worse the decomposition approach will work, also finding optimal tree decomposition is NP-hard

25/9/17

So far we've looked at uninformed and informed search. Today we'll look at local search and optimization - searching in very large problems, where we're willing to settle for a non-optimal but workable solution.

Search algorithms so far have systemically explored the search graph, and the path is the solution, but in many cases we don't care about the path. For example, scheduling exams without conflicts (the thing we care about is the final exam schedule, not the path to the schedule), or chip design (routing wires around to minimize chip footprint).

Local search problems tend to have a wide search tree and a cost function that tells us how good a given solution is. Many of these problems are NP-hard, but we just want solutions that are good enough, even if their cost function is high.

For example, for N-queens, we might try to find a solution that removes a lot of the queens from attacking positions, rather than a solution that gets all of the queens out of attacking positions. It turns out that for a random N-queens problem, we can find a solution in almost constant-time for arbitrary N with pretty high probability. When there are few constraints vs. variables, it's easy to solve - if we just throw the queens on the board somewhere, it's probably a valid solution. When there are lots of constraints vs. variables, it's easy to see it's infeasible - if there are no non-attacking positions left to put queens in. However, when there's roughly the same number of constraints and variables, there are some solutions, but they're hard to find - searching for any solutions becomes very computationally intensive. Local search problems tend to have the "phase transitions", where some cases are really difficult, but most are really easy.

One way we commonly solve search problems is by iteratively improving a random initial state - a greedy hill climbing algorithm often also known as gradient descent. If we get stuck, we restart at another random initial state, and then after a lot of runs, take the best result:

  1. Choose an initial state S with value V(S).
  2. Generate a moveset \delta(S) = \set{S_1, \ldots, S_k}.
  3. If every V(S_i) < V(S), we've found a maximum, so return S.
  4. Let S = \argmax_{S_i} V(S_i).

Hill climbing is easy to implement and light on memory, but isn't complete (when the problem space isn't convex) or optimal.

;wip: slides

Softmax is nice because it handles outliers well.

27/9/17

We now introduce an adversary with conflicting goals. Usually this means a game. Games are well-studied in AI because they're easy to represent, aren't too easy, and are easy to evaluate performance on.

We classify games into categories like perfect vs. imperfect information, deterministic (fully controlled by players) vs. stochastic (element of change). As a search problem, a 2-player perfect information game has a state defined by the board state and whose turn it is, the successor function is the set of the results of every legal moves at a given state, the terminal/leaf states are states where players win/lose/draw, the heuristic is the player's estimated utility function. A solution to the game is a strategy that tells you how to move at every state in the game - a way to get to a goal node from every node in the tree.

This is a challenging problem because there's a malicious opponent, and we need to take that into account with our search. A max-player is trying to maximize its own utility, while a min-player, the opponent, is trying to minimize it. We're trying to find an optimal strategy - a strategy that guarantees that gives us an outcome that is as favourable as possible, given that the opponent is playing as good as possible (we won't necessarily always win). Note that this assumes an optimal opponent, so we might prefer different strategies when the other player isn't playing optimally.

In this course, we'll mostly focus on zero-sum games (games where one player winning means the other loses) with perfect information. In theory these are easy to solve given infinite time, and most of the challenge comes from making this practically fast.

Consider the centipede game: players take turns incrementing a counter, and deciding whether to end the game, taking all the value in the counter and winning. Clearly, we want to take the counter to get the value, but we want to keep going for more turns to get more value.

We can analyze this using minimax search. We represent the value of each node in a tree as \text{minimax}(s) = \begin{cases} \text{utility}(s) &\text{if } s \text{ is a terminal state} \\ \max\set{\text{minimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a max-node (our turn)} \\ \min\set{\text{minimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a min-node (their turn)} \end{cases} (this is also called the security value). Essentially, it represents the worst-case utility of the game starting from the current state, assuming both players are playing optimally (they're both minimaxing).

Minimax search is essentially DFS through this minimax tree. This isn't complete since it doesn't work if the game doesn't terminate, but it has pretty good space complexity.

We can't apply minimax against the centipede game, since the logical course of action is to betray at the first possible turn, yet real people will generally play more than that. For chess, the branching factor was empirically determined to be around 35, and the tree depth is around 100, at least. This is impractical to search exhaustively.

For certain types of games, we can do alpha-beta pruning. Essentially, \alpha is the max-player's best (highest) security value guaranteed so far, and beta is the min-player's best (lowest) security value guaranteed so far. We then skip searching nodes where \beta \le \alpha - taking this move will make the current player worse off than before, because the min-player's guaranteed low score is no longer greater than the max-player's guaranteed high score, so a loss or draw is guaranteed for the other player. In a minimax tree, this gives us the same result as an exhaustive DFS.

This is very helpful - minimax search on chess can look 5 moves ahead, but alpha-beta pruning gives us enough performance to look up to 10 moves ahead! This still isn't good enough - alpha-beta pruning requires us to go all the way down to terminal nodes in order to get the alpha/beta values. We can supplement this using heuristics to estimate alpha and beta. A fast evaluation function gives the true utility for a terminal state, and the estimated expected utility for any other state. There's a lot of AI research in making a good evaluation function using expert knowledge and experience - one that makes good estimates of how much payoff we'll get from a given game state.

We also don't have to go down all the way to the terminal states, and cut off the search at certain thresholds. We'll usually want to cut off the search when we reach a certain depth, or when the state becomes relatively stable (not really leading towards a win/loss), though we'll often want to give singular extensions - additional depth for moves that seem particularly promising (this helps avoid some cases where we'll miss promising options just beyond the cutoff). The parameters for the cutoff would also depend on the opponent - a novice chess player might only need 5 move lookeahead, while a grandmaster might need even 14, with exhaustive search near the endgame and a good evaluation function.

For stochastic games, we treat chance itself as a player, so there's the max-player (us), the min-player (the opponent), and the chance player (the whims of nature), where the chance player makes its move with probabilities defined by the path in the tree, in between every max-player and min-player turns. We then construct an expectiminimax tree and consider it a search problem. The expectiminimax tree is the same as the minimax tree, but with another case: \text{expectiminimax}(s) = \begin{cases} \text{utility}(s) &\text{if } s \text{ is a terminal state} \\ \max\set{\text{expectiminimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a max-node (our turn)} \\ \min\set{\text{expectiminimax}(s') : s' \in \text{successors}(s)} &\text{if } s \text{ is a min-node (their turn)} \\ E(\set{\text{expectiminimax}(s') : s' \in \text{successors}(s)}) &\text{if } s \text{ is a min-node (their turn)} \end{cases}, where E(X) is the expected value.

For Go, the branching factor is about 250, and the terminal states are usually up to 150 moves deep. For this we can use techniques like monte-carlo tree search.

Creative Commons License This work by Anthony Zhang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Copyright 2013-2017 Anthony Zhang.