Back to Notes

CS 188 Midterm 1 Notes

Probability Review

Random Variables - P(X)

X is a random variable, so it doesn’t have a set value - Takes on a range of values
P(X) defines what probability X may take on a certain value
The distribution of the values of X
Properties:
Distribution may never have negative probabilities
Sums to 1
Notation:
Capital Letters = distribution over Random Variables
Lowercase Letters = actual values
Ex: P(X) vs P(X = +x) vs P(+x)
Joint Distributions
P(X,Y) = distribution of the cross product of X and Y
Defines the probability of ever single x happening with every single y
It is a distribution so sums to 1 and has no negative values
insert picture
Independent Probability Tables
Two random variables, C and N
C = result of flipping a coin where heads has a probability of .8
H = the number of heads that result from flipping a fair coin twice
P(C,N) = P(C)P(N)
The probability of every C with every N is the individual probabilities multiplied together
P(C N) means the probability of C given N
So P(C N) = P(C) since C and N are independent variables
Therefore, P(C,N) = P(C)P(N) since they aren’t correlated
Simple Chain Rule

Identity: P(A,B) = P(A|B)P(B) = P(B|A)P(A)
Conditional Probability, the solution comes from the definition of Joint Distributions
Identity from P(A|B) = P(AB)/P(B) and vise versa
Bayes Rule
P(A,B) = P(A B)P(B) = P(B A)P(A)
Two definitions are equivalent from this!
P(A|B) = P(AB)/P(B) = P(B|A)P(A)/(B) - Flip the conditional probability to get P(B|A)
Lecture 1: Intro to AI

Computational Rationality
Maximize your expected utility
What can AI Do
10 years ago, AI couldn’t play a decent game of jeopardy, win against any human at chess, win against the best humans at Go, play table tennis, grab a cup and put it on the shelf, and translate spoken language into a different language in real time!
In 10 years
Soon, AI may be able to drive perfectly safely on highways, discover and prove new math theorems, write funny stories, and many more! Lecture 2: Uninformed Search Planning Agents vs Reflex Agents
Reflex agents choose action based on current percept (and possibly memory)
Reflex agents do not plan ahead
Planning agents simulate the future based on its understanding of the world Search Problems
Necessary components
State space, successor function, start space, goal test
World state = includes all the details in the real world, including those irrelevant to the search problem
Search state = only includes the information in the real world that are relevant to the search problem
The size of the search state is usually a lot smaller than the size of the world state
Search Tree
State space graph
usually very big and is hard to store in the computer memory
Search tree
progressively covers the relevant part of full state space graph - good replacement for state space graphs
Nodes on a fringe
Partial plans rather than individual states
Ex. S->D, S->E, S->P, S->D->B
Example of a state space graph
Example of Search Tree
Depth First Search (DFS) with Search Tree
Properties
M = num of tiers
In the worst case, it could expand to the whole tree (O(b^m) nodes
The maximum space required on the fringe is O(bm)
It is complete if m is finite Breadth First Search (BFS) with Search Tree
In the worst case, it could expand to the whole tree (O(b^m) nodes
If the shallowest solutions is at level s:
It will expand to at least O(b^s) nodes
The maximum space required on the fringe is O(b^s)
It is complete if m is finite
It is optimal if all costs are 1
Uniform Cost Search (UHS) with Search Tree
It is complete if m is finite
It is optimal if costs are all 1
It is optimal in the general case
Lecture 3: A* Search and Heuristics

Heuristics
A function that estimates how close a state is to a goal
Takes in a state and returns a number
A* Search and Greedy Search
Both use some heuristic function h
Both use a priority queue
Admissible Heuristics
Underestimate the true cost to the goal
Admissible = optimistic!
Example
Manhattan distance between pacman’s location and the goal location
Euclidean distance between pacman and goal -0
Graph Search and Tree Search
Admissible heuristic
A* tree search is guaranteed to be optimal
Consistent heuristic
A* tree search is guaranteed to be optimal
A* GRAPH search is guaranteed to be optimal
Lecture 4: Game Trees I

Adversarial Games
AI can achieve human expert level of performance on checkers, chess, go, tic-tac-toe
Games include an action space, a state space, and a transition function
Similar to search
In zero sum games, there is only one utility value associated with each state
Minimax
Each player assumes that the other player is playing optimally
It is implemented recursively
Alpha Beta Pruning
The purpose of pruning is to ignore some nodes in the tree that are not helpful for decision making
ALPHA = max’s best option from path to root = INFINITY
BETA = min’s best option from path to root = NEGATIVE INFINITY
Depth Limited Search
Having a bad evaluation function causes pacman to “thrash”
Depth matters
How deeply the agent looks ahead to make a decision on the next action
There’s a tradeoff between the complexity of features and complexity of computation
Lecture 5: Game Trees II and Utilities

Expectimax vs Minimax
Minimax trees have two types of nodes: min-value and max-value
Expectimax has max-value and exp-value
Doesn’t have min!
Pruning cannot be performed on expectimax trees
If the node values are not bounded
Evaluation functions can be used in depth-limited expectimax trees to provide an estimate of the true expectimax value
Probability
Multiply the value with the probability that it will happen!
Example:
Suppose that the probability for no traffic (20 min), light traffic (30 min), and heavy traffic (60 min) is now [0.2, 0.2, 0.6]. How would you calculate the expected amount of time to get to the airport?
20∗0.2+30∗0.2+60∗0.6=46 Modeling Assumptions
When an AI agent models its opponent correctly, it is usually capable of achieving good results
When an AI agent models its opponent incorrectly, over-optimism or over-pessimism could negatively affect the AI agent’s decisions
Multi-Agent Utilities
Zero sum games
Assumes a single utility function shared by all players
Two player zero sum games
Can be generalized into a general sum game with two opposite player utility functions
General sum games
Provides the flexibility of giving each player a utility function
Each player maximizes its own component
General utility (one for each player)
Can give rise to cooperation and competition at different parts of the decision making
Intro to Utilities
For minimax reasoning
Only the order matters and the magnitude (scale) does not matter
For expectimax reasoning
The magnitude needs to have meaning since we are taking a weighted average
Utility functions
Gives us a succinct way to express our preferences, as reasonable behavior should arise from maximizing well-defined utility functions Preferences and Rationality
Rationality Examples
Alex prefer A over B, B over C. If Alex is a rational agent, Alex must prefer A over C.
If Bono prefer A over B, B over C, and C over A, then Bono’s preference is intransitive (and thus Bono is irrational).
Human Utilities
The utility of having money is a better utility function (than just the amount of money)
Allais (1953)
The Allais example is an interesting way of showing the paradox of human rationality. When presented with different scenarios of a lottery, many people prefer B>A and C>D. Using the same utility function for both, there is a contradiction that is reached because you can manipulate the utilities for the choices to contradict each other.
Shows that human rationality is less predictable because people generally like to avoid risk, however, if there is little difference in risk, people may choose to risk more to gain more.
Lecture 6: Markov’s Decision Process

Non-Deterministic Search
EXAMPLE: Grid World!
A maze-like problem
The agents live in a grid and there are walls, obstacles, and other agents
Noisy movement
Actions dont always go as planned (stochastic/non-deterministic search)
80% of the time, the action North takes the agent north (if there is no wall there)
10% of the time, north takes the action west, 10% east (not always predictable)
If there is a wall in the direction that the agent is trying to go, the agent will stay put
The agent receives rewards each time step (could be good or bad)
Living cost (-1 per step)
Goal is to maximize the sum of all rewards!
Deterministic search returns a plan (a sequence of actions)
Non-deterministic search returns a policy (a mapping from state to action)
Markov Decision Process
Defined by:
Set of states s s ∈ S (∈ = “is an element of”)
Set of actions a ∈ A
A transition function T(s, a, s’)
Probability that a (action) from s (state) leads to s’ (next state)
i.e., P(s’ s, a)
Also called the model or the dynamics
A reward function (R(s, a, s’)
Sometimes just R(s) or R(s’)
And discount γ if applied
A start state
A terminal state (possibly)
MDPs are non-deterministic search problems
One way to solve them is with expectimax search!
“Markov” = given the present state, the future and past are independent!
The action outcomes only depend on the current state
Just like search, where the successor function only depends on current state (not history states)
Policies
Deterministic single-agent search problems = outputs an optimal PLAN/sequence of actions from start to a goal
MDPs = outputs an optimal POLICY (π*: S -> A)
A policy π gives an action for each state
An optimal policy is one that maximizes an expected utility if followed
An explicit policy defines a reflex agent
Expectimax didn’t compute entire policies
It computed the action for a single state only
Each MDP state projects an expectimax-like search tree
Q-state = (s, a)
A state and an action that will be taken
A q-state is an expected node from taking action a from state s
Each branch is T(s,a,s’) = probability to that state
Utilities of Sequences
More or less?
Now or later?
Discounting
It’s reasonable to maximize the sum of rewards
Its also reasonable to prefer rewards now to later
One solution: values of rewards decay exponentially
How to discount
Each time we descend a level, we multiply in the discount one more time - Sooner rewards have higher utility than later rewards
Helps the algorithm converge
Ex:
U([1,2,3]) = 11 + .52 + .25*3
Therefore, U([1,2,3]) < U([3,2,1])
Stationary preferences
If we assume stationary preferences, there are only two ways to define utilities
Infinite utilities
What if the game lasts forever?
Finite horizon = terminate after fixed T steps, like depth-limited search
Discounting = use 0 < γ < 1
Smaller γ = smaller “horizon” aka shorter term focus
Absorbing state = guarantee that for every policy, there is a terminal state that will eventually be reached (like “overheating” for racing)
Solving MDPs
MDP quantities so far:
Policy = choice of action for each state
Utility = sum of (discounted) rewards
Optimal quantities
V*(s) = expected utility starting in s and acting optimally
Q*(s,a) = maximum expected utility starting out having taken action a from state s, and therefore acting optimally
π*(s) = optimal action from state s
The optimal V value for states further from the only terminal state with positive reward is lower because the terminal reward is discounted more times
V and Q values are connected
V(s) = maxaQ(s,a)
How to calculate the value of the states
Fundamental operation: compute the EXPECTIMAX value of the state
Expected utility under optimal action
Average sum of (discounted) rewards
This is what expectimax computes!
Recursive definition of the value
Start with V(s) = maxaQ(s,a)
This is to define the optimal value as the max of all the Q values below it
Basically defining the maximizer at the top
Then define the chance node, aka the Q value, to be the weighted sum of the reward + the discounted value of the next node
So combine the two and get:
If you were to write one equation for every state in the state space, what would it look like?
A system of equations!
N equations, where n is the side of the state space Value Iteration
Time-limited values!
Define Vk(s) to be the optimal value of s if the game ends in k more time steps
Kinda like expectimax!
Algorithm
Start with V0(s): no time steps left means an expected reward sum of zero
Given vector of Vk(s) values, do one ply of expectimax from each state
The value of the next state is the max over actions of the expected rewards - Repeat this value updating until convergence (Vk+1(s) = Vk(s) for all of the states
Complexity of each iteration: O($S^2A$)
S is the all the states in the state space
A is the amount of actions
Theorem: will converge to unique optimal values because approximations get refined towards those values
Policy may converge long before the actual values do tho
Basically this might find the optimal policy even if all the values haven’t converged yet (because the longer it runs, the smaller the individual value changes are)
How
Case 1
- If the tree has maximum depth M, then VM holds the actual untruncated values

Case 2

If the discount is less than 1
Sketch: for any state Vk and Vk + 1 can be viewed as depth k+1 expectimax results in nearly identical search trees
The difference is that on the bottom layer, Vk + 1 has actual rewards while Vk has zeros
The last layer is at best all RMAX

But everything is discounted by γk at that point so $V_k and $V_{k+1} are at most $γ_k max

$ different (as k increases, the values converge)

Policy Evaluation

Fixed policy
Usually we have to look at all actions and then find the policy
With a FIXED POLICY, we do what π says (the optimal policy) and see what the action is
USE CASE
Expectimax trees max over all actions to compute the optimal values
If we fixed some policy π(s), then the tree would be simpler because we would already know what action to take
Though the tree’s value would depend on which policy we fixed
Goal is to compute the utility of a state s under a fixed (generally non-optimal) policy! - Vπ(s) = expected total discounted rewards starting in s and following π

-
How to calculate the V’s:
- Turn recursive Bellman equations into updates (like value iteration)
- Initializes and then updates based on the new states
Idea 2
- Without the maxes, the Bellman equations are just a linear system
  Policy Extraction
If we have the optimal values V*(s), we need to figure out which action to take, which we can do by taking the mini-expectimax (one-step)
Argmax = outputs the x value at which the result would be the max
Kinda like the “domain” version of regular max, which gives you the max value in terms of the output of the function (aka the value)
So this particular equation would output the action the agent should take based on the most optimal potential value
Policy extraction = gives us the policy implied by the values, which means which action for us to take at every individual state
Computing Actions from Q-Values
Actions are easier to select from Q-values than regular v-values
Q-Values give ratings to each individual action from each square, v-values just give the overall square a value
Q-values:
Basically just choose the argmax of all the actions in the square
Take the best-valued action!
Policy Iteration
Value iteration repeats the bellman updates
This is slow, the “max” rarely changes, the policy often converges long before the values
Value iteration is slow, policy iteration is FAST!
Policy iteration = alternative, faster approach for optimal values
STEPS:
Step 1: Policy evaluation
Calculate utilities for some fixed policy (not optimal utilities!) until convergence
Step 2: Policy improvement
Update policy using one-step look-ahead with resulting converged (but not optimal) utilities as future values
Repeat until policy converges! :) - Evaluation
For fixed current policy π, find the values with policy evaluation and iterate until values converge
Improvement: for fixed values, get a better policy using policy extraction - Using the one-step look-ahead:
Summary
Value iteration ⇒ compute optimal values
Policy iteration ⇒ compute optimal values
Policy evaluation ⇒ compute values for a particular policy
Policy extraction (one-step lookahead) ⇒ turn your values into a policy
They all are kinda similar!
All different variations of bellman updates
All use one-step lookahead expectimax fragments
They differ only in whether we plug in a fixed policy or max over the actions
Lecture 7: Reinforcement Learning I

Motivation for RL
Reinforcement learning = the process of training agents to make sequential decisions
Offline Planning
Solving MDPs is offline planning because you need to know all the details of the MDP to make decisions
You have to determine all quantities through computation rather than playing the actual game
Online Learning
You have to play to figure out what to do ⇒ reinforcement learning!
Key Ideas
Exploration = you have to try unknown actions to get information
Exploitation = eventually you have to use what you know
Regret = even if you learn intelligently, you still make mistakes
Sampling = because of chance, you have to try things repeatedly
Difficulty = learning can be much harder than solving a known MDP
MDP vs RL
Basic idea:
Receive feedback in forms of rewards, which fuels agent’s utility
Must learn to act to maximize these rewards
Learn based on observed samples of outcomes!
Reinforcement Learning has parts similar to MDP:
Set of states s ∈ S
Set of actions per state a ∈ A
Model T(s,a,s’)
Reward function R(s,a,s’)
Looking for a policy π(s)
NEW TWIST!
We don’t know T or R, so we don’t know how good or bad each state is, or what each action does
Instead, we have to try actions and states to learn what works!
Offline vs Online
MDPs vs RL
MDP thinks about what would happen based on its knowledge, RL just dives right into the action and sees what happens
Model-Based Learning
Model-based learning = learns an approximate model based on experiences and solves for values as if the learned model were correct
Aka assumes the learned model is optimal and acts on it!
STEPS
1: Learn empirical MDP model
Count outcomes s’ (aka all the outcomes of the next state) for each state and action (s, a)
Normalize to give an estimate of T(s,a,s’) and discover each R(s,a,s’) as you experience (s,a,s’)
2: Solve the learned MDP
You could use value iteration
Example of Model-Based Learning
Lecture 8: Reinforcement Learning II

Model-Free Learning
Summary

MDP Equations

V*(s) = expected utility starting in s and acting optimally
Q*(s,a) = maximum expected utility starting out having taken action a from state s, and therefore acting optimally
π*(s) = optimal action from state s
The optimal V value for states further from the only terminal state with positive reward is lower because the terminal reward is discounted more times Value Iteration
Repeat until convergence
Vk+1(s) = Vk(s) for all states in States (∀ state ∈ States)
Complexity of each iteration: O(S2A)
Policy Evaluation (performed by Direct Evaluation)

Policy Extraction

Policy Iteration

ML Equations

Temporal Difference Learning (aka Sample-Based Policy Evaluation)

(performed by Direct Evaluation)

Q-Learning

Exploration vs Exploitation

Several schemes for forcing exploration
Like e-greedy: random actions with small e, deterministic on large e
Decrease e overtime to prevent thrashing
f(u, n) acts as a optimistic utility reward to unvisited states with estimated value u and visit count n
Regret
A measure of total mistake cost
To minimize regret = optimally learning to be optimal
The difference between your (expected) rewards and optimal (expected) rewards
Random expiration has higher regret
Feature-Based Representation
Generalize across similar states!
New Q function or value function can be expressed using weights
Approximate Q-Learning

Optimization with Least Squares

Take the derivative of the least squares error to justify why the q update works:

Minimizing Error

Policy Search

Starts with an okay solution (like Q-learning) =, then fine-tunes by hill climbing on feature weights

CS 188 Midterm 1 Notes

Probability Review

Random Variables - P(X)

Joint Distributions

Independent Probability Tables

Simple Chain Rule

Bayes Rule

Lecture 1: Intro to AI

Computational Rationality

What can AI Do

Search Tree

Depth First Search (DFS) with Search Tree

Uniform Cost Search (UHS) with Search Tree

Lecture 3: A* Search and Heuristics

Heuristics

A* Search and Greedy Search

Admissible Heuristics

Graph Search and Tree Search

Lecture 4: Game Trees I

Adversarial Games

Minimax

Alpha Beta Pruning

Depth Limited Search

Lecture 5: Game Trees II and Utilities

Expectimax vs Minimax

Probability

Multi-Agent Utilities

Intro to Utilities

Human Utilities

Lecture 6: Markov’s Decision Process

Non-Deterministic Search

Markov Decision Process

Utilities of Sequences

Discounting = use 0 < γ < 1

Solving MDPs

Then define the chance node, aka the Q value, to be the weighted sum of the reward + the discounted value of the next node

So combine the two and get:

Given vector of Vk(s) values, do one ply of expectimax from each state

Policy Evaluation

Policy Extraction

Computing Actions from Q-Values

Policy Iteration

Summary

Lecture 7: Reinforcement Learning I

Motivation for RL

MDP vs RL

Model-Based Learning

Lecture 8: Reinforcement Learning II

Model-Free Learning

MDP Equations

Policy Evaluation (performed by Direct Evaluation)

Policy Extraction

Policy Iteration

ML Equations

Temporal Difference Learning (aka Sample-Based Policy Evaluation)

Q-Learning

Exploration vs Exploitation

Regret

Feature-Based Representation

Approximate Q-Learning

Optimization with Least Squares

Take the derivative of the least squares error to justify why the q update works:

Minimizing Error

Policy Search