CS 188 Midterm 1 Notes
Probability Review
Random Variables - P(X)
- X is a random variable, so it doesn’t have a set value - Takes on a range of values
- P(X) defines what probability X may take on a certain value
- The distribution of the values of X
- Properties:
- Distribution may never have negative probabilities
- Sums to 1
- Notation:
- Capital Letters = distribution over Random Variables
- Lowercase Letters = actual values
- Ex: P(X) vs P(X = +x) vs P(+x)
Joint Distributions
- P(X,Y) = distribution of the cross product of X and Y
- Defines the probability of ever single x happening with every single y
- It is a distribution so sums to 1 and has no negative values
- insert picture
Independent Probability Tables
- Two random variables, C and N
- C = result of flipping a coin where heads has a probability of .8
- H = the number of heads that result from flipping a fair coin twice
- P(C,N) = P(C)P(N)
-
The probability of every C with every N is the individual probabilities multiplied together
-
P(C N) means the probability of C given N -
So P(C N) = P(C) since C and N are independent variables - Therefore, P(C,N) = P(C)P(N) since they aren’t correlated
Simple Chain Rule
Identity: P(A,B) = P(A|B)P(B) = P(B|A)P(A)
- Conditional Probability, the solution comes from the definition of Joint Distributions
- Identity from P(A|B) = P(AB)/P(B) and vise versa
Bayes Rule
-
P(A,B) = P(A B)P(B) = P(B A)P(A) - Two definitions are equivalent from this!
- P(A|B) = P(AB)/P(B) = P(B|A)P(A)/(B) - Flip the conditional probability to get P(B|A)
Lecture 1: Intro to AI
Computational Rationality
- Maximize your expected utility
What can AI Do
- 10 years ago, AI couldn’t play a decent game of jeopardy, win against any human at chess, win against the best humans at Go, play table tennis, grab a cup and put it on the shelf, and translate spoken language into a different language in real time!
- In 10 years
- Soon, AI may be able to drive perfectly safely on highways, discover and prove new math theorems, write funny stories, and many more! Lecture 2: Uninformed Search Planning Agents vs Reflex Agents
- Reflex agents choose action based on current percept (and possibly memory)
- Reflex agents do not plan ahead
- Planning agents simulate the future based on its understanding of the world Search Problems
-
Necessary components
- State space, successor function, start space, goal test
- World state = includes all the details in the real world, including those irrelevant to the search problem
- Search state = only includes the information in the real world that are relevant to the search problem
- The size of the search state is usually a lot smaller than the size of the world state
Search Tree
- State space graph
- usually very big and is hard to store in the computer memory
- Search tree
- progressively covers the relevant part of full state space graph - good replacement for state space graphs
- Nodes on a fringe
- Partial plans rather than individual states
- Ex. S->D, S->E, S->P, S->D->B
- Example of a state space graph
- Example of Search Tree
Depth First Search (DFS) with Search Tree
- Properties
- M = num of tiers
- In the worst case, it could expand to the whole tree (O(b^m) nodes
-
The maximum space required on the fringe is O(bm)
- It is complete if m is finite Breadth First Search (BFS) with Search Tree
- In the worst case, it could expand to the whole tree (O(b^m) nodes
- If the shallowest solutions is at level s:
- It will expand to at least O(b^s) nodes
- The maximum space required on the fringe is O(b^s)
- It is complete if m is finite
- It is optimal if all costs are 1
Uniform Cost Search (UHS) with Search Tree
- It is complete if m is finite
- It is optimal if costs are all 1
- It is optimal in the general case
Lecture 3: A* Search and Heuristics
Heuristics
- A function that estimates how close a state is to a goal
- Takes in a state and returns a number
A* Search and Greedy Search
- Both use some heuristic function h
- Both use a priority queue
Admissible Heuristics
- Underestimate the true cost to the goal
-
Admissible = optimistic!
- Example
- Manhattan distance between pacman’s location and the goal location
- Euclidean distance between pacman and goal -0
Graph Search and Tree Search
- Admissible heuristic
- A* tree search is guaranteed to be optimal
- Consistent heuristic
- A* tree search is guaranteed to be optimal
- A* GRAPH search is guaranteed to be optimal
Lecture 4: Game Trees I
Adversarial Games
- AI can achieve human expert level of performance on checkers, chess, go, tic-tac-toe
- Games include an action space, a state space, and a transition function
- Similar to search
- In zero sum games, there is only one utility value associated with each state
Minimax
- Each player assumes that the other player is playing optimally
- It is implemented recursively
Alpha Beta Pruning
- The purpose of pruning is to ignore some nodes in the tree that are not helpful for decision making
- ALPHA = max’s best option from path to root = INFINITY
- BETA = min’s best option from path to root = NEGATIVE INFINITY
Depth Limited Search
- Having a bad evaluation function causes pacman to “thrash”
- Depth matters
- How deeply the agent looks ahead to make a decision on the next action
- There’s a tradeoff between the complexity of features and complexity of computation
Lecture 5: Game Trees II and Utilities
Expectimax vs Minimax
- Minimax trees have two types of nodes: min-value and max-value
- Expectimax has max-value and exp-value
- Doesn’t have min!
- Pruning cannot be performed on expectimax trees
-
If the node values are not bounded
- Evaluation functions can be used in depth-limited expectimax trees to provide an estimate of the true expectimax value
Probability
- Multiply the value with the probability that it will happen!
- Example:
- Suppose that the probability for no traffic (20 min), light traffic (30 min), and heavy traffic (60 min) is now [0.2, 0.2, 0.6]. How would you calculate the expected amount of time to get to the airport?
- 20∗0.2+30∗0.2+60∗0.6=46 Modeling Assumptions
- When an AI agent models its opponent correctly, it is usually capable of achieving good results
- When an AI agent models its opponent incorrectly, over-optimism or over-pessimism could negatively affect the AI agent’s decisions
Multi-Agent Utilities
- Zero sum games
- Assumes a single utility function shared by all players
- Two player zero sum games
- Can be generalized into a general sum game with two opposite player utility functions
- General sum games
- Provides the flexibility of giving each player a utility function
- Each player maximizes its own component
- General utility (one for each player)
- Can give rise to cooperation and competition at different parts of the decision
making
Intro to Utilities
- For minimax reasoning
- Only the order matters and the magnitude (scale) does not matter
- For expectimax reasoning
- The magnitude needs to have meaning since we are taking a weighted average
- Utility functions
- Gives us a succinct way to express our preferences, as reasonable behavior should arise from maximizing well-defined utility functions Preferences and Rationality
- Rationality Examples
-
Alex prefer A over B, B over C. If Alex is a rational agent, Alex must prefer A over C.
- If Bono prefer A over B, B over C, and C over A, then Bono’s preference is intransitive (and thus Bono is irrational).
Human Utilities
- The utility of having money is a better utility function (than just the amount of money)
- Allais (1953)
- The Allais example is an interesting way of showing the paradox of human rationality. When presented with different scenarios of a lottery, many people prefer B>A and C>D. Using the same utility function for both, there is a contradiction that is reached because you can manipulate the utilities for the choices to contradict each other.
- Shows that human rationality is less predictable because people generally like to avoid risk, however, if there is little difference in risk, people may choose to risk more to gain more.
Lecture 6: Markov’s Decision Process
Non-Deterministic Search
- EXAMPLE: Grid World!
- A maze-like problem
- The agents live in a grid and there are walls, obstacles, and other agents
- Noisy movement
- Actions dont always go as planned (stochastic/non-deterministic search)
- 80% of the time, the action North takes the agent north (if there is no wall there)
- 10% of the time, north takes the action west, 10% east (not always predictable)
- If there is a wall in the direction that the agent is trying to go, the agent will stay put
- The agent receives rewards each time step (could be good or bad)
- Living cost (-1 per step)
- Goal is to maximize the sum of all rewards!
- Deterministic search returns a plan (a sequence of actions)
- Non-deterministic search returns a policy (a mapping from state to action)
Markov Decision Process
- Defined by:
- Set of states s s ∈ S (∈ = “is an element of”)
- Set of actions a ∈ A
- A transition function T(s, a, s’)
- Probability that a (action) from s (state) leads to s’ (next state)
-
i.e., P(s’ s, a) - Also called the model or the dynamics
- A reward function (R(s, a, s’)
- Sometimes just R(s) or R(s’)
- And discount γ if applied
- A start state
- A terminal state (possibly)
- MDPs are non-deterministic search problems
- One way to solve them is with expectimax search!
- “Markov” = given the present state, the future and past are independent!
- The action outcomes only depend on the current state
- Just like search, where the successor function only depends on current state (not history states)
- Policies
- Deterministic single-agent search problems = outputs an optimal PLAN/sequence of actions from start to a goal
- MDPs = outputs an optimal POLICY (π*: S -> A)
- A policy π gives an action for each state
- An optimal policy is one that maximizes an expected utility if followed
- An explicit policy defines a reflex agent
- Expectimax didn’t compute entire policies
- It computed the action for a single state only
- Each MDP state projects an expectimax-like search tree
- Q-state = (s, a)
- A state and an action that will be taken
- A q-state is an expected node from taking action a from state s
- Each branch is T(s,a,s’) = probability to that state
Utilities of Sequences
- More or less?
- Now or later?
- Discounting
- It’s reasonable to maximize the sum of rewards
- Its also reasonable to prefer rewards now to later
-
One solution: values of rewards decay exponentially
- How to discount
- Each time we descend a level, we multiply in the discount one more time - Sooner rewards have higher utility than later rewards
- Helps the algorithm converge
- Ex:
- U([1,2,3]) = 11 + .52 + .25*3
- Therefore, U([1,2,3]) < U([3,2,1])
- Stationary preferences
- If we assume stationary preferences, there are only two ways to define utilities
- Infinite utilities
- What if the game lasts forever?
- Finite horizon = terminate after fixed T steps, like depth-limited search
-
Discounting = use 0 < γ < 1
- Smaller γ = smaller “horizon” aka shorter term focus
- Absorbing state = guarantee that for every policy, there is a terminal state that will
eventually be reached (like “overheating” for racing)
Solving MDPs
- MDP quantities so far:
- Policy = choice of action for each state
- Utility = sum of (discounted) rewards
- Optimal quantities
- V*(s) = expected utility starting in s and acting optimally
- Q*(s,a) = maximum expected utility starting out having taken action a from state s, and therefore acting optimally
- π*(s) = optimal action from state s
- The optimal V value for states further from the only terminal state with positive reward is lower because the terminal reward is discounted more times
- V and Q values are connected
- V(s) = maxaQ(s,a)
- How to calculate the value of the states
-
Fundamental operation: compute the EXPECTIMAX value of the state
- Expected utility under optimal action
- Average sum of (discounted) rewards
- This is what expectimax computes!
- Recursive definition of the value
- Start with V(s) = maxaQ(s,a)
- This is to define the optimal value as the max of all the Q values below it
- Basically defining the maximizer at the top
-
Then define the chance node, aka the Q value, to be the weighted sum of the reward + the discounted value of the next node
-
So combine the two and get:
- If you were to write one equation for every state in the state space, what would it look like?
- A system of equations!
- N equations, where n is the side of the state space Value Iteration
- Time-limited values!
- Define Vk(s) to be the optimal value of s if the game ends in k more time steps
- Kinda like expectimax!
- Algorithm
- Start with V0(s): no time steps left means an expected reward sum of zero
-
Given vector of Vk(s) values, do one ply of expectimax from each state
- The value of the next state is the max over actions of the expected rewards - Repeat this value updating until convergence (Vk+1(s) = Vk(s) for all of the states
- Complexity of each iteration: O($S^2A$)
- S is the all the states in the state space
- A is the amount of actions
- Theorem: will converge to unique optimal values because approximations get refined towards those values
-
Policy may converge long before the actual values do tho
- Basically this might find the optimal policy even if all the values haven’t converged yet (because the longer it runs, the smaller the individual value changes are)
- How
- Case 1
- If the tree has maximum depth M, then VM holds the actual untruncated values
- Case 2
- If the discount is less than 1
- Sketch: for any state Vk and Vk + 1 can be viewed as depth k+1 expectimax results in nearly identical search trees
- The difference is that on the bottom layer, Vk + 1 has actual rewards while Vk has zeros
- The last layer is at best all RMAX
-
But everything is discounted by γk at that point so $V_k and $V_{k+1} are at most $γ_k max R $ different (as k increases, the values converge)
Policy Evaluation
- Fixed policy
- Usually we have to look at all actions and then find the policy
- With a FIXED POLICY, we do what π says (the optimal policy) and see what the action is
- USE CASE
- Expectimax trees max over all actions to compute the optimal values
- If we fixed some policy π(s), then the tree would be simpler because we would already know what action to take
- Though the tree’s value would depend on which policy we fixed
-
Goal is to compute the utility of a state s under a fixed (generally non-optimal) policy! - Vπ(s) = expected total discounted rewards starting in s and following π
-
- How to calculate the V’s:
- Turn recursive Bellman equations into updates (like value iteration)
- Initializes and then updates based on the new states
- Idea 2
- Without the maxes, the Bellman equations are just a linear system
Policy Extraction
- Without the maxes, the Bellman equations are just a linear system
- If we have the optimal values V*(s), we need to figure out which action to take, which we can do by taking the mini-expectimax (one-step)
- Argmax = outputs the x value at which the result would be the max
- Kinda like the “domain” version of regular max, which gives you the max value in terms of the output of the function (aka the value)
- So this particular equation would output the action the agent should take based on the most optimal potential value
- Policy extraction = gives us the policy implied by the values, which means which action for us to take at every individual state
Computing Actions from Q-Values
- Actions are easier to select from Q-values than regular v-values
- Q-Values give ratings to each individual action from each square, v-values just give the overall square a value
- Q-values:
- Basically just choose the argmax of all the actions in the square
- Take the best-valued action!
Policy Iteration
- Value iteration repeats the bellman updates
- This is slow, the “max” rarely changes, the policy often converges long before the values
- Value iteration is slow, policy iteration is FAST!
- Policy iteration = alternative, faster approach for optimal values
- STEPS:
- Step 1: Policy evaluation
- Calculate utilities for some fixed policy (not optimal utilities!) until convergence
- Step 2: Policy improvement
- Update policy using one-step look-ahead with resulting converged (but not optimal) utilities as future values
- Repeat until policy converges! :) - Evaluation
- For fixed current policy π, find the values with policy evaluation and iterate until values converge
- Improvement: for fixed values, get a better policy using policy extraction - Using the one-step look-ahead:
Summary
- Value iteration ⇒ compute optimal values
- Policy iteration ⇒ compute optimal values
- Policy evaluation ⇒ compute values for a particular policy
- Policy extraction (one-step lookahead) ⇒ turn your values into a policy
- They all are kinda similar!
- All different variations of bellman updates
- All use one-step lookahead expectimax fragments
- They differ only in whether we plug in a fixed policy or max over the actions
Lecture 7: Reinforcement Learning I
Motivation for RL
- Reinforcement learning = the process of training agents to make sequential decisions
- Offline Planning
-
Solving MDPs is offline planning because you need to know all the details of the MDP to make decisions
- You have to determine all quantities through computation rather than playing the actual game
- Online Learning
- You have to play to figure out what to do ⇒ reinforcement learning!
- Key Ideas
- Exploration = you have to try unknown actions to get information
- Exploitation = eventually you have to use what you know
- Regret = even if you learn intelligently, you still make mistakes
- Sampling = because of chance, you have to try things repeatedly
- Difficulty = learning can be much harder than solving a known MDP
MDP vs RL
- Basic idea:
- Receive feedback in forms of rewards, which fuels agent’s utility
- Must learn to act to maximize these rewards
- Learn based on observed samples of outcomes!
- Reinforcement Learning has parts similar to MDP:
- Set of states s ∈ S
- Set of actions per state a ∈ A
- Model T(s,a,s’)
- Reward function R(s,a,s’)
- Looking for a policy π(s)
- NEW TWIST!
- We don’t know T or R, so we don’t know how good or bad each state is, or what each action does
- Instead, we have to try actions and states to learn what works!
- Offline vs Online
- MDPs vs RL
- MDP thinks about what would happen based on its knowledge, RL just dives right into the action and sees what happens
Model-Based Learning
- Model-based learning = learns an approximate model based on experiences and solves for values as if the learned model were correct
-
Aka assumes the learned model is optimal and acts on it!
- STEPS
- 1: Learn empirical MDP model
- Count outcomes s’ (aka all the outcomes of the next state) for each state and action (s, a)
- Normalize to give an estimate of T(s,a,s’) and discover each R(s,a,s’) as you experience (s,a,s’)
- 2: Solve the learned MDP
- You could use value iteration
- Example of Model-Based Learning
Lecture 8: Reinforcement Learning II
Model-Free Learning
- Summary
MDP Equations
- V*(s) = expected utility starting in s and acting optimally
- Q*(s,a) = maximum expected utility starting out having taken action a from state s, and therefore acting optimally
- π*(s) = optimal action from state s
- The optimal V value for states further from the only terminal state with positive reward is lower because the terminal reward is discounted more times Value Iteration
- Repeat until convergence
- Vk+1(s) = Vk(s) for all states in States (∀ state ∈ States)
- Complexity of each iteration: O(S2A)
Policy Evaluation (performed by Direct Evaluation)
Policy Extraction
Policy Iteration
ML Equations
Temporal Difference Learning (aka Sample-Based Policy Evaluation)
(performed by Direct Evaluation)
Q-Learning
Exploration vs Exploitation
- Several schemes for forcing exploration
- Like e-greedy: random actions with small e, deterministic on large e
- Decrease e overtime to prevent thrashing
- f(u, n) acts as a optimistic utility reward to unvisited states with estimated value u and
visit count n
Regret
- A measure of total mistake cost
- To minimize regret = optimally learning to be optimal
-
The difference between your (expected) rewards and optimal (expected) rewards
- Random expiration has higher regret
Feature-Based Representation
- Generalize across similar states!
- New Q function or value function can be expressed using weights
Approximate Q-Learning
Optimization with Least Squares
Take the derivative of the least squares error to justify why the q update works:
Minimizing Error
Policy Search
- Starts with an okay solution (like Q-learning) =, then fine-tunes by hill climbing on feature weights