Reinforcement Learning, An Introduction

Tabular Solution Methods

the state and action spaces are small enough for the approximate value functions to be represented as arrays, or tables

c.2 Multi-armed Bandits

#### 2.1 K-armed bandits problem

The k-armed bandit problem is about repeatedly choosing among k actions, where each action a gives a reward with expected value $ q_(a) = \mathbb{E}[R_t \mid A_t = a] $. The challenge is balancing exploitation (choosing the best-known action) with exploration (trying other actions to improve estimates), since you don’t know the true ( q_(a) ) values in advance.

#### 2.2 Action-value problem

We estimate the value of an action (a) at time (t) by the average of past rewards:

$$Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbf{1}*{A_i=a}}{\sum*{i=1}^{t-1} \mathbf{1}_{A_i=a}}.$$ To choose actions, the greedy method picks $ A_t = \arg\max_a Q_t(a) $, while a better strategy is ε-greedy, which usually exploits the best action but sometimes explores randomly. #### 2.4 Incremental implementation Instead of storing all past rewards, you can update an action’s value estimate with constant memory/time using a running average. The incremental rule is $$Q_{n+1}=Q_n+\frac{1}{n}\big(R_n-Q_n\big),$$

where (R_n) is the (n)-th reward for that action.

In nonstationary problems we usually don’t want convergence to a fixed number; with constant (\alpha), (Q_n) tracks a moving (q_*(a,t)) instead of settling.

#### 2.5 Nonstationary problems

For nonstationary bandits, use a constant step size to favor recent rewards:

$$Q_{n+1}=Q_n+\alpha (R_n-Q_n),\quad 0<\alpha\le1,$$ which makes $Q_{n+1}$ an exponential recency-weighted average with weight on past reward $R_i$ equal to $\alpha(1-\alpha)^{n-i}$. With varying step sizes ${\alpha_n}$, convergence to a fixed value requires $\sum_{n=1}^\infty \alpha_n=\infty$ and $\sum_{n=1}^\infty \alpha_n^2<\infty$; constant $\alpha$ violates the second, so it tracks changing values instead of converging—useful in nonstationary settings. #### 2.6 Initial values Set optimistic initial values (e.g., $Q_1(a)=c$ with large $c$) so a greedy agent ($\varepsilon=0$) tries many actions early—each action is selected until its estimate drops toward its true mean after “disappointing” rewards. This gives temporary exploration that works in stationary tasks but fades once estimates settle. It’s brittle for nonstationary problems Because their influence disappears quickly and nonstationary tasks need ongoing exploration, not a one-time push; empirically, optimistic-greedy can catch up to $\varepsilon$-greedy but only because the optimism forces early exploration. #### 2.7 Up-confidence-bound The Upper-Confidence-Bound (UCB) method chooses actions by balancing their estimated value and uncertainty: $$A_t = \arg\max_a \Big[ Q_t(a) + c\sqrt{\tfrac{\ln t}{N_t(a)}} \Big],$$

where $Q_t(a)$ is the estimated value, $N_t(a)$ is how often $a$ was tried, and $c$ controls exploration. This way, rarely chosen actions get an extra boost from the uncertainty term, leading to more systematic exploration than $\varepsilon$-greedy, and often better long-term performance.

#### 2.8 gradient bandits problem

Gradient bandit algorithms assign each action a preference $H_t(a)$, and select actions using a softmax distribution:

$$\pi_t(a) = \frac{e^{H_t(a)}}{\sum_b e^{H_t(b)}}.$$ After observing reward $R_t$, preferences are updated by gradient ascent: chosen action’s $H_t$ increases if $R_t$ is above the baseline $\bar{R}_t$, decreases if below, while non-chosen actions are updated in the opposite direction—this baseline keeps learning stable and adaptive. The gradient bandit algorithm can be derived as stochastic gradient ascent on expected reward. The update $$H_{t+1}(a) = H_t(a) + \alpha (R_t - \bar{R}_t)(\mathbf{1}_{a=A_t} - \pi_t(a))$$

comes from taking the performance gradient $\frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)}$, where $\pi_t(a)$ is the softmax policy. This derivation proves the algorithm increases expected reward step by step and connects the intuitive update rule in (2.12) to formal gradient ascent theory.

Why this makes sense:

Objective: In gradient bandits, the goal is to maximize the expected reward $\mathbb{E}[R_t] = \sum_x \pi_t(x) q(x),$ where $q(x)$ is the true expected reward of action $x$. $\pi_t(x)$ is a function of $H_t(x)$, as we introduced before.

In gradient bandits, the goal is to maximize the expected reward

$\mathbb{E}[R_t] = \sum_x \pi_t(x) q(x),$

where $q(x)$ is the true expected reward of action $x$. $\pi_t(x)$ is a function of $H_t(x)$, as we introduced before.

Gradient ascent idea: If we treat the action preferences $H_t(a)$ as parameters, then the natural way to improve them is to move in the direction of the gradient of the expected reward: $H \leftarrow H + \alpha \nabla_H \mathbb{E}[R].$

If we treat the action preferences $H_t(a)$ as parameters, then the natural way to improve them is to move in the direction of the gradient of the expected reward:

$H \leftarrow H + \alpha \nabla_H \mathbb{E}[R].$

The derivative with respect to a preference parameter $H_t(a)$ is

$\frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} = \sum_x q_*(x) \frac{\partial \pi_t(x)}{\partial H_t(a)}.$

We know $\nabla_H \pi_t(x) = \pi_t(x)(\mathbf{1}_{a=A_t} - \pi_t(a))$.

So,

$$ \frac{\partial \mathbb{E}[R_t]}{\partial H_t(a)} = \sum_x q_*(x) \pi_t(x)(\mathbf{1}_{a=A_t} - \pi_t(a)) = \sum_x (q_*(x) - B_t) \pi_t(x)(\mathbf{1}_{a=A_t} - \pi_t(a)) = (\mathbb{E}[R_t]-\bar{R}_t)(\mathbf{1}_{a=A_t} - \pi_t(a))$$ $B_t$ is anything that doesn't depend on $x$, so it can be used to form a baseline $\bar{R}_t$ with $\pi_t(x)$. Baseline is introduced to make the update with lower variance and more stable.

Stochastic approximation:

We don’t know $\mathbb{E}[R_t]$ exactly, only samples $R_t$. So instead of the full gradient, we use a stochastic estimate based on observed rewards at each step. That’s why it’s called stochastic gradient ascent. #### 2.9 contextual bandits In contextual (associative) bandits, the learner must choose actions not just once, but based on the situation (context) they are in, effectively learning a policy that maps situations to the best actions. This setting is between simple (k)-armed bandits and full reinforcement learning: each action only affects the immediate reward like in bandits, but actions are chosen differently depending on the observed context.

3. Finite Markov Decision Processes

#### 3.1 Agent-environment interface An agent–environment interface in reinforcement learning is described as a cycle: at each time step $t$, the agent observes a state $S_t$, takes an action $A_t$, and then receives a reward $R_{t+1}$ and a new state $S_{t+1}$. The environment’s dynamics are defined by a probability function $$p (s',r \mid s,a ) = \Pr \{S_{t+1}=s', R_{t+1}=r \mid S_t=s, A_t=a\} ,$$

which gives the distribution of next states and rewards, and from which we can also compute state–transition probabilities and expected rewards.

3.2 Goals and Rewards

In reinforcement learning, the agent’s goal is to maximize the cumulative reward over time:

$$\max ; \mathbb{E}!\left[\sum_t R_t\right].$$ Rewards define what the agent should achieve, not how to achieve it, so the design of the reward signal must reflect the true goals we want the agent to pursue. #### 3.3 Returns and Episodes The return is the sum of future rewards, $$G_t = R_{t+1} + R_{t+2} + \dots + R_T,$$

for episodic tasks with a final time step (T). For continuing tasks, we use the discounted return

$$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}, \quad 0 \le \gamma \le 1,$$ which weights near-term rewards more if $\gamma$ is small and far-term rewards more if $\gamma$ is close to 1, making the agent more farsighted. #### 3.4 United notion for episodic and continuing tasks Episodic tasks end after a finite sequence of steps, while continuing tasks go on indefinitely, but both can be written with the same return formula: $$G_t = \sum_{k=0}^{T} \gamma^k R_{t+k+1},$$

where (T) may be finite (episodic) or infinite (continuing). By treating episode termination as entering a special absorbing state with zero reward, we get a unified notation that works for both cases.

#### 3.5 Polices and value function

A policy $\pi(a|s)$ is the agent’s rule for choosing action $a$ in state $s$.

The state-value function under policy $\pi$ is

$$v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t=s],$$ the expected return starting from $s$ and following $\pi$. The action-value function is $$q_\pi(s,a) = \mathbb{E}_\pi[G_t \mid S_t=s, A_t=a],$$

the expected return starting from $s$, taking action $a$, and then following $\pi$.

These functions capture how good states or state–action pairs are under a policy, and they are the foundation for reinforcement learning methods.

Some Exercises

#### Exercise 3.15 In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant (c) to all the rewards adds a constant, (v_c), to the values of all states, and thus does not affect the relative values of any states under any policies. What is (v_c) in terms of (c) and (\gamma)? Answer: Adding a constant (c) to all rewards shifts all state values by $$v'*\pi(s)= v*\pi(s)+\frac{c}{1-\gamma}.$$ So only differences matter; relative ordering is unchanged. #### Exercise 3.16 Now consider adding a constant (c) to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example. Answer: In episodic tasks, adding $c$ changes the return by $c\times$ $episode length$. If policies differ in how long episodes last, this can change which policy is optimal (e.g., maze task with per-step penalty). #### Exercise 3.17 What is the Bellman equation for action values, that is, for (q_\pi)? It must give the action value (q_\pi(s,a)) in terms of the action values (q_\pi(s',a')) of possible successors to the state–action pair ((s,a)). Show the sequence of equations analogous to (3.14), but for action values. Answer: $$q_\pi(s,a)=\mathbb{E}[R_{t+1}+\gamma v_\pi(S_{t+1}) \mid S_t=s,A_t=a] =\sum_{s',r} p(s',r \mid s,a)\left[r+\gamma \sum_{a'} \pi(a'|s')q_\pi(s',a')\right].$$ #### Exercise 3.18 The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. Give the equation corresponding to this intuition and diagram for the value at the root node (v_\pi(s)), in terms of the value at the expected leaf node, (q_\pi(s,a)), given (S_t=s). Answer: $$v_\pi(s) = \sum_a \pi(a|s) q_\pi(s,a).$$ Expanded with dynamics: $$v_\pi(s) = \sum_a \pi(a|s)\sum_{s',r} p(s',r\mid s,a)\left[r+\gamma v_\pi(s')\right].$$ #### Exercise 3.19 The value of an action, (q_\pi(s,a)), depends on the expected next reward and the expected sum of the remaining rewards. Give the equation corresponding to this intuition and diagram for the action value, (q_\pi(s,a)), in terms of the expected next reward, (R_{t+1}), and the expected next state value, (v_\pi(S_{t+1})), given (S_t=s, A_t=a). Answer: $$q_\pi(s,a)=\mathbb{E}[R_{t+1}+\gamma v_\pi(S_{t+1})\mid S_t=s,A_t=a].$$ Expanded: $$q_\pi(s,a) = \sum_{s',r} p(s',r\mid s,a)\left[r+\gamma v_\pi(s')\right] =\sum_{s',r} p(s',r\mid s,a)\left[r+\gamma \sum_{a'}\pi(a'|s')q_\pi(s',a')\right].$$ #### 3.6 Optimal value function

This page introduces the Bellman optimality equations, which define the value functions for an optimal policy.

* For the state-value function:

$$ v_*(s) = \max_a q_*(s,a) = \max_a \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) \mid S_t=s, A_t=a] = \max_a \sum_{s',r} p(s',r|s,a),[r+\gamma v_*(s')].$$

* For the action-value function:

$$ q_*(s,a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1},a') \mid S_t=s,A_t=a] = \sum_{s',r} p(s',r|s,a),[r+\gamma \max_{a'} q_*(s',a')].$$ #### 3.7 Optimality and approximation

In practice, agents cannot compute exact optimal policies because solving the Bellman optimality equations for large or complex tasks is too costly in time and memory. Instead, reinforcement learning uses approximations—tabular methods for small state spaces, or parameterized functions for large ones—focusing learning on frequently visited states so that even approximate policies can perform very well.

5. Monte carlo methods

#### 5.1 Monte carlo prediction

Monte Carlo (MC) prediction estimates the value of a state $ v_\pi(s) $ by averaging the returns observed after visiting that state under policy $ \pi $. There are two versions: first-visit MC, which averages returns only from the first time a state is seen in each episode, and every-visit MC, which averages all visits; both converge to $ v_\pi(s) $ as the number of visits grows.

#### 5.2 Monte carlo estimation on action values

Monte Carlo control improves a policy by alternating between evaluating ( q_\pi(s,a) ) from sampled episodes and improving it with $ \pi(s) = \arg\max_a q_\pi(s,a) $. Repeating this process (GPI) makes both $ \pi $ and $ q $ converge toward the optimal policy and value function.

#### 5.3 Monte carlo control

Monte Carlo control improves a policy by alternating between evaluating $ q_\pi(s,a) $ from sampled episodes and improving it with $ \pi(s) = \arg\max_a q_\pi(s,a) $. Repeating this process (GPI) makes both $ \pi $ and $ q $ converge toward the optimal policy and value function.

#### 5.4 Monte carlo without exploring starts

Monte Carlo control without exploring starts avoids the unrealistic assumption that every state–action pair can be started from. Instead, it uses ε-greedy policies (choose best action with prob. $1-\varepsilon$, random otherwise), which guarantees that all actions are explored and allows $q_\pi(s,a)$ to converge to the optimal solution.

#### 5.5 Off-policy prediction via importance sampling

Importance sampling lets us use data from a behavior policy (b) to estimate values for a target policy $\pi$.

We reweight each return (G_t) with the importance-sampling ratio:

$$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$$ so the corrected estimate becomes $$V(s) \approx \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1} G_t}{\sum_{t \in \mathcal{T}(s)} \rho_{t:T-1}}.$$

This ensures $E_b[\rho_{t:T-1} G_t | S_t=s] = v_\pi(s)$, giving an unbiased estimate of the target policy’s value.