Last time we looked at dynamic programming, methods that compute near-exact solutions to problems using a full-scale model of the environment dynamics. This time we are doing a complete 180. Monte Carlo(MC) methods are everything DP is not.

Monte Carlo methods consist of interacting directly with the environment - simulated experience, and then averaging rewards through the states the model went through. Monte Carlo methods in general refer to estimation methods where actions are often completely randomized.

Complete knowledge of the environment is unnecessary.
Works based on averaging outcomes (more on that later)
Easily scalable to larger problems.
Estimates are independent, i.e. no bootstrapping
Only a single sequence of events is looked into at each iteration (no full backups)

Don’t let this fool you, the methods we have seen in DP are still very important in Monte Carlo. The iterative cycle of evaluation followed by improvement (known as the General Policy Iteration or GPI) seen before is used in MC.

When we say that complete environment knowledge is unnecessary or MC methods do not require environment models, we mean that the agent does not have access to the complete environment; it can only interact with the environment and receive rewards.

Monte Carlo Evaluation

Monte Carlo methods work by averaging the rewards seen in each episode over the states (or state-action pairs) the agent goes through. There are two ways in which we evaluate states in a Monte Carlo policy:

First-visit MC: In each episode, for a state s we only record the first time we receive a reward when state s was visited as its reward. In a single episode we may end up in state s multiple times, but only the first such instance is considered.
Every-visit MC: Here we record all the times we end up in state s in each episode, take separate tallies for each encounter, and average tallied rewards over the number of encounters.

Suppose we are trying to calculate the value of a state X. In the first visit method, after we reach X we start to sum the rewards until the end of the episode. If the state X appears again, ignore it and don’t start counting again. The value of the state X is the average sum for all episodes where X appears.

The every visit method works in a similar way, but if the state X reappears in the episode, start another sum (and keep summing the previous sums too). Then average all sums for the value of that state.

Episode 1		Episode 2
State	Reward	State	Reward
A	2	B	5
B	3	B	2
A	-6	A	-1
B	8	B	5
B	-6	B	-4

In first-visit MC, the rewards that appear after A in episode 1 is 2+3-6+8-6=1; in episode 2 it is -1+5-4=0 (Start from the first time the state is observed). V(A) is therefore (1+0)/2=0.5. Similarly V(B)=(-1+7)/2=3.

In the implementations of the Monte Carlo policy evaluation functions, we will use a run_episode method that will return lists of states and rewards.

# First Visit Monte Carlo Policy Evaluation
def mc_eval_first_visit(env, policy, gamma, num_episodes):
    num_states = policy.shape[0]
    # Value tensor
    V = torch.zeros(num_states)
    # Number of times a state is counted   
    # Once for each episode the state appears
    N = torch.zeros(num_states)
    for episode in range(num_episodes):
        states, rewards = run_episode(env, policy)
        # Reward tensor
        G = torch.zeros(num_states)
        # flag variable array
        first_visit = [False] * num_states
        disc_reward = 0 # Discounted reward
        for state, reward in zip(reversed(states), reversed(rewards)):
            disc_reward = gamma * disc_reward + reward            
            G[state] = disc_reward
            first_visit[state] = True
        # Update value and counter tensors
        for state in states:
            if first_visit[state]:
                V[state] += G[state]
                N[state] += 1
    # Averaging first returns
    for state in range(num_states):
        if N[state] > 0:
            V[state] = V[state] / N[state]     
    return V

Let’s break this down. First, we go through the states and rewards in reverse. This is done to properly calculate discounting. The net reward G for each state is updated and the first_visit flag is set to True.

for state, reward in zip(reversed(states), reversed(rewards)):
    disc_reward = gamma * disc_reward + reward            
    G[state] = disc_reward
    first_visit[state] = True

After going through the result of an episode, the value and counter tensors are updated. In first-visit MC, the count for a state is updated by at most 1 in any given episode.

 for state in states:
    if first_visit[state]:
        V[state] += G[state]
        N[state] += 1

After simulating a number of episodes, the value tensor is divided by the number of appearances of each state to obtain the value function.

for state in range(num_states):
    if N[state] > 0:
        V[state] = V[state] / N[state] 

In every-visit MC, each time a state appears we start a new tally. A appears in episode 1 twice and once in episode 2. The rewards are 1 for the first tally and -4 for the second tally in episode 1 and 0 in episode 2. V(A)=(1-4+0)/3=-1. State B appears 7 times in total in the 2 episodes. Therefore, V(B)=(-1+2-6+7+2+1-4)/7=1/7 (gamma = 1).

The differences in implementation are slight, yet significant.

# Every Visit Monte Carlo Policy Evaluation
def mc_eval_every_visit(env, policy, gamma, num_episodes):
    num_states = policy.shape[0]
    # Value tensor
    V = torch.zeros(num_states)
    # Number of times a state is counted   
    # Once for each episode the state appears
    N = torch.zeros(num_states)
    # Reward tensor, is now incremented
    G = torch.zeros(num_states)
    for episode in range(num_episodes):
        states, rewards = run_episode(env, policy)
        disc_reward = 0 # Discounted reward
        for state, reward in zip(reversed(states), reversed(rewards)):
            # Every reward added
            disc_reward = gamma * disc_reward + reward            
            G[state] += disc_reward
            N[state] += 1
    # Averaging total returns
    for state in range(num_states):
        if N[state] > 0:
            V[state] = G[state] / N[state]     
    return V

The differences in implementation are:

The reward tensor G is no longer reset after each episode. Instead it keeps a total tally.
No flag variables used. Instead N is updated at each state visited.
V[state] = V[state] / N[state] is replaced by V[state] = G[state] / N[state] when averaging. This is done because G now keeps track of all rewards.

After every episode we only update the states we visited (in contrast to DP where everything is updated). First-visit and every-visit MC defines ways to score states and state-action pairs. These score estimates (akin to policy evaluation) are used to approximate an optimal policy (policy improvement). To achieve this policy, we need to explore the environment and evaluate all possible states the agent can end up in during an episode. Since, unlike DP, MC only updates the states it experiences in an episode, exploration has a greater emphasis here. Moreover, we don’t even know what all the possible states in the environment are (since we don’t have a model for the environment to access).

The best way to decide between a first-visit and every-visit strategy is implementation. Oftentimes the nature of the task may indicate which strategy is more appropriate; it is still good practice to try out both methods when unsure.

Monte Carlo Control: A Prologue

When an environment model is unavailable, we cannot estimate state values (since we don’t know how likely it is for us to come to a particular state, i.e., probabilities). We can only estimate the value of actions by interacting with the environment.

This poses a problem: if we let the agent greedily take actions or always follow the same policy, the agent won’t explore the other options. Actions that may seem poor at first may be much better down the line. A fully greedy approach won’t do.

Exploring Starts

A possible solution here is what’s known as exploring starts. Here we choose our first action randomly. This is under the assumption that every following state-action pair has some chance of taking place. Of course this assumption doesn’t actually hold, but exploring starts is very easy to implement.

# For the very first action
action = torch.randint(low=0,high=num_actions, size=[1]).item()

The first action just has to be random. The rest of the actions are selected based on the matrix of action values¹.

Stochastic Policy

A more robust way to ensure exploration is by making exploration a part of the policy. This means that instead of greedily selecting the best action all the time, all actions are given a non-zero likelihood of taking place. While the best action has the highest probability of happening, the agent selects an action from the distribution randomly. This ensures some exploration always takes place.

Avoiding Exploring Starts: Monte Carlo Control

Exploring starts is based on an assumption that simply does not hold. Here we look at the ways we can ensure a good Monte-Carlo policy search algorithm that doesn’t rely on exploring starts.

On-policy MC Control

On-policy Monte Carlo control is very similar to policy iteration; the optimal policy is learned by starting with a given policy and slowly improving it over many iterations. Here, instead of updating a value function for the state, we evaluate the state-action pair values. This is a subtlety of Monte Carlo control. These action values are also known as the Q-function. In the improvement phase we select the best actions the same way as before, running torch.argmax() on the Q function for every state seen. Upon repeating this many times, we reach the optimal policy.

The initial policy here has to adhere to a strict condition - stochasticity. That is, every action has some likelihood of being selected. Policies that can ensure this are called soft policies. A simple example here would be always selecting a random action at each state. A better policy would be what we call an ϵ-greedy policy.

ϵ-Greedy: A bit less greed

The ϵ-greedy policy is an easy policy to understand, but requires a bit of mathematical insight to prove. Of course I will not be going over the proof for the sake of sanity, but feel free to look into it if you have the time.

For an ϵ-greedy policy π, the update rule for a state s and action a is as follows

if a is optimal:
    π[s][a] = 1 - eps + eps/num_actions
else:
    π[s][a] = eps/num_actions

where eps(ϵ) is a parameter determining the minimum likelihood of an action of taking place. The higher the ϵ, the more the exploration². Actions are selected on a probabilistic basis, i.e. the less likely option will be selected sometimes.

Off-policy MC Control

Off-policy MC control uses a different technique to encourage exploration. Here, instead of building randomness into the agent’s policy, we use a separate policy for exploration. The policy that we update after each iteration is called the target policy, π, and the policy that is used for exploration is the behavior policy, b. But how do we use one policy to train another one?

Importance Sampling

Sampling is a statistical method to take a number of items from a group to estimate the properties of said group. As we are using two different policies here, we only take into account the actions that were taken by both policies, i.e., the common steps. The importance sampling pseudocode is as follows.

states, actions, rewards = run_episode(env, behaviour_policy)
weight = 1.0
return_g = 0.0  # Final reward, G
for state, action, reward in zip(reversed(states), reversed(actions), reversed(rewards)):
    return_g = gamma * return_g + reward  # Discounted reward
    # N here is a dictionary counting the number of times 
    # a state-action pair has occured
    N[(state, action)] += 1
    Q[state][action] += (weight/N[state, action])
    # Update weight by dividing it by probability of the action taken
    weight = weight/behaviour_policy(state)[action]
    # Stop if behaviour policy's action does not match that of target policy

    if action != torch.argmax(Q[state]).item():
        break

There are two types of importance sampling, ordinary and weighted. The above algorithm describes ordinary importance sampling. We calculate the value function using the equation $V_n=\sum_{k=1}^{n}(w_kR_k)/n$

Ordinary importance sampling has a significant drawback. Mathematical analysis has shown that in ordinary importance sampling variance can be as high as infinity. This means that this method can fail to converge or take longer to do so. The variance is unbounded because we are simply taking an average. A better alternative is weighted importance sampling $V_n=\sum_{k=1}^{n}(w_kR_k)/\sum_{k=1}^{n}w_k$ Here instead of taking a simple average, we take weighted averages of the returns. Since the largest weight(probability of the action taking place) of any single return is 1, the variance of the function converges to 0.

N[(state, action)] += weight

Changing this one line converts ordinary importance sampling to weighted importance sampling.

Conclusion

Monte Carlo methods are versatile, easy to implement, and quick to learn. The primary appeal of these methods is their ability to learn just from interacting with the environment. We estimate the optimal policy and value function by letting the agent gain experience in the form of sample episodes.

In many practical scenarios it is much easier to simulate sample episodes than it is to construct exact models that would be required for DP methods. Moreover, since MC only evaluates the states the agent encounters, it does not waste time evaluating the entire state space.

A drawback of Monte Carlo methods is that these algorithms only work on episodic tasks. Defining end points for continuing tasks to take averages from is far more challenging, and oftentimes unintuitive.

First-visit and every-visit Monte Carlo evaluation are two different ways to assess the relative value of a state or state-action pair. It is difficult to choose one method over the other. Therefore, apply both of them. Most of the time the result will be the same.

Monte Carlo methods have to ensure sufficient exploration when looking for the optimal policy. On-policy and off-policy Monte Carlo control are two good methods two perform this search. Again, it may sometimes be the case that the two approaches result in two different answers, in which case it is up to us to determine the best answer. We will see one such example in the next post.

In this article we went through the theory and algorithms of Monte Carlo reinforcement learning. Next time we will go through an interesting real-world example: Roulette - To Play or Not to Play?

Footnotes

[1] The action-value matrix works identically to a state-value matrix. The way in which we work with action-value matrices is also similar.

[2] The ϵ parameter has a very strong influencing in learning. If it is set too low, the resulting policy may not be the optimal one. On the other hand, if ϵ is too high, training will slow down and may even not converge as the agent is too busy trying out different routes all the time.

Monte Carlo - Learning from Experience

RL Basics