Implementing New Algorithms (Basic)

In this section, we will walk through the implementation of the classical REINFORCE [1] algorithm, also known as the “vanilla” policy gradient. We will start with an implementation that works with a fixed policy and environment. The next section Implementing New Algorithms (Advanced) will improve upon this implementation, utilizing the functionalities provided by the framework to make it more structured and command-line friendly.


First, let’s briefly review the algorithm along with some notations. We work with an MDP defined by the tuple \((\mathcal{S}, \mathcal{A}, P, r, \mu_0, \gamma, T)\), where \(\mathcal{S}\) is a set of states, \(\mathcal{A}\) is a set of actions, \(P: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0, 1]\) is the transition probability, \(r: \mathcal{S} \times \mathcal{A} \to \mathbb{R}\) is the reward function, \(\mu_0: \mathcal{S} \to [0, 1]\) is the initial state distribution, \(\gamma \in [0, 1]\) is the discount factor, and \(T \in \mathbb{N}\) is the horizon. REINFORCE directly optimizes a parameterized stochastic policy \(\pi_\theta: \mathcal{S} \times \mathcal{A} \to [0, 1]\) by performing gradient ascent on the expected return objective:

\[\eta(\theta) = \mathbb{E}\left[\sum_{t=0}^T \gamma^t r(s_t, a_t)\right]\]

where the expectation is implicitly taken over all possible trajectories, following the sampling procedure \(s_0 \sim \mu_0\), \(a_t \sim \pi_\theta(\cdot | s_t)\), and \(s_{t+1} \sim P(\cdot | s_t, a_t)\). By the likelihood ratio trick, the gradient of the objective with respect to \(\theta\) is given by

\[\nabla_\theta \eta(\theta) = \mathbb{E}\left[\left(\sum_{t=0}^T \gamma^t r(s_t, a_t)\right) \left(\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \right)\right]\]

We can reduce the variance of this estimator by noting that for \(t' < t\),

\[\mathbb{E}\left[ r(s_{t'}, a_{t'}) \nabla_\theta \log \pi_\theta(a_t | s_t) \right] = 0\]


\[\nabla_\theta \eta(\theta) = \mathbb{E}\left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t'=t}^T \gamma^{t'} r(s_{t'}, a_{t'}) \right]\]

Often, we use the following estimator instead:

\[\nabla_\theta \eta(\theta) = \mathbb{E}\left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \sum_{t'=t}^T \gamma^{t'-t} r(s_{t'}, a_{t'}) \right]\]

where \(\gamma^{t'}\) is replaced by \(\gamma^{t'-t}\). When viewing the discount factor as a variance reduction factor for the undiscounted objective, this alternative gradient estimator has less bias, at the expense of having a larger variance. We define \(R_t := \sum_{t'=t}^T \gamma^{t'-t} r(s_{t'}, a_{t'})\) as the empirical discounted return.

The above formula will be the central object of our implementation. The pseudocode for the whole algorithm is as below:

  • Initialize policy \(\pi\) with parameter \(\theta_1\).

  • For iteration \(k = 1, 2, \ldots\):

    • Sample N trajectories \(\tau_1\), …, \(\tau_n\) under the current policy \(\theta_k\), where \(\tau_i = (s_t^i, a_t^i, R_t^i)_{t=0}^{T-1}\). Note that the last state is dropped since no action is taken after observing the last state.
    • Compute the empirical policy gradient:
    \[\widehat{\nabla_\theta \eta(\theta)} = \frac{1}{NT} \sum_{i=1}^N \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^i | s_t^i) R_t^i\]
    • Take a gradient step: \(\theta_{k+1} = \theta_k + \alpha \widehat{\nabla_\theta \eta(\theta)}\).


As a start, we will try to solve the cartpole balancing task using a neural network policy. We will later generalize our algorithm to accept configuration parameters. But let’s keep things simple for now.

import gym
import numpy as np

from garage.envs import normalize
from import CategoricalMLPPolicy

# normalize() makes sure that the actions for the environment lies
# within the range [-1, 1] (only works for environments with continuous actions)
env = normalize(gym.make('CartPole-v0'))
# Initialize a neural network policy with a single hidden layer of 8 hidden units
policy = CategoricalMLPPolicy(env.spec, hidden_sizes=(8,))

# We will collect 100 trajectories per iteration
N = 100
# Each trajectory will have at most 100 time steps
T = 100
# Number of iterations
n_itr = 100
# Set the discount factor for the problem
discount = 0.99
# Learning rate for the gradient update
learning_rate = 0.01

Collecting Samples

Now, let’s collect samples for the environment under our current policy within a single iteration.

paths = []

for _ in range(N):
    observations = []
    actions = []
    rewards = []

    observation = env.reset()

    for _ in range(T):
        # policy.get_action() returns a pair of values. The second one returns a dictionary, whose values contains
        # sufficient statistics for the action distribution. It should at least contain entries that would be
        # returned by calling policy.dist_info(), which is the non-symbolic analog of policy.dist_info_sym().
        # Storing these statistics is useful, e.g., when forming importance sampling ratios. In our case it is
        # not needed.
        action, _ = policy.get_action(observation)
        # Recall that the last entry of the tuple stores diagnostic information about the environment. In our
        # case it is not needed.
        next_observation, reward, terminal, _ = env.step(action)
        observation = next_observation
        if terminal:
            # Finish rollout if terminal state reached

    # We need to compute the empirical return for each time step along the
    # trajectory
    returns = []
    return_so_far = 0
    for t in range(len(rewards) - 1, -1, -1):
        return_so_far = rewards[t] + discount * return_so_far
    # The returns are stored backwards in time, so we need to revert it
    returns = returns[::-1]


Observe that according to the formula for the empirical policy gradient, we could concatenate all the collected data for different trajectories together, which helps us vectorize the implementation further.

observations = np.concatenate([p['observations'] for p in paths])
actions = np.concatenate([p['actions'] for p in paths])
returns = np.concatenate([p['returns'] for p in paths])

Constructing the Computation Graph

We will use TensorFlow for our implementation, and we assume that the reader has some familiarity with it.

First, we construct symbolic variables for the input data:

# Create a TensorFlow variable for storing the observations
# We could have simply written `observations_var = TT.matrix('observations')` instead for this example. However,
# doing it in a slightly more abstract way allows us to delegate to the environment for handling the correct data
# type for the variable. For instance, for an environment with discrete observations, we might want to use integer
# types if the observations are represented as one-hot vectors.
observations_var = env.observation_space.to_tf_placeholder(
    # It should have 1 extra dimension since we want to represent a list of observations
actions_var = env.action_space.to_tf_placeholder(
returns_var = tf.compat.v1.placeholder(name='returns')

Note that we can transform the policy gradient formula as

\[\widehat{\nabla_\theta \eta(\theta)} = \nabla_\theta \left( \frac{1}{NT} \sum_{i=1}^N \sum_{t=0}^{T-1} \log \pi_\theta(a_t^i | s_t^i) R_t^i \right) = \nabla_\theta L(\theta)\]

where \(L(\theta) = \frac{1}{NT} \sum_{i=1}^N \sum_{t=0}^{T-1} \log \pi_\theta(a_t^i | s_t^i) R_t^i\) is called the surrogate function. Hence, we can first construct the computation graph for \(L(\theta)\), and then take its gradient to get the empirical policy gradient.

# policy.dist_info_sym returns a dictionary, whose values are symbolic expressions for quantities related to the
# distribution of the actions. For a Gaussian policy, it contains the mean and (log) standard deviation.
dist_info_vars = policy.dist_info_sym(observations_var, actions_var)

# policy.distribution returns a distribution object under garage.distributions. It contains many utilities for computing
# distribution-related quantities, given the computed dist_info_vars. Below we use dist.log_likelihood_sym to compute
# the symbolic log-likelihood. For this example, the corresponding distribution is an instance of the class
# garage.distributions.DiagonalGaussian
dist = policy.distribution

# Note that we negate the objective, since most optimizers assume a
# minimization problem
surr = - tf.reduce_mean(dist.log_likelihood_sym(actions_var, dist_info_vars) * returns_var)

# Get the list of trainable parameters.
params = policy.get_params(trainable=True)
grads = tf.gradient(surr, params)

Gradient Update and Diagnostics

We are almost done! Now, you can use your favorite stochastic optimization algorithm for performing the parameter update. We choose ADAM [2] in our implementation:

f_train = tensor_utils.compile_function(
    inputs=[observations_var, actions_var, returns_var],
    outputs=adam(grads, params, learning_rate=learning_rate),
f_train(observations, actions, returns)

Since this algorithm is on-policy, we can evaluate its performance by inspecting the collected samples:

print('Average Return:', np.mean([sum(path['rewards']) for path in paths]))

The complete source code so far is available at examples/tf/

Additional Tricks

Adding a Baseline

The variance of the policy gradient can be further reduced by adding a baseline. The refined formula is given by

\[\widehat{\nabla_\theta \eta(\theta)} = \frac{1}{NT} \sum_{i=1}^N \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^i | s_t^i) (R_t^i - b(s_t^i))\]

We can do this since \(\mathbb{E} \left[\nabla_\theta \log \pi_\theta(a_t^i | s_t^i) b(s_t^i)\right] = 0\)

The baseline is typically implemented as an estimator of \(V^\pi(s)\). In this case, \(R_t^i - b(s_t^i)\) is an estimator of \(A^\pi(s_t^i, a_t^i)\). The framework implements a few options for the baseline. A good balance of computational efficiency and accuracy is achieved by a linear baseline using state features, available at garage/np/baselines/ To use it in our implementation, the relevant code looks like the following:

# ... initialization code ...

from import LinearFeatureBaseline
baseline = LinearFeatureBaseline(env.spec)

# ... inside the loop for each episode, after the samples are collected

path = dict(

path_baseline = baseline.predict(path)
advantages = []
returns = []
return_so_far = 0
for t in range(len(rewards) - 1, -1, -1):
    return_so_far = rewards[t] + discount * return_so_far
    advantage = return_so_far - path_baseline[t]
# The advantages are stored backwards in time, so we need to revert it
advantages = np.array(advantages[::-1])
# And we need to do the same thing for the list of returns
returns = np.array(returns[::-1])

Normalizing the returns

Currently, the learning rate we set for the algorithm is very susceptible to reward scaling. We can alleviate this dependency by whitening the advantages before computing the gradients. In terms of code, this would be:

advantages = (advantages - np.mean(advantages)) / (np.std(advantages) + 1e-8)

Training the baseline

After each iteration, we use the newly collected trajectories to train our baseline:

The reason that this is executed after computing the baselines along the given trajectories is that in the extreme case, if we only have one trajectory starting from each state, and if the baseline could fit the data perfectly, then all the advantages would be zero, giving us no gradient signals at all.

Now, we can train the policy much faster (we need to change the learning rate accordingly because of the rescaling). The complete source code so far is available at examples/tf/

[1]Williams, Ronald J. “Simple statistical gradient-following algorithms for connectionist reinforcement learning.” Machine learning 8.3-4 (1992): 229-256.
[2]Kingma, Diederik P., and Jimmy Ba Adam. “A method for stochastic optimization.” International Conference on Learning Representation. 2015.