Implement a New Algorithm¶

In this section, we will describe how to implement an RL algorithm using garage. Note that this section assumes some level of familiarity with reinforcement learning. For a more gentle introduction to the field of reinforcement learning as a whole, we recommend consulting OpenAI’s Spinning Up.

We will start by introducing the core RLAlgorithm API used in garage, then show how to implement the classical REINFORCE [1] algorithm, also known as the “vanilla” policy gradient (VPG).

Algorithm API¶

All RL algorithms used with garage implement a small interface that allows accessing important services such as snapshotting, “plotting” (visualization of the current policy in the environment), and resume.

The interface requires a single method, train(trainer), which takes a garage.experiment.Trainer. The interface is defined in garage.np.algos.RLAlgorithm, but inheriting from this class isn’t necessary.

Some additional functionality (such as sampling and plotting) require additional fields to exist.

"""Interface of RLAlgorithm."""
import abc


class RLAlgorithm(abc.ABC):
    """Base class for all the algorithms.

    Note:
        If the field sampler_cls exists, it will be by Trainer.setup to
        initialize a sampler.

    """

    # pylint: disable=too-few-public-methods

    @abc.abstractmethod
    def train(self, trainer):
        """Obtain samplers and start actual training for each epoch.

        Args:
            trainer (Trainer): Trainer is passed to give algorithm
                the access to trainer.step_epochs(), which provides services
                such as snapshotting and sampler control.

        """

In order to implement snapshotting and resume, instances of RLAlgorithm are also expected to support the Python standard library’s pickle interface. Garage primitives such as environments, policies, Q functions, and value functions already implement this interface, so no work is typically required to implement it.

Basic Setup¶

Garage components are fairly weakly coupled, meaning that different parts can be used independently. However, for this purpose of this tutorial we’ll use the parts together in the way that’s generally recommended.

At the core of garage is the assumption that the algorithm runs a series of “epochs”, which are a unit of time small enough that most services, such as logging, will only have new results once per epoch.

The current epoch is controlled by the algorithm using Trainer.step_epochs().

class MyAlgorithm:

    def train(self, trainer):
        epoch_stepper = trainer.step_epochs()
        print('It is epoch 0')
        next(epoch_stepper)
        print('It is epoch 1')
        next(epoch_stepper)
        print('It is epoch 2')

In practice, it’s used in a loop like this:

class MyAlgorithm:

    def train(self, trainer):
        for epoch in trainer.step_epochs():
            print('It is epoch', epoch)

Each time the epoch is stepped, various “services” update. For example, logs are synchronized, snapshotting (for later resuming) may occur, the plotter will update, etc.

When an experiment is resumed, the epoch train will be called again, but the first epoch yielded by step_epochs will be the one after the snapshot.

In order to use the Trainer, we’ll need a set up log directory. This can be done manually, but for this tutorial we’ll use the wrap_experiment function to do that for us.

We’ll also want an environment to test our algorithm with.

from garage import wrap_experiment
from garage.envs import PointEnv
from garage.experiment import Trainer

@wrap_experiment
def debug_my_algorithm(ctxt):
    trainer = Trainer(ctxt)
    env = PointEnv()
    algo = MyAlgorithm()
    trainer.setup(algo, env)
    trainer.train(n_epochs=3)

debug_my_algorithm()

With the above file and the MyAlgorithm definition above, it should be possible to run MyAlgorithm, and get it to print an output like the following:

2020-07-22 23:32:34 | [debug_my_algorithm] Logging to /home/ruofu/garage/data/local/experiment/debug_my_algorithm
2020-07-22 23:32:34 | [debug_my_algorithm] Obtaining samples...
It is epoch 0
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #0 | Saving snapshot...
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #0 | Saved
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #0 | Time 0.01 s
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #0 | EpochTime 0.01 s
-------------  -
TotalEnvSteps  0
-------------  -
It is epoch 1
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #1 | Saving snapshot...
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #1 | Saved
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #1 | Time 0.01 s
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #1 | EpochTime 0.00 s
-------------  -
TotalEnvSteps  0
-------------  -
It is epoch 2
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #2 | Saving snapshot...
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #2 | Saved
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #2 | Time 0.02 s
2020-07-22 23:32:34 | [debug_my_algorithm] epoch #2 | EpochTime 0.01 s
-------------  -
TotalEnvSteps  0
-------------  -

Now that we have the basics out of the way, we can start actually doing some reinforcement learning.

Gathering Samples¶

In the above section, we set up an algorithm, but never actually explored the environment at all, as can be seen by TotalEnvSteps always being zero.

In order to collect samples from the environment, we need to construct a sampler and set it a field in our algorithm. Then we can call trainer.obtain_samples() to get samples. We’ll also need to seed the random number generators used for the experiment.

class SimpleVPG:

    def __init__(self, env_spec, policy, sampler):
        self.env_spec = env_spec
        self.policy = policy
        self._sampler = sampler
        self.max_episode_length = 200

    def train(self, trainer):
        for epoch in trainer.step_epochs():
            samples = trainer.obtain_samples(epoch)

from garage import wrap_experiment
from garage.envs import PointEnv
from garage.experiment import Trainer
from garage.experiment.deterministic import set_seed
from garage.samplers import LocalSampler
from garage.torch.policies import GaussianMLPPolicy

@wrap_experiment
def debug_my_algorithm(ctxt):
    set_seed(100)
    trainer = Trainer(ctxt)
    env = PointEnv()
    policy = GaussianMLPPolicy(env.spec)
    sampler = LocalSampler(agents=policy, envs=env, max_episode_length=200)
    algo = SimpleVPG(env.spec, policy, sampler)
    trainer.setup(algo, env)
    trainer.train(n_epochs=500, batch_size=4000)

debug_my_algorithm()

Training the Policy with Samples¶

Of course, we’ll need to actually use the resulting samples to train our policy with PyTorch, TensorFlow or NumPy. In this tutorial, we’ll implement an extremely simple form of REINFORCE [1] (a.k.a. Vanilla Policy Gradient) using PyTorch and TensorFlow. We will also implement a simple Cross Entropy Method (CEM) [2] using NumPy.

PyTorch¶

import torch
import numpy as np

from garage.samplers import LocalSampler
from garage.np import discount_cumsum

class SimpleVPG:

    def __init__(self, env_spec, policy, sampler):
        self.env_spec = env_spec
        self.policy = policy
        self._sampler = sampler
        self.max_episode_length = 200
        self._discount = 0.99
        self._policy_opt = torch.optim.Adam(self.policy.parameters(), lr=1e-3)

    def train(self, trainer):
        for epoch in trainer.step_epochs():
            samples = trainer.obtain_samples(epoch)
            self._train_once(samples)

    def _train_once(self, samples):
        losses = []
        self._policy_opt.zero_grad()
        for path in samples:
            returns_numpy = discount_cumsum(path['rewards'], self._discount)
            returns = torch.Tensor(returns_numpy.copy())
            obs = torch.Tensor(path['observations'])
            actions = torch.Tensor(path['actions'])
            dist = self.policy(obs)[0]
            log_likelihoods = dist.log_prob(actions)
            loss = (-log_likelihoods * returns).mean()
            loss.backward()
            losses.append(loss.item())
        self._policy_opt.step()
        return np.mean(losses)

That lets us train a policy, but it doesn’t let us confirm that it actually works. We can add a little logging to the train() method.

from garage import log_performance, EpisodeBatch

...
    def train(self, trainer):
        for epoch in trainer.step_epochs():
            samples = trainer.obtain_samples(epoch)
            log_performance(
                epoch,
                EpisodeBatch.from_episode_list(self.env_spec, samples),
                self._discount)
            self._train_once(samples)

For completeness, the full experiment file (example/torch/tutorial_vpg.py) is repeated below:

Running the experiment file should print outputs like the following. The policy should solve the PointEnv after 100 epochs (i.e. the Evaluation/SuccessRate reaches 1).

2020-07-24 15:30:32 | [tutorial_vpg] Logging to /home/ruofu/garage/data/local/experiment/tutorial_vpg
Sampling  [####################################]  100%
2020-07-24 15:30:36 | [tutorial_vpg] epoch #0 | Saving snapshot...
2020-07-24 15:30:36 | [tutorial_vpg] epoch #0 | Saved
2020-07-24 15:30:36 | [tutorial_vpg] epoch #0 | Time 3.65 s
2020-07-24 15:30:36 | [tutorial_vpg] epoch #0 | EpochTime 3.65 s
----------------------------------  -----------
Evaluation/AverageDiscountedReturn   -78.1057
Evaluation/AverageReturn            -180.404
Evaluation/Iteration                   0
Evaluation/MaxReturn                 -36.996
Evaluation/MinReturn                -625.757
Evaluation/NumEpisodes                26
Evaluation/StdReturn                 143.39
Evaluation/SuccessRate                 0.384615
Evaluation/TerminationRate             0.384615
TotalEnvSteps                       4085
----------------------------------  -----------
2020-07-24 15:30:37 | [tutorial_vpg] epoch #1 | Saving snapshot...
2020-07-24 15:30:37 | [tutorial_vpg] epoch #1 | Saved
2020-07-24 15:30:37 | [tutorial_vpg] epoch #1 | Time 4.21 s
2020-07-24 15:30:37 | [tutorial_vpg] epoch #1 | EpochTime 0.55 s
----------------------------------  -----------
Evaluation/AverageDiscountedReturn   -77.1423
Evaluation/AverageReturn            -186.052
Evaluation/Iteration                   1
Evaluation/MaxReturn                 -19.9412
Evaluation/MinReturn                -458.353
Evaluation/NumEpisodes                28
Evaluation/StdReturn                 134.528
Evaluation/SuccessRate                 0.428571
Evaluation/TerminationRate             0.428571
TotalEnvSteps                       8202
----------------------------------  -----------
...

As PointEnv currently not supports visualization, If you want to visualize the policy when training, you can solve an Gym environment, for example LunarLanderContinuous-v2, and set plot to True in trainer.train():

...
@wrap_experiment
def tutorial_vpg(ctxt=None):
    set_seed(100)
    trainer = Trainer(ctxt)
    env = GymEnv('LunarLanderContinuous-v2')
    policy = GaussianMLPPolicy(env.spec)
    sampler = LocalSampler(agents=policy, envs=env, max_episode_length=200)
    algo = SimpleVPG(env.spec, policy, sampler)
    trainer.setup(algo, env)
    trainer.train(n_epochs=500, batch_size=4000, plot=True)
...

TensorFlow¶

Before the training part, TensorFlow version is almost the same as PyTorch’s, except for the replacement of Trainer with TFTrainer.

...
from garage import wrap_experiment
from garage.envs import PointEnv
from garage.experiment import TFTrainer
from garage.experiment.deterministic import set_seed
from garage.tf.policies import GaussianMLPPolicy

@wrap_experiment
def tutorial_vpg(ctxt=None):
    set_seed(100)
    with TFTrainer(ctxt) as trainer:
        env = PointEnv()
        policy = GaussianMLPPolicy(env.spec)
        sampler = LocalSampler(agents=policy,
                               envs=env,
                               max_episode_length=200.
                               is_tf_worker=True)
        algo = SimpleVPG(env.spec, policy, sampler)
        trainer.setup(algo, env)
        trainer.train(n_epochs=500, batch_size=4000)
...

Different from PyTorch’s version, we need to build the computation graph before training the policy in TensorFlow.

import tensorflow as tf
...

class SimpleVPG:

    def __init__(self, env_spec, policy, sampler):
        self.env_spec = env_spec
        self.policy = policy
        self._sampler = sampler
        self.max_episode_length = 200
        self._discount = 0.99
        self.init_opt()

    def init_opt(self):
        observation_dim = self.policy.observation_space.flat_dim
        action_dim = self.policy.action_space.flat_dim
        with tf.name_scope('inputs'):
            self._observation = tf.compat.v1.placeholder(
                tf.float32, shape=[None, observation_dim], name='observation')
            self._action = tf.compat.v1.placeholder(tf.float32,
                                                    shape=[None, action_dim],
                                                    name='action')
            self._returns = tf.compat.v1.placeholder(tf.float32,
                                                     shape=[None],
                                                     name='return')
        policy_dist = self.policy.build(self._observation, name='policy').dist
        with tf.name_scope('loss'):
            ll = policy_dist.log_prob(self._action, name='log_likelihood')
            loss = -tf.reduce_mean(ll * self._returns)
        with tf.name_scope('train'):
            self._train_op = tf.compat.v1.train.AdamOptimizer(1e-3).minimize(
                loss)

The train() method is the same, while int the _train_once() method, we feed the inputs with sample data.

    def train(self, trainer):
        for epoch in trainer.step_epochs():
            samples = trainer.obtain_samples(epoch)
            log_performance(
                epoch,
                EpisodeBatch.from_list(self.env_spec, samples),
                self._discount)
            self._train_once(samples)

    def _train_once(self, samples):
        obs = np.concatenate([path['observations'] for path in samples])
        actions = np.concatenate([path['actions'] for path in samples])
        returns = []
        for path in samples:
            returns.append(discount_cumsum(path['rewards'], self._discount))
        returns = np.concatenate(returns)
        sess = tf.compat.v1.get_default_session()
        sess.run(self._train_op,
                 feed_dict={
                     self._observation: obs,
                     self._action: actions,
                     self._returns: returns,
                 })
        return np.mean(returns)

As it is mentioned above, to support snapshot and resume, we need to implement all things pickling. However, we use instance variables (e.g. self._action) to save unpickled tf.Tensor in the class. So we need to define __getstate__ and __setstate__ like:

   def __getstate__(self):
        data = self.__dict__.copy()
        del data['_observation']
        del data['_action']
        del data['_returns']
        del data['_train_op']
        return data

    def __setstate__(self, state):
        self.__dict__ = state
        self.init_opt()

For completeness, the full experiment file (example/tf/tutorial_vpg.py) is repeated below:

Similar to the PyTorch’s version, Running the experiment file should print outputs like the following. The policy should solve the PointEnv after 100 epochs (i.e. the Evaluation/SuccessRate reaches 1).

2020-07-24 15:31:44 | [tutorial_vpg] Logging to /home/ruofu/garage/data/local/experiment/tutorial_vpg_1
2020-07-24 15:31:45 | [tutorial_vpg] Obtaining samples...
Sampling  [####################################]  100%
2020-07-24 15:31:50 | [tutorial_vpg] epoch #0 | Saving snapshot...
2020-07-24 15:31:51 | [tutorial_vpg] epoch #0 | Saved
2020-07-24 15:31:51 | [tutorial_vpg] epoch #0 | Time 5.25 s
2020-07-24 15:31:51 | [tutorial_vpg] epoch #0 | EpochTime 5.25 s
----------------------------------  ----------
Evaluation/AverageDiscountedReturn   -376.475
Evaluation/AverageReturn            -1035.36
Evaluation/Iteration                    0
Evaluation/MaxReturn                 -969.42
Evaluation/MinReturn                -1090.39
Evaluation/NumEpisodes                 20
Evaluation/StdReturn                   35.3741
Evaluation/SuccessRate                  0
Evaluation/TerminationRate              0
TotalEnvSteps                        4000
----------------------------------  ----------
Sampling  [####################################]  100%
2020-07-24 15:31:53 | [tutorial_vpg] epoch #1 | Saving snapshot...
2020-07-24 15:31:53 | [tutorial_vpg] epoch #1 | Saved
2020-07-24 15:31:53 | [tutorial_vpg] epoch #1 | Time 7.42 s
2020-07-24 15:31:53 | [tutorial_vpg] epoch #1 | EpochTime 2.16 s
----------------------------------  ----------
Evaluation/AverageDiscountedReturn   -376.199
Evaluation/AverageReturn            -1044.4
Evaluation/Iteration                    1
Evaluation/MaxReturn                 -865.945
Evaluation/MinReturn                -1154.95
Evaluation/NumEpisodes                 20
Evaluation/StdReturn                   69.6729
Evaluation/SuccessRate                  0
Evaluation/TerminationRate              0
TotalEnvSteps                        8000
----------------------------------  ----------
...

NumPy¶

We will implement CEM with NumPy, and train the CategoricalMLPPolicy to solve CartPole-v1. The experiment function is similar to that of TensorFlow:

from garage import wrap_experiment
from garage.envs import GymEnv
from garage.experiment import TFTrainer
from garage.experiment.deterministic import set_seed
from garage.tf.policies import CategoricalMLPPolicy

@wrap_experiment
def tutorial_cem(ctxt=None):
    set_seed(100)
    with TFTrainer(ctxt) as trainer:
        env = GymEnv('CartPole-v1')
        policy = CategoricalMLPPolicy(env.spec)
        sampler = LocalSampler(agents=policy,
                               envs=env,
                               max_episode_length=200,
                               is_tf_worker=True)
        algo = SimpleCEM(env.spec, policy, sampler)
        trainer.setup(algo, env)
        trainer.train(n_epochs=100, batch_size=1000)

When training the policy, we use policy.get_param_values() method to get the initial parameters of the policy, and use policy.set_param_values() to update parameters of the policy.

import numpy as np
from garage.np import discount_cumsum
from garage.sampler import LocalSampler

class SimpleCEM:

    def __init__(self, env_spec, policy, sampler):
        self.env_spec = env_spec
        self.policy = policy
        self._sampler = sampler
        self.max_episode_length = 200
        self._discount = 0.99
        self._extra_std = 1
        self._extra_decay_time = 100
        self._n_samples = 20
        self._n_best = 1
        self._cur_std = 1
        self._cur_mean = self.policy.get_param_values()
        self._all_avg_returns = []
        self._all_params = [self._cur_mean.copy()]
        self._cur_params = None

    def train(self, trainer):
        for epoch in trainer.step_epochs():
            samples = trainer.obtain_samples(epoch)
            log_performance(
                epoch,
                EpisodeBatch.from_list(self.env_spec, samples),
                self._discount)
            self._train_once(epoch, samples)

    def _train_once(self, epoch, paths):
        returns = []
        for path in paths:
            returns.append(discount_cumsum(path['rewards'], self._discount))
        avg_return = np.mean(np.concatenate(returns))
        self._all_avg_returns.append(avg_return)
        if (epoch + 1) % self._n_samples == 0:
            avg_rtns = np.array(self._all_avg_returns)
            best_inds = np.argsort(-avg_rtns)[:self._n_best]
            best_params = np.array(self._all_params)[best_inds]
            self._cur_mean = best_params.mean(axis=0)
            self._cur_std = best_params.std(axis=0)
            self.policy.set_param_values(self._cur_mean)
            avg_return = max(self._all_avg_returns)
            self._all_avg_returns.clear()
            self._all_params.clear()
        self._cur_params = self._sample_params(epoch)
        self._all_params.append(self._cur_params.copy())
        self.policy.set_param_values(self._cur_params)
        return avg_return

    def _sample_params(self, epoch):
        extra_var_mult = max(1.0 - epoch / self._extra_decay_time, 0)
        sample_std = np.sqrt(
            np.square(self._cur_std) +
            np.square(self._extra_std) * extra_var_mult)
        return np.random.standard_normal(len(
            self._cur_mean)) * sample_std + self._cur_mean

You can see the full experiment file here. Running the experiment file should print outputs like the following. If you want to visualize the policy when training, you can set plot to True in trainer.train() as mentioned before in PyTorch section.

2020-07-24 15:33:49 | [tutorial_cem] Logging to /home/ruofu/garage/data/local/experiment/tutorial_cem
2020-07-24 15:33:50 | [tutorial_cem] Obtaining samples...
Sampling  [####################################]  100%
2020-07-24 15:33:54 | [tutorial_cem] epoch #0 | Saving snapshot...
2020-07-24 15:33:54 | [tutorial_cem] epoch #0 | Saved
2020-07-24 15:33:54 | [tutorial_cem] epoch #0 | Time 3.52 s
2020-07-24 15:33:54 | [tutorial_cem] epoch #0 | EpochTime 3.52 s
----------------------------------  ---------
Evaluation/AverageDiscountedReturn    20.0163
Evaluation/AverageReturn              22.5333
Evaluation/Iteration                   0
Evaluation/MaxReturn                  52
Evaluation/MinReturn                  10
Evaluation/NumEpisodes                45
Evaluation/StdReturn                   7.9822
Evaluation/TerminationRate             1
TotalEnvSteps                       1014
----------------------------------  ---------
2020-07-24 15:33:54 | [tutorial_cem] epoch #1 | Saving snapshot...
2020-07-24 15:33:54 | [tutorial_cem] epoch #1 | Saved
2020-07-24 15:33:54 | [tutorial_cem] epoch #1 | Time 4.03 s
2020-07-24 15:33:54 | [tutorial_cem] epoch #1 | EpochTime 0.50 s
----------------------------------  ----------
Evaluation/AverageDiscountedReturn    15.7595
Evaluation/AverageReturn              17.1017
Evaluation/Iteration                   1
Evaluation/MaxReturn                  24
Evaluation/MinReturn                  13
Evaluation/NumEpisodes                59
Evaluation/StdReturn                   2.75985
Evaluation/TerminationRate             1
TotalEnvSteps                       2023
----------------------------------  ----------
...

References¶

1(1,2): Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
2: Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to monte carlo simulation, randomized optimization and machine learning. Information Science & Statistics, Springer Verlag, NY, 2004.

This page was authored by K.R. Zentner (@krzentner) and Ruofu Wang (@yeukfu).