# REINFORCE (VPG)¶

 Paper Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. [1] Framework(s) PyTorch¶ TensorFlow¶ API Reference garage.torch.algos.VPG garage.tf.algos.VPG Code garage/torch/algos/vpg.py garage/tf/algos/vpg.py Examples examples

The REINFORCE algorithm, also sometimes known as Vanilla Policy Gradient (VPG), is the most basic policy gradient method, and was built upon to develop more complicated methods such as PPO and VPG. The original paper on REINFORCE is available here.

This doc will provide a high level overview of the algorithm and its implementation in garage. For a more thorough introduction into policy gradient methods as well reinforcement learning in general, we encourage you to read through Spinning Up here.

Note that in the codebase, both the tensorflow and torch implementations refer to this algorithm as VPG.

## Overview¶

In REINFORCE as well as other policy gradient algorithms, the gradient steps taken aim to minimize a loss function by incrementally modifying the policy network’s parameters. The loss function in the original REINFORCE algorithm is given by:

$-log(\pi(s,a)) * G$

where the log term is the log probability of taking some action a at some state s, and G is the return, the sum of the discounted rewards from the current timestep up until the end of the episode.

In practice, this loss function isn’t typically used as its performance is limited by the high variance in the return G over entire episodes. To combat this, an advantage estimate is introduced in place of the return G. The advantage is given by:

$A(s,a) = r + \gamma V(s') - V(s)$

Where V(s) is a learned value function that estimates the value of a given state, r is the reward received from transitioning from state s into state s' by taking action a, and γ is the discount rate, a hyperparameter passed to the algorithm.

The augmented loss function then becomes:

$-log(\pi(s,a)) * A(s,a)$

Naturally, since the value function is learned over time as more updates are performed, it introduces some margin bias caused by the imperfect estimates, but decreases the overall variance as well.

In garage, a technique called Generalized Advantage Estimation is used to compute the advantage in the loss term. This introduces a hyperparameter λ that can be used to tune the amount of variance vs bias in each update, where λ=1 results in the maximum variance and zero bias, and λ=0 results in the opposite. Best results are typically achieved with λ ϵ [0.9, 0.999]. This resource provides a more in-depth explanation of GAE and its utility.

## Examples¶

As with all algorithms in garage, you can take a look at the the examples provided in garage/examples for an idea of hyperparameter values and types . For VPG, these are:

### TF¶

#!/usr/bin/env python3
"""This is an example to train a task with VPG algorithm.

Here it runs CartPole-v1 environment with 100 iterations.

Results:
AverageReturn: 100
RiseTime: itr 16
"""
from garage import wrap_experiment
from garage.envs import GymEnv
from garage.experiment.deterministic import set_seed
from garage.np.baselines import LinearFeatureBaseline
from garage.tf.algos import VPG
from garage.tf.policies import CategoricalMLPPolicy
from garage.trainer import TFTrainer

@wrap_experiment
def vpg_cartpole(ctxt=None, seed=1):
"""Train VPG with CartPole-v1 environment.

Args:
ctxt (garage.experiment.ExperimentContext): The experiment
configuration used by Trainer to create the snapshotter.
seed (int): Used to seed the random number generator to produce
determinism.

"""
set_seed(seed)
with TFTrainer(snapshot_config=ctxt) as trainer:
env = GymEnv('CartPole-v1')

policy = CategoricalMLPPolicy(name='policy',
env_spec=env.spec,
hidden_sizes=(32, 32))

baseline = LinearFeatureBaseline(env_spec=env.spec)

algo = VPG(env_spec=env.spec,
policy=policy,
baseline=baseline,
discount=0.99,
optimizer_args=dict(learning_rate=0.01, ))

trainer.setup(algo, env)
trainer.train(n_epochs=100, batch_size=10000)

vpg_cartpole()


### Pytorch¶

#!/usr/bin/env python3
"""This is an example to train a task with VPG algorithm (PyTorch).

Here it runs InvertedDoublePendulum-v2 environment with 100 iterations.

Results:
AverageReturn: 450 - 650
"""
import torch

from garage import wrap_experiment
from garage.envs import GymEnv
from garage.experiment.deterministic import set_seed
from garage.torch.algos import VPG
from garage.torch.policies import GaussianMLPPolicy
from garage.torch.value_functions import GaussianMLPValueFunction
from garage.trainer import Trainer

@wrap_experiment
def vpg_pendulum(ctxt=None, seed=1):
"""Train PPO with InvertedDoublePendulum-v2 environment.

Args:
ctxt (garage.experiment.ExperimentContext): The experiment
configuration used by Trainer to create the snapshotter.
seed (int): Used to seed the random number generator to produce
determinism.

"""
set_seed(seed)
env = GymEnv('InvertedDoublePendulum-v2')

trainer = Trainer(ctxt)

policy = GaussianMLPPolicy(env.spec,
hidden_sizes=[64, 64],
hidden_nonlinearity=torch.tanh,
output_nonlinearity=None)

value_function = GaussianMLPValueFunction(env_spec=env.spec,
hidden_sizes=(32, 32),
hidden_nonlinearity=torch.tanh,
output_nonlinearity=None)

algo = VPG(env_spec=env.spec,
policy=policy,
value_function=value_function,
discount=0.99,

trainer.setup(algo, env)
trainer.train(n_epochs=100, batch_size=10000)

vpg_pendulum()


## References¶

1

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.