REINFORCE¶
Paper 
Simple Statistical GradientFollowing Algorithms for Connectionist Reinforcement Learning. [1] 

Framework(s) 

API Reference 

Code 

Examples 
The REINFORCE algorithm, also sometimes known as Vanilla Policy Gradient (VPG), is the most basic policy gradient method, and was built upon to develop more complicated methods such as PPO and VPG. The original paper on REINFORCE is available here.
This doc will provide a high level overview of the algorithm and its implementation in garage. For a more thorough introduction into policy gradient methods as well reinforcement learning in general, we encourage you to read through Spinning Up here.
Note that in the codebase, both the tensorflow and torch implementations refer to this algorithm as VPG.
Overview¶
In REINFORCE as well as other policy gradient algorithms, the gradient steps taken aim to minimize a loss function by incrementally modifying the policy network’s parameters. The loss function in the original REINFORCE algorithm is given by:
where the log
term is the log probability of taking some action a
at some state s
, and G
is the return, the sum of the discounted rewards from the current timestep up until the end of the episode.
In practice, this loss function isn’t typically used as its performance is limited by the high variance in the return G
over entire episodes. To combat this, an advantage estimate is introduced in place of the return G
. The advantage is given by:
Where V(s)
is a learned value function that estimates the value of a given state, r
is the reward received from transitioning from state s
into state s'
by taking action a
, and γ
is the discount rate, a hyperparameter passed to the algorithm.
The augmented loss function then becomes:
Naturally, since the value function is learned over time as more updates are performed, it introduces some margin bias caused by the imperfect estimates, but decreases the overall variance as well.
In garage, a technique called Generalized Advantage Estimation is used to compute the advantage in the loss term. This introduces a hyperparameter λ that can be used to tune the amount of variance vs bias in each update, where λ=1 results in the maximum variance and zero bias, and λ=0 results in the opposite. Best results are typically achieved with λ ϵ [0.9, 0.999]
. This resource provides a more indepth explanation of GAE and its utility.
Examples¶
As with all algorithms in garage, you can take a look at the the examples provided in garage/examples
for an idea of hyperparameter values and types . For VPG, these are:
TF¶
#!/usr/bin/env python3
"""This is an example to train a task with VPG algorithm.
Here it runs CartPolev1 environment with 100 iterations.
Results:
AverageReturn: 100
RiseTime: itr 16
"""
from garage import wrap_experiment
from garage.envs import GymEnv
from garage.experiment.deterministic import set_seed
from garage.np.baselines import LinearFeatureBaseline
from garage.tf.algos import VPG
from garage.tf.policies import CategoricalMLPPolicy
from garage.trainer import TFTrainer
@wrap_experiment
def vpg_cartpole(ctxt=None, seed=1):
"""Train VPG with CartPolev1 environment.
Args:
ctxt (garage.experiment.ExperimentContext): The experiment
configuration used by Trainer to create the snapshotter.
seed (int): Used to seed the random number generator to produce
determinism.
"""
set_seed(seed)
with TFTrainer(snapshot_config=ctxt) as trainer:
env = GymEnv('CartPolev1')
policy = CategoricalMLPPolicy(name='policy',
env_spec=env.spec,
hidden_sizes=(32, 32))
baseline = LinearFeatureBaseline(env_spec=env.spec)
algo = VPG(env_spec=env.spec,
policy=policy,
baseline=baseline,
discount=0.99,
optimizer_args=dict(learning_rate=0.01, ))
trainer.setup(algo, env)
trainer.train(n_epochs=100, batch_size=10000)
vpg_cartpole()
Pytorch¶
#!/usr/bin/env python3
"""This is an example to train a task with VPG algorithm (PyTorch).
Here it runs InvertedDoublePendulumv2 environment with 100 iterations.
Results:
AverageReturn: 450  650
"""
import torch
from garage import wrap_experiment
from garage.envs import GymEnv
from garage.experiment.deterministic import set_seed
from garage.torch.algos import VPG
from garage.torch.policies import GaussianMLPPolicy
from garage.torch.value_functions import GaussianMLPValueFunction
from garage.trainer import Trainer
@wrap_experiment
def vpg_pendulum(ctxt=None, seed=1):
"""Train PPO with InvertedDoublePendulumv2 environment.
Args:
ctxt (garage.experiment.ExperimentContext): The experiment
configuration used by Trainer to create the snapshotter.
seed (int): Used to seed the random number generator to produce
determinism.
"""
set_seed(seed)
env = GymEnv('InvertedDoublePendulumv2')
trainer = Trainer(ctxt)
policy = GaussianMLPPolicy(env.spec,
hidden_sizes=[64, 64],
hidden_nonlinearity=torch.tanh,
output_nonlinearity=None)
value_function = GaussianMLPValueFunction(env_spec=env.spec,
hidden_sizes=(32, 32),
hidden_nonlinearity=torch.tanh,
output_nonlinearity=None)
algo = VPG(env_spec=env.spec,
policy=policy,
value_function=value_function,
discount=0.99,
center_adv=False)
trainer.setup(algo, env)
trainer.train(n_epochs=100, batch_size=10000)
vpg_pendulum()
References¶
 1
Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
This page was authored by Mishari Aliesa (@maliesa96).