Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. [1]






API Reference







The REINFORCE algorithm, also sometimes known as Vanilla Policy Gradient (VPG), is the most basic policy gradient method, and was built upon to develop more complicated methods such as PPO and VPG. The original paper on REINFORCE is available here.

This doc will provide a high level overview of the algorithm and its implementation in garage. For a more thorough introduction into policy gradient methods as well reinforcement learning in general, we encourage you to read through Spinning Up here.

Note that in the codebase, both the tensorflow and torch implementations refer to this algorithm as VPG.


In REINFORCE as well as other policy gradient algorithms, the gradient steps taken aim to minimize a loss function by incrementally modifying the policy network’s parameters. The loss function in the original REINFORCE algorithm is given by:

\[-log(\pi(s,a)) * G\]

where the log term is the log probability of taking some action a at some state s, and G is the return, the sum of the discounted rewards from the current timestep up until the end of the episode.

In practice, this loss function isn’t typically used as its performance is limited by the high variance in the return G over entire episodes. To combat this, an advantage estimate is introduced in place of the return G. The advantage is given by:

\[A(s,a) = r + \gamma V(s') - V(s)\]

Where V(s) is a learned value function that estimates the value of a given state, r is the reward received from transitioning from state s into state s' by taking action a, and γ is the discount rate, a hyperparameter passed to the algorithm.

The augmented loss function then becomes:

\[-log(\pi(s,a)) * A(s,a)\]

Naturally, since the value function is learned over time as more updates are performed, it introduces some margin bias caused by the imperfect estimates, but decreases the overall variance as well.

In garage, a technique called Generalized Advantage Estimation is used to compute the advantage in the loss term. This introduces a hyperparameter λ that can be used to tune the amount of variance vs bias in each update, where λ=1 results in the maximum variance and zero bias, and λ=0 results in the opposite. Best results are typically achieved with λ ϵ [0.9, 0.999]. This resource provides a more in-depth explanation of GAE and its utility.


As with all algorithms in garage, you can take a look at the the examples provided in garage/examples for an idea of hyperparameter values and types . For VPG, these are:





Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.

This page was authored by Mishari Aliesa (@maliesa96).