Proximal Policy Optimization (PPO)¶

Paper	Proximal Policy Optimization Algorithms [1]
Framework(s)	PyTorch¶	TensorFlow¶
API Reference	garage.torch.algos.PPO	garage.tf.algos.PPO
Code	garage/torch/algos/ppo.py	garage/tf/algos/ppo.py
Examples	examples

Proximal Policy Optimization Algorithms (PPO) is a family of policy gradient methods which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent.

Garage’s implementation also supports adding entropy bonus to the objective. Two types of entropy approaches could be used here. Maximum entropy approach adds the dense entropy to the reward for each time step, while entropy regularization adds the mean entropy to the surrogate objective. See [2] for more details.

Examples¶

Garage has implementations of PPO with PyTorch and TensorFlow.

PyTorch¶

TensorFlow¶

References¶

1: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
2: Sergey Levine. Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909, 2018.

This page was authored by Ruofu Wang (@yeukfu).