Proximal Policy Optimization (PPO)¶
Paper |
Proximal Policy Optimization Algorithms [1] |
|
Framework(s) |
||
API Reference |
||
Code |
||
Examples |
Proximal Policy Optimization Algorithms (PPO) is a family of policy gradient methods which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent.
Garage’s implementation also supports adding entropy bonus to the objective. Two types of entropy approaches could be used here. Maximum entropy approach adds the dense entropy to the reward for each time step, while entropy regularization adds the mean entropy to the surrogate objective. See [2] for more details.
References¶
- 1
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- 2
Sergey Levine. Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
This page was authored by Ruofu Wang (@yeukfu).