Trust Region Policy Optimization.

class TRPO(env_spec, policy, baseline, sampler, scope=None, discount=0.99, gae_lambda=0.98, center_adv=True, positive_adv=False, fixed_horizon=False, lr_clip_range=0.01, max_kl_step=0.01, optimizer=None, optimizer_args=None, policy_ent_coeff=0.0, use_softplus_entropy=False, use_neg_logli_entropy=False, stop_entropy_gradient=False, kl_constraint='hard', entropy_method='no_entropy', name='TRPO')


Inheritance diagram of

Trust Region Policy Optimization.


  • env_spec (EnvSpec) – Environment specification.

  • policy ( – Policy.

  • baseline ( – The baseline.

  • sampler (garage.sampler.Sampler) – Sampler.

  • scope (str) – Scope for identifying the algorithm. Must be specified if running multiple algorithms simultaneously, each using different environments and policies.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • fixed_horizon (bool) – Whether to fix horizon.

  • lr_clip_range (float) – The limit on the likelihood ratio between policies, as in PPO.

  • max_kl_step (float) – The maximum KL divergence between old and new policies, as in TRPO.

  • optimizer (object) – The optimizer of the algorithm. Should be the optimizers in

  • optimizer_args (dict) – The arguments of the optimizer.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • use_neg_logli_entropy (bool) – Whether to estimate the entropy as the negative log likelihood of the action.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • kl_constraint (str) – KL constraint, either ‘hard’ or ‘soft’.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See for more details.

  • name (str) – The name of the algorithm.


Obtain samplers and start actual training for each epoch.


trainer (Trainer) – Experiment trainer, which rovides services such as snapshotting and sampler control.


The average return in last epoch cycle.

Return type