Trust Region Policy Optimization.

class TRPO(env_spec, policy, value_function, sampler, policy_optimizer=None, vf_optimizer=None, num_train_per_epoch=1, discount=0.99, gae_lambda=0.98, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy')

Bases: garage.torch.algos.VPG

Inheritance diagram of garage.torch.algos.trpo.TRPO

Trust Region Policy Optimization (TRPO).

  • env_spec (EnvSpec) – Environment specification.

  • policy (garage.torch.policies.Policy) – Policy.

  • value_function (garage.torch.value_functions.ValueFunction) – The value function.

  • sampler (garage.sampler.Sampler) – Sampler.

  • policy_optimizer (garage.torch.optimizer.OptimizerWrapper) – Optimizer for policy.

  • vf_optimizer (garage.torch.optimizer.OptimizerWrapper) – Optimizer for value function.

  • num_train_per_epoch (int) – Number of train_once calls per epoch.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See for more details.

property discount

Discount factor used by the algorithm.


discount factor.

Return type



Obtain samplers and start actual training for each epoch.


trainer (Trainer) – Gives the algorithm the access to :method:`~Trainer.step_epochs()`, which provides services such as snapshotting and sampler control.


The average return in last epoch cycle.

Return type