garage.tf.algos.erwr

Episodic Reward Weighted Regression.

class ERWR(env_spec, policy, baseline, sampler, scope=None, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=True, fixed_horizon=False, lr_clip_range=0.01, max_kl_step=0.01, optimizer=None, optimizer_args=None, policy_ent_coeff=0.0, use_softplus_entropy=False, use_neg_logli_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy', name='ERWR')

Bases: garage.tf.algos.vpg.VPG

Inheritance diagram of garage.tf.algos.erwr.ERWR

Episodic Reward Weighted Regression [1].

Note

This does not implement the original RwR 2 that deals with “immediate reward problems” since it doesn’t find solutions that optimize for temporally delayed rewards.

1

Kober, Jens, and Jan R. Peters. “Policy search for motor primitives in robotics.” Advances in neural information processing systems. 2009.

2

Peters, Jan, and Stefan Schaal. “Using reward-weighted regression for reinforcement learning of task space control. ” Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International Symposium on. IEEE, 2007.

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • policy (garage.tf.policies.StochasticPolicy) – Policy.

  • baseline (garage.tf.baselines.Baseline) – The baseline.

  • sampler (garage.sampler.Sampler) – Sampler.

  • scope (str) – Scope for identifying the algorithm. Must be specified if running multiple algorithms simultaneously, each using different environments and policies.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • fixed_horizon (bool) – Whether to fix horizon.

  • lr_clip_range (float) – The limit on the likelihood ratio between policies, as in PPO.

  • max_kl_step (float) – The maximum KL divergence between old and new policies, as in TRPO.

  • optimizer (object) – The optimizer of the algorithm. Should be the optimizers in garage.tf.optimizers.

  • optimizer_args (dict) – The arguments of the optimizer.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • use_neg_logli_entropy (bool) – Whether to estimate the entropy as the negative log likelihood of the action.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.

  • name (str) – The name of the algorithm.

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Experiment trainer, which rovides services such as snapshotting and sampler control.

Returns

The average return in last epoch cycle.

Return type

float