Episodic Reward Weighted Regression.

class ERWR(env_spec, policy, baseline, sampler, scope=None, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=True, fixed_horizon=False, lr_clip_range=0.01, max_kl_step=0.01, optimizer=None, optimizer_args=None, policy_ent_coeff=0.0, use_softplus_entropy=False, use_neg_logli_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy', name='ERWR')


Inheritance diagram of

Episodic Reward Weighted Regression [1].


This does not implement the original RwR 2 that deals with “immediate reward problems” since it doesn’t find solutions that optimize for temporally delayed rewards.


Kober, Jens, and Jan R. Peters. “Policy search for motor primitives in robotics.” Advances in neural information processing systems. 2009.


Peters, Jan, and Stefan Schaal. “Using reward-weighted regression for reinforcement learning of task space control. ” Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International Symposium on. IEEE, 2007.

  • env_spec (EnvSpec) – Environment specification.

  • policy ( – Policy.

  • baseline ( – The baseline.

  • sampler (garage.sampler.Sampler) – Sampler.

  • scope (str) – Scope for identifying the algorithm. Must be specified if running multiple algorithms simultaneously, each using different environments and policies.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • fixed_horizon (bool) – Whether to fix horizon.

  • lr_clip_range (float) – The limit on the likelihood ratio between policies, as in PPO.

  • max_kl_step (float) – The maximum KL divergence between old and new policies, as in TRPO.

  • optimizer (object) – The optimizer of the algorithm. Should be the optimizers in

  • optimizer_args (dict) – The arguments of the optimizer.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • use_neg_logli_entropy (bool) – Whether to estimate the entropy as the negative log likelihood of the action.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See for more details.

  • name (str) – The name of the algorithm.


Obtain samplers and start actual training for each epoch.


trainer (Trainer) – Experiment trainer, which rovides services such as snapshotting and sampler control.


The average return in last epoch cycle.

Return type