Model-Agnostic Meta-Learning (MAML) algorithm applied to PPO.

class MAMLPPO(env, policy, value_function, sampler, task_sampler, inner_lr=_Default(0.1), outer_lr=0.001, lr_clip_range=0.5, discount=0.99, gae_lambda=1.0, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy', meta_batch_size=20, num_grad_updates=1, meta_evaluator=None, evaluate_every_n_epochs=1)

Bases: garage.torch.algos.maml.MAML

Inheritance diagram of garage.torch.algos.maml_ppo.MAMLPPO

Model-Agnostic Meta-Learning (MAML) applied to PPO.

  • env (Environment) – A multi-task environment.

  • policy (garage.torch.policies.Policy) – Policy.

  • value_function ( – The value function.

  • sampler (garage.sampler.Sampler) – Sampler.

  • task_sampler (garage.experiment.TaskSampler) – Task sampler.

  • inner_lr (float) – Adaptation learning rate.

  • outer_lr (float) – Meta policy learning rate.

  • lr_clip_range (float) – The limit on the likelihood ratio between policies.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See for more details.

  • meta_batch_size (int) – Number of tasks sampled per batch.

  • num_grad_updates (int) – Number of adaptation gradient steps.

  • meta_evaluator (garage.experiment.MetaEvaluator) – A meta evaluator for meta-testing. If None, don’t do meta-testing.

  • evaluate_every_n_epochs (int) – Do meta-testing every this epochs.

property policy

Current policy of the inner algorithm.


Current policy of the inner


Return type



Obtain samples and start training for each epoch.


trainer (Trainer) – Gives the algorithm access to :method:`~Trainer.step_epochs()`, which provides services such as snapshotting and sampler control.


The average return in last epoch cycle.

Return type



Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.


The policy used to obtain samples that are later used for

meta-RL adaptation.

Return type


adapt_policy(exploration_policy, exploration_episodes)

Adapt the policy by one gradient steps for a task.

  • exploration_policy (Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_episodes by interacting with an environment. The caller may not use this object after passing it into this method.

  • exploration_episodes (EpisodeBatch) – Episodes with which to adapt, generated by exploration_policy exploring the environment.


A policy adapted to the task represented by the


Return type