`garage.torch.algos.maml_trpo`¶

Model-Agnostic Meta-Learning (MAML) algorithm applied to TRPO.

class MAMLTRPO(env, policy, value_function, sampler, task_sampler, inner_lr=_Default(0.01), outer_lr=0.001, max_kl_step=0.01, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy', meta_batch_size=40, num_grad_updates=1, meta_evaluator=None, evaluate_every_n_epochs=1)¶

Bases: garage.torch.algos.maml.MAML

Inheritance diagram of garage.torch.algos.maml_trpo.MAMLTRPO

Model-Agnostic Meta-Learning (MAML) applied to TRPO.

Parameters

env (Environment) – A multi-task environment.
policy (garage.torch.policies.Policy) – Policy.
value_function (garage.np.baselines.Baseline) – The value function.
sampler (garage.sampler.Sampler) – Sampler.
task_sampler (garage.experiment.TaskSampler) – Task sampler.
inner_lr (float) – Adaptation learning rate.
outer_lr (float) – Meta policy learning rate.
max_kl_step (float) – The maximum KL divergence between old and new policies.
discount (float) – Discount.
gae_lambda (float) – Lambda used for generalized advantage estimation.
center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.
positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.
policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.
use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.
stop_entropy_gradient (bool) – Whether to stop the entropy gradient.
entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.
meta_batch_size (int) – Number of tasks sampled per batch.
num_grad_updates (int) – Number of adaptation gradient steps.
meta_evaluator (garage.experiment.MetaEvaluator) – A meta evaluator for meta-testing. If None, don’t do meta-testing.
evaluate_every_n_epochs (int) – Do meta-testing every this epochs.

property policy¶

Current policy of the inner algorithm.

Returns

Current policy of the inner: algorithm.

Return type

garage.torch.policies.Policy

train(trainer)¶

Obtain samples and start training for each epoch.

Parameters: trainer (Trainer) – Gives the algorithm access to :method:`~Trainer.step_epochs()`, which provides services such as snapshotting and sampler control.
Returns: The average return in last epoch cycle.
Return type: float

get_exploration_policy()¶

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns

The policy used to obtain samples that are later used for: meta-RL adaptation.

Return type

Policy

adapt_policy(exploration_policy, exploration_episodes)¶

Adapt the policy by one gradient steps for a task.

Parameters

exploration_policy (Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_episodes by interacting with an environment. The caller may not use this object after passing it into this method.
exploration_episodes (EpisodeBatch) – Episodes with which to adapt, generated by exploration_policy exploring the environment.

Returns

A policy adapted to the task represented by the: exploration_episodes.

Return type

Policy

garage.torch.algos.maml_trpo¶

`garage.torch.algos.maml_trpo`¶