garage.torch.algos.maml_vpg
¶
Model-Agnostic Meta-Learning (MAML) algorithm applied to VPG.
-
class
MAMLVPG
(env, policy, value_function, inner_lr=_Default(0.1), outer_lr=0.001, max_episode_length=100, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy', meta_batch_size=20, num_grad_updates=1, meta_evaluator=None, evaluate_every_n_epochs=1)¶ Bases:
garage.torch.algos.maml.MAML
Model-Agnostic Meta-Learning (MAML) applied to VPG.
Parameters: - env (Environment) – A multi-task environment.
- policy (garage.torch.policies.Policy) – Policy.
- value_function (garage.np.baselines.Baseline) – The value function.
- inner_lr (float) – Adaptation learning rate.
- outer_lr (float) – Meta policy learning rate.
- max_episode_length (int) – Maximum length of a single episode.
- discount (float) – Discount.
- gae_lambda (float) – Lambda used for generalized advantage estimation.
- center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.
- positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.
- policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.
- use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.
- stop_entropy_gradient (bool) – Whether to stop the entropy gradient.
- entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.
- meta_batch_size (int) – Number of tasks sampled per batch.
- num_grad_updates (int) – Number of adaptation gradient steps.
- meta_evaluator (garage.experiment.MetaEvaluator) – A meta evaluator for meta-testing. If None, don’t do meta-testing.
- evaluate_every_n_epochs (int) – Do meta-testing every this epochs.
-
policy
¶ Current policy of the inner algorithm.
Returns: - Current policy of the inner
- algorithm.
Return type: garage.torch.policies.Policy
-
train
(self, runner)¶ Obtain samples and start training for each epoch.
Parameters: runner (LocalRunner) – Gives the algorithm access to :method:`~LocalRunner.step_epochs()`, which provides services such as snapshotting and sampler control. Returns: The average return in last epoch cycle. Return type: float
-
train_once
(self, runner, all_samples, all_params)¶ Train the algorithm once.
Parameters: - runner (LocalRunner) – The experiment runner.
- all_samples (list[list[MAMLEpisodeBatch]]) – A two dimensional list of MAMLEpisodeBatch of size [meta_batch_size * (num_grad_updates + 1)]
- all_params (list[dict]) – A list of named parameter dictionaries. Each dictionary contains key value pair of names (str) and parameters (torch.Tensor).
Returns: Average return.
Return type:
-
log_performance
(self, itr, all_samples, loss_before, loss_after, kl_before, kl, policy_entropy)¶ Evaluate performance of this batch.
Parameters: - itr (int) – Iteration number.
- all_samples (list[list[MAMLEpisodeBatch]]) – Two dimensional list of MAMLEpisodeBatch of size [meta_batch_size * (num_grad_updates + 1)]
- loss_before (float) – Loss before optimization step.
- loss_after (float) – Loss after optimization step.
- kl_before (float) – KL divergence before optimization step.
- kl (float) – KL divergence after optimization step.
- policy_entropy (float) – Policy entropy.
Returns: The average return in last epoch cycle.
Return type:
-
get_exploration_policy
(self)¶ Return a policy used before adaptation to a specific task.
Each time it is retrieved, this policy should only be evaluated in one task.
Returns: - The policy used to obtain samples that are later used for
- meta-RL adaptation.
Return type: Policy
-
adapt_policy
(self, exploration_policy, exploration_episodes)¶ Adapt the policy by one gradient steps for a task.
Parameters: - exploration_policy (Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_episodes by interacting with an environment. The caller may not use this object after passing it into this method.
- exploration_episodes (EpisodeBatch) – Episodes with which to adapt, generated by exploration_policy exploring the environment.
Returns: - A policy adapted to the task represented by the
exploration_episodes.
Return type: