garage.torch.algos.maml module

Model-Agnostic Meta-Learning (MAML) algorithm implementation for RL.

class MAML(inner_algo, env, policy, meta_optimizer, meta_batch_size=40, inner_lr=0.1, outer_lr=0.001, num_grad_updates=1, meta_evaluator=None, evaluate_every_n_epochs=1)[source]

Bases: object

Model-Agnostic Meta-Learning (MAML).

Parameters:
  • inner_algo (garage.torch.algos.VPG) – The inner algorithm used for computing loss.
  • env (garage.envs.GarageEnv) – A gym environment.
  • policy (garage.torch.policies.Policy) – Policy.
  • meta_optimizer (Union[torch.optim.Optimizer, tuple]) – Type of optimizer. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).
  • meta_batch_size (int) – Number of tasks sampled per batch.
  • inner_lr (float) – Adaptation learning rate.
  • outer_lr (float) – Meta policy learning rate.
  • num_grad_updates (int) – Number of adaptation gradient steps.
  • meta_evaluator (garage.experiment.MetaEvaluator) – A meta evaluator for meta-testing. If None, don’t do meta-testing.
  • evaluate_every_n_epochs (int) – Do meta-testing every this epochs.
adapt_policy(exploration_policy, exploration_trajectories)[source]

Adapt the policy by one gradient steps for a task.

Parameters:
  • exploration_policy (garage.Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_trajectories by interacting with an environment. The caller may not use this object after passing it into this method.
  • exploration_trajectories (garage.TrajectoryBatch) – Trajectories to adapt to, generated by exploration_policy exploring the environment.
Returns:

A policy adapted to the task represented by the

exploration_trajectories.

Return type:

garage.Policy

get_exploration_policy()[source]

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns:
The policy used to obtain samples that are later
used for meta-RL adaptation.
Return type:garage.Policy
log_performance(itr, all_samples, loss_before, loss_after, kl_before, kl, policy_entropy)[source]

Evaluate performance of this batch.

Parameters:
  • itr (int) – Iteration number.
  • all_samples (list[list[MAMLTrajectoryBatch]]) – Two dimensional list of MAMLTrajectoryBatch of size [meta_batch_size * (num_grad_updates + 1)]
  • loss_before (float) – Loss before optimization step.
  • loss_after (float) – Loss after optimization step.
  • kl_before (float) – KL divergence before optimization step.
  • kl (float) – KL divergence after optimization step.
  • policy_entropy (float) – Policy entropy.
Returns:

The average return in last epoch cycle.

Return type:

float

policy

Current policy of the inner algorithm.

Returns:
Current policy of the inner
algorithm.
Return type:garage.torch.policies.Policy
train(runner)[source]

Obtain samples and start training for each epoch.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch cycle.
Return type:float
train_once(runner, all_samples, all_params)[source]

Train the algorithm once.

Parameters:
  • runner (garage.experiment.LocalRunner) – The experiment runner.
  • all_samples (list[list[MAMLTrajectoryBatch]]) – A two dimensional list of MAMLTrajectoryBatch of size [meta_batch_size * (num_grad_updates + 1)]
  • all_params (list[dict]) – A list of named parameter dictionaries. Each dictionary contains key value pair of names (str) and parameters (torch.Tensor).
Returns:

Average return.

Return type:

float

class MAMLTrajectoryBatch[source]

Bases: garage.torch.algos.maml.MAMLTrajectoryBatch

A tuple representing a batch of whole trajectories in MAML.

A MAMLTrajectoryBatch represents a batch of whole trajectories produced from one environment. +———————–+————————————————-+ | Symbol | Description | +=======================+=================================================+ | \(N\) | Trajectory index dimension | +———————–+————————————————-+ | \(T\) | Maximum length of a trajectory | +———————–+————————————————-+ | \(S^*\) | Single-step shape of a time-series tensor | +———————–+————————————————-+

paths

Nonflatten original paths from sampler.

Type:list[dict[str, np.ndarray or dict[str, np.ndarray]]]
observations

A torch tensor of shape \((N \bullet T, O^*)\) containing the (possibly multi-dimensional) observations for all time steps in this batch. These must conform to env_spec.observation_space.

Type:torch.Tensor
actions

A torch tensor of shape \((N \bullet T, A^*)\) containing the (possibly multi-dimensional) actions for all time steps in this batch. These must conform to env_spec.action_space.

Type:torch.Tensor
rewards

A torch tensor of shape \((N \bullet T)\) containing the rewards for all time steps in this batch.

Type:torch.Tensor
valids

An integer numpy array of shape \((N, )\) containing the length of each trajectory in this batch. This may be used to reconstruct the individual trajectories.

Type:numpy.ndarray
baselines

An numpy array of shape \((N \bullet T, )\) containing the value function estimation at all time steps in this batch.

Type:numpy.ndarray
Raises:ValueError – If any of the above attributes do not conform to their prescribed types and shapes.