garage.torch.algos.maml module¶

Model-Agnostic Meta-Learning (MAML) algorithm implementation for RL.

class MAML(inner_algo, env, policy, meta_optimizer, meta_batch_size=40, inner_lr=0.1, outer_lr=0.001, num_grad_updates=1, meta_evaluator=None, evaluate_every_n_epochs=1)[source]¶

Bases: object

Model-Agnostic Meta-Learning (MAML).

Parameters:

inner_algo (garage.torch.algos.VPG) – The inner algorithm used for computing loss.
env (garage.envs.GarageEnv) – A gym environment.
policy (garage.torch.policies.Policy) – Policy.
meta_optimizer (Union[torch.optim.Optimizer, tuple]) – Type of optimizer. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).
meta_batch_size (int) – Number of tasks sampled per batch.
inner_lr (float) – Adaptation learning rate.
outer_lr (float) – Meta policy learning rate.
num_grad_updates (int) – Number of adaptation gradient steps.
meta_evaluator (garage.experiment.MetaEvaluator) – A meta evaluator for meta-testing. If None, don’t do meta-testing.
evaluate_every_n_epochs (int) – Do meta-testing every this epochs.

adapt_policy(exploration_policy, exploration_trajectories)[source]¶

Adapt the policy by one gradient steps for a task.

Parameters:

exploration_policy (garage.Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_trajectories by interacting with an environment. The caller may not use this object after passing it into this method.
exploration_trajectories (garage.TrajectoryBatch) – Trajectories to adapt to, generated by exploration_policy exploring the environment.

Returns:

A policy adapted to the task represented by the: exploration_trajectories.

Return type:

garage.Policy

get_exploration_policy()[source]¶

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns:	The policy used to obtain samples that are later used for meta-RL adaptation.
Return type:	garage.Policy

log_performance(itr, all_samples, loss_before, loss_after, kl_before, kl, policy_entropy)[source]¶

Evaluate performance of this batch.

Parameters:	itr (int) – Iteration number. all_samples (list[list[MAMLTrajectoryBatch]]) – Two dimensional list of MAMLTrajectoryBatch of size [meta_batch_size * (num_grad_updates + 1)] loss_before (float) – Loss before optimization step. loss_after (float) – Loss after optimization step. kl_before (float) – KL divergence before optimization step. kl (float) – KL divergence after optimization step. policy_entropy (float) – Policy entropy.
Returns:	The average return in last epoch cycle.
Return type:	float

policy¶

Current policy of the inner algorithm.

Returns:	Current policy of the inner algorithm.
Return type:	garage.torch.policies.Policy

train(runner)[source]¶

Obtain samples and start training for each epoch.

Parameters:	runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:	The average return in last epoch cycle.
Return type:	float

train_once(runner, all_samples, all_params)[source]¶

Train the algorithm once.

Parameters:	runner (garage.experiment.LocalRunner) – The experiment runner. all_samples (list[list[MAMLTrajectoryBatch]]) – A two dimensional list of MAMLTrajectoryBatch of size [meta_batch_size * (num_grad_updates + 1)] all_params (list[dict]) – A list of named parameter dictionaries. Each dictionary contains key value pair of names (str) and parameters (torch.Tensor).
Returns:	Average return.
Return type:	float

class MAMLTrajectoryBatch[source]¶

Bases: garage.torch.algos.maml.MAMLTrajectoryBatch

A tuple representing a batch of whole trajectories in MAML.

A MAMLTrajectoryBatch represents a batch of whole trajectories produced from one environment. +———————–+————————————————-+ | Symbol | Description | +=======================+=================================================+ | \(N\) | Trajectory index dimension | +———————–+————————————————-+ | \(T\) | Maximum length of a trajectory | +———————–+————————————————-+ | \(S^*\) | Single-step shape of a time-series tensor | +———————–+————————————————-+

paths¶

Nonflatten original paths from sampler.

Type:	list[dict[str, np.ndarray or dict[str, np.ndarray]]]

observations¶

A torch tensor of shape \((N \bullet T, O^*)\) containing the (possibly multi-dimensional) observations for all time steps in this batch. These must conform to env_spec.observation_space.

Type:	torch.Tensor

actions¶

A torch tensor of shape \((N \bullet T, A^*)\) containing the (possibly multi-dimensional) actions for all time steps in this batch. These must conform to env_spec.action_space.

Type:	torch.Tensor

rewards¶

A torch tensor of shape \((N \bullet T)\) containing the rewards for all time steps in this batch.

Type:	torch.Tensor

valids¶

An integer numpy array of shape \((N, )\) containing the length of each trajectory in this batch. This may be used to reconstruct the individual trajectories.

Type:	numpy.ndarray

baselines¶

An numpy array of shape \((N \bullet T, )\) containing the value function estimation at all time steps in this batch.

Type:	numpy.ndarray

Raises:	`ValueError` – If any of the above attributes do not conform to their prescribed types and shapes.