garage.torch.algos.maml module¶
Model-Agnostic Meta-Learning (MAML) algorithm implementation for RL.
-
class
MAML
(inner_algo, env, policy, meta_optimizer, meta_batch_size=40, inner_lr=0.1, outer_lr=0.001, num_grad_updates=1, meta_evaluator=None, evaluate_every_n_epochs=1)[source]¶ Bases:
object
Model-Agnostic Meta-Learning (MAML).
Parameters: - inner_algo (garage.torch.algos.VPG) – The inner algorithm used for computing loss.
- env (garage.envs.GarageEnv) – A gym environment.
- policy (garage.torch.policies.Policy) – Policy.
- meta_optimizer (Union[torch.optim.Optimizer, tuple]) – Type of optimizer. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).
- meta_batch_size (int) – Number of tasks sampled per batch.
- inner_lr (float) – Adaptation learning rate.
- outer_lr (float) – Meta policy learning rate.
- num_grad_updates (int) – Number of adaptation gradient steps.
- meta_evaluator (garage.experiment.MetaEvaluator) – A meta evaluator for meta-testing. If None, don’t do meta-testing.
- evaluate_every_n_epochs (int) – Do meta-testing every this epochs.
-
adapt_policy
(exploration_policy, exploration_trajectories)[source]¶ Adapt the policy by one gradient steps for a task.
Parameters: - exploration_policy (garage.Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_trajectories by interacting with an environment. The caller may not use this object after passing it into this method.
- exploration_trajectories (garage.TrajectoryBatch) – Trajectories to adapt to, generated by exploration_policy exploring the environment.
Returns: - A policy adapted to the task represented by the
exploration_trajectories.
Return type: garage.Policy
-
get_exploration_policy
()[source]¶ Return a policy used before adaptation to a specific task.
Each time it is retrieved, this policy should only be evaluated in one task.
Returns: - The policy used to obtain samples that are later
- used for meta-RL adaptation.
Return type: garage.Policy
-
log_performance
(itr, all_samples, loss_before, loss_after, kl_before, kl, policy_entropy)[source]¶ Evaluate performance of this batch.
Parameters: - itr (int) – Iteration number.
- all_samples (list[list[MAMLTrajectoryBatch]]) – Two dimensional list of MAMLTrajectoryBatch of size [meta_batch_size * (num_grad_updates + 1)]
- loss_before (float) – Loss before optimization step.
- loss_after (float) – Loss after optimization step.
- kl_before (float) – KL divergence before optimization step.
- kl (float) – KL divergence after optimization step.
- policy_entropy (float) – Policy entropy.
Returns: The average return in last epoch cycle.
Return type:
-
policy
¶ Current policy of the inner algorithm.
Returns: - Current policy of the inner
- algorithm.
Return type: garage.torch.policies.Policy
-
train
(runner)[source]¶ Obtain samples and start training for each epoch.
Parameters: runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control. Returns: The average return in last epoch cycle. Return type: float
-
train_once
(runner, all_samples, all_params)[source]¶ Train the algorithm once.
Parameters: - runner (garage.experiment.LocalRunner) – The experiment runner.
- all_samples (list[list[MAMLTrajectoryBatch]]) – A two dimensional list of MAMLTrajectoryBatch of size [meta_batch_size * (num_grad_updates + 1)]
- all_params (list[dict]) – A list of named parameter dictionaries. Each dictionary contains key value pair of names (str) and parameters (torch.Tensor).
Returns: Average return.
Return type:
-
class
MAMLTrajectoryBatch
[source]¶ Bases:
garage.torch.algos.maml.MAMLTrajectoryBatch
A tuple representing a batch of whole trajectories in MAML.
A
MAMLTrajectoryBatch
represents a batch of whole trajectories produced from one environment. +———————–+————————————————-+ | Symbol | Description | +=======================+=================================================+ | \(N\) | Trajectory index dimension | +———————–+————————————————-+ | \(T\) | Maximum length of a trajectory | +———————–+————————————————-+ | \(S^*\) | Single-step shape of a time-series tensor | +———————–+————————————————-+-
paths
¶ Nonflatten original paths from sampler.
Type: list[dict[str, np.ndarray or dict[str, np.ndarray]]]
-
observations
¶ A torch tensor of shape \((N \bullet T, O^*)\) containing the (possibly multi-dimensional) observations for all time steps in this batch. These must conform to
env_spec.observation_space
.Type: torch.Tensor
-
actions
¶ A torch tensor of shape \((N \bullet T, A^*)\) containing the (possibly multi-dimensional) actions for all time steps in this batch. These must conform to
env_spec.action_space
.Type: torch.Tensor
-
rewards
¶ A torch tensor of shape \((N \bullet T)\) containing the rewards for all time steps in this batch.
Type: torch.Tensor
-
valids
¶ An integer numpy array of shape \((N, )\) containing the length of each trajectory in this batch. This may be used to reconstruct the individual trajectories.
Type: numpy.ndarray
-
baselines
¶ An numpy array of shape \((N \bullet T, )\) containing the value function estimation at all time steps in this batch.
Type: numpy.ndarray
Raises: ValueError
– If any of the above attributes do not conform to their prescribed types and shapes.-