garage.torch.algos.pearl module

PEARL and PEARLWorker in Pytorch.

Code is adapted from https://github.com/katerakelly/oyster.

class PEARL(env, inner_policy, qf, vf, num_train_tasks, num_test_tasks, latent_dim, encoder_hidden_sizes, test_env_sampler, policy_class=<class 'garage.torch.policies.context_conditioned_policy.ContextConditionedPolicy'>, encoder_class=<class 'garage.torch.embeddings.mlp_encoder.MLPEncoder'>, policy_lr=0.0003, qf_lr=0.0003, vf_lr=0.0003, context_lr=0.0003, policy_mean_reg_coeff=0.001, policy_std_reg_coeff=0.001, policy_pre_activation_coeff=0.0, soft_target_tau=0.005, kl_lambda=0.1, optimizer_class=<sphinx.ext.autodoc.importer._MockObject object>, use_information_bottleneck=True, use_next_obs_in_context=False, meta_batch_size=64, num_steps_per_epoch=1000, num_initial_steps=100, num_tasks_sample=100, num_steps_prior=100, num_steps_posterior=0, num_extra_rl_steps_posterior=100, batch_size=1024, embedding_batch_size=1024, embedding_mini_batch_size=1024, max_path_length=1000, discount=0.99, replay_buffer_size=1000000, reward_scale=1, update_post_train=1)[source]

Bases: garage.np.algos.meta_rl_algorithm.MetaRLAlgorithm

A PEARL model based on https://arxiv.org/abs/1903.08254.

PEARL, which stands for Probablistic Embeddings for Actor-Critic Reinforcement Learning, is an off-policy meta-RL algorithm. It is built on top of SAC using two Q-functions and a value function with an addition of an inference network that estimates the posterior \(q(z \| c)\). The policy is conditioned on the latent variable Z in order to adpat its behavior to specific tasks.

Parameters:
  • env (list[GarageEnv]) – Batch of sampled environment updates(EnvUpdate), which, when invoked on environments, will configure them with new tasks.
  • policy_class (garage.torch.policies.Policy) – Context-conditioned policy class.
  • encoder_class (garage.torch.embeddings.ContextEncoder) – Encoder class for the encoder in context-conditioned policy.
  • inner_policy (garage.torch.policies.Policy) – Policy.
  • qf (torch.nn.Module) – Q-function.
  • vf (torch.nn.Module) – Value function.
  • num_train_tasks (int) – Number of tasks for training.
  • num_test_tasks (int) – Number of tasks for testing.
  • latent_dim (int) – Size of latent context vector.
  • encoder_hidden_sizes (list[int]) – Output dimension of dense layer(s) of the context encoder.
  • test_env_sampler (garage.experiment.SetTaskSampler) – Sampler for test tasks.
  • policy_lr (float) – Policy learning rate.
  • qf_lr (float) – Q-function learning rate.
  • vf_lr (float) – Value function learning rate.
  • context_lr (float) – Inference network learning rate.
  • policy_mean_reg_coeff (float) – Policy mean regulation weight.
  • policy_std_reg_coeff (float) – Policy std regulation weight.
  • policy_pre_activation_coeff (float) – Policy pre-activation weight.
  • soft_target_tau (float) – Interpolation parameter for doing the soft target update.
  • kl_lambda (float) – KL lambda value.
  • optimizer_class (callable) – Type of optimizer for training networks.
  • use_information_bottleneck (bool) – False means latent context is deterministic.
  • use_next_obs_in_context (bool) – Whether or not to use next observation in distinguishing between tasks.
  • meta_batch_size (int) – Meta batch size.
  • num_steps_per_epoch (int) – Number of iterations per epoch.
  • num_initial_steps (int) – Number of transitions obtained per task before training.
  • num_tasks_sample (int) – Number of random tasks to obtain data for each iteration.
  • num_steps_prior (int) – Number of transitions to obtain per task with z ~ prior.
  • num_steps_posterior (int) – Number of transitions to obtain per task with z ~ posterior.
  • num_extra_rl_steps_posterior (int) – Number of additional transitions to obtain per task with z ~ posterior that are only used to train the policy and NOT the encoder.
  • batch_size (int) – Number of transitions in RL batch.
  • embedding_batch_size (int) – Number of transitions in context batch.
  • embedding_mini_batch_size (int) – Number of transitions in mini context batch; should be same as embedding_batch_size for non-recurrent encoder.
  • max_path_length (int) – Maximum path length.
  • discount (float) – RL discount factor.
  • replay_buffer_size (int) – Maximum samples in replay buffer.
  • reward_scale (int) – Reward scale.
  • update_post_train (int) – How often to resample context when obtaining data during training (in trajectories).
adapt_policy(exploration_policy, exploration_trajectories)[source]

Produce a policy adapted for a task.

Parameters:
  • exploration_policy (garage.Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_trajectories by interacting with an environment. The caller may not use this object after passing it into this method.
  • exploration_trajectories (garage.TrajectoryBatch) – Trajectories to adapt to, generated by exploration_policy exploring the environment.
Returns:

A policy adapted to the task represented by the

exploration_trajectories.

Return type:

garage.Policy

classmethod augment_env_spec(env_spec, latent_dim)[source]

Augment environment by a size of latent dimension.

Parameters:
  • env_spec (garage.envs.EnvSpec) – Environment specs to be augmented.
  • latent_dim (int) – Latent dimension.
Returns:

Augmented environment specs.

Return type:

garage.envs.EnvSpec

classmethod get_env_spec(env_spec, latent_dim, module)[source]

Get environment specs of encoder with latent dimension.

Parameters:
  • env_spec (garage.envs.EnvSpec) – Environment specs.
  • latent_dim (int) – Latent dimension.
  • module (str) – Module to get environment specs for.
Returns:

Module environment specs with latent

dimension.

Return type:

garage.envs.InOutSpec

get_exploration_policy()[source]

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns:
The policy used to obtain samples that are later
used for meta-RL adaptation.
Return type:garage.Policy
networks

Return all the networks within the model.

Returns:A list of networks.
Return type:list
policy

Return all the policy within the model.

Returns:Policy within the model.
Return type:garage.torch.policies.Policy
to(device=None)[source]

Put all the networks within the model on device.

Parameters:device (str) – ID of GPU or CPU.
train(runner)[source]

Obtain samples, train, and evaluate for each epoch.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
class PEARLWorker(*, seed, max_path_length, worker_number, deterministic=False, accum_context=False)[source]

Bases: garage.sampler.default_worker.DefaultWorker

A worker class used in sampling for PEARL.

It stores context and resample belief in the policy every step.

Parameters:
  • seed (int) – The seed to use to intialize random number generators.
  • max_path_length (int or float) – The maximum length paths which will be sampled. Can be (floating point) infinity.
  • worker_number (int) – The number of the worker where this update is occurring. This argument is used to set a different seed for each worker.
  • deterministic (bool) – If true, use the mean action returned by the stochastic policy instead of sampling from the returned action distribution.
  • accum_context (bool) – If true, update context of the agent.
agent

The worker’s agent.

Type:Policy or None
env

The worker’s environment.

Type:gym.Env or None
rollout()[source]

Sample a single rollout of the agent in the environment.

Returns:The collected trajectory.
Return type:garage.TrajectoryBatch
start_rollout()[source]

Begin a new rollout.

step_rollout()[source]

Take a single time-step in the current rollout.

Returns:True iff the path is done, either due to the environment indicating termination of due to reaching max_path_length.
Return type:bool