garage.torch.algos

PyTorch algorithms.

class BC(env_spec, learner, *, batch_size, source=None, sampler=None, policy_optimizer=torch.optim.Adam, policy_lr=_Default(0.001), loss='log_prob', minibatches_per_epoch=16, name='BC')

Bases: garage.np.algos.rl_algorithm.RLAlgorithm

Inheritance diagram of garage.torch.algos.BC

Behavioral Cloning.

Based on Model-Free Imitation Learning with Policy Optimization:

https://arxiv.org/abs/1605.08478

Parameters
  • env_spec (EnvSpec) – Specification of environment.

  • learner (garage.torch.Policy) – Policy to train.

  • batch_size (int) – Size of optimization batch.

  • source (Policy or Generator[TimeStepBatch]) – Expert to clone. If a policy is passed, will set .policy to source and use the trainer to sample from the policy.

  • sampler (garage.sampler.Sampler) – Sampler. If source is a policy, a sampler is required for sampling.

  • policy_optimizer (torch.optim.Optimizer) – Optimizer to be used to optimize the policy.

  • policy_lr (float) – Learning rate of the policy optimizer.

  • loss (str) – Which loss function to use. Must be either ‘log_prob’ or ‘mse’. If set to ‘log_prob’ (the default), learner must be a garage.torch.StochasticPolicy.

  • minibatches_per_epoch (int) – Number of minibatches per epoch.

  • name (str) – Name to use for logging.

Raises

ValueError – If learner` is not a garage.torch.StochasticPolicy and loss is ‘log_prob’.

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Experiment trainer, for services such as snapshotting and sampler control.

class DDPG(env_spec, policy, qf, replay_buffer, sampler, *, steps_per_epoch=20, n_train_steps=50, max_episode_length_eval=None, buffer_batch_size=64, min_buffer_size=int(10000.0), exploration_policy=None, target_update_tau=0.01, discount=0.99, policy_weight_decay=0, qf_weight_decay=0, policy_optimizer=torch.optim.Adam, qf_optimizer=torch.optim.Adam, policy_lr=_Default(0.0001), qf_lr=_Default(0.001), clip_pos_returns=False, clip_return=np.inf, max_action=None, reward_scale=1.0)

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.torch.algos.DDPG

A DDPG model implemented with PyTorch.

DDPG, also known as Deep Deterministic Policy Gradient, uses actor-critic method to optimize the policy and Q-function prediction. It uses a supervised method to update the critic network and policy gradient to update the actor network. And there are exploration strategy, replay buffer and target networks involved to stabilize the training process.

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • policy (garage.torch.policies.Policy) – Policy.

  • qf (object) – Q-value network.

  • replay_buffer (ReplayBuffer) – Replay buffer.

  • sampler (garage.sampler.Sampler) – Sampler.

  • steps_per_epoch (int) – Number of train_once calls per epoch.

  • n_train_steps (int) – Training steps.

  • buffer_batch_size (int) – Batch size of replay buffer.

  • min_buffer_size (int) – The minimum buffer size for replay buffer.

  • exploration_policy (garage.np.exploration_policies.ExplorationPolicy) – # noqa: E501 Exploration strategy.

  • target_update_tau (float) – Interpolation parameter for doing the soft target update.

  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.

  • discount (float) – Discount factor for the cumulative return.

  • policy_weight_decay (float) – L2 weight decay factor for parameters of the policy network.

  • qf_weight_decay (float) – L2 weight decay factor for parameters of the q value network.

  • policy_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training policy network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).

  • qf_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training Q-value network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).

  • policy_lr (float) – Learning rate for policy network parameters.

  • qf_lr (float) – Learning rate for Q-value network parameters.

  • clip_pos_returns (bool) – Whether or not clip positive returns.

  • clip_return (float) – Clip return to be in [-clip_return, clip_return].

  • max_action (float) – Maximum action magnitude.

  • reward_scale (float) – Reward scale.

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Experiment trainer.

Returns

The average return in last epoch cycle.

Return type

float

train_once(self, itr, episodes)

Perform one iteration of training.

Parameters
  • itr (int) – Iteration number.

  • episodes (EpisodeBatch) – Batch of episodes.

optimize_policy(self, samples_data)

Perform algorithm optimizing.

Parameters

samples_data (dict) – Processed batch data.

Returns

Loss of action predicted by the policy network. qval_loss: Loss of Q-value predicted by the Q-network. ys: y_s. qval: Q-value predicted by the Q-network.

Return type

action_loss

update_target(self)

Update parameters in the target policy and Q-value network.

class DQN(env_spec, policy, qf, replay_buffer, sampler, exploration_policy=None, eval_env=None, double_q=True, qf_optimizer=torch.optim.Adam, *, steps_per_epoch=20, n_train_steps=50, max_episode_length_eval=None, deterministic_eval=False, buffer_batch_size=64, min_buffer_size=int(10000.0), num_eval_episodes=10, discount=0.99, qf_lr=_Default(0.001), clip_rewards=None, clip_gradient=10, target_update_freq=5, reward_scale=1.0)

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.torch.algos.DQN

DQN algorithm. See https://arxiv.org/pdf/1312.5602.pdf.

DQN, also known as the Deep Q Network algorithm, is an off-policy algorithm that learns action-value estimates for each state, action pair. The policy then simply acts by taking the action that yields the highest Q(s,a) value for a given state s.

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • policy (garage.torch.policies.Policy) – Policy. For DQN, this is a policy that performs the action that yields the highest Q value.

  • qf (nn.Module) – Q-value network.

  • replay_buffer (ReplayBuffer) – Replay buffer.

  • sampler (garage.sampler.Sampler) – Sampler.

  • steps_per_epoch (int) – Number of train_once calls per epoch.

  • n_train_steps (int) – Training steps.

  • eval_env (Environment) – Evaluation environment. If None, a copy of the main environment is used for evaluation.

  • double_q (bool) – Whether to use Double DQN. See https://arxiv.org/abs/1509.06461.

  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.

  • buffer_batch_size (int) – Batch size of replay buffer.

  • min_buffer_size (int) – The minimum buffer size for replay buffer.

  • exploration_policy (ExplorationPolicy) – Exploration strategy, typically epsilon-greedy.

  • num_eval_episodes (int) – Nunber of evaluation episodes. Defaults to 10.

  • deterministic_eval (bool) – Whether to evaluate the policy deterministically (without exploration noise). False by default.

  • target_update_freq (int) – Number of optimization steps between each update to the target Q network.

  • discount (float) – Discount factor for the cumulative return.

  • qf_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training Q-value network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).

  • qf_lr (float) – Learning rate for Q-value network parameters.

  • clip_rewards (float) – Clip reward to be in [-clip_rewards, clip_rewards]. If None, rewards are not clipped.

  • clip_gradient (float) – Clip gradient norm to clip_gradient. If None, gradient are not clipped. Defaults to 10.

  • reward_scale (float) – Reward scale.

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Experiment trainer.

Returns

The average return in last epoch cycle.

Return type

float

to(self, device=None)

Put all the networks within the model on device.

Parameters

device (str) – ID of GPU or CPU.

class VPG(env_spec, policy, value_function, sampler, policy_optimizer=None, vf_optimizer=None, num_train_per_epoch=1, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy')

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.torch.algos.VPG

Vanilla Policy Gradient (REINFORCE).

VPG, also known as Reinforce, trains stochastic policy in an on-policy way.

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • policy (garage.torch.policies.Policy) – Policy.

  • value_function (garage.torch.value_functions.ValueFunction) – The value function.

  • sampler (garage.sampler.Sampler) – Sampler.

  • policy_optimizer (garage.torch.optimizer.OptimizerWrapper) – Optimizer for policy.

  • vf_optimizer (garage.torch.optimizer.OptimizerWrapper) – Optimizer for value function.

  • num_train_per_epoch (int) – Number of train_once calls per epoch.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.

property discount(self)

Discount factor used by the algorithm.

Returns

discount factor.

Return type

float

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Gives the algorithm the access to :method:`~Trainer.step_epochs()`, which provides services such as snapshotting and sampler control.

Returns

The average return in last epoch cycle.

Return type

float

class MAMLVPG(env, policy, value_function, sampler, task_sampler, inner_lr=_Default(0.1), outer_lr=0.001, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy', meta_batch_size=20, num_grad_updates=1, meta_evaluator=None, evaluate_every_n_epochs=1)

Bases: garage.torch.algos.maml.MAML

Inheritance diagram of garage.torch.algos.MAMLVPG

Model-Agnostic Meta-Learning (MAML) applied to VPG.

Parameters
  • env (Environment) – A multi-task environment.

  • policy (garage.torch.policies.Policy) – Policy.

  • value_function (garage.np.baselines.Baseline) – The value function.

  • sampler (garage.sampler.Sampler) – Sampler.

  • task_sampler (garage.experiment.TaskSampler) – Task sampler.

  • inner_lr (float) – Adaptation learning rate.

  • outer_lr (float) – Meta policy learning rate.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.

  • meta_batch_size (int) – Number of tasks sampled per batch.

  • num_grad_updates (int) – Number of adaptation gradient steps.

  • meta_evaluator (garage.experiment.MetaEvaluator) – A meta evaluator for meta-testing. If None, don’t do meta-testing.

  • evaluate_every_n_epochs (int) – Do meta-testing every this epochs.

train(self, trainer)

Obtain samples and start training for each epoch.

Parameters

trainer (Trainer) – Gives the algorithm access to :method:`~Trainer.step_epochs()`, which provides services such as snapshotting and sampler control.

Returns

The average return in last epoch cycle.

Return type

float

property policy(self)

Current policy of the inner algorithm.

Returns

Current policy of the inner

algorithm.

Return type

garage.torch.policies.Policy

get_exploration_policy(self)

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns

The policy used to obtain samples that are later used for

meta-RL adaptation.

Return type

Policy

adapt_policy(self, exploration_policy, exploration_episodes)

Adapt the policy by one gradient steps for a task.

Parameters
  • exploration_policy (Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_episodes by interacting with an environment. The caller may not use this object after passing it into this method.

  • exploration_episodes (EpisodeBatch) – Episodes with which to adapt, generated by exploration_policy exploring the environment.

Returns

A policy adapted to the task represented by the

exploration_episodes.

Return type

Policy

class PPO(env_spec, policy, value_function, sampler, policy_optimizer=None, vf_optimizer=None, lr_clip_range=0.2, num_train_per_epoch=1, discount=0.99, gae_lambda=0.97, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy')

Bases: garage.torch.algos.VPG

Inheritance diagram of garage.torch.algos.PPO

Proximal Policy Optimization (PPO).

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • policy (garage.torch.policies.Policy) – Policy.

  • value_function (garage.torch.value_functions.ValueFunction) – The value function.

  • sampler (garage.sampler.Sampler) – Sampler.

  • policy_optimizer (garage.torch.optimizer.OptimizerWrapper) – Optimizer for policy.

  • vf_optimizer (garage.torch.optimizer.OptimizerWrapper) – Optimizer for value function.

  • lr_clip_range (float) – The limit on the likelihood ratio between policies.

  • num_train_per_epoch (int) – Number of train_once calls per epoch.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.

property discount(self)

Discount factor used by the algorithm.

Returns

discount factor.

Return type

float

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Gives the algorithm the access to :method:`~Trainer.step_epochs()`, which provides services such as snapshotting and sampler control.

Returns

The average return in last epoch cycle.

Return type

float

class MAMLPPO(env, policy, value_function, sampler, task_sampler, inner_lr=_Default(0.1), outer_lr=0.001, lr_clip_range=0.5, discount=0.99, gae_lambda=1.0, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy', meta_batch_size=20, num_grad_updates=1, meta_evaluator=None, evaluate_every_n_epochs=1)

Bases: garage.torch.algos.maml.MAML

Inheritance diagram of garage.torch.algos.MAMLPPO

Model-Agnostic Meta-Learning (MAML) applied to PPO.

Parameters
  • env (Environment) – A multi-task environment.

  • policy (garage.torch.policies.Policy) – Policy.

  • value_function (garage.np.baselines.Baseline) – The value function.

  • sampler (garage.sampler.Sampler) – Sampler.

  • task_sampler (garage.experiment.TaskSampler) – Task sampler.

  • inner_lr (float) – Adaptation learning rate.

  • outer_lr (float) – Meta policy learning rate.

  • lr_clip_range (float) – The limit on the likelihood ratio between policies.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.

  • meta_batch_size (int) – Number of tasks sampled per batch.

  • num_grad_updates (int) – Number of adaptation gradient steps.

  • meta_evaluator (garage.experiment.MetaEvaluator) – A meta evaluator for meta-testing. If None, don’t do meta-testing.

  • evaluate_every_n_epochs (int) – Do meta-testing every this epochs.

train(self, trainer)

Obtain samples and start training for each epoch.

Parameters

trainer (Trainer) – Gives the algorithm access to :method:`~Trainer.step_epochs()`, which provides services such as snapshotting and sampler control.

Returns

The average return in last epoch cycle.

Return type

float

property policy(self)

Current policy of the inner algorithm.

Returns

Current policy of the inner

algorithm.

Return type

garage.torch.policies.Policy

get_exploration_policy(self)

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns

The policy used to obtain samples that are later used for

meta-RL adaptation.

Return type

Policy

adapt_policy(self, exploration_policy, exploration_episodes)

Adapt the policy by one gradient steps for a task.

Parameters
  • exploration_policy (Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_episodes by interacting with an environment. The caller may not use this object after passing it into this method.

  • exploration_episodes (EpisodeBatch) – Episodes with which to adapt, generated by exploration_policy exploring the environment.

Returns

A policy adapted to the task represented by the

exploration_episodes.

Return type

Policy

class TD3(env_spec, policy, qf1, qf2, replay_buffer, sampler, *, max_episode_length_eval=None, grad_steps_per_env_step, exploration_policy, uniform_random_policy=None, max_action=None, target_update_tau=0.005, discount=0.99, reward_scaling=1.0, update_actor_interval=2, buffer_batch_size=64, replay_buffer_size=1000000.0, min_buffer_size=10000.0, exploration_noise=0.1, policy_noise=0.2, policy_noise_clip=0.5, clip_return=np.inf, policy_lr=_Default(0.0001), qf_lr=_Default(0.001), policy_optimizer=torch.optim.Adam, qf_optimizer=torch.optim.Adam, num_evaluation_episodes=10, steps_per_epoch=20, start_steps=10000, update_after=1000, use_deterministic_evaluation=False)

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.torch.algos.TD3

Implementation of TD3.

Based on https://arxiv.org/pdf/1802.09477.pdf.

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • policy (garage.torch.policies.Policy) – Policy (actor network).

  • qf1 (garage.torch.q_functions.QFunction) – Q function (critic network).

  • qf2 (garage.torch.q_functions.QFunction) – Q function (critic network).

  • replay_buffer (ReplayBuffer) – Replay buffer.

  • sampler (garage.sampler.Sampler) – Sampler.

  • replay_buffer_size (int) – Size of the replay buffer

  • exploration_policy (garage.np.exploration_policies.ExplorationPolicy) – Exploration strategy.

  • uniform_random_policy – (garage.np.exploration_policies.ExplorationPolicy): Uniform random exploration strategy.

  • target_update_tau (float) – Interpolation parameter for doing the soft target update.

  • discount (float) – Discount factor (gamma) for the cumulative return.

  • reward_scaling (float) – Reward scaling.

  • update_actor_interval (int) – Policy (Actor network) update interval.

  • max_action (float) – Maximum action magnitude.

  • buffer_batch_size (int) – Size of replay buffer.

  • min_buffer_size (int) – The minimum buffer size for replay buffer.

  • policy_noise (float) – Policy (actor) noise.

  • policy_noise_clip (float) – Noise clip.

  • exploration_noise (float) – Exploration noise.

  • clip_return (float) – Clip return to be in [-clip_return, clip_return].

  • policy_lr (float) – Learning rate for training policy network.

  • qf_lr (float) – Learning rate for training Q network.

  • policy_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training policy network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).

  • qf_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training Q-value network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).

  • steps_per_epoch (int) – Number of train_once calls per epoch.

  • grad_steps_per_env_step (int) – Number of gradient steps taken per environment step sampled.

  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.

  • num_evaluation_episodes (int) – The number of evaluation trajectories used for computing eval stats at the end of every epoch.

  • start_steps (int) – The number of steps for warming up before selecting actions according to policy.

  • update_after (int) – The number of steps to perform before policy is updated.

  • use_deterministic_evaluation (bool) – True if the trained policy should be evaluated deterministically.

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Experiment trainer, which provides services such as snapshotting and sampler control.

property networks(self)

Return all the networks within the model.

Returns

A list of networks.

Return type

list

to(self, device=None)

Put all the networks within the model on device.

Parameters

device (str) – ID of GPU or CPU.

class TRPO(env_spec, policy, value_function, sampler, policy_optimizer=None, vf_optimizer=None, num_train_per_epoch=1, discount=0.99, gae_lambda=0.98, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy')

Bases: garage.torch.algos.VPG

Inheritance diagram of garage.torch.algos.TRPO

Trust Region Policy Optimization (TRPO).

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • policy (garage.torch.policies.Policy) – Policy.

  • value_function (garage.torch.value_functions.ValueFunction) – The value function.

  • sampler (garage.sampler.Sampler) – Sampler.

  • policy_optimizer (garage.torch.optimizer.OptimizerWrapper) – Optimizer for policy.

  • vf_optimizer (garage.torch.optimizer.OptimizerWrapper) – Optimizer for value function.

  • num_train_per_epoch (int) – Number of train_once calls per epoch.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.

property discount(self)

Discount factor used by the algorithm.

Returns

discount factor.

Return type

float

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Gives the algorithm the access to :method:`~Trainer.step_epochs()`, which provides services such as snapshotting and sampler control.

Returns

The average return in last epoch cycle.

Return type

float

class MAMLTRPO(env, policy, value_function, sampler, task_sampler, inner_lr=_Default(0.01), outer_lr=0.001, max_kl_step=0.01, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy', meta_batch_size=40, num_grad_updates=1, meta_evaluator=None, evaluate_every_n_epochs=1)

Bases: garage.torch.algos.maml.MAML

Inheritance diagram of garage.torch.algos.MAMLTRPO

Model-Agnostic Meta-Learning (MAML) applied to TRPO.

Parameters
  • env (Environment) – A multi-task environment.

  • policy (garage.torch.policies.Policy) – Policy.

  • value_function (garage.np.baselines.Baseline) – The value function.

  • sampler (garage.sampler.Sampler) – Sampler.

  • task_sampler (garage.experiment.TaskSampler) – Task sampler.

  • inner_lr (float) – Adaptation learning rate.

  • outer_lr (float) – Meta policy learning rate.

  • max_kl_step (float) – The maximum KL divergence between old and new policies.

  • discount (float) – Discount.

  • gae_lambda (float) – Lambda used for generalized advantage estimation.

  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.

  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.

  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.

  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.

  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.

  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.

  • meta_batch_size (int) – Number of tasks sampled per batch.

  • num_grad_updates (int) – Number of adaptation gradient steps.

  • meta_evaluator (garage.experiment.MetaEvaluator) – A meta evaluator for meta-testing. If None, don’t do meta-testing.

  • evaluate_every_n_epochs (int) – Do meta-testing every this epochs.

train(self, trainer)

Obtain samples and start training for each epoch.

Parameters

trainer (Trainer) – Gives the algorithm access to :method:`~Trainer.step_epochs()`, which provides services such as snapshotting and sampler control.

Returns

The average return in last epoch cycle.

Return type

float

property policy(self)

Current policy of the inner algorithm.

Returns

Current policy of the inner

algorithm.

Return type

garage.torch.policies.Policy

get_exploration_policy(self)

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns

The policy used to obtain samples that are later used for

meta-RL adaptation.

Return type

Policy

adapt_policy(self, exploration_policy, exploration_episodes)

Adapt the policy by one gradient steps for a task.

Parameters
  • exploration_policy (Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_episodes by interacting with an environment. The caller may not use this object after passing it into this method.

  • exploration_episodes (EpisodeBatch) – Episodes with which to adapt, generated by exploration_policy exploring the environment.

Returns

A policy adapted to the task represented by the

exploration_episodes.

Return type

Policy

class SAC(env_spec, policy, qf1, qf2, replay_buffer, sampler, *, max_episode_length_eval=None, gradient_steps_per_itr, fixed_alpha=None, target_entropy=None, initial_log_entropy=0.0, discount=0.99, buffer_batch_size=64, min_buffer_size=int(10000.0), target_update_tau=0.005, policy_lr=0.0003, qf_lr=0.0003, reward_scale=1.0, optimizer=torch.optim.Adam, steps_per_epoch=1, num_evaluation_episodes=10, eval_env=None, use_deterministic_evaluation=True, temporal_regularization_factor=0.0, spatial_regularization_factor=0.0, spatial_regularization_eps=1.0)

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.torch.algos.SAC

A SAC Model in Torch.

Based on Soft Actor-Critic and Applications:

https://arxiv.org/abs/1812.05905

Soft Actor-Critic (SAC) is an algorithm which optimizes a stochastic policy in an off-policy way, forming a bridge between stochastic policy optimization and DDPG-style approaches. A central feature of SAC is entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy. This has a close connection to the exploration-exploitation trade-off: increasing entropy results in more exploration, which can accelerate learning later on. It can also prevent the policy from prematurely converging to a bad local optimum.

Parameters
  • policy (garage.torch.policy.Policy) – Policy/Actor/Agent that is being optimized by SAC.

  • qf1 (garage.torch.q_function.ContinuousMLPQFunction) – QFunction/Critic used for actor/policy optimization. See Soft Actor-Critic and Applications.

  • qf2 (garage.torch.q_function.ContinuousMLPQFunction) – QFunction/Critic used for actor/policy optimization. See Soft Actor-Critic and Applications.

  • replay_buffer (ReplayBuffer) – Stores transitions that are previously collected by the sampler.

  • sampler (garage.sampler.Sampler) – Sampler.

  • env_spec (EnvSpec) – The env_spec attribute of the environment that the agent is being trained in.

  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.

  • gradient_steps_per_itr (int) – Number of optimization steps that should

  • gradient_steps_per_itr – Number of optimization steps that should occur before the training step is over and a new batch of transitions is collected by the sampler.

  • fixed_alpha (float) – The entropy/temperature to be used if temperature is not supposed to be learned.

  • target_entropy (float) – target entropy to be used during entropy/temperature optimization. If None, the default heuristic from Soft Actor-Critic Algorithms and Applications is used.

  • initial_log_entropy (float) – initial entropy/temperature coefficient to be used if a fixed_alpha is not being used (fixed_alpha=None), and the entropy/temperature coefficient is being learned.

  • discount (float) – Discount factor to be used during sampling and critic/q_function optimization.

  • buffer_batch_size (int) – The number of transitions sampled from the replay buffer that are used during a single optimization step.

  • min_buffer_size (int) – The minimum number of transitions that need to be in the replay buffer before training can begin.

  • target_update_tau (float) – coefficient that controls the rate at which the target q_functions update over optimization iterations.

  • policy_lr (float) – learning rate for policy optimizers.

  • qf_lr (float) – learning rate for q_function optimizers.

  • reward_scale (float) – reward scale. Changing this hyperparameter changes the effect that the reward from a transition will have during optimization.

  • optimizer (torch.optim.Optimizer) – optimizer to be used for policy/actor, q_functions/critics, and temperature/entropy optimizations.

  • steps_per_epoch (int) – Number of train_once calls per epoch.

  • num_evaluation_episodes (int) – The number of evaluation episodes used for computing eval stats at the end of every epoch.

  • eval_env (Environment) – environment used for collecting evaluation episodes. If None, a copy of the train env is used.

  • use_deterministic_evaluation (bool) – True if the trained policy should be evaluated deterministically.

  • temporal_regularization_factor (float) – coefficient that determines the temporal regularization penalty as defined in CAPS as lambda_t

  • spatial_regularization_factor (float) – coefficient that determines the spatial regularization penalty as defined in CAPS as lambda_s

  • spatial_regularization_eps (float) – sigma of the normal distribution from with spatial regularization observations are drawn, in caps this is defined as epsilon_s

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Gives the algorithm the access to :method:`~Trainer.step_epochs()`, which provides services such as snapshotting and sampler control.

Returns

The average return in last epoch cycle.

Return type

float

train_once(self, itr=None, paths=None)

Complete 1 training iteration of SAC.

Parameters
  • itr (int) – Iteration number. This argument is deprecated.

  • paths (list[dict]) – A list of collected paths. This argument is deprecated.

Returns

loss from actor/policy network after optimization. torch.Tensor: loss from 1st q-function after optimization. torch.Tensor: loss from 2nd q-function after optimization.

Return type

torch.Tensor

optimize_policy(self, samples_data)

Optimize the policy q_functions, and temperature coefficient.

Parameters

samples_data (dict) – Transitions(S,A,R,S’) that are sampled from the replay buffer. It should have the keys ‘observation’, ‘action’, ‘reward’, ‘terminal’, and ‘next_observations’.

Note

samples_data’s entries should be torch.Tensor’s with the following shapes:

observation: \((N, O^*)\) action: \((N, A^*)\) reward: \((N, 1)\) terminal: \((N, 1)\) next_observation: \((N, O^*)\)

Returns

loss from actor/policy network after optimization. torch.Tensor: loss from 1st q-function after optimization. torch.Tensor: loss from 2nd q-function after optimization.

Return type

torch.Tensor

property networks(self)

Return all the networks within the model.

Returns

A list of networks.

Return type

list

to(self, device=None)

Put all the networks within the model on device.

Parameters

device (str) – ID of GPU or CPU.

class PEARL(env, inner_policy, qf, vf, sampler, *, num_train_tasks, num_test_tasks=None, latent_dim, encoder_hidden_sizes, test_env_sampler, policy_class=ContextConditionedPolicy, encoder_class=MLPEncoder, policy_lr=0.0003, qf_lr=0.0003, vf_lr=0.0003, context_lr=0.0003, policy_mean_reg_coeff=0.001, policy_std_reg_coeff=0.001, policy_pre_activation_coeff=0.0, soft_target_tau=0.005, kl_lambda=0.1, optimizer_class=torch.optim.Adam, use_information_bottleneck=True, use_next_obs_in_context=False, meta_batch_size=64, num_steps_per_epoch=1000, num_initial_steps=100, num_tasks_sample=100, num_steps_prior=100, num_steps_posterior=0, num_extra_rl_steps_posterior=100, batch_size=1024, embedding_batch_size=1024, embedding_mini_batch_size=1024, discount=0.99, replay_buffer_size=1000000, reward_scale=1, update_post_train=1)

Bases: garage.np.algos.MetaRLAlgorithm

Inheritance diagram of garage.torch.algos.PEARL

A PEARL model based on https://arxiv.org/abs/1903.08254.

PEARL, which stands for Probablistic Embeddings for Actor-Critic Reinforcement Learning, is an off-policy meta-RL algorithm. It is built on top of SAC using two Q-functions and a value function with an addition of an inference network that estimates the posterior \(q(z \| c)\). The policy is conditioned on the latent variable Z in order to adpat its behavior to specific tasks.

Parameters
  • env (list[Environment]) – Batch of sampled environment updates( EnvUpdate), which, when invoked on environments, will configure them with new tasks.

  • policy_class (type) – Class implementing :pyclass:`~ContextConditionedPolicy`

  • encoder_class (garage.torch.embeddings.ContextEncoder) – Encoder class for the encoder in context-conditioned policy.

  • inner_policy (garage.torch.policies.Policy) – Policy.

  • qf (torch.nn.Module) – Q-function.

  • vf (torch.nn.Module) – Value function.

  • sampler (garage.sampler.Sampler) – Sampler.

  • num_train_tasks (int) – Number of tasks for training.

  • num_test_tasks (int or None) – Number of tasks for testing.

  • latent_dim (int) – Size of latent context vector.

  • encoder_hidden_sizes (list[int]) – Output dimension of dense layer(s) of the context encoder.

  • test_env_sampler (garage.experiment.SetTaskSampler) – Sampler for test tasks.

  • policy_lr (float) – Policy learning rate.

  • qf_lr (float) – Q-function learning rate.

  • vf_lr (float) – Value function learning rate.

  • context_lr (float) – Inference network learning rate.

  • policy_mean_reg_coeff (float) – Policy mean regulation weight.

  • policy_std_reg_coeff (float) – Policy std regulation weight.

  • policy_pre_activation_coeff (float) – Policy pre-activation weight.

  • soft_target_tau (float) – Interpolation parameter for doing the soft target update.

  • kl_lambda (float) – KL lambda value.

  • optimizer_class (type) – Type of optimizer for training networks.

  • use_information_bottleneck (bool) – False means latent context is deterministic.

  • use_next_obs_in_context (bool) – Whether or not to use next observation in distinguishing between tasks.

  • meta_batch_size (int) – Meta batch size.

  • num_steps_per_epoch (int) – Number of iterations per epoch.

  • num_initial_steps (int) – Number of transitions obtained per task before training.

  • num_tasks_sample (int) – Number of random tasks to obtain data for each iteration.

  • num_steps_prior (int) – Number of transitions to obtain per task with z ~ prior.

  • num_steps_posterior (int) – Number of transitions to obtain per task with z ~ posterior.

  • num_extra_rl_steps_posterior (int) – Number of additional transitions to obtain per task with z ~ posterior that are only used to train the policy and NOT the encoder.

  • batch_size (int) – Number of transitions in RL batch.

  • embedding_batch_size (int) – Number of transitions in context batch.

  • embedding_mini_batch_size (int) – Number of transitions in mini context batch; should be same as embedding_batch_size for non-recurrent encoder.

  • discount (float) – RL discount factor.

  • replay_buffer_size (int) – Maximum samples in replay buffer.

  • reward_scale (int) – Reward scale.

  • update_post_train (int) – How often to resample context when obtaining data during training (in episodes).

train(self, trainer)

Obtain samples, train, and evaluate for each epoch.

Parameters

trainer (Trainer) – Gives the algorithm the access to :method:`Trainer..step_epochs()`, which provides services such as snapshotting and sampler control.

property policy(self)

Return all the policy within the model.

Returns

Policy within the model.

Return type

garage.torch.policies.Policy

property networks(self)

Return all the networks within the model.

Returns

A list of networks.

Return type

list

get_exploration_policy(self)

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns

The policy used to obtain samples that are later used for

meta-RL adaptation.

Return type

Policy

adapt_policy(self, exploration_policy, exploration_episodes)

Produce a policy adapted for a task.

Parameters
  • exploration_policy (Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_episodes by interacting with an environment. The caller may not use this object after passing it into this method.

  • exploration_episodes (EpisodeBatch) – Episodes to which to adapt, generated by exploration_policy exploring the environment.

Returns

A policy adapted to the task represented by the

exploration_episodes.

Return type

Policy

to(self, device=None)

Put all the networks within the model on device.

Parameters

device (str) – ID of GPU or CPU.

classmethod augment_env_spec(cls, env_spec, latent_dim)

Augment environment by a size of latent dimension.

Parameters
  • env_spec (EnvSpec) – Environment specs to be augmented.

  • latent_dim (int) – Latent dimension.

Returns

Augmented environment specs.

Return type

EnvSpec

classmethod get_env_spec(cls, env_spec, latent_dim, module)

Get environment specs of encoder with latent dimension.

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • latent_dim (int) – Latent dimension.

  • module (str) – Module to get environment specs for.

Returns

Module environment specs with latent dimension.

Return type

InOutSpec