garage.torch.algos.td3

TD3 model in Pytorch.

class TD3(env_spec, policy, qf1, qf2, replay_buffer, sampler, *, max_episode_length_eval=None, grad_steps_per_env_step, exploration_policy, uniform_random_policy=None, max_action=None, target_update_tau=0.005, discount=0.99, reward_scaling=1.0, update_actor_interval=2, buffer_batch_size=64, replay_buffer_size=1000000.0, min_buffer_size=10000.0, exploration_noise=0.1, policy_noise=0.2, policy_noise_clip=0.5, clip_return=np.inf, policy_lr=_Default(0.0001), qf_lr=_Default(0.001), policy_optimizer=torch.optim.Adam, qf_optimizer=torch.optim.Adam, num_evaluation_episodes=10, steps_per_epoch=20, start_steps=10000, update_after=1000, use_deterministic_evaluation=False)

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.torch.algos.td3.TD3

Implementation of TD3.

Based on https://arxiv.org/pdf/1802.09477.pdf.

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • policy (garage.torch.policies.Policy) – Policy (actor network).

  • qf1 (garage.torch.q_functions.QFunction) – Q function (critic network).

  • qf2 (garage.torch.q_functions.QFunction) – Q function (critic network).

  • replay_buffer (ReplayBuffer) – Replay buffer.

  • sampler (garage.sampler.Sampler) – Sampler.

  • replay_buffer_size (int) – Size of the replay buffer

  • exploration_policy (garage.np.exploration_policies.ExplorationPolicy) – Exploration strategy.

  • uniform_random_policy – (garage.np.exploration_policies.ExplorationPolicy): Uniform random exploration strategy.

  • target_update_tau (float) – Interpolation parameter for doing the soft target update.

  • discount (float) – Discount factor (gamma) for the cumulative return.

  • reward_scaling (float) – Reward scaling.

  • update_actor_interval (int) – Policy (Actor network) update interval.

  • max_action (float) – Maximum action magnitude.

  • buffer_batch_size (int) – Size of replay buffer.

  • min_buffer_size (int) – The minimum buffer size for replay buffer.

  • policy_noise (float) – Policy (actor) noise.

  • policy_noise_clip (float) – Noise clip.

  • exploration_noise (float) – Exploration noise.

  • clip_return (float) – Clip return to be in [-clip_return, clip_return].

  • policy_lr (float) – Learning rate for training policy network.

  • qf_lr (float) – Learning rate for training Q network.

  • policy_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training policy network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).

  • qf_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training Q-value network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).

  • steps_per_epoch (int) – Number of train_once calls per epoch.

  • grad_steps_per_env_step (int) – Number of gradient steps taken per environment step sampled.

  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.

  • num_evaluation_episodes (int) – The number of evaluation trajectories used for computing eval stats at the end of every epoch.

  • start_steps (int) – The number of steps for warming up before selecting actions according to policy.

  • update_after (int) – The number of steps to perform before policy is updated.

  • use_deterministic_evaluation (bool) – True if the trained policy should be evaluated deterministically.

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Experiment trainer, which provides services such as snapshotting and sampler control.

property networks(self)

Return all the networks within the model.

Returns

A list of networks.

Return type

list

to(self, device=None)

Put all the networks within the model on device.

Parameters

device (str) – ID of GPU or CPU.