garage.tf.algos.td3

This module implements a TD3 model.

TD3, or Twin Delayed Deep Deterministic Policy Gradient, uses actor-critic method to optimize the policy and reward prediction. Notably, it uses the minimum value of two critics instead of one to limit overestimation.

class TD3(env_spec, policy, qf, qf2, replay_buffer, *, target_update_tau=0.01, policy_weight_decay=0, qf_weight_decay=0, policy_optimizer=tf.compat.v1.train.AdamOptimizer, qf_optimizer=tf.compat.v1.train.AdamOptimizer, policy_lr=_Default(0.0001), qf_lr=_Default(0.001), clip_pos_returns=False, clip_return=np.inf, discount=0.99, max_episode_length_eval=None, max_action=None, name='TD3', steps_per_epoch=20, n_train_steps=50, buffer_batch_size=64, min_buffer_size=10000.0, reward_scale=1.0, exploration_policy_sigma=0.2, actor_update_period=2, exploration_policy_clip=0.5, exploration_policy=None)

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.tf.algos.td3.TD3

Implementation of TD3.

Based on https://arxiv.org/pdf/1802.09477.pdf.

Example

$ python garage/examples/tf/td3_pendulum.py

Parameters
  • env_spec (EnvSpec) – Environment.

  • policy (Policy) – Policy.

  • qf (garage.tf.q_functions.QFunction) – Q-function.

  • qf2 (garage.tf.q_functions.QFunction) – Q function to use

  • replay_buffer (ReplayBuffer) – Replay buffer.

  • target_update_tau (float) – Interpolation parameter for doing the soft target update.

  • policy_lr (float) – Learning rate for training policy network.

  • qf_lr (float) – Learning rate for training q value network.

  • policy_weight_decay (float) – L2 weight decay factor for parameters of the policy network.

  • qf_weight_decay (float) – L2 weight decay factor for parameters of the q value network.

  • policy_optimizer (tf.compat.v1.train.Optimizer) – Optimizer for training policy network.

  • qf_optimizer (tf.compat.v1.train.Optimizer) – Optimizer for training Q-function network.

  • clip_pos_returns (boolean) – Whether or not clip positive returns.

  • clip_return (float) – Clip return to be in [-clip_return, clip_return].

  • discount (float) – Discount factor for the cumulative return.

  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.

  • max_action (float) – Maximum action magnitude.

  • name (str) – Name of the algorithm shown in computation graph.

  • steps_per_epoch (int) – Number of batches of samples in each epoch.

  • n_train_steps (int) – Number of optimizations in each epoch cycle.

  • buffer_batch_size (int) – Size of replay buffer.

  • min_buffer_size (int) – Number of samples in replay buffer before first optimization.

  • reward_scale (float) – Scale to reward.

  • exploration_policy_sigma (float) – Action noise sigma.

  • exploration_policy_clip (float) – Action noise clip.

  • actor_update_period (int) – Action update period.

  • exploration_policy (ExplorationPolicy) – Exploration strategy.

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Experiment trainer, which provides services such as snapshotting and sampler control.

Returns

The average return in last epoch cycle.

Return type

float