This module implements a TD3 model.

TD3, or Twin Delayed Deep Deterministic Policy Gradient, uses actor-critic method to optimize the policy and reward prediction. Notably, it uses the minimum value of two critics instead of one to limit overestimation.

class TD3(env_spec, policy, qf, qf2, replay_buffer, sampler, *, target_update_tau=0.01, policy_weight_decay=0, qf_weight_decay=0, policy_optimizer=tf.compat.v1.train.AdamOptimizer, qf_optimizer=tf.compat.v1.train.AdamOptimizer, policy_lr=_Default(0.0001), qf_lr=_Default(0.001), clip_pos_returns=False, clip_return=np.inf, discount=0.99, max_episode_length_eval=None, max_action=None, name='TD3', steps_per_epoch=20, n_train_steps=50, buffer_batch_size=64, min_buffer_size=10000.0, reward_scale=1.0, exploration_policy_sigma=0.2, actor_update_period=2, exploration_policy_clip=0.5, exploration_policy=None)


Inheritance diagram of

Implementation of TD3.

Based on


$ python garage/examples/tf/

  • env_spec (EnvSpec) – Environment.

  • policy (Policy) – Policy.

  • qf ( – Q-function.

  • qf2 ( – Q function to use

  • replay_buffer (ReplayBuffer) – Replay buffer.

  • sampler (garage.sampler.Sampler) – Sampler.

  • target_update_tau (float) – Interpolation parameter for doing the soft target update.

  • policy_lr (float) – Learning rate for training policy network.

  • qf_lr (float) – Learning rate for training q value network.

  • policy_weight_decay (float) – L2 weight decay factor for parameters of the policy network.

  • qf_weight_decay (float) – L2 weight decay factor for parameters of the q value network.

  • policy_optimizer (tf.compat.v1.train.Optimizer) – Optimizer for training policy network.

  • qf_optimizer (tf.compat.v1.train.Optimizer) – Optimizer for training Q-function network.

  • clip_pos_returns (boolean) – Whether or not clip positive returns.

  • clip_return (float) – Clip return to be in [-clip_return, clip_return].

  • discount (float) – Discount factor for the cumulative return.

  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.

  • max_action (float) – Maximum action magnitude.

  • name (str) – Name of the algorithm shown in computation graph.

  • steps_per_epoch (int) – Number of batches of samples in each epoch.

  • n_train_steps (int) – Number of optimizations in each epoch cycle.

  • buffer_batch_size (int) – Size of replay buffer.

  • min_buffer_size (int) – Number of samples in replay buffer before first optimization.

  • reward_scale (float) – Scale to reward.

  • exploration_policy_sigma (float) – Action noise sigma.

  • exploration_policy_clip (float) – Action noise clip.

  • actor_update_period (int) – Action update period.

  • exploration_policy (ExplorationPolicy) – Exploration strategy.


Obtain samplers and start actual training for each epoch.


trainer (Trainer) – Experiment trainer, which provides services such as snapshotting and sampler control.


The average return in last epoch cycle.

Return type