`garage.tf.algos.td3`¶

This module implements a TD3 model.

TD3, or Twin Delayed Deep Deterministic Policy Gradient, uses actor-critic method to optimize the policy and reward prediction. Notably, it uses the minimum value of two critics instead of one to limit overestimation.

class TD3(env_spec, policy, qf, qf2, replay_buffer, *, target_update_tau=0.01, policy_weight_decay=0, qf_weight_decay=0, policy_optimizer=tf.compat.v1.train.AdamOptimizer, qf_optimizer=tf.compat.v1.train.AdamOptimizer, policy_lr=_Default(0.0001), qf_lr=_Default(0.001), clip_pos_returns=False, clip_return=np.inf, discount=0.99, max_episode_length_eval=None, max_action=None, name='TD3', steps_per_epoch=20, n_train_steps=50, buffer_batch_size=64, min_buffer_size=10000.0, reward_scale=1.0, exploration_policy_sigma=0.2, actor_update_period=2, exploration_policy_clip=0.5, exploration_policy=None)¶

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.tf.algos.td3.TD3

Implementation of TD3.

Based on https://arxiv.org/pdf/1802.09477.pdf.

Example

$ python garage/examples/tf/td3_pendulum.py

Parameters

env_spec (EnvSpec) – Environment.
policy (Policy) – Policy.
qf (garage.tf.q_functions.QFunction) – Q-function.
qf2 (garage.tf.q_functions.QFunction) – Q function to use
replay_buffer (ReplayBuffer) – Replay buffer.
target_update_tau (float) – Interpolation parameter for doing the soft target update.
policy_lr (float) – Learning rate for training policy network.
qf_lr (float) – Learning rate for training q value network.
policy_weight_decay (float) – L2 weight decay factor for parameters of the policy network.
qf_weight_decay (float) – L2 weight decay factor for parameters of the q value network.
policy_optimizer (tf.compat.v1.train.Optimizer) – Optimizer for training policy network.
qf_optimizer (tf.compat.v1.train.Optimizer) – Optimizer for training Q-function network.
clip_pos_returns (boolean) – Whether or not clip positive returns.
clip_return (float) – Clip return to be in [-clip_return, clip_return].
discount (float) – Discount factor for the cumulative return.
max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.
max_action (float) – Maximum action magnitude.
name (str) – Name of the algorithm shown in computation graph.
steps_per_epoch (int) – Number of batches of samples in each epoch.
n_train_steps (int) – Number of optimizations in each epoch cycle.
buffer_batch_size (int) – Size of replay buffer.
min_buffer_size (int) – Number of samples in replay buffer before first optimization.
reward_scale (float) – Scale to reward.
exploration_policy_sigma (float) – Action noise sigma.
exploration_policy_clip (float) – Action noise clip.
actor_update_period (int) – Action update period.
exploration_policy (ExplorationPolicy) – Exploration strategy.

train(self, trainer)¶

Obtain samplers and start actual training for each epoch.

Parameters: trainer (Trainer) – Experiment trainer, which provides services such as snapshotting and sampler control.
Returns: The average return in last epoch cycle.
Return type: float

garage.tf.algos.td3¶

`garage.tf.algos.td3`¶