garage.tf.algos.td3 module

This module implements a TD3 model.

TD3, or Twin Delayed Deep Deterministic Policy Gradient, uses actor-critic method to optimize the policy and reward prediction. Notably, it uses the minimum value of two critics instead of one to limit overestimation.

class TD3(env_spec, policy, qf, qf2, replay_buffer, *, target_update_tau=0.01, policy_weight_decay=0, qf_weight_decay=0, policy_optimizer=<class 'tensorflow.python.training.adam.AdamOptimizer'>, qf_optimizer=<class 'tensorflow.python.training.adam.AdamOptimizer'>, policy_lr=<garage._functions._Default object>, qf_lr=<garage._functions._Default object>, clip_pos_returns=False, clip_return=inf, discount=0.99, max_action=None, name='TD3', steps_per_epoch=20, max_path_length=None, max_eval_path_length=None, n_train_steps=50, buffer_batch_size=64, min_buffer_size=10000.0, rollout_batch_size=1, reward_scale=1.0, exploration_policy_sigma=0.2, actor_update_period=2, exploration_policy_clip=0.5, smooth_return=True, exploration_policy=None)[source]

Bases: garage.np.algos.rl_algorithm.RLAlgorithm

Implementation of TD3.

Based on https://arxiv.org/pdf/1802.09477.pdf.

Example

$ python garage/examples/tf/td3_pendulum.py

Parameters:
  • env_spec (garage.envs.EnvSpec) – Environment.
  • policy (garage.tf.policies.Policy) – Policy.
  • qf (garage.tf.q_functions.QFunction) – Q-function.
  • qf2 (garage.tf.q_functions.QFunction) – Q function to use
  • replay_buffer (garage.replay_buffer.ReplayBuffer) – Replay buffer.
  • target_update_tau (float) – Interpolation parameter for doing the soft target update.
  • policy_lr (float) – Learning rate for training policy network.
  • qf_lr (float) – Learning rate for training q value network.
  • policy_weight_decay (float) – L2 weight decay factor for parameters of the policy network.
  • qf_weight_decay (float) – L2 weight decay factor for parameters of the q value network.
  • policy_optimizer (tf.python.training.optimizer.Optimizer) – Optimizer for training policy network.
  • qf_optimizer (tf.python.training.optimizer.Optimizer) – Optimizer for training q function network.
  • clip_pos_returns (boolean) – Whether or not clip positive returns.
  • clip_return (float) – Clip return to be in [-clip_return, clip_return].
  • discount (float) – Discount factor for the cumulative return.
  • max_action (float) – Maximum action magnitude.
  • name (str) – Name of the algorithm shown in computation graph.
  • steps_per_epoch (int) – Number of batches of samples in each epoch.
  • max_path_length (int) – Maximum length of a path.
  • max_eval_path_length (int or None) – Maximum length of paths used for off-policy evaluation. If None, defaults to max_path_length.
  • n_train_steps (int) – Number of optimizations in each epoch cycle.
  • buffer_batch_size (int) – Size of replay buffer.
  • min_buffer_size (int) – Number of samples in replay buffer before first optimization.
  • rollout_batch_size (int) – Roll out batch size.
  • reward_scale (float) – Scale to reward.
  • exploration_policy_sigma (float) – Action noise sigma.
  • exploration_policy_clip (float) – Action noise clip.
  • actor_update_period (int) – Action update period.
  • smooth_return (bool) – If True, do statistics on all samples collection. Otherwise do statistics on one batch.
  • exploration_policy (garage.np.exploration_policies.ExplorationPolicy) – Exploration strategy.
init_opt()[source]

Build the loss function and init the optimizer.

optimize_policy(itr)[source]

Perform algorithm optimizing.

Parameters:itr (int) – Iterations.
Returns:Loss of action predicted by the policy network. qval_loss(float): Loss of q value predicted by the q network. ys(float): y_s. qval(float): Q value predicted by the q network.
Return type:action_loss(float)
train(runner)[source]

Obtain samplers and start actual training for each epoch.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch cycle.
Return type:float
train_once(itr, paths)[source]

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

Average return.

Return type:

np.float64