garage.tf.algos.td3 module¶

This module implements a TD3 model.

TD3, or Twin Delayed Deep Deterministic Policy Gradient, uses actor-critic method to optimize the policy and reward prediction. Notably, it uses the minimum value of two critics instead of one to limit overestimation.

class TD3(env_spec, policy, qf, qf2, replay_buffer, *, target_update_tau=0.01, policy_weight_decay=0, qf_weight_decay=0, policy_optimizer=<class 'tensorflow.python.training.adam.AdamOptimizer'>, qf_optimizer=<class 'tensorflow.python.training.adam.AdamOptimizer'>, policy_lr=<garage._functions._Default object>, qf_lr=<garage._functions._Default object>, clip_pos_returns=False, clip_return=inf, discount=0.99, max_action=None, name='TD3', steps_per_epoch=20, max_path_length=None, max_eval_path_length=None, n_train_steps=50, buffer_batch_size=64, min_buffer_size=10000.0, rollout_batch_size=1, reward_scale=1.0, exploration_policy_sigma=0.2, actor_update_period=2, exploration_policy_clip=0.5, smooth_return=True, exploration_policy=None)[source]¶

Bases: garage.np.algos.rl_algorithm.RLAlgorithm

Implementation of TD3.

Based on https://arxiv.org/pdf/1802.09477.pdf.

Example

$ python garage/examples/tf/td3_pendulum.py

Parameters:

env_spec (garage.envs.EnvSpec) – Environment.
policy (garage.tf.policies.Policy) – Policy.
qf (garage.tf.q_functions.QFunction) – Q-function.
qf2 (garage.tf.q_functions.QFunction) – Q function to use
replay_buffer (garage.replay_buffer.ReplayBuffer) – Replay buffer.
target_update_tau (float) – Interpolation parameter for doing the soft target update.
policy_lr (float) – Learning rate for training policy network.
qf_lr (float) – Learning rate for training q value network.
policy_weight_decay (float) – L2 weight decay factor for parameters of the policy network.
qf_weight_decay (float) – L2 weight decay factor for parameters of the q value network.
policy_optimizer (tf.python.training.optimizer.Optimizer) – Optimizer for training policy network.
qf_optimizer (tf.python.training.optimizer.Optimizer) – Optimizer for training q function network.
clip_pos_returns (boolean) – Whether or not clip positive returns.
clip_return (float) – Clip return to be in [-clip_return, clip_return].
discount (float) – Discount factor for the cumulative return.
max_action (float) – Maximum action magnitude.
name (str) – Name of the algorithm shown in computation graph.
steps_per_epoch (int) – Number of batches of samples in each epoch.
max_path_length (int) – Maximum length of a path.
max_eval_path_length (int or None) – Maximum length of paths used for off-policy evaluation. If None, defaults to max_path_length.
n_train_steps (int) – Number of optimizations in each epoch cycle.
buffer_batch_size (int) – Size of replay buffer.
min_buffer_size (int) – Number of samples in replay buffer before first optimization.
rollout_batch_size (int) – Roll out batch size.
reward_scale (float) – Scale to reward.
exploration_policy_sigma (float) – Action noise sigma.
exploration_policy_clip (float) – Action noise clip.
actor_update_period (int) – Action update period.
smooth_return (bool) – If True, do statistics on all samples collection. Otherwise do statistics on one batch.
exploration_policy (garage.np.exploration_policies.ExplorationPolicy) – Exploration strategy.

init_opt()[source]¶: Build the loss function and init the optimizer.

optimize_policy(itr)[source]¶

Perform algorithm optimizing.

Parameters:	itr (int) – Iterations.
Returns:	Loss of action predicted by the policy network. qval_loss(float): Loss of q value predicted by the q network. ys(float): y_s. qval(float): Q value predicted by the q network.
Return type:	action_loss(float)

train(runner)[source]¶

Obtain samplers and start actual training for each epoch.

Parameters:	runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:	The average return in last epoch cycle.
Return type:	float

train_once(itr, paths)[source]¶

Perform one step of policy optimization given one batch of samples.

Parameters:	itr (int) – Iteration number. paths (list[dict]) – A list of collected paths.
Returns:	Average return.
Return type:	np.float64