garage.tf.algos.td3 module¶
This module implements a TD3 model.
TD3, or Twin Delayed Deep Deterministic Policy Gradient, uses actor-critic method to optimize the policy and reward prediction. Notably, it uses the minimum value of two critics instead of one to limit overestimation.
-
class
TD3
(env_spec, policy, qf, qf2, replay_buffer, *, target_update_tau=0.01, policy_weight_decay=0, qf_weight_decay=0, policy_optimizer=<class 'tensorflow.python.training.adam.AdamOptimizer'>, qf_optimizer=<class 'tensorflow.python.training.adam.AdamOptimizer'>, policy_lr=<garage._functions._Default object>, qf_lr=<garage._functions._Default object>, clip_pos_returns=False, clip_return=inf, discount=0.99, max_action=None, name='TD3', steps_per_epoch=20, max_path_length=None, max_eval_path_length=None, n_train_steps=50, buffer_batch_size=64, min_buffer_size=10000.0, rollout_batch_size=1, reward_scale=1.0, exploration_policy_sigma=0.2, actor_update_period=2, exploration_policy_clip=0.5, smooth_return=True, exploration_policy=None)[source]¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
Implementation of TD3.
Based on https://arxiv.org/pdf/1802.09477.pdf.
Example
$ python garage/examples/tf/td3_pendulum.py
Parameters: - env_spec (garage.envs.EnvSpec) – Environment.
- policy (garage.tf.policies.Policy) – Policy.
- qf (garage.tf.q_functions.QFunction) – Q-function.
- qf2 (garage.tf.q_functions.QFunction) – Q function to use
- replay_buffer (garage.replay_buffer.ReplayBuffer) – Replay buffer.
- target_update_tau (float) – Interpolation parameter for doing the soft target update.
- policy_lr (float) – Learning rate for training policy network.
- qf_lr (float) – Learning rate for training q value network.
- policy_weight_decay (float) – L2 weight decay factor for parameters of the policy network.
- qf_weight_decay (float) – L2 weight decay factor for parameters of the q value network.
- policy_optimizer (tf.python.training.optimizer.Optimizer) – Optimizer for training policy network.
- qf_optimizer (tf.python.training.optimizer.Optimizer) – Optimizer for training q function network.
- clip_pos_returns (boolean) – Whether or not clip positive returns.
- clip_return (float) – Clip return to be in [-clip_return, clip_return].
- discount (float) – Discount factor for the cumulative return.
- max_action (float) – Maximum action magnitude.
- name (str) – Name of the algorithm shown in computation graph.
- steps_per_epoch (int) – Number of batches of samples in each epoch.
- max_path_length (int) – Maximum length of a path.
- max_eval_path_length (int or None) – Maximum length of paths used for off-policy evaluation. If None, defaults to max_path_length.
- n_train_steps (int) – Number of optimizations in each epoch cycle.
- buffer_batch_size (int) – Size of replay buffer.
- min_buffer_size (int) – Number of samples in replay buffer before first optimization.
- rollout_batch_size (int) – Roll out batch size.
- reward_scale (float) – Scale to reward.
- exploration_policy_sigma (float) – Action noise sigma.
- exploration_policy_clip (float) – Action noise clip.
- actor_update_period (int) – Action update period.
- smooth_return (bool) – If True, do statistics on all samples collection. Otherwise do statistics on one batch.
- exploration_policy (garage.np.exploration_policies.ExplorationPolicy) – Exploration strategy.
-
optimize_policy
(itr)[source]¶ Perform algorithm optimizing.
Parameters: itr (int) – Iterations. Returns: Loss of action predicted by the policy network. qval_loss(float): Loss of q value predicted by the q network. ys(float): y_s. qval(float): Q value predicted by the q network. Return type: action_loss(float)
-
train
(runner)[source]¶ Obtain samplers and start actual training for each epoch.
Parameters: runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control. Returns: The average return in last epoch cycle. Return type: float