`garage.torch.algos.td3`¶

TD3 model in Pytorch.

class TD3(env_spec, policy, qf1, qf2, replay_buffer, sampler, *, max_episode_length_eval=None, grad_steps_per_env_step, exploration_policy, uniform_random_policy=None, max_action=None, target_update_tau=0.005, discount=0.99, reward_scaling=1.0, update_actor_interval=2, buffer_batch_size=64, replay_buffer_size=1000000.0, min_buffer_size=10000.0, exploration_noise=0.1, policy_noise=0.2, policy_noise_clip=0.5, clip_return=np.inf, policy_lr=_Default(0.0001), qf_lr=_Default(0.001), policy_optimizer=torch.optim.Adam, qf_optimizer=torch.optim.Adam, num_evaluation_episodes=10, steps_per_epoch=20, start_steps=10000, update_after=1000, use_deterministic_evaluation=False)¶

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.torch.algos.td3.TD3

Implementation of TD3.

Based on https://arxiv.org/pdf/1802.09477.pdf.

Parameters

env_spec (EnvSpec) – Environment specification.
policy (garage.torch.policies.Policy) – Policy (actor network).
qf1 (garage.torch.q_functions.QFunction) – Q function (critic network).
qf2 (garage.torch.q_functions.QFunction) – Q function (critic network).
replay_buffer (ReplayBuffer) – Replay buffer.
sampler (garage.sampler.Sampler) – Sampler.
replay_buffer_size (int) – Size of the replay buffer
exploration_policy (garage.np.exploration_policies.ExplorationPolicy) – Exploration strategy.
uniform_random_policy – (garage.np.exploration_policies.ExplorationPolicy): Uniform random exploration strategy.
target_update_tau (float) – Interpolation parameter for doing the soft target update.
discount (float) – Discount factor (gamma) for the cumulative return.
reward_scaling (float) – Reward scaling.
update_actor_interval (int) – Policy (Actor network) update interval.
max_action (float) – Maximum action magnitude.
buffer_batch_size (int) – Size of replay buffer.
min_buffer_size (int) – The minimum buffer size for replay buffer.
policy_noise (float) – Policy (actor) noise.
policy_noise_clip (float) – Noise clip.
exploration_noise (float) – Exploration noise.
clip_return (float) – Clip return to be in [-clip_return, clip_return].
policy_lr (float) – Learning rate for training policy network.
qf_lr (float) – Learning rate for training Q network.
policy_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training policy network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).
qf_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training Q-value network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).
steps_per_epoch (int) – Number of train_once calls per epoch.
grad_steps_per_env_step (int) – Number of gradient steps taken per environment step sampled.
max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.
num_evaluation_episodes (int) – The number of evaluation trajectories used for computing eval stats at the end of every epoch.
start_steps (int) – The number of steps for warming up before selecting actions according to policy.
update_after (int) – The number of steps to perform before policy is updated.
use_deterministic_evaluation (bool) – True if the trained policy should be evaluated deterministically.

property networks¶

Return all the networks within the model.

Returns: A list of networks.
Return type: list

train(trainer)¶

Obtain samplers and start actual training for each epoch.

Parameters: trainer (Trainer) – Experiment trainer, which provides services such as snapshotting and sampler control.

to(device=None)¶

Put all the networks within the model on device.

Parameters: device (str) – ID of GPU or CPU.

garage.torch.algos.td3¶

`garage.torch.algos.td3`¶