garage.torch.algos.ddpg

This modules creates a DDPG model in PyTorch.

class DDPG(env_spec, policy, qf, replay_buffer, *, max_episode_length, steps_per_epoch=20, n_train_steps=50, max_episode_length_eval=None, buffer_batch_size=64, min_buffer_size=int(10000.0), exploration_policy=None, target_update_tau=0.01, discount=0.99, policy_weight_decay=0, qf_weight_decay=0, policy_optimizer=torch.optim.Adam, qf_optimizer=torch.optim.Adam, policy_lr=_Default(0.0001), qf_lr=_Default(0.001), clip_pos_returns=False, clip_return=np.inf, max_action=None, reward_scale=1.0)

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.torch.algos.ddpg.DDPG

A DDPG model implemented with PyTorch.

DDPG, also known as Deep Deterministic Policy Gradient, uses actor-critic method to optimize the policy and Q-function prediction. It uses a supervised method to update the critic network and policy gradient to update the actor network. And there are exploration strategy, replay buffer and target networks involved to stabilize the training process.

Parameters:
  • env_spec (EnvSpec) – Environment specification.
  • policy (garage.torch.policies.Policy) – Policy.
  • qf (object) – Q-value network.
  • replay_buffer (ReplayBuffer) – Replay buffer.
  • steps_per_epoch (int) – Number of train_once calls per epoch.
  • n_train_steps (int) – Training steps.
  • max_episode_length (int) – Maximum episode length. The episode will be truncated when the length of the episode reaches max_episode_length.
  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to max_episode_length.
  • buffer_batch_size (int) – Batch size of replay buffer.
  • min_buffer_size (int) – The minimum buffer size for replay buffer.
  • exploration_policy (garage.np.exploration_policies.ExplorationPolicy) – # noqa: E501 Exploration strategy.
  • target_update_tau (float) – Interpolation parameter for doing the soft target update.
  • discount (float) – Discount factor for the cumulative return.
  • policy_weight_decay (float) – L2 weight decay factor for parameters of the policy network.
  • qf_weight_decay (float) – L2 weight decay factor for parameters of the q value network.
  • policy_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training policy network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).
  • qf_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training Q-value network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).
  • policy_lr (float) – Learning rate for policy network parameters.
  • qf_lr (float) – Learning rate for Q-value network parameters.
  • clip_pos_returns (bool) – Whether or not clip positive returns.
  • clip_return (float) – Clip return to be in [-clip_return, clip_return].
  • max_action (float) – Maximum action magnitude.
  • reward_scale (float) – Reward scale.
train(self, runner)

Obtain samplers and start actual training for each epoch.

Parameters:runner (LocalRunner) – Experiment runner.
Returns:The average return in last epoch cycle.
Return type:float
train_once(self, itr, episodes)

Perform one iteration of training.

Parameters:
  • itr (int) – Iteration number.
  • episodes (EpisodeBatch) – Batch of episodes.
optimize_policy(self, samples_data)

Perform algorithm optimizing.

Parameters:samples_data (dict) – Processed batch data.
Returns:Loss of action predicted by the policy network. qval_loss: Loss of Q-value predicted by the Q-network. ys: y_s. qval: Q-value predicted by the Q-network.
Return type:action_loss
update_target(self)

Update parameters in the target policy and Q-value network.