garage.tf.algos.ddpg

Deep Deterministic Policy Gradient (DDPG) implementation in TensorFlow.

class DDPG(env_spec, policy, qf, replay_buffer, *, steps_per_epoch=20, n_train_steps=50, max_episode_length=None, max_episode_length_eval=None, buffer_batch_size=64, min_buffer_size=int(10000.0), exploration_policy=None, target_update_tau=0.01, discount=0.99, policy_weight_decay=0, qf_weight_decay=0, policy_optimizer=tf.compat.v1.train.AdamOptimizer, qf_optimizer=tf.compat.v1.train.AdamOptimizer, policy_lr=_Default(0.0001), qf_lr=_Default(0.001), clip_pos_returns=False, clip_return=np.inf, max_action=None, reward_scale=1.0, name='DDPG')

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.tf.algos.ddpg.DDPG

A DDPG model based on https://arxiv.org/pdf/1509.02971.pdf.

DDPG, also known as Deep Deterministic Policy Gradient, uses actor-critic method to optimize the policy and reward prediction. It uses a supervised method to update the critic network and policy gradient to update the actor network. And there are exploration strategy, replay buffer and target networks involved to stabilize the training process.

Example

$ python garage/examples/tf/ddpg_pendulum.py

Parameters:
  • env_spec (EnvSpec) – Environment specification.
  • policy (garage.tf.policies.Policy) – Policy.
  • qf (object) – The q value network.
  • replay_buffer (garage.replay_buffer.ReplayBuffer) – Replay buffer.
  • steps_per_epoch (int) – Number of train_once calls per epoch.
  • n_train_steps (int) – Training steps.
  • max_episode_length (int) – Maximum episode length. The episode will be truncated when length of episode reaches max_episode_length.
  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to max_episode_length.
  • buffer_batch_size (int) – Batch size of replay buffer.
  • min_buffer_size (int) – The minimum buffer size for replay buffer.
  • exploration_policy (garage.np.exploration_policies.ExplorationPolicy) – Exploration strategy.
  • target_update_tau (float) – Interpolation parameter for doing the soft target update.
  • policy_lr (float) – Learning rate for training policy network.
  • qf_lr (float) – Learning rate for training q value network.
  • discount (float) – Discount factor for the cumulative return.
  • policy_weight_decay (float) – L2 regularization factor for parameters of the policy network. Value of 0 means no regularization.
  • qf_weight_decay (float) – L2 regularization factor for parameters of the q value network. Value of 0 means no regularization.
  • policy_optimizer (tf.Optimizer) – Optimizer for training policy network.
  • qf_optimizer (tf.Optimizer) – Optimizer for training q function network.
  • clip_pos_returns (bool) – Whether or not clip positive returns.
  • clip_return (float) – Clip return to be in [-clip_return, clip_return].
  • max_action (float) – Maximum action magnitude.
  • reward_scale (float) – Reward scale.
  • name (str) – Name of the algorithm shown in computation graph.
init_opt(self)

Build the loss function and init the optimizer.

train(self, runner)

Obtain samplers and start actual training for each epoch.

Parameters:runner (LocalRunner) – Experiment runner, which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch cycle.
Return type:float
train_once(self, itr, episodes)

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • episodes (EpisodeBatch) – Batch of episodes.
optimize_policy(self)

Perform algorithm optimizing.

Returns:Loss of action predicted by the policy network float: Loss of q value predicted by the q network. float: ys. float: Q value predicted by the q network.
Return type:float