garage.tf.algos.dqn

Deep Q-Learning Network algorithm.

class DQN(env_spec, policy, qf, replay_buffer, exploration_policy=None, steps_per_epoch=20, min_buffer_size=int(10000.0), buffer_batch_size=64, max_episode_length_eval=None, n_train_steps=50, qf_lr=0.001, qf_optimizer=tf.compat.v1.train.AdamOptimizer, discount=1.0, target_network_update_freq=5, grad_norm_clipping=None, double_q=False, reward_scale=1.0, name='DQN')

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.tf.algos.dqn.DQN

DQN from https://arxiv.org/pdf/1312.5602.pdf.

Known as Deep Q-Network, it estimates the Q-value function by deep neural networks. It enables Q-Learning to be applied on high complexity environments. To deal with pixel environments, numbers of tricks are usually needed, e.g. skipping frames and stacking frames as single observation.

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • policy (Policy) – Policy.

  • qf (object) – The q value network.

  • replay_buffer (ReplayBuffer) – Replay buffer.

  • exploration_policy (ExplorationPolicy) – Exploration strategy.

  • steps_per_epoch (int) – Number of train_once calls per epoch.

  • min_buffer_size (int) – The minimum buffer size for replay buffer.

  • buffer_batch_size (int) – Batch size for replay buffer.

  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.

  • n_train_steps (int) – Training steps.

  • qf_lr (float) – Learning rate for Q-Function.

  • qf_optimizer (tf.compat.v1.train.Optimizer) – Optimizer for Q-Function.

  • discount (float) – Discount factor for rewards.

  • target_network_update_freq (int) – Frequency of updating target network.

  • grad_norm_clipping (float) – Maximum clipping value for clipping tensor values to a maximum L2-norm. It must be larger than 0. If None, no gradient clipping is done. For detail, see docstring for tf.clip_by_norm.

  • double_q (bool) – Bool for using double q-network.

  • reward_scale (float) – Reward scale.

  • name (str) – Name of the algorithm.

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Experiment trainer, which provides services such as snapshotting and sampler control.

Returns

The average return in last epoch cycle.

Return type

float