garage.tf.algos.dqn module

Deep Q-Learning Network algorithm.

class DQN(env_spec, policy, qf, replay_buffer, exploration_policy=None, steps_per_epoch=20, min_buffer_size=10000, buffer_batch_size=64, rollout_batch_size=1, n_train_steps=50, max_path_length=None, max_eval_path_length=None, qf_lr=<garage._functions._Default object>, qf_optimizer=<class 'tensorflow.python.training.adam.AdamOptimizer'>, discount=1.0, target_network_update_freq=5, grad_norm_clipping=None, double_q=False, reward_scale=1.0, smooth_return=True, name='DQN')[source]

Bases: garage.np.algos.rl_algorithm.RLAlgorithm

DQN from https://arxiv.org/pdf/1312.5602.pdf.

Known as Deep Q-Network, it estimates the Q-value function by deep neural networks. It enables Q-Learning to be applied on high complexity environments. To deal with pixel environments, numbers of tricks are usually needed, e.g. skipping frames and stacking frames as single observation.

Parameters:
  • env_spec (garage.envs.env_spec.EnvSpec) – Environment specification.
  • policy (garage.tf.policies.Policy) – Policy.
  • qf (object) – The q value network.
  • replay_buffer (garage.replay_buffer.ReplayBuffer) – Replay buffer.
  • exploration_policy – (garage.np.exploration_policies.ExplorationPolicy): Exploration strategy.
  • steps_per_epoch (int) – Number of train_once calls per epoch.
  • min_buffer_size (int) – The minimum buffer size for replay buffer.
  • buffer_batch_size (int) – Batch size for replay buffer.
  • rollout_batch_size (int) – Roll out batch size.
  • n_train_steps (int) – Training steps.
  • max_path_length (int) – Maximum path length. The episode will terminate when length of trajectory reaches max_path_length.
  • max_eval_path_length (int or None) – Maximum length of paths used for off-policy evaluation. If None, defaults to max_path_length.
  • qf_lr (float) – Learning rate for Q-Function.
  • qf_optimizer (tf.Optimizer) – Optimizer for Q-Function.
  • discount (float) – Discount factor for rewards.
  • target_network_update_freq (int) – Frequency of updating target network.
  • grad_norm_clipping (float) – Maximum clipping value for clipping tensor values to a maximum L2-norm. It must be larger than 0. If None, no gradient clipping is done. For detail, see docstring for tf.clip_by_norm.
  • double_q (bool) – Bool for using double q-network.
  • reward_scale (float) – Reward scale.
  • smooth_return (bool) – Whether to smooth the return.
  • name (str) – Name of the algorithm.
init_opt()[source]

Initialize the networks and Ops.

Assume discrete space for dqn, so action dimension will always be action_space.n

optimize_policy(samples_data)[source]

Optimize network using experiences from replay buffer.

Parameters:samples_data (list) – Processed batch data.
Returns:Loss of policy.
Return type:numpy.float64
train(runner)[source]

Obtain samplers and start actual training for each epoch.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch cycle.
Return type:float
train_once(itr, paths)[source]

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

Average return.

Return type:

numpy.float64