garage.tf.algos.dqn
¶
Deep Q-Learning Network algorithm.
-
class
DQN
(env_spec, policy, qf, replay_buffer, exploration_policy=None, steps_per_epoch=20, min_buffer_size=int(10000.0), buffer_batch_size=64, n_train_steps=50, max_episode_length=None, max_episode_length_eval=None, qf_lr=_Default(0.001), qf_optimizer=tf.compat.v1.train.AdamOptimizer, discount=1.0, target_network_update_freq=5, grad_norm_clipping=None, double_q=False, reward_scale=1.0, name='DQN')¶ Bases:
garage.np.algos.RLAlgorithm
DQN from https://arxiv.org/pdf/1312.5602.pdf.
Known as Deep Q-Network, it estimates the Q-value function by deep neural networks. It enables Q-Learning to be applied on high complexity environments. To deal with pixel environments, numbers of tricks are usually needed, e.g. skipping frames and stacking frames as single observation.
Parameters: - env_spec (EnvSpec) – Environment specification.
- policy (garage.tf.policies.Policy) – Policy.
- qf (object) – The q value network.
- replay_buffer (ReplayBuffer) – Replay buffer.
- exploration_policy (ExplorationPolicy) – Exploration strategy.
- steps_per_epoch (int) – Number of train_once calls per epoch.
- min_buffer_size (int) – The minimum buffer size for replay buffer.
- buffer_batch_size (int) – Batch size for replay buffer.
- n_train_steps (int) – Training steps.
- max_episode_length (int) – Maximum episode length. The episode will be truncated when length of episode reaches max_episode_length.
- max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to max_episode_length.
- qf_lr (float) – Learning rate for Q-Function.
- qf_optimizer (tf.Optimizer) – Optimizer for Q-Function.
- discount (float) – Discount factor for rewards.
- target_network_update_freq (int) – Frequency of updating target network.
- grad_norm_clipping (float) – Maximum clipping value for clipping tensor values to a maximum L2-norm. It must be larger than 0. If None, no gradient clipping is done. For detail, see docstring for tf.clip_by_norm.
- double_q (bool) – Bool for using double q-network.
- reward_scale (float) – Reward scale.
- name (str) – Name of the algorithm.
-
init_opt
(self)¶ Initialize the networks and Ops.
Assume discrete space for dqn, so action dimension will always be action_space.n
-
train
(self, runner)¶ Obtain samplers and start actual training for each epoch.
Parameters: runner (LocalRunner) – Experiment runner, which provides services such as snapshotting and sampler control. Returns: The average return in last epoch cycle. Return type: float
-
train_once
(self, itr, episodes)¶ Perform one step of policy optimization given one batch of samples.
Parameters: - itr (int) – Iteration number.
- episodes (EpisodeBatch) – Batch of episodes.
Returns: Q function losses
Return type:
-
optimize_policy
(self)¶ Optimize network using experiences from replay buffer.
Returns: Loss of policy. Return type: numpy.float64