garage.torch.algos.dqn

This modules creates a DDPG model in PyTorch.

class DQN(env_spec, policy, qf, replay_buffer, sampler, exploration_policy=None, eval_env=None, double_q=True, qf_optimizer=torch.optim.Adam, *, steps_per_epoch=20, n_train_steps=50, max_episode_length_eval=None, deterministic_eval=False, buffer_batch_size=64, min_buffer_size=int(10000.0), num_eval_episodes=10, discount=0.99, qf_lr=_Default(0.001), clip_rewards=None, clip_gradient=10, target_update_freq=5, reward_scale=1.0)

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.torch.algos.dqn.DQN

DQN algorithm. See https://arxiv.org/pdf/1312.5602.pdf.

DQN, also known as the Deep Q Network algorithm, is an off-policy algorithm that learns action-value estimates for each state, action pair. The policy then simply acts by taking the action that yields the highest Q(s,a) value for a given state s.

Parameters
  • env_spec (EnvSpec) – Environment specification.

  • policy (garage.torch.policies.Policy) – Policy. For DQN, this is a policy that performs the action that yields the highest Q value.

  • qf (nn.Module) – Q-value network.

  • replay_buffer (ReplayBuffer) – Replay buffer.

  • sampler (garage.sampler.Sampler) – Sampler.

  • steps_per_epoch (int) – Number of train_once calls per epoch.

  • n_train_steps (int) – Training steps.

  • eval_env (Environment) – Evaluation environment. If None, a copy of the main environment is used for evaluation.

  • double_q (bool) – Whether to use Double DQN. See https://arxiv.org/abs/1509.06461.

  • max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.

  • buffer_batch_size (int) – Batch size of replay buffer.

  • min_buffer_size (int) – The minimum buffer size for replay buffer.

  • exploration_policy (ExplorationPolicy) – Exploration strategy, typically epsilon-greedy.

  • num_eval_episodes (int) – Nunber of evaluation episodes. Defaults to 10.

  • deterministic_eval (bool) – Whether to evaluate the policy deterministically (without exploration noise). False by default.

  • target_update_freq (int) – Number of optimization steps between each update to the target Q network.

  • discount (float) – Discount factor for the cumulative return.

  • qf_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training Q-value network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).

  • qf_lr (float) – Learning rate for Q-value network parameters.

  • clip_rewards (float) – Clip reward to be in [-clip_rewards, clip_rewards]. If None, rewards are not clipped.

  • clip_gradient (float) – Clip gradient norm to clip_gradient. If None, gradient are not clipped. Defaults to 10.

  • reward_scale (float) – Reward scale.

train(self, trainer)

Obtain samplers and start actual training for each epoch.

Parameters

trainer (Trainer) – Experiment trainer.

Returns

The average return in last epoch cycle.

Return type

float

to(self, device=None)

Put all the networks within the model on device.

Parameters

device (str) – ID of GPU or CPU.