`garage.torch.algos.dqn`¶

This modules creates a DDPG model in PyTorch.

class DQN(env_spec, policy, qf, replay_buffer, sampler, exploration_policy=None, eval_env=None, double_q=True, qf_optimizer=torch.optim.Adam, *, steps_per_epoch=20, n_train_steps=50, max_episode_length_eval=None, deterministic_eval=False, buffer_batch_size=64, min_buffer_size=int(10000.0), num_eval_episodes=10, discount=0.99, qf_lr=_Default(0.001), clip_rewards=None, clip_gradient=10, target_update_freq=5, reward_scale=1.0)¶

Bases: garage.np.algos.RLAlgorithm

Inheritance diagram of garage.torch.algos.dqn.DQN

DQN algorithm. See https://arxiv.org/pdf/1312.5602.pdf.

DQN, also known as the Deep Q Network algorithm, is an off-policy algorithm that learns action-value estimates for each state, action pair. The policy then simply acts by taking the action that yields the highest Q(s,a) value for a given state s.

Parameters

env_spec (EnvSpec) – Environment specification.
policy (garage.torch.policies.Policy) – Policy. For DQN, this is a policy that performs the action that yields the highest Q value.
qf (nn.Module) – Q-value network.
replay_buffer (ReplayBuffer) – Replay buffer.
sampler (garage.sampler.Sampler) – Sampler.
steps_per_epoch (int) – Number of train_once calls per epoch.
n_train_steps (int) – Training steps.
eval_env (Environment) – Evaluation environment. If None, a copy of the main environment is used for evaluation.
double_q (bool) – Whether to use Double DQN. See https://arxiv.org/abs/1509.06461.
max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.
buffer_batch_size (int) – Batch size of replay buffer.
min_buffer_size (int) – The minimum buffer size for replay buffer.
exploration_policy (ExplorationPolicy) – Exploration strategy, typically epsilon-greedy.
num_eval_episodes (int) – Nunber of evaluation episodes. Defaults to 10.
deterministic_eval (bool) – Whether to evaluate the policy deterministically (without exploration noise). False by default.
target_update_freq (int) – Number of optimization steps between each update to the target Q network.
discount (float) – Discount factor for the cumulative return.
qf_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training Q-value network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).
qf_lr (float) – Learning rate for Q-value network parameters.
clip_rewards (float) – Clip reward to be in [-clip_rewards, clip_rewards]. If None, rewards are not clipped.
clip_gradient (float) – Clip gradient norm to clip_gradient. If None, gradient are not clipped. Defaults to 10.
reward_scale (float) – Reward scale.

train(trainer)¶

Obtain samplers and start actual training for each epoch.

Parameters: trainer (Trainer) – Experiment trainer.
Returns: The average return in last epoch cycle.
Return type: float

to(device=None)¶

Put all the networks within the model on device.

Parameters: device (str) – ID of GPU or CPU.

garage.torch.algos.dqn¶

`garage.torch.algos.dqn`¶