garage.torch.algos.dqn
¶
This modules creates a DDPG model in PyTorch.
- class DQN(env_spec, policy, qf, replay_buffer, sampler, exploration_policy=None, eval_env=None, double_q=True, qf_optimizer=torch.optim.Adam, *, steps_per_epoch=20, n_train_steps=50, max_episode_length_eval=None, deterministic_eval=False, buffer_batch_size=64, min_buffer_size=int(10000.0), num_eval_episodes=10, discount=0.99, qf_lr=_Default(0.001), clip_rewards=None, clip_gradient=10, target_update_freq=5, reward_scale=1.0)¶
Bases:
garage.np.algos.RLAlgorithm
DQN algorithm. See https://arxiv.org/pdf/1312.5602.pdf.
DQN, also known as the Deep Q Network algorithm, is an off-policy algorithm that learns action-value estimates for each state, action pair. The policy then simply acts by taking the action that yields the highest Q(s,a) value for a given state s.
- Parameters
env_spec (EnvSpec) – Environment specification.
policy (garage.torch.policies.Policy) – Policy. For DQN, this is a policy that performs the action that yields the highest Q value.
qf (nn.Module) – Q-value network.
replay_buffer (ReplayBuffer) – Replay buffer.
sampler (garage.sampler.Sampler) – Sampler.
steps_per_epoch (int) – Number of train_once calls per epoch.
n_train_steps (int) – Training steps.
eval_env (Environment) – Evaluation environment. If None, a copy of the main environment is used for evaluation.
double_q (bool) – Whether to use Double DQN. See https://arxiv.org/abs/1509.06461.
max_episode_length_eval (int or None) – Maximum length of episodes used for off-policy evaluation. If None, defaults to env_spec.max_episode_length.
buffer_batch_size (int) – Batch size of replay buffer.
min_buffer_size (int) – The minimum buffer size for replay buffer.
exploration_policy (ExplorationPolicy) – Exploration strategy, typically epsilon-greedy.
num_eval_episodes (int) – Nunber of evaluation episodes. Defaults to 10.
deterministic_eval (bool) – Whether to evaluate the policy deterministically (without exploration noise). False by default.
target_update_freq (int) – Number of optimization steps between each update to the target Q network.
discount (float) – Discount factor for the cumulative return.
qf_optimizer (Union[type, tuple[type, dict]]) – Type of optimizer for training Q-value network. This can be an optimizer type such as torch.optim.Adam or a tuple of type and dictionary, where dictionary contains arguments to initialize the optimizer e.g. (torch.optim.Adam, {‘lr’ : 1e-3}).
qf_lr (float) – Learning rate for Q-value network parameters.
clip_rewards (float) – Clip reward to be in [-clip_rewards, clip_rewards]. If None, rewards are not clipped.
clip_gradient (float) – Clip gradient norm to clip_gradient. If None, gradient are not clipped. Defaults to 10.
reward_scale (float) – Reward scale.
- train(trainer)¶
Obtain samplers and start actual training for each epoch.