garage.torch.algos.vpg module

Vanilla Policy Gradient (REINFORCE).

class VPG(env_spec, policy, value_function, policy_optimizer=None, vf_optimizer=None, max_path_length=500, num_train_per_epoch=1, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=False, policy_ent_coeff=0.0, use_softplus_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy')[source]

Bases: garage.np.algos.rl_algorithm.RLAlgorithm

Vanilla Policy Gradient (REINFORCE).

VPG, also known as Reinforce, trains stochastic policy in an on-policy way.

Parameters:
  • env_spec (garage.envs.EnvSpec) – Environment specification.
  • policy (garage.torch.policies.Policy) – Policy.
  • value_function (garage.torch.value_functions.ValueFunction) – The value function.
  • policy_optimizer (garage.torch.optimizer.OptimizerWrapper) – Optimizer for policy.
  • vf_optimizer (garage.torch.optimizer.OptimizerWrapper) – Optimizer for value function.
  • max_path_length (int) – Maximum length of a single rollout.
  • num_train_per_epoch (int) – Number of train_once calls per epoch.
  • discount (float) – Discount.
  • gae_lambda (float) – Lambda used for generalized advantage estimation.
  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.
  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.
  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.
  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.
  • stop_entropy_gradient (bool) – Whether to stop the entropy gradient.
  • entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.
process_samples(paths)[source]

Process sample data based on the collected paths.

Notes: P is the maximum path length (self.max_path_length)

Parameters:paths (list[dict]) – A list of collected paths
Returns:
The observations of the environment
with shape \((N, P, O*)\).
torch.Tensor: The actions fed to the environment
with shape \((N, P, A*)\).

torch.Tensor: The acquired rewards with shape \((N, P)\). list[int]: Numbers of valid steps in each paths. torch.Tensor: Value function estimation at each step

with shape \((N, P)\).
Return type:torch.Tensor
train(runner)[source]

Obtain samplers and start actual training for each epoch.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch cycle.
Return type:float
train_once(itr, paths)[source]

Train the algorithm once.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

Calculated mean value of undiscounted returns.

Return type:

numpy.float64