garage.tf.algos.npo module¶
Natural Policy Gradient Optimization.
-
class
NPO
(env_spec, policy, baseline, scope=None, max_path_length=100, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=False, fixed_horizon=False, pg_loss='surrogate', lr_clip_range=0.01, max_kl_step=0.01, optimizer=None, optimizer_args=None, policy_ent_coeff=0.0, use_softplus_entropy=False, use_neg_logli_entropy=False, stop_entropy_gradient=False, entropy_method='no_entropy', flatten_input=True, name='NPO')[source]¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
Natural Policy Gradient Optimization.
Parameters: - env_spec (garage.envs.EnvSpec) – Environment specification.
- policy (garage.tf.policies.StochasticPolicy) – Policy.
- baseline (garage.tf.baselines.Baseline) – The baseline.
- scope (str) – Scope for identifying the algorithm. Must be specified if running multiple algorithms simultaneously, each using different environments and policies.
- max_path_length (int) – Maximum length of a single rollout.
- discount (float) – Discount.
- gae_lambda (float) – Lambda used for generalized advantage estimation.
- center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.
- positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.
- fixed_horizon (bool) – Whether to fix horizon.
- pg_loss (str) – A string from: ‘vanilla’, ‘surrogate’, ‘surrogate_clip’. The type of loss functions to use.
- lr_clip_range (float) – The limit on the likelihood ratio between policies, as in PPO.
- max_kl_step (float) – The maximum KL divergence between old and new policies, as in TRPO.
- optimizer (object) – The optimizer of the algorithm. Should be the optimizers in garage.tf.optimizers.
- optimizer_args (dict) – The arguments of the optimizer.
- policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.
- use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.
- use_neg_logli_entropy (bool) – Whether to estimate the entropy as the negative log likelihood of the action.
- stop_entropy_gradient (bool) – Whether to stop the entropy gradient.
- entropy_method (str) – A string from: ‘max’, ‘regularized’, ‘no_entropy’. The type of entropy method to use. ‘max’ adds the dense entropy to the reward for each time step. ‘regularized’ adds the mean entropy to the surrogate objective. See https://arxiv.org/abs/1805.00909 for more details.
- flatten_input (bool) – Whether to flatten input along the observation dimension. If True, for example, an observation with shape (2, 4) will be flattened to 8.
- name (str) – The name of the algorithm.
Note
- sane defaults for entropy configuration:
- entropy_method=’max’, center_adv=False, stop_gradient=True (center_adv normalizes the advantages tensor, which will significantly alleviate the effect of entropy. It is also recommended to turn off entropy gradient so that the agent will focus on high-entropy actions instead of increasing the variance of the distribution.)
- entropy_method=’regularized’, stop_gradient=False, use_neg_logli_entropy=False
-
log_diagnostics
(paths)[source]¶ Log diagnostic information.
Parameters: paths (list[dict]) – A list of collected paths.
-
optimize_policy
(samples_data)[source]¶ Optimize policy.
Parameters: samples_data (dict) – Processed sample data. See garage.tf.paths_to_tensors() for details.
-
train
(runner)[source]¶ Obtain samplers and start actual training for each epoch.
Parameters: runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control. Returns: The average return in last epoch cycle. Return type: float