garage.tf.algos.reps module¶
Relative Entropy Policy Search implementation in Tensorflow.
-
class
REPS
(env_spec, policy, baseline, max_path_length=500, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=False, fixed_horizon=False, epsilon=0.5, l2_reg_dual=0.0, l2_reg_loss=0.0, optimizer=<class 'garage.tf.optimizers.lbfgs_optimizer.LbfgsOptimizer'>, optimizer_args=None, dual_optimizer=<function fmin_l_bfgs_b>, dual_optimizer_args=None, name='REPS')[source]¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
Relative Entropy Policy Search.
References
- [1] J. Peters, K. Mulling, and Y. Altun, “Relative Entropy Policy Search,”
- Artif. Intell., pp. 1607-1612, 2008.
Example
$ python garage/examples/tf/reps_gym_cartpole.py
Parameters: - env_spec (garage.envs.EnvSpec) – Environment specification.
- policy (garage.tf.policies.StochasticPolicy) – Policy.
- baseline (garage.tf.baselines.Baseline) – The baseline.
- scope (str) – Scope for identifying the algorithm. Must be specified if running multiple algorithms simultaneously, each using different environments and policies.
- max_path_length (int) – Maximum length of a single rollout.
- discount (float) – Discount.
- gae_lambda (float) – Lambda used for generalized advantage estimation.
- center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.
- positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.
- fixed_horizon (bool) – Whether to fix horizon.
- epsilon (float) – Dual func parameter.
- l2_reg_dual (float) – Coefficient for dual func l2 regularization.
- l2_reg_loss (float) – Coefficient for policy loss l2 regularization.
- optimizer (object) – The optimizer of the algorithm. Should be the optimizers in garage.tf.optimizers.
- optimizer_args (dict) – Arguments of the optimizer.
- dual_optimizer (object) – Dual func optimizer.
- dual_optimizer_args (dict) – Arguments of the dual optimizer.
- name (str) – Name of the algorithm.
-
log_diagnostics
(paths)[source]¶ Log diagnostic information.
Parameters: paths (list[dict]) – A list of collected paths.
-
optimize_policy
(samples_data)[source]¶ Optimize the policy using the samples.
Parameters: samples_data (dict) – Processed sample data. See garage.tf.paths_to_tensors() for details.
-
train
(runner)[source]¶ Obtain samplers and start actual training for each epoch.
Parameters: runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control. Returns: The average return in last epoch cycle. Return type: float