garage.np.algos package¶
Reinforcement learning algorithms which use NumPy as a numerical backend.
-
class
RLAlgorithm
[source]¶ Bases:
abc.ABC
Base class for all the algorithms.
Note
If the field sampler_cls exists, it will be by LocalRunner.setup to initialize a sampler.
-
train
(runner)[source]¶ Obtain samplers and start actual training for each epoch.
Parameters: runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
-
-
class
CEM
(env_spec, policy, baseline, n_samples, discount=0.99, max_path_length=500, init_std=1, best_frac=0.05, extra_std=1.0, extra_decay_time=100)[source]¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
Cross Entropy Method.
CEM works by iteratively optimizing a gaussian distribution of policy.
In each epoch, CEM does the following: 1. Sample n_samples policies from a gaussian distribution of
mean cur_mean and std cur_std.- Do rollouts for each policy.
- Update cur_mean and cur_std by doing Maximum Likelihood Estimation over the n_best top policies in terms of return.
Parameters: - env_spec (garage.envs.EnvSpec) – Environment specification.
- policy (garage.np.policies.Policy) – Action policy.
- baseline (garage.np.baselines.Baseline) – Baseline for GAE (Generalized Advantage Estimation).
- n_samples (int) – Number of policies sampled in one epoch.
- discount (float) – Environment reward discount.
- max_path_length (int) – Maximum length of a single rollout.
- best_frac (float) – The best fraction.
- init_std (float) – Initial std for policy param distribution.
- extra_std (float) – Decaying std added to param distribution.
- extra_decay_time (float) – Epochs that it takes to decay extra std.
-
train
(runner)[source]¶ Initialize variables and start training.
Parameters: runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control. Returns: The average return in last epoch cycle. Return type: float
-
class
CMAES
(env_spec, policy, baseline, n_samples, discount=0.99, max_path_length=500, sigma0=1.0)[source]¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
Covariance Matrix Adaptation Evolution Strategy.
Note
The CMA-ES method can hardly learn a successful policy even for simple task. It is still maintained here only for consistency with original rllab paper.
Parameters: - env_spec (garage.envs.EnvSpec) – Environment specification.
- policy (garage.np.policies.Policy) – Action policy.
- baseline (garage.np.baselines.Baseline) – Baseline for GAE (Generalized Advantage Estimation).
- n_samples (int) – Number of policies sampled in one epoch.
- discount (float) – Environment reward discount.
- max_path_length (int) – Maximum length of a single rollout.
- sigma0 (float) – Initial std for param distribution.
-
train
(runner)[source]¶ Initialize variables and start training.
Parameters: runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control. Returns: The average return in last epoch cycle. Return type: float
-
class
MetaRLAlgorithm
[source]¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
,abc.ABC
Base class for Meta-RL Algorithms.
-
adapt_policy
(exploration_policy, exploration_trajectories)[source]¶ Produce a policy adapted for a task.
Parameters: - exploration_policy (garage.Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_trajectories by interacting with an environment. The caller may not use this object after passing it into this method.
- exploration_trajectories (garage.TrajectoryBatch) – Trajectories to adapt to, generated by exploration_policy exploring the environment.
Returns: - A policy adapted to the task represented by the
exploration_trajectories.
Return type: garage.Policy
-
-
class
NOP
[source]¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
NOP (no optimization performed) policy search algorithm.
-
optimize_policy
(paths)[source]¶ Optimize the policy using the samples.
Parameters: paths (list[dict]) – A list of collected paths.
-
train
(runner)[source]¶ Obtain samplers and start actual training for each epoch.
Parameters: runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
-