garage.np.algos package¶

Reinforcement learning algorithms which use NumPy as a numerical backend.

class RLAlgorithm[source]¶

Bases: abc.ABC

Base class for all the algorithms.

Note

If the field sampler_cls exists, it will be by LocalRunner.setup to initialize a sampler.

train(runner)[source]¶

Obtain samplers and start actual training for each epoch.

Parameters:	runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.

class CEM(env_spec, policy, baseline, n_samples, discount=0.99, max_path_length=500, init_std=1, best_frac=0.05, extra_std=1.0, extra_decay_time=100)[source]¶

Bases: garage.np.algos.rl_algorithm.RLAlgorithm

Cross Entropy Method.

CEM works by iteratively optimizing a gaussian distribution of policy.

In each epoch, CEM does the following: 1. Sample n_samples policies from a gaussian distribution of

mean cur_mean and std cur_std.

Do rollouts for each policy.
Update cur_mean and cur_std by doing Maximum Likelihood Estimation over the n_best top policies in terms of return.

Parameters:

env_spec (garage.envs.EnvSpec) – Environment specification.
policy (garage.np.policies.Policy) – Action policy.
baseline (garage.np.baselines.Baseline) – Baseline for GAE (Generalized Advantage Estimation).
n_samples (int) – Number of policies sampled in one epoch.
discount (float) – Environment reward discount.
max_path_length (int) – Maximum length of a single rollout.
best_frac (float) – The best fraction.
init_std (float) – Initial std for policy param distribution.
extra_std (float) – Decaying std added to param distribution.
extra_decay_time (float) – Epochs that it takes to decay extra std.

train(runner)[source]¶

Initialize variables and start training.

Parameters:	runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:	The average return in last epoch cycle.
Return type:	float

train_once(itr, paths)[source]¶

Perform one step of policy optimization given one batch of samples.

Parameters:	itr (int) – Iteration number. paths (list[dict]) – A list of collected paths.
Returns:	The average return of epoch cycle.
Return type:	float

class CMAES(env_spec, policy, baseline, n_samples, discount=0.99, max_path_length=500, sigma0=1.0)[source]¶

Bases: garage.np.algos.rl_algorithm.RLAlgorithm

Covariance Matrix Adaptation Evolution Strategy.

Note

The CMA-ES method can hardly learn a successful policy even for simple task. It is still maintained here only for consistency with original rllab paper.

Parameters:

env_spec (garage.envs.EnvSpec) – Environment specification.
policy (garage.np.policies.Policy) – Action policy.
baseline (garage.np.baselines.Baseline) – Baseline for GAE (Generalized Advantage Estimation).
n_samples (int) – Number of policies sampled in one epoch.
discount (float) – Environment reward discount.
max_path_length (int) – Maximum length of a single rollout.
sigma0 (float) – Initial std for param distribution.

train(runner)[source]¶

Initialize variables and start training.

Parameters:	runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:	The average return in last epoch cycle.
Return type:	float

train_once(itr, paths)[source]¶

Perform one step of policy optimization given one batch of samples.

Parameters:	itr (int) – Iteration number. paths (list[dict]) – A list of collected paths.
Returns:	The average return in last epoch cycle.
Return type:	float

class MetaRLAlgorithm[source]¶

Bases: garage.np.algos.rl_algorithm.RLAlgorithm, abc.ABC

Base class for Meta-RL Algorithms.

adapt_policy(exploration_policy, exploration_trajectories)[source]¶

Produce a policy adapted for a task.

Parameters:

exploration_policy (garage.Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_trajectories by interacting with an environment. The caller may not use this object after passing it into this method.
exploration_trajectories (garage.TrajectoryBatch) – Trajectories to adapt to, generated by exploration_policy exploring the environment.

Returns:

A policy adapted to the task represented by the: exploration_trajectories.

Return type:

garage.Policy

get_exploration_policy()[source]¶

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns:	The policy used to obtain samples that are later used for meta-RL adaptation.
Return type:	garage.Policy

class NOP[source]¶

Bases: garage.np.algos.rl_algorithm.RLAlgorithm

NOP (no optimization performed) policy search algorithm.

init_opt()[source]¶: Initialize the optimization procedure.

optimize_policy(paths)[source]¶

Optimize the policy using the samples.

Parameters:	paths (list[dict]) – A list of collected paths.

train(runner)[source]¶

Obtain samplers and start actual training for each epoch.

Parameters:	runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.

garage.np.algos package¶

Submodules¶