garage.np.algos
¶
Reinforcement learning algorithms which use NumPy as a numerical backend.

class
CEM
(env_spec, policy, n_samples, discount=0.99, init_std=1, best_frac=0.05, extra_std=1.0, extra_decay_time=100)¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
Cross Entropy Method.
CEM works by iteratively optimizing a gaussian distribution of policy.
In each epoch, CEM does the following: 1. Sample n_samples policies from a gaussian distribution of
mean cur_mean and std cur_std.
Collect episodes for each policy.
Update cur_mean and cur_std by doing Maximum Likelihood Estimation over the n_best top policies in terms of return.
 Parameters
env_spec (EnvSpec) – Environment specification.
policy (garage.np.policies.Policy) – Action policy.
n_samples (int) – Number of policies sampled in one epoch.
discount (float) – Environment reward discount.
best_frac (float) – The best fraction.
init_std (float) – Initial std for policy param distribution.
extra_std (float) – Decaying std added to param distribution.
extra_decay_time (float) – Epochs that it takes to decay extra std.

class
CMAES
(env_spec, policy, n_samples, discount=0.99, sigma0=1.0)¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
Covariance Matrix Adaptation Evolution Strategy.
Note
The CMAES method can hardly learn a successful policy even for simple task. It is still maintained here only for consistency with original rllab paper.
 Parameters
env_spec (EnvSpec) – Environment specification.
policy (garage.np.policies.Policy) – Action policy.
n_samples (int) – Number of policies sampled in one epoch.
discount (float) – Environment reward discount.
sigma0 (float) – Initial std for param distribution.

class
MetaRLAlgorithm
¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
,abc.ABC
Base class for MetaRL Algorithms.

abstract
get_exploration_policy
(self)¶ Return a policy used before adaptation to a specific task.
Each time it is retrieved, this policy should only be evaluated in one task.
 Returns
 The policy used to obtain samples, which are later used for
metaRL adaptation.
 Return type

abstract
adapt_policy
(self, exploration_policy, exploration_episodes)¶ Produce a policy adapted for a task.
 Parameters
exploration_policy (Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_trajectories by interacting with an environment. The caller may not use this object after passing it into this method.
exploration_episodes (EpisodeBatch) – Episodes with which to adapt. These are generated by exploration_policy while exploring the environment.
 Returns
 A policy adapted to the task represented by the
exploration_episodes.
 Return type

abstract

class
NOP
¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
NOP (no optimization performed) policy search algorithm.

init_opt
(self)¶ Initialize the optimization procedure.

optimize_policy
(self, paths)¶ Optimize the policy using the samples.
