garage.np.algos.cem

Cross Entropy Method.

class CEM(env_spec, policy, baseline, n_samples, discount=0.99, max_episode_length=500, init_std=1, best_frac=0.05, extra_std=1.0, extra_decay_time=100)

Bases: garage.np.algos.rl_algorithm.RLAlgorithm

Inheritance diagram of garage.np.algos.cem.CEM

Cross Entropy Method.

CEM works by iteratively optimizing a gaussian distribution of policy.

In each epoch, CEM does the following: 1. Sample n_samples policies from a gaussian distribution of

mean cur_mean and std cur_std.
  1. Collect episodes for each policy.
  2. Update cur_mean and cur_std by doing Maximum Likelihood Estimation over the n_best top policies in terms of return.
Parameters:
  • env_spec (EnvSpec) – Environment specification.
  • policy (garage.np.policies.Policy) – Action policy.
  • baseline (garage.np.baselines.Baseline) – Baseline for GAE (Generalized Advantage Estimation).
  • n_samples (int) – Number of policies sampled in one epoch.
  • discount (float) – Environment reward discount.
  • max_episode_length (int) – Maximum length of a single episode.
  • best_frac (float) – The best fraction.
  • init_std (float) – Initial std for policy param distribution.
  • extra_std (float) – Decaying std added to param distribution.
  • extra_decay_time (float) – Epochs that it takes to decay extra std.
train(self, runner)

Initialize variables and start training.

Parameters:runner (LocalRunner) – Experiment runner, which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch cycle.
Return type:float
train_once(self, itr, paths)

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

The average return of epoch cycle.

Return type:

float