garage.tf.algos.rl2 module

Module for RL2.

This module contains RL2, RL2Worker and the environment wrapper for RL2.

class NoResetPolicy(policy)[source]

Bases: object

A policy that does not reset.

For RL2 meta-test, the policy should not reset after meta-RL adapation. The hidden state will be retained as it is where the adaptation takes place.

Parameters:policy (garage.tf.policies.Policy) – Policy itself.
Returns:The wrapped policy that does not reset.
Return type:object
get_action(obs)[source]

Get a single action from this policy for the input observation.

Parameters:obs (numpy.ndarray) – Observation from environment.
Returns:
Predicted action and agent
info.
Return type:tuple[numpy.ndarray, dict]
get_param_values()[source]

Return values of params.

Returns:Policy parameters values.
Return type:np.ndarray
reset()[source]

gym.Env reset function.

set_param_values(params)[source]

Set param values.

Parameters:params (np.ndarray) – A numpy array of parameter values.
class RL2(rl2_max_path_length, meta_batch_size, task_sampler, meta_evaluator, n_epochs_per_eval, **inner_algo_args)[source]

Bases: garage.np.algos.meta_rl_algorithm.MetaRLAlgorithm, abc.ABC

RL^2.

Reference: https://arxiv.org/pdf/1611.02779.pdf.

When sampling for RL^2, there are more than one environments to be sampled from. In the original implementation, within each task/environment, all rollouts sampled will be concatenated into one single rollout, and fed to the inner algorithm. Thus, returns and advantages are calculated across the rollout.

RL2Worker is required in sampling for RL2. See example/tf/rl2_ppo_halfcheetah.py for reference.

User should not instantiate RL2 directly. Currently garage supports PPO and TRPO as inner algorithm. Refer to garage/tf/algos/rl2ppo.py and garage/tf/algos/rl2trpo.py.

Parameters:
  • rl2_max_path_length (int) – Maximum length for trajectories with respect to RL^2. Notice that it is different from the maximum path length for the inner algorithm.
  • meta_batch_size (int) – Meta batch size.
  • task_sampler (garage.experiment.TaskSampler) – Task sampler.
  • meta_evaluator (garage.experiment.MetaEvaluator) – Evaluator for meta-RL algorithms.
  • n_epochs_per_eval (int) – If meta_evaluator is passed, meta-evaluation will be performed every n_epochs_per_eval epochs.
  • inner_algo_args (dict) – Arguments for inner algorithm.
adapt_policy(exploration_policy, exploration_trajectories)[source]

Produce a policy adapted for a task.

Parameters:
  • exploration_policy (garage.Policy) – A policy which was returned from get_exploration_policy(), and which generated exploration_trajectories by interacting with an environment. The caller may not use this object after passing it into this method.
  • exploration_trajectories (garage.TrajectoryBatch) – Trajectories to adapt to, generated by exploration_policy exploring the environment.
Returns:

A policy adapted to the task represented

by the exploration_trajectories.

Return type:

garage.tf.policies.Policy

get_exploration_policy()[source]

Return a policy used before adaptation to a specific task.

Each time it is retrieved, this policy should only be evaluated in one task.

Returns:
The policy used to obtain samples that are later
used for meta-RL adaptation.
Return type:object
max_path_length

Max path length.

Returns:Maximum path length in a trajectory.
Return type:int
policy

Policy.

Returns:Policy to be used.
Return type:garage.Policy
train(runner)[source]

Obtain samplers and start actual training for each epoch.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch.
Return type:float
train_once(itr, paths)[source]

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

Average return.

Return type:

numpy.float64

class RL2AdaptedPolicy(policy)[source]

Bases: object

A RL2 policy after adaptation.

Parameters:policy (garage.tf.policies.Policy) – Policy itself.
get_action(obs)[source]

Get a single action from this policy for the input observation.

Parameters:obs (numpy.ndarray) – Observation from environment.
Returns:Predicted action and agent info.
Return type:tuple(numpy.ndarray, dict)
get_param_values()[source]

Return values of params.

Returns:
Policy parameters values
and initial hidden state that will be set every time the policy is used for meta-test.
Return type:tuple(np.ndarray, np.ndarray)
reset()[source]

gym.Env reset function.

set_param_values(params)[source]

Set param values.

Parameters:params (tuple(np.ndarray, np.ndarray)) – Two numpy array of parameter values, one of the network parameters, one for the initial hidden state.
class RL2Env(env)[source]

Bases: gym.core.Wrapper

Environment wrapper for RL2.

In RL2, observation is concatenated with previous action, reward and terminal signal to form new observation.

Parameters:env (gym.Env) – An env that will be wrapped.
reset(**kwargs)[source]

gym.Env reset function.

Parameters:kwargs – Keyword arguments.
Returns:augmented observation.
Return type:np.ndarray
spec

Environment specification.

Returns:Environment specification.
Return type:EnvSpec
step(action)[source]

gym.Env step function.

Parameters:action (int) – action taken.
Returns:augmented observation. float: reward. bool: terminal signal. dict: environment info.
Return type:np.ndarray
class RL2Worker(*, seed, max_path_length, worker_number, n_paths_per_trial=2)[source]

Bases: garage.sampler.default_worker.DefaultWorker

Initialize a worker for RL2.

In RL2, policy does not reset between trajectories in each meta batch. Policy only resets once at the beginning of a trial/meta batch.

Parameters:
  • seed (int) – The seed to use to intialize random number generators.
  • max_path_length (int or float) – The maximum length paths which will be sampled. Can be (floating point) infinity.
  • worker_number (int) – The number of the worker where this update is occurring. This argument is used to set a different seed for each worker.
  • n_paths_per_trial (int) – Number of trajectories sampled per trial/ meta batch. Policy resets in the beginning of a meta batch, and obtain n_paths_per_trial trajectories in one meta batch.
agent

The worker’s agent.

Type:Policy or None
env

The worker’s environment.

Type:gym.Env or None
rollout()[source]

Sample a single rollout of the agent in the environment.

Returns:The collected trajectory.
Return type:garage.TrajectoryBatch
start_rollout()[source]

Begin a new rollout.