garage.tf.algos.te_ppo

Proximal Policy Optimization with Task Embedding.

class TEPPO(env_spec, policy, baseline, scope=None, max_episode_length=500, discount=0.99, gae_lambda=0.98, center_adv=True, positive_adv=False, fixed_horizon=False, lr_clip_range=0.01, max_kl_step=0.01, optimizer=None, optimizer_args=None, policy_ent_coeff=0.001, encoder_ent_coeff=0.001, use_softplus_entropy=False, stop_ce_gradient=False, inference=None, inference_optimizer=None, inference_optimizer_args=None, inference_ce_coeff=0.001, name='PPOTaskEmbedding')

Bases: garage.tf.algos.te_npo.TENPO

Inheritance diagram of garage.tf.algos.te_ppo.TEPPO

Proximal Policy Optimization with Task Embedding.

See https://karolhausman.github.io/pdf/hausman17nips-ws2.pdf for algorithm reference.

Parameters:
  • env_spec (EnvSpec) – Environment specification.
  • policy (garage.tf.policies.TaskEmbeddingPolicy) – Policy.
  • baseline (garage.tf.baselines.Baseline) – The baseline.
  • scope (str) – Scope for identifying the algorithm. Must be specified if running multiple algorithms simultaneously, each using different environments and policies.
  • max_episode_length (int) – Maximum length of a single episode.
  • discount (float) – Discount.
  • gae_lambda (float) – Lambda used for generalized advantage estimation.
  • center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.
  • positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.
  • fixed_horizon (bool) – Whether to fix horizon.
  • lr_clip_range (float) – The limit on the likelihood ratio between policies, as in PPO.
  • max_kl_step (float) – The maximum KL divergence between old and new policies, as in TRPO.
  • optimizer (object) – The optimizer of the algorithm. Should be the optimizers in garage.tf.optimizers.
  • optimizer_args (dict) – The arguments of the optimizer.
  • policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.
  • encoder_ent_coeff (float) – The coefficient of the policy encoder entropy. Setting it to zero would mean no entropy regularization.
  • use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.
  • stop_ce_gradient (bool) – Whether to stop the cross entropy gradient.
  • inference (garage.tf.embedding.encoder.StochasticEncoder) – A encoder that infers the task embedding from a state trajectory.
  • inference_optimizer (object) – The optimizer of the inference. Should be an optimizer in garage.tf.optimizers.
  • inference_optimizer_args (dict) – The arguments of the inference optimizer.
  • inference_ce_coeff (float) – The coefficient of the cross entropy of task embeddings inferred from task one-hot and state trajectory. This is effectively the coefficient of log-prob of inference.
  • name (str) – The name of the algorithm.
init_opt(self)

Initialize optimizater.

Raises:NotImplementedError – Raise if the policy is recurrent.
train(self, runner)

Obtain samplers and start actual training for each epoch.

Parameters:runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control.
Returns:The average return in last epoch cycle.
Return type:float
train_once(self, itr, paths)

Perform one step of policy optimization given one batch of samples.

Parameters:
  • itr (int) – Iteration number.
  • paths (list[dict]) – A list of collected paths.
Returns:

Average return.

Return type:

numpy.float64

optimize_policy(self, itr, samples_data)

Optimize policy.

Parameters:
  • itr (int) – Iteration number.
  • samples_data (dict) – Processed sample data. See process_samples() for details.
paths_to_tensors(self, paths)

Return processed sample data based on the collected paths.

Parameters:paths (list[dict]) – A list of collected paths.
Returns:
Processed sample data, with key
  • observations: (numpy.ndarray)
  • tasks: (numpy.ndarray)
  • actions: (numpy.ndarray)
  • trjectories: (numpy.ndarray)
  • rewards: (numpy.ndarray)
  • baselines: (numpy.ndarray)
  • returns: (numpy.ndarray)
  • valids: (numpy.ndarray)
  • agent_infos: (dict)
  • letent_infos: (dict)
  • env_infos: (dict)
  • trjectory_infos: (dict)
  • paths: (list[dict])
Return type:dict
evaluate(self, policy_opt_input_values, samples_data)

Evaluate rewards and everything else.

Parameters:
  • policy_opt_input_values (list[np.ndarray]) – Flattened policy optimization input values.
  • samples_data (dict) – Processed sample data. See process_samples() for details.
Returns:

Processed sample data.

Return type:

dict

visualize_distribution(self)

Visualize encoder distribution.

classmethod get_encoder_spec(cls, task_space, latent_dim)

Get the embedding spec of the encoder.

Parameters:
  • task_space (akro.Space) – Task spec.
  • latent_dim (int) – Latent dimension.
Returns:

Encoder spec.

Return type:

garage.InOutSpec

classmethod get_infer_spec(cls, env_spec, latent_dim, inference_window_size)

Get the embedding spec of the inference.

Every inference_window_size timesteps in the trajectory will be used as the inference network input.

Parameters:
  • env_spec (garage.envs.EnvSpec) – Environment spec.
  • latent_dim (int) – Latent dimension.
  • inference_window_size (int) – Length of inference window.
Returns:

Inference spec.

Return type:

garage.InOutSpec