garage.tf.algos.te_npo module¶
Natural Policy Optimization with Task Embeddings.
-
class
TENPO
(env_spec, policy, baseline, scope=None, max_path_length=500, discount=0.99, gae_lambda=1, center_adv=True, positive_adv=False, fixed_horizon=False, lr_clip_range=0.01, max_kl_step=0.01, optimizer=None, optimizer_args=None, policy_ent_coeff=0.0, encoder_ent_coeff=0.0, use_softplus_entropy=False, stop_ce_gradient=False, flatten_input=True, inference=None, inference_optimizer=None, inference_optimizer_args=None, inference_ce_coeff=0.0, name='NPOTaskEmbedding')[source]¶ Bases:
garage.np.algos.rl_algorithm.RLAlgorithm
Natural Policy Optimization with Task Embeddings.
See https://karolhausman.github.io/pdf/hausman17nips-ws2.pdf for algorithm reference.
Parameters: - env_spec (garage.envs.EnvSpec) – Environment specification.
- policy (garage.tf.policies.TaskEmbeddingPolicy) – Policy.
- baseline (garage.tf.baselines.Baseline) – The baseline.
- scope (str) – Scope for identifying the algorithm. Must be specified if running multiple algorithms simultaneously, each using different environments and policies.
- max_path_length (int) – Maximum length of a single rollout.
- discount (float) – Discount.
- gae_lambda (float) – Lambda used for generalized advantage estimation.
- center_adv (bool) – Whether to rescale the advantages so that they have mean 0 and standard deviation 1.
- positive_adv (bool) – Whether to shift the advantages so that they are always positive. When used in conjunction with center_adv the advantages will be standardized before shifting.
- fixed_horizon (bool) – Whether to fix horizon.
- lr_clip_range (float) – The limit on the likelihood ratio between policies, as in PPO.
- max_kl_step (float) – The maximum KL divergence between old and new policies, as in TRPO.
- optimizer (object) – The optimizer of the algorithm. Should be the optimizers in garage.tf.optimizers.
- optimizer_args (dict) – The arguments of the optimizer.
- policy_ent_coeff (float) – The coefficient of the policy entropy. Setting it to zero would mean no entropy regularization.
- encoder_ent_coeff (float) – The coefficient of the policy encoder entropy. Setting it to zero would mean no entropy regularization.
- use_softplus_entropy (bool) – Whether to estimate the softmax distribution of the entropy to prevent the entropy from being negative.
- stop_ce_gradient (bool) – Whether to stop the cross entropy gradient.
- flatten_input (bool) – Whether to flatten input along the observation dimension. If True, for example, an observation with shape (2, 4) will be flattened to 8.
- inference (garage.tf.embeddings.StochasticEncoder) – A encoder that infers the task embedding from trajectory.
- inference_optimizer (object) – The optimizer of the inference. Should be an optimizer in garage.tf.optimizers.
- inference_optimizer_args (dict) – The arguments of the inference optimizer.
- inference_ce_coeff (float) – The coefficient of the cross entropy of task embeddings inferred from task one-hot and trajectory. This is effectively the coefficient of log-prob of inference.
- name (str) – The name of the algorithm.
-
evaluate
(policy_opt_input_values, samples_data)[source]¶ Evaluate rewards and everything else.
Parameters: Returns: Processed sample data.
Return type:
-
classmethod
get_encoder_spec
(task_space, latent_dim)[source]¶ Get the embedding spec of the encoder.
Parameters: - task_space (akro.Space) – Task spec.
- latent_dim (int) – Latent dimension.
Returns: Encoder spec.
Return type:
-
classmethod
get_infer_spec
(env_spec, latent_dim, inference_window_size)[source]¶ Get the embedding spec of the inference.
Every inference_window_size timesteps in the trajectory will be used as the inference network input.
Parameters: - env_spec (garage.envs.EnvSpec) – Environment spec.
- latent_dim (int) – Latent dimension.
- inference_window_size (int) – Length of inference window.
Returns: Inference spec.
Return type:
-
init_opt
()[source]¶ Initialize optimizater.
Raises: NotImplementedError
– Raise if the policy is recurrent.
-
paths_to_tensors
(paths)[source]¶ Return processed sample data based on the collected paths.
Parameters: paths (list[dict]) – A list of collected paths. Returns: - Processed sample data, with key
- observations: (numpy.ndarray)
- tasks: (numpy.ndarray)
- actions: (numpy.ndarray)
- trjectories: (numpy.ndarray)
- rewards: (numpy.ndarray)
- baselines: (numpy.ndarray)
- returns: (numpy.ndarray)
- valids: (numpy.ndarray)
- agent_infos: (dict)
- letent_infos: (dict)
- env_infos: (dict)
- trjectory_infos: (dict)
- paths: (list[dict])
Return type: dict
-
train
(runner)[source]¶ Obtain samplers and start actual training for each epoch.
Parameters: runner (LocalRunner) – LocalRunner is passed to give algorithm the access to runner.step_epochs(), which provides services such as snapshotting and sampler control. Returns: The average return in last epoch cycle. Return type: float