garage.torch.algos.mtsac module¶

This modules creates a MTSAC model in PyTorch.

class MTSAC(policy, qf1, qf2, replay_buffer, env_spec, num_tasks, *, max_path_length, max_eval_path_length=None, eval_env, gradient_steps_per_itr, fixed_alpha=None, target_entropy=None, initial_log_entropy=0.0, discount=0.99, buffer_batch_size=64, min_buffer_size=10000, target_update_tau=0.005, policy_lr=0.0003, qf_lr=0.0003, reward_scale=1.0, optimizer=<sphinx.ext.autodoc.importer._MockObject object>, steps_per_epoch=1, num_evaluation_trajectories=5)[source]¶

Bases: garage.torch.algos.sac.SAC

A MTSAC Model in Torch.

This MTSAC implementation uses is the same as SAC except for a small change called “disentangled alphas”. Alpha is the entropy coefficient that is used to control exploration of the agent/policy. Disentangling alphas refers to having a separate alpha coefficients for every task learned by the policy. The alphas are accessed by using a the one-hot encoding of an id that is assigned to each task.

Parameters:

policy (garage.torch.policy.Policy) – Policy/Actor/Agent that is being optimized by SAC.
qf1 (garage.torch.q_function.ContinuousMLPQFunction) – QFunction/Critic used for actor/policy optimization. See Soft Actor-Critic and Applications.
qf2 (garage.torch.q_function.ContinuousMLPQFunction) – QFunction/Critic used for actor/policy optimization. See Soft Actor-Critic and Applications.
replay_buffer (garage.replay_buffer.ReplayBuffer) – Stores transitions that are previously collected by the sampler.
env_spec (garage.envs.env_spec.EnvSpec) – The env_spec attribute of the environment that the agent is being trained in. Usually accessable by calling env.spec.
num_tasks (int) – The number of tasks being learned.
max_path_length (int) – The max path length of the algorithm.
max_eval_path_length (int or None) – Maximum length of paths used for off-policy evaluation. If None, defaults to max_path_length.
eval_env (garage.envs.GarageEnv) – The environment used for collecting evaluation trajectories.
gradient_steps_per_itr (int) – Number of optimization steps that should occur before the training step is over and a new batch of transitions is collected by the sampler.
fixed_alpha (float) – The entropy/temperature to be used if temperature is not supposed to be learned.
target_entropy (float) – target entropy to be used during entropy/temperature optimization. If None, the default heuristic from Soft Actor-Critic Algorithms and Applications is used.
initial_log_entropy (float) – initial entropy/temperature coefficient to be used if a fixed_alpha is not being used (fixed_alpha=None), and the entropy/temperature coefficient is being learned.
discount (float) – The discount factor to be used during sampling and critic/q_function optimization.
buffer_batch_size (int) – The number of transitions sampled from the replay buffer that are used during a single optimization step.
min_buffer_size (int) – The minimum number of transitions that need to be in the replay buffer before training can begin.
target_update_tau (float) – A coefficient that controls the rate at which the target q_functions update over optimization iterations.
policy_lr (float) – Learning rate for policy optimizers.
qf_lr (float) – Learning rate for q_function optimizers.
reward_scale (float) – Reward multiplier. Changing this hyperparameter changes the effect that the reward from a transition will have during optimization.
optimizer (torch.optim.Optimizer) – Optimizer to be used for policy/actor, q_functions/critics, and temperature/entropy optimizations.
steps_per_epoch (int) – Number of train_once calls per epoch.
num_evaluation_trajectories (int) – The number of evaluation trajectories used for computing eval stats at the end of every epoch.

to(device=None)[source]¶

Put all the networks within the model on device.

Parameters:	device (str) – ID of GPU or CPU.