Soft Actor-Critic¶
Action Space |
Continuous |
Paper |
Soft Actor-Critic Algorithms and Applications [1] |
Framework(s) |
|
API Reference |
|
Code |
|
Examples |
Soft Actor-Critic (SAC) is an algorithm which optimizes a stochastic policy in an off-policy way, forming a bridge between stochastic policy optimization and DDPG-style approaches. A central feature of SAC is entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy. This has a close connection to the exploration-exploitation trade-off: increasing entropy results in more exploration, which can accelerate learning later on. It can also prevent the policy from prematurely converging to a bad local optimum.
Default Parameters¶
initial_log_entropy=0.
discount=0.99
buffer_batch_size=64
min_buffer_size=int(1e4)
target_update_tau=5e-3
policy_lr=3e-4
qf_lr=3e-4
reward_scale=1.0
optimizer=torch.optim.Adam
steps_per_epoch=1
num_evaluation_episodes=10
Examples¶
#!/usr/bin/env python3
"""This is an example to train a task with SAC algorithm written in PyTorch."""
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F
from garage import wrap_experiment
from garage.envs import GymEnv, normalize
from garage.experiment import deterministic
from garage.replay_buffer import PathBuffer
from garage.sampler import LocalSampler
from garage.torch import set_gpu_mode
from garage.torch.algos import SAC
from garage.torch.policies import TanhGaussianMLPPolicy
from garage.torch.q_functions import ContinuousMLPQFunction
from garage.trainer import Trainer
@wrap_experiment(snapshot_mode='none')
def sac_half_cheetah_batch(ctxt=None, seed=1):
"""Set up environment and algorithm and run the task.
Args:
ctxt (garage.experiment.ExperimentContext): The experiment
configuration used by Trainer to create the snapshotter.
seed (int): Used to seed the random number generator to produce
determinism.
"""
deterministic.set_seed(seed)
trainer = Trainer(snapshot_config=ctxt)
env = normalize(GymEnv('HalfCheetah-v2'))
policy = TanhGaussianMLPPolicy(
env_spec=env.spec,
hidden_sizes=[256, 256],
hidden_nonlinearity=nn.ReLU,
output_nonlinearity=None,
min_std=np.exp(-20.),
max_std=np.exp(2.),
)
qf1 = ContinuousMLPQFunction(env_spec=env.spec,
hidden_sizes=[256, 256],
hidden_nonlinearity=F.relu)
qf2 = ContinuousMLPQFunction(env_spec=env.spec,
hidden_sizes=[256, 256],
hidden_nonlinearity=F.relu)
replay_buffer = PathBuffer(capacity_in_transitions=int(1e6))
sac = SAC(env_spec=env.spec,
policy=policy,
qf1=qf1,
qf2=qf2,
gradient_steps_per_itr=1000,
max_episode_length_eval=1000,
replay_buffer=replay_buffer,
min_buffer_size=1e4,
target_update_tau=5e-3,
discount=0.99,
buffer_batch_size=256,
reward_scale=1.,
steps_per_epoch=1)
if torch.cuda.is_available():
set_gpu_mode(True)
else:
set_gpu_mode(False)
sac.to()
trainer.setup(algo=sac, env=env, sampler_cls=LocalSampler)
trainer.train(n_epochs=1000, batch_size=1000)
s = np.random.randint(0, 1000)
sac_half_cheetah_batch(seed=521)
References¶
- 1
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and others. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
This page was authored by Ruofu Wang (@yeukfu).