
Policies for TensorFlow-based algorithms.

class Policy(name, env_spec)[source]

Bases: abc.ABC

Base class for Policies.


The action space for the environment.


Policy environment specification.

flat_to_params(flattened_params, **tags)[source]

Unflatten tensors according to their respective shapes.

  • flattened_params (np.ndarray) – A numpy array of flattened params.
  • tags (dict) – A map specifying the parameters and their shapes.

A list of parameters reshaped to the shapes specified.

Return type:

tensors (List[np.ndarray])


Get action sampled from the policy.

Parameters:observation (np.ndarray) – Observation from the environment.
Returns:Action sampled from the policy.
Return type:(np.ndarray)

Get action sampled from the policy.

Parameters:observations (list[np.ndarray]) – Observations from the environment.
Returns:Actions sampled from the policy.
Return type:(np.ndarray)

Get global variables.

Returns:A list of global variables in the current variable scope.
Return type:List[tf.Variable]

Get parameter shapes.


Get param values.

Parameters:tags (dict) – A map of parameters for which the values are required.
Returns:Values of the parameters evaluated in the current session
Return type:param_values (np.ndarray)

Get the trainable variables.

Returns:A list of trainable variables in the current variable scope.
Return type:List[tf.Variable]

Get trainable variables.

Returns:A list of trainable variables in the current variable scope.
Return type:List[tf.Variable]

Log extra information per iteration based on the collected paths.


Name of the policy model and the variable scope.


The observation space of the environment.


Indicating if the policy is recurrent.


Reset the policy.

If dones is None, it will be by default np.array([True]) which implies the policy will not be “vectorized”, i.e. number of parallel environments for training data sampling = 1.

Parameters:dones (numpy.ndarray) – Bool that indicates terminal state(s).
set_param_values(param_values, name=None, **tags)[source]

Set param values.

  • param_values (np.ndarray) – A numpy array of parameter values.
  • tags (dict) – A map of parameters for which the values should be
  • loaded.

State info keys.

Returns:keys for the information related to the policy’s state when taking an action.
Return type:List[str]

State info specifcation.

Returns:keys and shapes for the information related to the policy’s state when taking an action.
Return type:List[str]

Clean up operation.


Boolean for vectorized.

Returns:Indicates whether the policy is vectorized. If True, it should implement get_actions(), and support resetting with multiple simultaneous states.
Return type:bool
class StochasticPolicy(name, env_spec)[source]



dist_info(obs, state_infos)[source]

Distribution info.

Return the distribution information about the actions.

  • obs (tf.Tensor) – observation values
  • state_infos (dict) – a dictionary whose values should contain information about the state of the policy at the time it received the observation
dist_info_sym(obs_var, state_info_vars, name='dist_info_sym')[source]

Symbolic graph of the distribution.

Return the symbolic distribution information about the actions. :param obs_var: symbolic variable for observations :type obs_var: tf.Tensor :param state_info_vars: a dictionary whose values should contain

information about the state of the policy at the time it received the observation.
Parameters:name (str) – Name of the symbolic graph.


class CategoricalCNNPolicy(env_spec, conv_filters, conv_filter_sizes, conv_strides, conv_pad, name='CategoricalCNNPolicy', hidden_sizes=[], hidden_nonlinearity=<function relu>, hidden_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, hidden_b_init=<tensorflow.python.ops.init_ops.Zeros object>, output_nonlinearity=<function softmax>, output_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_b_init=<tensorflow.python.ops.init_ops.Zeros object>, layer_normalization=False)[source]


A policy that contains a CNN and a MLP to make prediction based on a categorical distribution.

It only works with akro.Discrete action space.

  • env_spec (garage.envs.env_spec.EnvSpec) – Environment specification.
  • conv_filter_sizes (tuple[int]) – Dimension of the filters. For example, (3, 5) means there are two convolutional layers. The filter for first layer is of dimension (3 x 3) and the second one is of dimension (5 x 5).
  • conv_filters (tuple[int]) – Number of filters. For example, (3, 32) means there are two convolutional layers. The filter for the first layer has 3 channels and the second one with 32 channels.
  • conv_strides (tuple[int]) – The stride of the sliding window. For example, (1, 2) means there are two convolutional layers. The stride of the filter for first layer is 1 and that of the second layer is 2.
  • conv_pad (str) – The type of padding algorithm to use, either ‘SAME’ or ‘VALID’.
  • name (str) – Policy name, also the variable scope of the policy.
  • hidden_sizes (list[int]) – Output dimension of dense layer(s). For example, (32, 32) means the MLP of this policy consists of two hidden layers, each with 32 hidden units.
  • hidden_nonlinearity (callable) – Activation function for intermediate dense layer(s). It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • hidden_w_init (callable) – Initializer function for the weight of intermediate dense layer(s). The function should return a tf.Tensor.
  • hidden_b_init (callable) – Initializer function for the bias of intermediate dense layer(s). The function should return a tf.Tensor.
  • output_nonlinearity (callable) – Activation function for output dense layer. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • output_w_init (callable) – Initializer function for the weight of output dense layer(s). The function should return a tf.Tensor.
  • output_b_init (callable) – Initializer function for the bias of output dense layer(s). The function should return a tf.Tensor.
  • layer_normalization (bool) – Bool for using layer normalization or not.
dist_info(obs, state_infos=None)[source]

Distribution info.

dist_info_sym(obs_var, state_info_vars=None, name=None)[source]

Symbolic graph of the distribution.


Policy distribution.


Return a single action.


Return multiple actions.


Vectorized or not.

class CategoricalGRUPolicy(env_spec, name='CategoricalGRUPolicy', hidden_dim=32, hidden_nonlinearity=<function tanh>, hidden_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, hidden_b_init=<tensorflow.python.ops.init_ops.Zeros object>, recurrent_nonlinearity=<function sigmoid>, recurrent_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_nonlinearity=<function softmax>, output_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_b_init=<tensorflow.python.ops.init_ops.Zeros object>, hidden_state_init=<tensorflow.python.ops.init_ops.Zeros object>, hidden_state_init_trainable=False, state_include_action=True, layer_normalization=False)[source]


A policy that contains a GRU to make prediction based on a categorical distribution.

It only works with akro.Discrete action space.

  • env_spec (garage.envs.env_spec.EnvSpec) – Environment specification.
  • name (str) – Policy name, also the variable scope.
  • hidden_dim (int) – Hidden dimension for LSTM cell.
  • hidden_nonlinearity (callable) – Activation function for intermediate dense layer(s). It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • hidden_w_init (callable) – Initializer function for the weight of intermediate dense layer(s). The function should return a tf.Tensor.
  • hidden_b_init (callable) – Initializer function for the bias of intermediate dense layer(s). The function should return a tf.Tensor.
  • recurrent_nonlinearity (callable) – Activation function for recurrent layers. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • recurrent_w_init (callable) – Initializer function for the weight of recurrent layer(s). The function should return a tf.Tensor.
  • output_nonlinearity (callable) – Activation function for output dense layer. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • output_w_init (callable) – Initializer function for the weight of output dense layer(s). The function should return a tf.Tensor.
  • output_b_init (callable) – Initializer function for the bias of output dense layer(s). The function should return a tf.Tensor.
  • hidden_state_init (callable) – Initializer function for the initial hidden state. The functino should return a tf.Tensor.
  • hidden_state_init_trainable (bool) – Bool for whether the initial hidden state is trainable.
  • state_include_action (bool) – Whether the state includes action. If True, input dimension will be (observation dimension + action dimension).
  • layer_normalization (bool) – Bool for using layer normalization or not.
dist_info_sym(obs_var, state_info_vars, name=None)[source]

Symbolic graph of the distribution.


Policy distribution.


Return a single action.


Return multiple actions.


Recurrent or not.


Reset the policy.


State info specification.


Vectorized or not.

class CategoricalLSTMPolicy(env_spec, name='CategoricalLSTMPolicy', hidden_dim=32, hidden_nonlinearity=<function tanh>, hidden_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, hidden_b_init=<tensorflow.python.ops.init_ops.Zeros object>, recurrent_nonlinearity=<function sigmoid>, recurrent_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_nonlinearity=<function softmax>, output_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_b_init=<tensorflow.python.ops.init_ops.Zeros object>, hidden_state_init=<tensorflow.python.ops.init_ops.Zeros object>, hidden_state_init_trainable=False, cell_state_init=<tensorflow.python.ops.init_ops.Zeros object>, cell_state_init_trainable=False, state_include_action=True, forget_bias=True, layer_normalization=False)[source]


A policy that contains a LSTM to make prediction based on a categorical distribution.

It only works with akro.Discrete action space.

  • env_spec (garage.envs.env_spec.EnvSpec) – Environment specification.
  • name (str) – Policy name, also the variable scope.
  • hidden_dim (int) – Hidden dimension for LSTM cell.
  • hidden_nonlinearity (callable) – Activation function for intermediate dense layer(s). It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • hidden_w_init (callable) – Initializer function for the weight of intermediate dense layer(s). The function should return a tf.Tensor.
  • hidden_b_init (callable) – Initializer function for the bias of intermediate dense layer(s). The function should return a tf.Tensor.
  • recurrent_nonlinearity (callable) – Activation function for recurrent layers. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • recurrent_w_init (callable) – Initializer function for the weight of recurrent layer(s). The function should return a tf.Tensor.
  • output_nonlinearity (callable) – Activation function for output dense layer. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • output_w_init (callable) – Initializer function for the weight of output dense layer(s). The function should return a tf.Tensor.
  • output_b_init (callable) – Initializer function for the bias of output dense layer(s). The function should return a tf.Tensor.
  • hidden_state_init (callable) – Initializer function for the initial hidden state. The functino should return a tf.Tensor.
  • hidden_state_init_trainable (bool) – Bool for whether the initial hidden state is trainable.
  • cell_state_init (callable) – Initializer function for the initial cell state. The functino should return a tf.Tensor.
  • cell_state_init_trainable (bool) – Bool for whether the initial cell state is trainable.
  • state_include_action (bool) – Whether the state includes action. If True, input dimension will be (observation dimension + action dimension).
  • forget_bias (bool) – If True, add 1 to the bias of the forget gate at initialization. It’s used to reduce the scale of forgetting at the beginning of the training.
  • layer_normalization (bool) – Bool for using layer normalization or not.
dist_info_sym(obs_var, state_info_vars, name=None)[source]

Symbolic graph of the distribution.


Policy distribution.


Return a single action.


Return multiple actions.


Recurrent or not.


Reset the policy.


State info specification.


Vectorized or not.

class CategoricalMLPPolicy(env_spec, name='CategoricalMLPPolicy', hidden_sizes=(32, 32), hidden_nonlinearity=<function tanh>, hidden_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, hidden_b_init=<tensorflow.python.ops.init_ops.Zeros object>, output_nonlinearity=<function softmax>, output_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_b_init=<tensorflow.python.ops.init_ops.Zeros object>, layer_normalization=False)[source]


A policy that contains a MLP to make prediction based on a categorical distribution.

It only works with akro.Discrete action space.

  • env_spec (garage.envs.env_spec.EnvSpec) – Environment specification.
  • name (str) – Policy name, also the variable scope.
  • hidden_sizes (list[int]) – Output dimension of dense layer(s). For example, (32, 32) means the MLP of this policy consists of two hidden layers, each with 32 hidden units.
  • hidden_nonlinearity (callable) – Activation function for intermediate dense layer(s). It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • hidden_w_init (callable) – Initializer function for the weight of intermediate dense layer(s). The function should return a tf.Tensor.
  • hidden_b_init (callable) – Initializer function for the bias of intermediate dense layer(s). The function should return a tf.Tensor.
  • output_nonlinearity (callable) – Activation function for output dense layer. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • output_w_init (callable) – Initializer function for the weight of output dense layer(s). The function should return a tf.Tensor.
  • output_b_init (callable) – Initializer function for the bias of output dense layer(s). The function should return a tf.Tensor.
  • layer_normalization (bool) – Bool for using layer normalization or not.
dist_info(obs, state_infos=None)[source]

Distribution info.

dist_info_sym(obs_var, state_info_vars=None, name=None)[source]

Symbolic graph of the distribution.


Policy distribution.


Return a single action.


Return multiple actions.


Get regularizable weight variables under the Policy scope.


Vectorized or not.

class ContinuousMLPPolicy(env_spec, name='ContinuousMLPPolicy', hidden_sizes=(64, 64), hidden_nonlinearity=<function relu>, hidden_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, hidden_b_init=<tensorflow.python.ops.init_ops.Zeros object>, output_nonlinearity=<function tanh>, output_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_b_init=<tensorflow.python.ops.init_ops.Zeros object>, input_include_goal=False, layer_normalization=False)[source]


Continuous MLP Policy Network.

The policy network selects action based on the state of the environment. It uses neural nets to fit the function of pi(s).

  • env_spec (garage.envs.env_spec.EnvSpec) – Environment specification.
  • name (str) – Policy name, also the variable scope.
  • hidden_sizes (list[int]) – Output dimension of dense layer(s). For example, (32, 32) means the MLP of this policy consists of two hidden layers, each with 32 hidden units.
  • hidden_nonlinearity (callable) – Activation function for intermediate dense layer(s). It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • hidden_w_init (callable) – Initializer function for the weight of intermediate dense layer(s). The function should return a tf.Tensor.
  • hidden_b_init (callable) – Initializer function for the bias of intermediate dense layer(s). The function should return a tf.Tensor.
  • output_nonlinearity (callable) – Activation function for output dense layer. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • output_w_init (callable) – Initializer function for the weight of output dense layer(s). The function should return a tf.Tensor.
  • output_b_init (callable) – Initializer function for the bias of output dense layer(s). The function should return a tf.Tensor.
  • input_include_goal (bool) – Include goal in the observation or not.
  • layer_normalization (bool) – Bool for using layer normalization or not.

Return a clone of the policy.

It only copies the configuration of the Q-function, not the parameters.

Parameters:name (str) – Name of the newly created policy.
Returns:Clone of this object

Get single action from this policy for the input observation.

Parameters:observation (numpy.ndarray) – Observation from environment.
Returns:Predicted action. dict: Empty dict since this policy does not model a distribution.
Return type:numpy.ndarray
get_action_sym(obs_var, name=None)[source]

Symbolic graph of the action.

  • obs_var (tf.Tensor) – Tensor input for symbolic graph.
  • name (str) – Name for symbolic graph.

symbolic graph of the action.

Return type:



Get multiple actions from this policy for the input observations.

Parameters:observations (numpy.ndarray) – Observations from environment.
Returns:Predicted actions. dict: Empty dict since this policy does not model a distribution.
Return type:numpy.ndarray

Get regularizable weight variables under the Policy scope.

Returns:List of regularizable variables.
Return type:list(tf.Variable)

Vectorized or not.

Returns:vectorized or not.
Return type:bool
class DiscreteQfDerivedPolicy(env_spec, qf, name='DiscreteQfDerivedPolicy')[source]


DiscreteQfDerived policy.

  • env_spec (garage.envs.env_spec.EnvSpec) – Environment specification.
  • qf (garage.q_functions.QFunction) – The q-function used.
  • name (str) – Name of the policy.

Get action from this policy for the input observation.

Parameters:observation (numpy.ndarray) – Observation from environment.
Returns:Single optimal action from this policy.

Get actions from this policy for the input observations.

Parameters:observations (numpy.ndarray) – Observations from environment.
Returns:Optimal actions from this policy.

Vectorized or not.

class GaussianGRUPolicy(env_spec, hidden_dim=32, name='GaussianGRUPolicy', hidden_nonlinearity=<function tanh>, hidden_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, hidden_b_init=<tensorflow.python.ops.init_ops.Zeros object>, recurrent_nonlinearity=<function sigmoid>, recurrent_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_nonlinearity=None, output_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_b_init=<tensorflow.python.ops.init_ops.Zeros object>, hidden_state_init=<tensorflow.python.ops.init_ops.Zeros object>, hidden_state_init_trainable=False, learn_std=True, std_share_network=False, init_std=1.0, layer_normalization=False, state_include_action=True)[source]


Models the action distribution using a Gaussian parameterized by a GRU.

  • env_spec (garage.envs.env_spec.EnvSpec) – Environment specification.
  • name (str) – Model name, also the variable scope.
  • hidden_dim (int) – Hidden dimension for GRU cell for mean.
  • hidden_nonlinearity (Callable) – Activation function for intermediate dense layer(s). It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • hidden_w_init (Callable) – Initializer function for the weight of intermediate dense layer(s). The function should return a tf.Tensor.
  • hidden_b_init (Callable) – Initializer function for the bias of intermediate dense layer(s). The function should return a tf.Tensor.
  • recurrent_nonlinearity (Callable) – Activation function for recurrent layers. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • recurrent_w_init (Callable) – Initializer function for the weight of recurrent layer(s). The function should return a tf.Tensor.
  • output_nonlinearity (Callable) – Activation function for output dense layer. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • output_w_init (Callable) – Initializer function for the weight of output dense layer(s). The function should return a tf.Tensor.
  • output_b_init (Callable) – Initializer function for the bias of output dense layer(s). The function should return a tf.Tensor.
  • hidden_state_init (Callable) – Initializer function for the initial hidden state. The functino should return a tf.Tensor.
  • hidden_state_init_trainable (bool) – Bool for whether the initial hidden state is trainable.
  • learn_std (bool) – Is std trainable.
  • std_share_network (bool) – Boolean for whether mean and std share the same network.
  • init_std (float) – Initial value for std.
  • layer_normalization (bool) – Bool for using layer normalization or not.
  • state_include_action (bool) – Whether the state includes action. If True, input dimension will be (observation dimension + action dimension).
dist_info_sym(obs_var, state_info_vars, name=None)[source]

Build a symbolic graph of the distribution parameters.

  • obs_var (tf.Tensor) – Tensor input for symbolic graph.
  • state_info_vars (dict) – Extra state information, e.g. previous action.
  • name (str) – Name for symbolic graph.

Outputs of the symbolic graph of distribution


Return type:



Policy distribution.

Get a single action from this policy for the input observation.

Parameters:observation (numpy.ndarray) – Observation from environment.
Returns:Predicted action and agent info.
action (numpy.ndarray): Predicted action. agent_info (dict): Distribution obtained after observing the
given observation, with keys * mean: (numpy.ndarray) * log_std: (numpy.ndarray) * prev_action: (numpy.ndarray), only present if
self._state_include_action is True.
Return type:tuple[numpy.ndarray, dict]

Get multiple actions from this policy for the input observations.

Parameters:observations (numpy.ndarray) – Observations from environment.
Returns:Prediction actions and agent infos.
actions (numpy.ndarray): Predicted actions. agent_infos (dict): Distribution obtained after observing the
given observation, with keys * mean: (numpy.ndarray) * log_std: (numpy.ndarray) * prev_action: (numpy.ndarray), only present if
self._state_include_action is True.
Return type:tuple[numpy.ndarray, dict]

Whether this policy is recurrent or not.


Reset the policy.


If dones is None, it will be by default np.array([True]) which implies the policy will not be “vectorized”, i.e. number of parallel environments for training data sampling = 1.

Parameters:dones (numpy.ndarray) – Bool that indicates terminal state(s).

State info specification.


Whether the policy is vectorized or not.

class GaussianLSTMPolicy(env_spec, hidden_dim=32, name='GaussianLSTMPolicy', hidden_nonlinearity=<function tanh>, hidden_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, hidden_b_init=<tensorflow.python.ops.init_ops.Zeros object>, recurrent_nonlinearity=<function sigmoid>, recurrent_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_nonlinearity=None, output_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_b_init=<tensorflow.python.ops.init_ops.Zeros object>, hidden_state_init=<tensorflow.python.ops.init_ops.Zeros object>, hidden_state_init_trainable=False, cell_state_init=<tensorflow.python.ops.init_ops.Zeros object>, cell_state_init_trainable=False, forget_bias=True, learn_std=True, std_share_network=False, init_std=1.0, layer_normalization=False, state_include_action=True)[source]


A policy which models actions with a Gaussian parameterized by an LSTM.

  • env_spec (garage.envs.env_spec.EnvSpec) – Environment specification.
  • name (str) – Model name, also the variable scope.
  • hidden_dim (int) – Hidden dimension for LSTM cell for mean.
  • hidden_nonlinearity (Callable) – Activation function for intermediate dense layer(s). It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • hidden_w_init (Callable) – Initializer function for the weight of intermediate dense layer(s). The function should return a tf.Tensor.
  • hidden_b_init (Callable) – Initializer function for the bias of intermediate dense layer(s). The function should return a tf.Tensor.
  • recurrent_nonlinearity (Callable) – Activation function for recurrent layers. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • recurrent_w_init (Callable) – Initializer function for the weight of recurrent layer(s). The function should return a tf.Tensor.
  • output_nonlinearity (Callable) – Activation function for output dense layer. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • output_w_init (Callable) – Initializer function for the weight of output dense layer(s). The function should return a tf.Tensor.
  • output_b_init (Callable) – Initializer function for the bias of output dense layer(s). The function should return a tf.Tensor.
  • hidden_state_init (Callable) – Initializer function for the initial hidden state. The functino should return a tf.Tensor.
  • hidden_state_init_trainable (bool) – Bool for whether the initial hidden state is trainable.
  • cell_state_init (Callable) – Initializer function for the initial cell state. The functino should return a tf.Tensor.
  • cell_state_init_trainable (bool) – Bool for whether the initial cell state is trainable.
  • forget_bias (bool) – If True, add 1 to the bias of the forget gate at initialization. It’s used to reduce the scale of forgetting at the beginning of the training.
  • learn_std (bool) – Is std trainable.
  • std_share_network (bool) – Boolean for whether mean and std share the same network.
  • init_std (float) – Initial value for std.
  • layer_normalization (bool) – Bool for using layer normalization or not.
  • state_include_action (bool) – Whether the state includes action. If True, input dimension will be (observation dimension + action dimension).
dist_info_sym(obs_var, state_info_vars, name=None)[source]

Build a symbolic graph of the action distribution parameters.

  • obs_var (tf.Tensor) – Tensor input for symbolic graph.
  • state_info_vars (dict) – Extra state information, e.g. previous action.
  • name (str) – Name for symbolic graph.

Output of the symbolic graph of action

distribution parameters.

Return type:



Policy distribution.

Get single action from this policy for the input observation.

Parameters:observation (numpy.ndarray) – Observation from environment.
Returns:Predicted action and agent information.
action (numpy.ndarray): Predicted action. agent_info (dict): Distribution obtained after observing the
given observation, with keys * mean: (numpy.ndarray) * log_std: (numpy.ndarray) * prev_action: (numpy.ndarray), only present if
self._state_include_action is True.
Return type:tuple[numpy.ndarray, dict]

Get multiple actions from this policy for the input observations.

Parameters:observations (numpy.ndarray) – Observations from environment.
Returns:Predicted action and agent information.
actions (numpy.ndarray): Predicted actions. agent_infos (dict): Distribution obtained after observing the
given observation, with keys * mean: (numpy.ndarray) * log_std: (numpy.ndarray) * prev_action: (numpy.ndarray), only present if
self._state_include_action is True.
Return type:tuple[numpy.ndarray, dict]

Whether this policy is recurrent or not.


Reset the policy.


If dones is None, it will be by default np.array([True]), which implies the policy will not be “vectorized”, i.e. number of paralle environments for training data sampling = 1.

Parameters:dones (numpy.ndarray) – Bool that indicates terminal state(s).

State info specification.


Whether this policy is vectorized.

class GaussianMLPPolicy(env_spec, name='GaussianMLPPolicy', hidden_sizes=(32, 32), hidden_nonlinearity=<function tanh>, hidden_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, hidden_b_init=<tensorflow.python.ops.init_ops.Zeros object>, output_nonlinearity=None, output_w_init=<tensorflow.python.ops.init_ops.GlorotUniform object>, output_b_init=<tensorflow.python.ops.init_ops.Zeros object>, learn_std=True, adaptive_std=False, std_share_network=False, init_std=1.0, min_std=1e-06, max_std=None, std_hidden_sizes=(32, 32), std_hidden_nonlinearity=<function tanh>, std_output_nonlinearity=None, std_parameterization='exp', layer_normalization=False)[source]


GaussianMLPPolicy with GaussianMLPModel.

A policy that contains a MLP to make prediction based on a gaussian distribution.

  • env_spec (garage.envs.env_spec.EnvSpec) – Environment specification.
  • name (str) – Model name, also the variable scope.
  • hidden_sizes (list[int]) – Output dimension of dense layer(s) for the MLP for mean. For example, (32, 32) means the MLP consists of two hidden layers, each with 32 hidden units.
  • hidden_nonlinearity (callable) – Activation function for intermediate dense layer(s). It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • hidden_w_init (callable) – Initializer function for the weight of intermediate dense layer(s). The function should return a tf.Tensor.
  • hidden_b_init (callable) – Initializer function for the bias of intermediate dense layer(s). The function should return a tf.Tensor.
  • output_nonlinearity (callable) – Activation function for output dense layer. It should return a tf.Tensor. Set it to None to maintain a linear activation.
  • output_w_init (callable) – Initializer function for the weight of output dense layer(s). The function should return a tf.Tensor.
  • output_b_init (callable) – Initializer function for the bias of output dense layer(s). The function should return a tf.Tensor.
  • learn_std (bool) – Is std trainable.
  • adaptive_std (bool) – Is std a neural network. If False, it will be a parameter.
  • std_share_network (bool) – Boolean for whether mean and std share the same network.
  • init_std (float) – Initial value for std.
  • std_hidden_sizes (list[int]) – Output dimension of dense layer(s) for the MLP for std. For example, (32, 32) means the MLP consists of two hidden layers, each with 32 hidden units.
  • min_std (float) – If not None, the std is at least the value of min_std, to avoid numerical issues.
  • max_std (float) – If not None, the std is at most the value of max_std, to avoid numerical issues.
  • std_hidden_nonlinearity – Nonlinearity for each hidden layer in the std network.
  • std_output_nonlinearity – Nonlinearity for output layer in the std network.
  • std_parametrization (str) – How the std should be parametrized. There are a few options:
  • exp (-) – the logarithm of the std will be stored, and applied a exponential transformation
  • softplus (-) – the std will be computed as log(1+exp(x))
  • layer_normalization (bool) – Bool for using layer normalization or not.

dist_info_sym(obs_var, state_info_vars=None, name='default')[source]

Symbolic graph of the distribution.


Policy distribution.


Get action from the policy.


Get actions from the policy.


Get the trainable variables.


Vectorized or not.