Utility functions related to sampling.
rollout(env, agent, *, max_episode_length=np.inf, animated=False, speedup=1, deterministic=False)¶
Sample a single episode of the agent in the environment.
agent (Policy) – Agent used to select actions.
env (Environment) – Environment to perform actions in.
max_episode_length (int) – If the episode reaches this many timesteps, it is truncated.
animated (bool) – If true, render the environment after each step.
speedup (float) – Factor by which to decrease the wait time between rendered steps. Only relevant, if animated == true.
deterministic (bool) – If true, use the mean action returned by the stochastic policy instead of sampling from the returned action distribution.
- Dictionary, with keys:
- observations(np.array): Flattened array of observations.
There should be one more of these than actions. Note that observations[i] (for i < len(observations) - 1) was used by the agent to choose actions[i]. Should have shape (T + 1, S^*) (the unflattened state space of the current environment).
- actions(np.array): Non-flattened array of actions. Should have
shape (T, S^*) (the unflattened action space of the current environment).
- rewards(np.array): Array of rewards of shape (T,) (1D array of
- agent_infos(Dict[str, np.array]): Dictionary of stacked,
non-flattened agent_info arrays.
- env_infos(Dict[str, np.array]): Dictionary of stacked,
non-flattened env_info arrays.
- episode_infos(Dict[str, np.array]): Dictionary of stacked,
non-flattened episode_info arrays.
dones(np.array): Array of termination signals.
- Return type
Truncate the paths so that the total number of samples is max_samples.
This is done by removing extra paths at the end of the list, and make the last path shorter if necessary
paths (list[dict[str, np.ndarray]]) – Samples, items with keys: * observations (np.ndarray): Enviroment observations * actions (np.ndarray): Agent actions * rewards (np.ndarray): Environment rewards * env_infos (dict): Environment state information * agent_infos (dict): Agent state information
max_samples (int) – Maximum number of samples allowed.
- A list of paths, truncated so that the
number of samples adds up to max-samples
- Return type
ValueError – If key a other than ‘observations’, ‘actions’, ‘rewards’, ‘env_infos’ and ‘agent_infos’ is found.