The Autonomous Learning Library¶
The Autonomous Learning Library is a PyTorchbased toolkit for building and evaluating reinforcement learning agents.
Here are some common links:
The GitHub repository.
The Getting Started guide.
An example project to help you get started building your own agents.
The Benchmark Performance for our preset agents.
The all.presets documentation, including default hyperparameters.
Enjoy!
Getting Started¶
Prerequisites¶
The Autonomous Learning Library requires a recent version of Pytorch (>= 10.3). Additionally, Tensorboard is required in order to enable logging. We recommond installing these through Conda. We also strongly recommend using a machine with a fast GPU (a GTX 970 or better).
Installation¶
The autonomouslearninglibrary
can be installed from PyPi using pip
:
pip install autonomouslearninglibrary
If you don’t have PyTorch or Tensorboard previously installed, you can install them using:
pip install autonomouslearninglibrary[pytorch]
An alternate approach, that may be useful when following this tutorial, is to instead install by cloning the Github repository:
git clone https://github.com/cpnota/autonomouslearninglibrary.git
cd autonomouslearninglibrary
pip install e .["dev"]
If you chose to clone the repository, you can test your installation by running the unit test suite:
make test
This should also tell you if CUDA (the GPU driver) is available.
Running a Preset Agent¶
The goal of the Autonomous Learning Library is to provide components for building new agents. However, the library also includes a number of “preset” agent configurations for easy benchmarking and comparison, as well as some useful scripts. For example, an A2C agent can be run on CartPole as follows:
allclassic CartPolev0 ppo
The results will be written to runs/_a2c <id>
, where <id>
is some some string generated by the library.
You can view these results and other information through tensorboard:
tensorboard logdir runs
By opening your browser to <http://localhost:6006>, you should see a dashboard that looks something like the following (you may need to adjust the “smoothing” parameter):
If you want to compare agents in a nicer, format, you can use the plot script:
allplot logdir runs
This should give you a plot similar to the following:
In this plot, each point represents the average of the episodic returns over the last 100 episodes. The shaded region represents the standard deviation over that interval.
Finally, to watch the trained model in action, we provide a watch scripts for each preset module:
allwatchclassic CartPolev0 "runs/_a2c <id>"
You need to find the <id> by checking the runs
directory.
Each of these scripts can be found the scripts
directory of the main repository.
Be sure to check out the atari
and continuous
scripts for more fun!
Basic Concepts¶
In this section, we explain the basic elements of the autonomouslearninglibrary
and the philosophy behind some of the basic design decision.
AgentBased Design¶
One of the core philosophies in the autonomouslearninglibrary is that RL should be agentbased, not algorithmbased.
To see what we mean by this, check out the OpenAI Baselines implementation of DQN.
There’s a giant function called learn
which accepts an environment and a bunch of hyperparameters, at the heart of which there is a control loop which calls many different functions.
Which part of this function is the agent? Which part is the environment? Which part is something else?
We call this implementation algorithmbased because the central abstraction is a function called learn
which provides the complete specification of an algorithm.
What should the proper abstraction for agent be, then? We have to look no further than the following famous diagram:
The definition of an Agent
is simple.
It accepts a state and a reward and returns an action.
That’s it.
Everything else is an implementation detail.
Here’s the Agent
interface in the autonomouslearninglibrary:
class Agent(ABC):
@abstractmethod
def act(self, state, reward):
pass
@abstractmethod
def eval(self, state, reward):
pass
The act
function is called when training the agent.
The eval
function is called when evaluating the agent, e.g., after a training run has completed.
When and how the Agent
trains inside of this function is nobody’s business except the Agent
itself.
When the Agent
is allowed to act is determined by some outer control loop, and is not of concern to the Agent
.
What might an implementation of act
look like? Here’s the act function from our DQN implementation:
def act(self, state, reward):
self._store_transition(self._state, self._action, reward, state)
self._train()
self._state = state
self._action = self.policy(state)
return self.action
That’s it. _store_transition()
and _train()
are private helper methods.
There is no reason for the control loop to know anything about these details.
There is no tight coupling between the Agent
and the control loop.
This approach simplifies both our Agent
implementation and the control loop itself.
Separating the control loop logic from the Agent
logic allows greater flexibility in the way agents are used.
In fact, Agent
is entirely decoupled from the Environment
interface.
This means that our agents can be used outside of standard research environments, such as part of a REST api, a multiagent system, etc.
Any code that passes a State
and a scalar reward is compatible with our agents.
Function Approximation¶
Almost everything a deep reinforcement learning agent does is predicated on function approximation.
For this reason, one of the central abstractions in the autonomouslearninglibrary
is Approximation
.
By building agents that rely on the Approximation
abstraction rather than directly interfacing with PyTorch Module
and Optimizer
objects,
we can add to or modify the functionality of an Agent
without altering its source code (this is known as the OpenClosed Principle).
The default Approximation
object allows us to achieve a high level of code reuse by encapsulating common functionality such as logging, model checkpointing, target networks, learning rate schedules and gradient clipping.
The Approximation
object in turn relies on a set of abstractions that allow users to alter its behavior.
Let’s look at a simple usage of Approximation
in solving a very easy supervised learning task:
import torch
from torch import nn, optim
from all.approximation import Approximation
# create a pytorch module
model = nn.Linear(16, 1)
# create an associated pytorch optimizer
optimizer = optim.Adam(model.parameters(), lr=1e2)
# create the function approximator
f = Approximation(model, optimizer)
for _ in range(200):
# Generate some arbitrary data.
# We'll approximate a very simple function:
# the sum of the input features.
x = torch.randn((16, 16))
y = x.sum(1, keepdim=True)
# forward pass
y_hat = f(x)
# compute loss
loss = nn.functional.mse_loss(y_hat, y)
# backward pass
f.reinforce(loss)
Easy! Now let’s look at the _train() function for our DQN agent:
def _train(self):
if self._should_train():
(states, actions, rewards, next_states, _) = self.replay_buffer.sample(self.minibatch_size)
# forward pass
values = self.q(states, actions)
targets = rewards + self.discount_factor * torch.max(self.q.target(next_states), dim=1)[0]
# compute loss
loss = mse_loss(values, targets)
# backward pass
self.q.reinforce(loss)
Just as easy!
The agent does not need to know anything about the network architecture, logging, regularization, etc.
These are all handled through the appropriate configuration of Approximation
.
Instead, the Agent
implementation is able to focus exclusively on its sole purpose: defining the RL algorithm itself.
By encapsulating these details in Approximation
, we are able to follow the single responsibility principle.
A few other quick things to note: f.no_grad(x)
runs a forward pass with torch.no_grad()
, speeding computation.
f.eval(x)
does the same, but also puts the model in eval mode first, (e.g., BatchNorm
or Dropout
layers), and then puts the model back into its previous mode before returning.
f.target(x)
calls the target network (an advanced concept used in algorithms such as DQN. S, for example, David Silver’s course notes) associated with the Approximation
, also with torch.no_grad()
.
The autonomouslearninglibrary
provides a few thin wrappers over Approximation
for particular purposes, such as QNetwork
, VNetwork
, FeatureNetwork
, and several Policy
implementations.
Environments¶
The importance of the Environment
in reinforcement learning nearly goes without saying.
In the autonomouslearninglibrary
, the prepackaged environments are simply wrappers for OpenAI Gym, the defacto standard library for RL environments.
We add a few additional features:
gym
primarily usesnumpy.array
for representing states and actions. We automatically convert to and fromtorch.Tensor
objects so that agent implemenetations need not consider the difference.We add properties to the environment for
state
,reward
, etc. This simplifies the control loop and is generally useful.We apply common preprocessors, such as several standard Atari wrappers. However, where possible, we prefer to perform preprocessing using
Body
objects to maximize the flexibility of the agents.
Below, we show how several different types of environments can be created:
from all.environments import AtariEnvironment, GymEnvironment
# create an Atari environment on the gpu
env = AtariEnvironment('Breakout', device='cuda')
# create a classic control environment on the compute
env = GymEnvironment('CartPolev0')
# create a PyBullet environment on the cpu
import pybullet_envs
env = GymEnvironment('HalfCheetahBulletEnvv0')
Now we can write our first control loop:
# initialize the environment
env.reset()
# Loop for some arbitrary number of timesteps.
for timesteps in range(1000000):
env.render()
action = agent.act(env.state, env.reward)
env.step(action)
if env.done:
# terminal update
agent.act(env.state, env.reward)
# reset the environment
env.reset()
Of course, this control loop is not exactly featurepacked.
Generally, it’s better to use the Experiment
module described later.
Presets¶
In the autonomouslearninglibrary
, agents are compositional, which means that the behavior of a given Agent
depends on the behavior of several other objects.
Users can compose agents with specific behavior by passing appropriate objects into the constructor of the highlevel algorithms contained in all.agents
.
The library provides a number of functions which compose these objects in specific ways such that they are appropriate for a given set of environment.
We call such a function a preset
, and several such presets are contained in the all.presets
package.
(This is an example of the more general factory method pattern).
For example, all.agents.vqn
contains a highlevel description of a vanilla Qlearning algorithm.
In order to actually apply this agent to a problem, for example, a classic control problem, we might define the following preset:
# The outer function signature contains the set of hyperparameters
def vqn(
# Common settings
device="cpu",
# Hyperparameters
discount_factor=0.99,
lr=1e2,
exploration=0.1,
):
# The inner function creates a closure over the hyperparameters passed into the outer function.
# It accepts an "env" object which is passed right before the Experiment begins, as well as
# the writer created by the Experiment which defines the logging parameters.
def _vqn(env, writer=DummyWriter()):
# create a pytorch model
model = nn.Sequential(
nn.Linear(env.state_space.shape[0], 64),
nn.ReLU(),
nn.Linear(64, env.action_space.n),
).to(device)
# create a pytorch optimizer for the model
optimizer = Adam(model.parameters(), lr=lr)
# create an Approximation of the Qfunction
q = QNetwork(model, optimizer, writer=writer)
# create a Policy object derived from the Qfunction
policy = GreedyPolicy(q, env.action_space.n, epsilon=exploration)
# instansiate the agent
return VQN(q, policy, discount_factor=discount_factor)
# return the inner function
return _vqn
Notice how there is an “outer function” and an “inner” function. This approach allows the separation of configuration and instansiation. While this may seem redundant, it can sometimes be useful. For example, suppose we want to run the same agent on multiple environments. This can be done as follows:
agent = vqn()
envs = [, GymEnvironment('MountainCarv0')]
some_custom_runner(agent(), GymEnvironment('CartPolev0'))
some_custom_runner(agent(), GymEnvironment('MountainCarv0'))
Now, each call to some_custom_runner
receives a unique instance of the agent.
This is sometimes achieved in other libraries by providing a “reset” function.
We find our approach allows us to keep the Agent
interface clean,
and is overall more elegant and less error prone.
Experiment¶
Finally, we have all of the components necessary to introduce the run_experiment
helper function.
run_experiment
is the builtin control loop for running reinforcement learning experiment.
It instansiates its own Writer
object, which is then passed to each of the agents, and runs each agent on each environment passed to it for some number of timesteps (frames) or episodes).
Here is a quick example:
from gym import envs
from all.experiments import run_experiment
from all.presets import atari
from all.environments import AtariEnvironment
agents = [
atari.dqn(),
atari.ddqn(),
atari.c51(),
atari.rainbow(),
atari.a2c(),
atari.ppo(),
]
envs = [AtariEnvironment(env, device='cuda') for env in ['BeamRider', 'Breakout', 'Pong', 'Qbert', 'SpaceInvaders']]
run_experiment(agents, envs, 10e6)
The above block executes each run sequentially.
This could take a very long time, even on a fast GPU!
If you have access to a cluster running Slurm, you can replace run_experiment
with SlurmExperiment
to speed things up substantially (the magic of submitting jobs is handled behind the scenes).
By default, run_experiment
will write the results to ./runs
.
You can view the results in tensorboard
by running the following command:
tensorboard logdir runs
In addition to the tensorboard
logs, every 100 episodes, the mean and standard deviation of the previous 100 episode returns are written to runs/[agent]/[env]/returns100.csv
.
This is much faster to read and plot than Tensorboard’s proprietary format.
The library contains an automatically plotting utility that generates appropriate plots for an entire runs
directory as follows:
from all.experiments import plot_returns_100
plot_returns_100('./runs')
This will generate a plot that looks like the following (after tweaking the whitespace through the matplotlib
UI):
An optional parameter is test_episodes
, which is set to 100 by default.
After running for the given number of frames, the agent will be evaluated for a number of episodes specified by test_episodes
with training disabled.
This is useful measuring the final performance of an agent.
You can also pass optional parameters to run_experiment
to change its behavior.
You can set render=True
to watch the agent during training (generally not recommended: it slows the agent considerably!).
You can set quiet=True
to silence command line output.
Lastly, you can set write_loss=False
to disable writing debugging information to tensorboard
.
These files can become large, so this is recommended if you have limited storage!
Finally, run_experiment
relies on an underlying Experiment
API.
If you don’t like the behavior of run_experiment
, you can reuse the underlying Experiment
objects to change it.
Building Your Own Agent¶
In the previous section, we discussed the basic components of the autonomouslearninglibrary
.
While the library contains a selection of preset agents, the primary goal of the library is to be a tool to build your own agents.
To this end, we have provided an example project containing a new model predictive control variant of DQN to demonstrate the flexibility of the library.
Briefly, when creating your own agent, you will generally have the following components:
An
agent.py
file containing the highlevel implementation of theAgent
.A
model.py
file containing the PyTorch models appropriate for your chosen domain.A
preset.py
file that composes yourAgent
using the appropriate model and other objects.A
main.py
or similar file that runs your agent and anyautonomouslearninglibrary
presets you wish to compare against.
While it is not necessary to follow this structure, we believe it will generally guide you towards using the autonomouslearninglibrary
in the intended manner and ensure that your code is understandable to other users of the library.
Benchmark Performance¶
Reinforcement learning algorithms are difficult to debug and test.
For this reason, in order to ensuring the correctness of the preset agents provided by the autonomouslearninglibrary
,
we benchmark each algorithm after every major change.
We also discuss the performance of our implementations relative to published results.
For our hyperparameters for each domain, see all.presets.
Atari Benchmark¶
To benchmark the all.presets.atari
presets, we ran each agent for 10 million timesteps (40 million ingame frames).
The learning rate was decayed over the course of training using cosine annealing.
The environment implementation uses the following wrappers:
NoopResetEnv (adds a random number of noops at the beginning of each game reset)
MaxAndSkipEnv (Repeats each action four times before the next agent observation. Takes the max pixel value over the four frames.)
FireResetEnv (Automatically chooses the “FIRE” action when env.reset() is called)
WarpFrame (Rescales the frame to 84x84 and greyscales the image)
LifeLostEnv (Adds a key to “info” indicating that a life was lost)
Additionally, we use the following agent “bodies”:
FrameStack (provides the last four frames as the state)
ClipRewards (Converts all rewards to {1, 0, 1})
EpisodicLives (If life was lost, treats the frame as the end of an episode)
The results were as follows:
For comparison, we look at the results published in the paper, Rainbow: Combining Improvements in Deep Reinforcement Learning:
In these results, the authors ran each agent for 50 million timesteps (200 million frames).
We can see that at the 10 million timestep mark, our results are similar or slightly better.
Our dqn
and ddqn
in particular were better almost across the board.
While there are almost certainly some minor implementation differences,
our agents achieved very similar behavior to the agents tested by DeepMind.
PyBullet Benchmark¶
[PyBullet](https://pybullet.org/wordpress/) provides a free alternative to the popular MuJoCo robotics environments.
While MuJoCo requires a license key and can be difficult for independent researchers to afford, PyBullet is free and open.
Additionally, the PyBullet environments are widely considered more challenging, making them a more discriminant test bed.
For these reasons, we chose to benchmark the all.presets.continuous
presets using PyBullet.
Similar to the Atari benchmark, we ran each agent for 10 million timesteps (in this case, timesteps are equal to frames). The learning rate was decayed over the course of training using cosine annealing. To reduce the variance of the updates, we added an extra time feature to the state (t * 0.001, where t is the current timestep). The results were as follows:
PPO was omitted from the plot for Humanoid because it achieved very large negative returns which interfered with the scale of the graph. Note, however, that our implementation of soft actorcritic (SAC) is able to solve even this difficult environment.
Because most research papers still use MuJoCo, direct comparisons are difficult to come by. However, George Sung helpfully benchmarked TD3 and DDPG on several PyBullet environments [here](https://github.com/georgesung/TD3). However, he only ran each environment for 1 million timesteps and tuned his hyperparameters accordingly. Generally, our agents achieved higher final perfomance but converged more slowly.
all.agents¶

A reinforcement learning agent. 

Advantage ActorCritic (A2C). 

A categorical DQN agent (C51). 

Deep Deterministic Policy Gradient (DDPG). 

Double Deep QNetwork (DDQN). 

Deep QNetwork (DQN). 

Proximal Policy Optimization (PPO). 

Rainbow: Combining Improvements in Deep Reinforcement Learning. 

Soft ActorCritic (SAC). 

Vanilla ActorCritic (VAC). 

Vanilla Policy Gradient (VPG/REINFORCE). 

Vanilla QNetwork (VQN). 

Vanilla SARSA (VSarsa). 

class
all.agents.
A2C
(features, v, policy, discount_factor=0.99, entropy_loss_scaling=0.01, n_envs=None, n_steps=4, writer=<all.logging.DummyWriter object>)¶ Bases:
all.agents._agent.Agent
Advantage ActorCritic (A2C). A2C is policy gradient method in the actorcritic family. It is the synchronous variant of the Asychronous Advantage ActorCritic (A3C). The key distiguishing feature between A2C/A3C and prior actorcritic methods is the use of parallel actors interaction with a parallel set of environments. This mitigates the need for a replay buffer by providing a different mechanism for decorrelating samples. https://arxiv.org/abs/1602.01783
 Parameters
features (FeatureNetwork) – Shared feature layers.
v (VNetwork) – Value head which approximates the statevalue function.
policy (StochasticPolicy) – Policy head which outputs an action distribution.
discount_factor (float) – Discount factor for future rewards.
n_envs (int) – Number of parallel actors/environments
n_steps (int) – Number of timesteps per rollout. Updates are performed once per rollout.
writer (Writer) – Used for logging.

act
(states)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(states)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

class
all.agents.
Agent
¶ Bases:
abc.ABC
,all.optim.scheduler.Schedulable
A reinforcement learning agent.
In reinforcement learning, an Agent learns by interacting with an Environment. Usually, an agent tries to maximize a reward signal. It does this by observing environment “states”, taking “actions”, receiving “rewards”, and in doing so, learning which stateaction pairs correlate with high rewards. An Agent implementation should encapsulate some particular reinforcement learning algorihthm.

abstract
act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

abstract
eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

abstract

class
all.agents.
C51
(q_dist, replay_buffer, discount_factor=0.99, eps=1e05, exploration=0.02, minibatch_size=32, replay_start_size=5000, update_frequency=1, writer=<all.logging.DummyWriter object>)¶ Bases:
all.agents._agent.Agent
A categorical DQN agent (C51). Rather than making a point estimate of the Qfunction, C51 estimates a categorical distribution over possible values. The 51 refers to the number of atoms used in the categorical distribution used to estimate the value distribution. https://arxiv.org/abs/1707.06887
 Parameters
q_dist (QDist) – Approximation of the Q distribution.
replay_buffer (ReplayBuffer) – The experience replay buffer.
discount_factor (float) – Discount factor for future rewards.
eps (float) – Stability parameter for computing the loss function.
exploration (float) – The probability of choosing a random action.
minibatch_size (int) – The number of experiences to sample in each training update.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
update_frequency (int) – Number of timesteps per training update.

act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

class
all.agents.
DDPG
(q, policy, replay_buffer, action_space, discount_factor=0.99, minibatch_size=32, noise=0.1, replay_start_size=5000, update_frequency=1)¶ Bases:
all.agents._agent.Agent
Deep Deterministic Policy Gradient (DDPG). DDPG extends the ideas of DQN to a continuous action setting. Unlike DQN, which uses a single joint Q/policy network, DDPG uses separate networks for approximating the Qfunction and approximating the policy. The policy network outputs a vector action in some continuous space. A small amount of noise is added to aid exploration. The Qnetwork is used to train the policy network. A replay buffer is used to allow for batch updates and decorrelation of the samples. https://arxiv.org/abs/1509.02971
 Parameters
q (QContinuous) – An Approximation of the continuous action Qfunction.
policy (DeterministicPolicy) – An Approximation of a deterministic policy.
replay_buffer (ReplayBuffer) – The experience replay buffer.
action_space (gym.spaces.Box) – Description of the action space.
discount_factor (float) – Discount factor for future rewards.
minibatch_size (int) – The number of experiences to sample in each training update.
noise (float) – the amount of noise to add to each action (before scaling).
replay_start_size (int) – Number of experiences in replay buffer when training begins.
update_frequency (int) – Number of timesteps per training update.

act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

class
all.agents.
DDQN
(q, policy, replay_buffer, discount_factor=0.99, loss=<function weighted_mse_loss>, minibatch_size=32, replay_start_size=5000, update_frequency=1)¶ Bases:
all.agents._agent.Agent
Double Deep QNetwork (DDQN). DDQN is an enchancment to DQN that uses a “double Qstyle” update, wherein the online network is used to select target actions and the target network is used to evaluate these actions. https://arxiv.org/abs/1509.06461 This agent also adds support for weighted replay buffers, such as priotized experience replay (PER). https://arxiv.org/abs/1511.05952
 Parameters
q (QNetwork) – An Approximation of the Q function.
policy (GreedyPolicy) – A policy derived from the Qfunction.
replay_buffer (ReplayBuffer) – The experience replay buffer.
discount_factor (float) – Discount factor for future rewards.
loss (function) – The weighted loss function to use.
minibatch_size (int) – The number of experiences to sample in each training update.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
update_frequency (int) – Number of timesteps per training update.

act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

class
all.agents.
DQN
(q, policy, replay_buffer, discount_factor=0.99, loss=torch.nn.functional.mse_loss, minibatch_size=32, replay_start_size=5000, update_frequency=1)¶ Bases:
all.agents._agent.Agent
Deep QNetwork (DQN). DQN was one of the original deep reinforcement learning algorithms. It extends the ideas behind Qlearning to work well with modern convolution networks. The core innovation is the use of a replay buffer, which allows the use of batchstyle updates with decorrelated samples. It also uses a “target” network in order to improve the stability of updates. https://www.nature.com/articles/nature14236
 Parameters
q (QNetwork) – An Approximation of the Q function.
policy (GreedyPolicy) – A policy derived from the Qfunction.
replay_buffer (ReplayBuffer) – The experience replay buffer.
discount_factor (float) – Discount factor for future rewards.
exploration (float) – The probability of choosing a random action.
loss (function) – The weighted loss function to use.
minibatch_size (int) – The number of experiences to sample in each training update.
n_actions (int) – The number of available actions.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
update_frequency (int) – Number of timesteps per training update.

act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

class
all.agents.
PPO
(features, v, policy, discount_factor=0.99, entropy_loss_scaling=0.01, epochs=4, epsilon=0.2, lam=0.95, minibatches=4, n_envs=None, n_steps=4, writer=<all.logging.DummyWriter object>)¶ Bases:
all.agents._agent.Agent
Proximal Policy Optimization (PPO). PPO is an actorcritic style policy gradient algorithm that allows for the reuse of samples by using importance weighting. This often increases sample efficiency compared to algorithms such as A2C. To avoid overfitting, PPO uses a special “clipped” objective that prevents the algorithm from changing the current policy too quickly.
 Parameters
features (FeatureNetwork) – Shared feature layers.
v (VNetwork) – Value head which approximates the statevalue function.
policy (StochasticPolicy) – Policy head which outputs an action distribution.
discount_factor (float) – Discount factor for future rewards.
entropy_loss_scaling (float) – Contribution of the entropy loss to the total policy loss.
epochs (int) – Number of times to reuse each sample.
lam (float) – The Generalized Advantage Estimate (GAE) decay parameter.
minibatches (int) – The number of minibatches to split each batch into.
n_envs (int) – Number of parallel actors/environments.
n_steps (int) – Number of timesteps per rollout. Updates are performed once per rollout.
writer (Writer) – Used for logging.

act
(states)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(states)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

class
all.agents.
Rainbow
(q_dist, replay_buffer, discount_factor=0.99, eps=1e05, exploration=0.02, minibatch_size=32, replay_start_size=5000, update_frequency=1, writer=<all.logging.DummyWriter object>)¶ Bases:
all.agents.c51.C51
Rainbow: Combining Improvements in Deep Reinforcement Learning. Rainbow combines C51 with 5 other “enhancements” to DQN: double Qlearning, dueling networks, noisy networks prioritized reply, nstep rollouts. https://arxiv.org/abs/1710.02298
Whether this agent is Rainbow or C51 depends on the objects that are passed into it. Dueling networks and noisy networks are part of the model used for q_dist, while prioritized replay and nstep rollouts are handled by the replay buffer. Double Qlearning is always used.
 Parameters
q_dist (QDist) – Approximation of the Q distribution.
replay_buffer (ReplayBuffer) – The experience replay buffer.
discount_factor (float) – Discount factor for future rewards.
eps (float) – Stability parameter for computing the loss function.
exploration (float) – The probability of choosing a random action.
minibatch_size (int) – The number of experiences to sample in each training update.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
update_frequency (int) – Number of timesteps per training update.

class
all.agents.
SAC
(policy, q_1, q_2, v, replay_buffer, discount_factor=0.99, entropy_target=2.0, lr_temperature=0.0001, minibatch_size=32, replay_start_size=5000, temperature_initial=0.1, update_frequency=1, writer=<all.logging.DummyWriter object>)¶ Bases:
all.agents._agent.Agent
Soft ActorCritic (SAC). SAC is a proposed improvement to DDPG that replaces the standard meansquared Bellman error (MSBE) objective with a “maximum entropy” objective that impoves exploration. It also uses a few other tricks, such as the “Clipped DoubleQ Learning” trick introduced by TD3. This implementation uses automatic temperature adjustment to replace the difficult to set temperature parameter with a more easily tuned entropy target parameter. https://arxiv.org/abs/1801.01290
 Parameters
policy (DeterministicPolicy) – An Approximation of a deterministic policy.
q1 (QContinuous) – An Approximation of the continuous action Qfunction.
q2 (QContinuous) – An Approximation of the continuous action Qfunction.
v (VNetwork) – An Approximation of the statevalue function.
replay_buffer (ReplayBuffer) – The experience replay buffer.
discount_factor (float) – Discount factor for future rewards.
entropy_target (float) – The desired entropy of the policy. Usually env.action_space.shape[0]
minibatch_size (int) – The number of experiences to sample in each training update.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
temperature_initial (float) – The initial temperature used in the maximum entropy objective.
update_frequency (int) – Number of timesteps per training update.

act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

class
all.agents.
VAC
(features, v, policy, discount_factor=1)¶ Bases:
all.agents._agent.Agent
Vanilla ActorCritic (VAC). VAC is an implementation of the actorcritic alogorithm found in the Sutton and Barto (2018) textbook. This implementation tweaks the algorithm slightly by using a shared feature layer. It is also compatible with the use of parallel environments. https://papers.nips.cc/paper/1786actorcriticalgorithms.pdf
 Parameters
features (FeatureNetwork) – Shared feature layers.
v (VNetwork) – Value head which approximates the statevalue function.
policy (StochasticPolicy) – Policy head which outputs an action distribution.
discount_factor (float) – Discount factor for future rewards.
n_envs (int) – Number of parallel actors/environments
n_steps (int) – Number of timesteps per rollout. Updates are performed once per rollout.
writer (Writer) – Used for logging.

act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

class
all.agents.
VPG
(features, v, policy, discount_factor=0.99, min_batch_size=1)¶ Bases:
all.agents._agent.Agent
Vanilla Policy Gradient (VPG/REINFORCE). VPG (also known as REINFORCE) is the least biased implementation of the policy gradient theorem. It uses complete episode rollouts as unbiased estimates of the Qfunction, rather than the nstep rollouts found in most actorcritic algorithms. The statevalue function approximation reduces varience, but does not introduce any bias. This implementation introduces two tweaks. First, it uses a shared feature layer. Second, it introduces the capacity for training on multiple episodes at once. These enhancements often improve learning without sacrifice the essential character of the algorithm. https://link.springer.com/article/10.1007/BF00992696
 Parameters
features (FeatureNetwork) – Shared feature layers.
v (VNetwork) – Value head which approximates the statevalue function.
policy (StochasticPolicy) – Policy head which outputs an action distribution.
discount_factor (float) – Discount factor for future rewards.
min_batch_size (int) – Updates will occurs when an episode ends after at least this many stateaction pairs are seen. Set this to a large value in order to train on multiple episodes at once.

act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

class
all.agents.
VQN
(q, policy, discount_factor=0.99)¶ Bases:
all.agents._agent.Agent
Vanilla QNetwork (VQN). VQN is an implementation of the Qlearning algorithm found in the Sutton and Barto (2018) textbook. Qlearning algorithms attempt to learning the optimal policy while executing a (generally) suboptimal policy (typically epsilongreedy). In theory, This allows the agent to gain the benefits of exploration without sacrificing the performance of the final policy. However, the cost of this is that Qlearning is generally less stable than its onpolicy bretheren, SARSA. http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf
 Parameters
q (QNetwork) – An Approximation of the Q function.
policy (GreedyPolicy) – A policy derived from the Qfunction.
discount_factor (float) – Discount factor for future rewards.

act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

class
all.agents.
VSarsa
(q, policy, discount_factor=0.99)¶ Bases:
all.agents._agent.Agent
Vanilla SARSA (VSarsa). SARSA (StateActionRewardStateAction) is an onpolicy alternative to Qlearning. Unlike Qlearning, SARSA attempts to learn the Qfunction for the current policy rather than the optimal policy. This approach is more stable but may not result in the optimal policy. However, this problem can be mitigated by decaying the exploration rate over time.
 Parameters
q (QNetwork) – An Approximation of the Q function.
policy (GreedyPolicy) – A policy derived from the Qfunction.
discount_factor (float) – Discount factor for future rewards.

act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor
all.approximation¶

class
all.approximation.
Approximation
(model, optimizer, checkpointer=None, clip_grad=0, loss_scaling=1, name='approximation', scheduler=None, target=None, writer=<all.logging.DummyWriter object>)¶ Bases:
object
Base function approximation object.
This defines a Pytorchbased function approximation object that wraps key functionality useful for reinforcement learning, including decaying learning rates, model checkpointing, loss scaling, gradient clipping, target networks, and tensorboard logging. This enables increased code reusability and simpler Agent implementations.
 Parameters
model (torch.nn.Module) – A Pytorch module representing the model used to approximate the function. This could be a convolution network, a fully connected network, or any other Pytorchcompatible model.
optimizer (torch.optim.Optimizer) – A optimizer initialized with the model parameters, e.g. SGD, Adam, RMSprop, etc.
checkpointer (all.approximation.checkpointer.Checkpointer) – A Checkpointer object that periodically saves the model and its parameters to the disk. Default: A PeriodicCheckpointer that saves the model once every 200 updates.
clip_grad (float, optional) – If nonzero, clips the norm of the gradient to this value in order prevent large updates and improve stability. See torch.nn.utils.clip_grad.
loss_scaling (float, optional) – Multiplies the loss by this value before performing a backwards pass. Useful when used with multiheaded networks with shared feature layers.
name (str, optional) – The name of the function approximator used for logging.
( (scheduler) – torch.optim.lr_scheduler._LRScheduler:, optional): A learning rate scheduler initialized with the given optimizer. step() will be called after every update.
target (all.approximation.target.TargetNetwork, optional) – A target network object to be used during optimization. A target network updates more slowly than the base model that is being optimizing, allowing for a more stable optimization target.
(all.logging.Writer (writer) – , optional): A Writer object used for logging. The standard object logs to tensorboard, however, other types of Writer objects may be implemented by the user.

eval
(*inputs)¶ Run a forward pass of the model in eval mode with no_grad. The model is returned to its previous mode afer the forward pass is made.

no_grad
(*inputs)¶ Run a forward pass of the model in no_grad mode.

reinforce
(loss)¶ Backpropagate the loss through the model and make an update step. Internally, this will perform most of the activities associated with a control loop in standard machine learning environments, depending on the configuration of the object: Gradient clipping, learning rate schedules, logging, checkpointing, etc.
 Parameters
loss (torch.Tensor) – The loss computed for a batch of inputs.
 Returns
The current Approximation object
 Return type
self

step
()¶ Given that a backward pass has been made, run an optimization step Internally, this will perform most of the activities associated with a control loop in standard machine learning environments, depending on the configuration of the object: Gradient clipping, learning rate schedules, logging, checkpointing, etc.
 Returns
The current Approximation object
 Return type
self

target
(*inputs)¶ Run a forward pass of the target network.

zero_grad
()¶ Clears the gradients of all optimized tensors
 Returns
The current Approximation object
 Return type
self

class
all.approximation.
DummyCheckpointer
¶ Bases:
all.approximation.checkpointer.Checkpointer

init
(*inputs)¶


class
all.approximation.
FeatureNetwork
(model, optimizer=None, name='feature', **kwargs)¶ Bases:
all.approximation.approximation.Approximation
A special type of Approximation that accumulates gradients before backpropagating them. This is useful when features are shared between network heads.
The __call__ function caches the computation graph and detaches the output. Then, various functions approximators may backpropagate to the output. The reinforce() function will then backpropagate the accumulated gradients on the output through the original computation graph.

reinforce
()¶ Backward pass of the model.


class
all.approximation.
FixedTarget
(update_frequency)¶ Bases:
all.approximation.target.abstract.TargetNetwork

init
(model)¶

update
()¶


class
all.approximation.
PeriodicCheckpointer
(frequency)¶ Bases:
all.approximation.checkpointer.Checkpointer

init
(model, filename)¶


class
all.approximation.
PolyakTarget
(rate)¶ Bases:
all.approximation.target.abstract.TargetNetwork
TargetNetwork that updates using polyak averaging

init
(model)¶

update
()¶


class
all.approximation.
QContinuous
(model, optimizer, name='q', **kwargs)¶ Bases:
all.approximation.approximation.Approximation

class
all.approximation.
QDist
(model, optimizer, n_actions, n_atoms, v_min, v_max, name='q_dist', **kwargs)¶ Bases:
all.approximation.approximation.Approximation

project
(dist, support)¶


class
all.approximation.
QNetwork
(model, optimizer, name='q', **kwargs)¶ Bases:
all.approximation.approximation.Approximation

class
all.approximation.
TrivialTarget
¶ Bases:
all.approximation.target.abstract.TargetNetwork

init
(model)¶

update
()¶


class
all.approximation.
VNetwork
(model, optimizer, name='v', **kwargs)¶ Bases:
all.approximation.approximation.Approximation
all.bodies¶

class
all.bodies.
Body
(agent)¶ Bases:
all.agents._agent.Agent
A Body wraps a reinforcment learning Agent, altering its inputs and ouputs.
The Body API is identical to the Agent API from the perspective of the rest of the system. This base class is provided only for semantic clarity.

act
(state)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

property
agent
¶

eval
(state)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

process_action
(action)¶

process_state
(state)¶


class
all.bodies.
DeepmindAtariBody
(agent, lazy_frames=False, episodic_lives=True, frame_stack=4)¶ Bases:
all.bodies._body.Body
all.environments¶

class
all.environments.
AtariEnvironment
(name, *args, **kwargs)¶ Bases:
all.environments.gym.GymEnvironment

duplicate
(n)¶ Create n copies of this environment.

property
name
¶ The name of the environment.


class
all.environments.
Environment
¶ Bases:
abc.ABC
A reinforcement learning Environment.
In reinforcement learning, an Agent learns by interacting with an Environment. An Environment defines the dynamics of a particular problem: the states, the actions, the transitions between states, and the rewards given to the agent. Environments are often used to benchmark reinforcement learning agents, or to define real problems that the user hopes to solve using reinforcement learning.

abstract property
action_space
¶ The Space representing the range of possible actions.
 Returns
An object of type Space that represents possible actions the agent may take
 Return type
Space

abstract
close
()¶ Clean up any extraneaous environment objects.

abstract property
device
¶ The torch device the environment lives on.

abstract
duplicate
(n)¶ Create n copies of this environment.

abstract property
name
¶ The name of the environment.

property
observation_space
¶ Alias for Environemnt.state_space.
 Returns
An object of type Space that represents possible states the agent may observe
 Return type
Space

abstract
render
(**kwargs)¶ Render the current environment state.

abstract
reset
()¶ Reset the environment and return a new intial state.
 Returns
The initial state for the next episode.
 Return type
State

abstract property
state
¶ The State of the Environment at the current timestep.

abstract property
state_space
¶ The Space representing the range of observable states.
 Returns
An object of type Space that represents possible states the agent may observe
 Return type
Space

abstract
step
(action)¶ Apply an action and get the next state.
 Parameters
action (
Action
) – The action to apply at the current time step. Returns
all.environments.State – The State of the environment after the action is applied. This State object includes both the done flag and any additional “info”
float – The reward achieved by the previous action

abstract property

class
all.environments.
GymEnvironment
(env, device=torch.device)¶ Bases:
all.environments.abstract.Environment

property
action_space
¶ The Space representing the range of possible actions.
 Returns
An object of type Space that represents possible actions the agent may take
 Return type
Space

close
()¶ Clean up any extraneaous environment objects.

property
device
¶ The torch device the environment lives on.

duplicate
(n)¶ Create n copies of this environment.

property
env
¶

property
name
¶ The name of the environment.

render
(**kwargs)¶ Render the current environment state.

reset
()¶ Reset the environment and return a new intial state.
 Returns
The initial state for the next episode.
 Return type
State

seed
(seed)¶

property
state
¶ The State of the Environment at the current timestep.

property
state_space
¶ The Space representing the range of observable states.
 Returns
An object of type Space that represents possible states the agent may observe
 Return type
Space

step
(action)¶ Apply an action and get the next state.
 Parameters
action (
Action
) – The action to apply at the current time step. Returns
all.environments.State – The State of the environment after the action is applied. This State object includes both the done flag and any additional “info”
float – The reward achieved by the previous action

property
all.experiments¶

class
all.experiments.
Experiment
(writer, quiet)¶ Bases:
abc.ABC
An Experiment manages the basic train/test loop and logs results.
 Parameters
( (writer) – torch.logging.writer:): A Writer object used for logging.
quiet (bool) – If False, the Experiment will print information about episode returns to standard out.

abstract property
episode
¶ The index of the current training episode

abstract property
frame
¶ The index of the current training frame.

abstract
test
(episodes=100)¶ Test the agent in eval mode for a certain number of episodes.
 Parameters
episodes (int) – The number of test epsiodes.
 Returns
A list of all returns received during testing.
 Return type
list(float)

abstract
train
(frames=inf, episodes=inf)¶ Train the agent for a certain number of frames or episodes. If both frames and episodes are specified, then the training loop will exit when either condition is satisfied.
 Parameters
frames (int) – The maximum number of training frames.
episodes (bool) – The maximum number of training episodes.

class
all.experiments.
ExperimentWriter
(experiment, agent_name, env_name, loss=True)¶ Bases:
tensorboardX.writer.SummaryWriter
,all.logging.Writer
The Writer object used by all.experiments.Experiment. Writes logs using tensorboard into the current runs directory, tagging the run with a combination of the agent name, the commit hash of the current git repo of the working directory (if any), and the current time. Also writes summary statistics into CSV files. :param experiment: The Experiment associated with the Writer object. :type experiment: all.experiments.Experiment :param agent_name: The name of the Agent the Experiment is being performed on :type agent_name: str :param env_name: The name of the environment the Experiment is being performed in :type env_name: str :param loss: Whether or not to log loss/scheduling metrics, or only evaluation and summary metrics. :type loss: bool, optional

add_evaluation
(name, value, step='frame')¶ Log the evaluation metric.
 Parameters
name (str) – The tag to associate with the loss
value (number) – The evaluation metric at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

add_loss
(name, value, step='frame')¶ Log the given loss metric at the current step.
 Parameters
name (str) – The tag to associate with the loss
value (number) – The value of the loss at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

add_scalar
(name, value, step='frame')¶ Log an arbitrary scalar. :param name: The tag to associate with the scalar :type name: str :param value: The value of the scalar at the current step :type value: number :param step: Which step to use (e.g., “frame” or “episode”) :type step: str, optional

add_schedule
(name, value, step='frame')¶ Log the current value of a hyperparameter according to some schedule.
 Parameters
name (str) – The tag to associate with the hyperparameter schedule
value (number) – The value of the hyperparameter at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

add_summary
(name, mean, std, step='frame')¶ Log a summary statistic.
 Parameters
name (str) – The tag to associate with the summary statistic
mean (float) – The mean of the statistic at the current step
std (float) – The standard deviation of the statistic at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)


class
all.experiments.
GreedyAgent
(action_space, feature=None, q=None, policy=None)¶ Bases:
all.agents._agent.Agent

act
(state, _)¶ Select an action for the current timestep and update internal parameters.
In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

choose_continuous
(state)¶

choose_discrete
(state)¶

eval
(state, reward)¶ Select an action for the current timestep in evaluation mode.
Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.
 Parameters
state (all.environment.State) – The environment state at the current timestep.
 Returns
The action to take at the current timestep.
 Return type
torch.Tensor

static
load
(dirname, env)¶


class
all.experiments.
ParallelEnvExperiment
(agent, env, render=False, quiet=False, write_loss=True)¶ Bases:
all.experiments.experiment.Experiment
An Experiment object for training and testing agents that use parallel training environments.

property
episode
¶ The index of the current training episode

property
frame
¶ The index of the current training frame.

test
(episodes=100)¶ Test the agent in eval mode for a certain number of episodes.
 Parameters
episodes (int) – The number of test epsiodes.
 Returns
A list of all returns received during testing.
 Return type
list(float)

train
(frames=inf, episodes=inf)¶ Train the agent for a certain number of frames or episodes. If both frames and episodes are specified, then the training loop will exit when either condition is satisfied.
 Parameters
frames (int) – The maximum number of training frames.
episodes (bool) – The maximum number of training episodes.

property

class
all.experiments.
SingleEnvExperiment
(agent, env, render=False, quiet=False, write_loss=True)¶ Bases:
all.experiments.experiment.Experiment
An Experiment object for training and testing agents that interact with one environment at a time.

property
episode
¶ The index of the current training episode

property
frame
¶ The index of the current training frame.

test
(episodes=100)¶ Test the agent in eval mode for a certain number of episodes.
 Parameters
episodes (int) – The number of test epsiodes.
 Returns
A list of all returns received during testing.
 Return type
list(float)

train
(frames=inf, episodes=inf)¶ Train the agent for a certain number of frames or episodes. If both frames and episodes are specified, then the training loop will exit when either condition is satisfied.
 Parameters
frames (int) – The maximum number of training frames.
episodes (bool) – The maximum number of training episodes.

property

class
all.experiments.
SlurmExperiment
(agents, envs, frames, test_episodes=100, job_name='autonomouslearninglibrary', sbatch_args=None)¶ Bases:
object

create_sbatch_script
()¶

make_output_directory
()¶

parse_args
()¶

queue_jobs
()¶

run_experiment
()¶

run_sbatch_script
()¶


all.experiments.
load_and_watch
(dir, env, fps=60)¶

all.experiments.
run_experiment
(agents, envs, frames, test_episodes=100, render=False, quiet=False, write_loss=True)¶

all.experiments.
watch
(agent, env, fps=60)¶
all.logging¶

class
all.logging.
DummyWriter
¶ Bases:
all.logging.Writer
A default Writer object that performs no logging and has no side effects.

add_evaluation
(name, value, step='frame')¶ Log the evaluation metric.
 Parameters
name (str) – The tag to associate with the loss
value (number) – The evaluation metric at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

add_loss
(name, value, step='frame')¶ Log the given loss metric at the current step.
 Parameters
name (str) – The tag to associate with the loss
value (number) – The value of the loss at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

add_scalar
(name, value, step='frame')¶ Log an arbitrary scalar.
 Parameters
name (str) – The tag to associate with the scalar
value (number) – The value of the scalar at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

add_schedule
(name, value, step='frame')¶ Log the current value of a hyperparameter according to some schedule.
 Parameters
name (str) – The tag to associate with the hyperparameter schedule
value (number) – The value of the hyperparameter at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

add_summary
(name, mean, std, step='frame')¶ Log a summary statistic.
 Parameters
name (str) – The tag to associate with the summary statistic
mean (float) – The mean of the statistic at the current step
std (float) – The standard deviation of the statistic at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)


class
all.logging.
Writer
¶ Bases:
abc.ABC

abstract
add_evaluation
(name, value, step='frame')¶ Log the evaluation metric.
 Parameters
name (str) – The tag to associate with the loss
value (number) – The evaluation metric at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

abstract
add_loss
(name, value, step='frame')¶ Log the given loss metric at the current step.
 Parameters
name (str) – The tag to associate with the loss
value (number) – The value of the loss at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

abstract
add_scalar
(name, value, step='frame')¶ Log an arbitrary scalar.
 Parameters
name (str) – The tag to associate with the scalar
value (number) – The value of the scalar at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

abstract
add_schedule
(name, value, step='frame')¶ Log the current value of a hyperparameter according to some schedule.
 Parameters
name (str) – The tag to associate with the hyperparameter schedule
value (number) – The value of the hyperparameter at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

abstract
add_summary
(name, mean, std, step='frame')¶ Log a summary statistic.
 Parameters
name (str) – The tag to associate with the summary statistic
mean (float) – The mean of the statistic at the current step
std (float) – The standard deviation of the statistic at the current step
step (str, optional) – Which step to use (e.g., “frame” or “episode”)

log_dir
= 'runs'¶

abstract
all.memory¶

class
all.memory.
ExperienceReplayBuffer
(size, device=torch.device)¶ Bases:
all.memory.replay_buffer.ReplayBuffer

sample
(batch_size)¶ Sample from the stored transitions

store
(state, action, next_state)¶ Store the transition in the buffer

update_priorities
(td_errors)¶ Update priorities based on the TD error


class
all.memory.
GeneralizedAdvantageBuffer
(v, features, n_steps, n_envs, discount_factor=1, lam=1)¶ Bases:
object

advantages
(states)¶

store
(states, actions, rewards)¶


class
all.memory.
NStepAdvantageBuffer
(v, features, n_steps, n_envs, discount_factor=1)¶ Bases:
object

advantages
(states)¶

store
(states, actions, rewards)¶


class
all.memory.
NStepReplayBuffer
(steps, discount_factor, buffer)¶ Bases:
all.memory.replay_buffer.ReplayBuffer
Converts any ReplayBuffer into an NStepReplayBuffer

sample
(*args, **kwargs)¶ Sample from the stored transitions

store
(state, action, next_state)¶ Store the transition in the buffer

update_priorities
(*args, **kwargs)¶ Update priorities based on the TD error


class
all.memory.
PrioritizedReplayBuffer
(buffer_size, alpha=0.6, beta=0.4, epsilon=1e05, device=torch.device)¶ Bases:
all.memory.replay_buffer.ExperienceReplayBuffer
,all.optim.scheduler.Schedulable

sample
(batch_size)¶ Sample from the stored transitions

store
(state, action, next_state)¶ Store the transition in the buffer

update_priorities
(priorities)¶ Update priorities based on the TD error

all.nn¶

all.nn.
td_loss
(loss)¶

all.nn.
weighted_mse_loss
(input, target, weight, reduction='mean')¶

all.nn.
weighted_smooth_l1_loss
(input, target, weight, reduction='mean')¶
all.optim¶

class
all.optim.
LinearScheduler
(initial_value, final_value, decay_start, decay_end, name='variable', writer=<all.logging.DummyWriter object>)¶ Bases:
all.optim.scheduler.Scheduler

class
all.optim.
Schedulable
¶ Bases:
object
Allow “instance” descriptors to implement parameter scheduling.
all.policies¶

class
all.policies.
DeterministicPolicy
(model, optimizer, space, name='policy', **kwargs)¶ Bases:
all.approximation.approximation.Approximation
A DDPGstyle deterministic policy.
 Parameters
model (torch.nn.Module) – A Pytorch module representing the policy network. The input shape should be the same as the shape of the state space, and the output shape should be the same as the shape of the action space.
optimizer (torch.optim.Optimizer) – A optimizer initialized with the model parameters, e.g. SGD, Adam, RMSprop, etc.
action_space (gym.spaces.Box) – The Box representing the action space.
kwargs (optional) – Any other arguments accepted by all.approximation.Approximation

class
all.policies.
GaussianPolicy
(model, optimizer, space, name='policy', **kwargs)¶ Bases:
all.approximation.approximation.Approximation
A Gaussian stochastic policy.
This policy will choose actions from a distribution represented by a spherical Gaussian. The first n outputs the model will be squashed to [1, 1] through a tanh function, and then scaled to the given action_space, and the remaining n outputs will define the amount of noise added.
 Parameters
model (torch.nn.Module) – A Pytorch module representing the policy network. The input shape should be the same as the shape of the state (or feature) space, and the output shape should be double the size of the the action space. The first n outputs will be the unscaled mean of the action for each dimension, and the second n outputs will be the logarithm of the variance.
optimizer (torch.optim.Optimizer) – A optimizer initialized with the model parameters, e.g. SGD, Adam, RMSprop, etc.
action_space (gym.spaces.Box) – The Box representing the action space.
kwargs (optional) – Any other arguments accepted by all.approximation.Approximation

class
all.policies.
GreedyPolicy
(q, num_actions, epsilon=0.0)¶ Bases:
all.optim.scheduler.Schedulable
An “epsilongreedy” action selection policy for discrete action spaces.
This policy will usually choose the optimal action according to an approximation of the action value function (the “qfunction”), but with probabilty epsilon will choose a random action instead. GreedyPolicy is a Schedulable, meaning that epsilon can be varied over time by passing a Scheduler object.
 Parameters
q (all.approximation.QNetwork) – The actionvalue or “qfunction”
num_actions (int) – The number of available actions.
epsilon (float, optional) – The probability of selecting a random action.

eval
(state)¶

no_grad
(state)¶

class
all.policies.
ParallelGreedyPolicy
(q, num_actions, epsilon=0.0)¶ Bases:
all.optim.scheduler.Schedulable
A parallel version of the “epsilongreedy” action selection policy for discrete action spaces.
This policy will usually choose the optimal action according to an approximation of the action value function (the “qfunction”), but with probabilty epsilon will choose a random action instead. GreedyPolicy is a Schedulable, meaning that epsilon can be varied over time by passing a Scheduler object.
 Parameters
q (all.approximation.QNetwork) – The actionvalue or “qfunction”
num_actions (int) – The number of available actions.
epsilon (float, optional) – The probability of selecting a random action.

eval
(state)¶

no_grad
(state)¶

class
all.policies.
SoftDeterministicPolicy
(model, optimizer, space, name='policy', **kwargs)¶ Bases:
all.approximation.approximation.Approximation
A “soft” deterministic policy compatible with soft actorcritic (SAC).
 Parameters
model (torch.nn.Module) – A Pytorch module representing the policy network. The input shape should be the same as the shape of the state (or feature) space, and the output shape should be double the size of the the action space The first n outputs will be the unscaled mean of the action for each dimension, and the second n outputs will be the logarithm of the variance.
optimizer (torch.optim.Optimizer) – A optimizer initialized with the model parameters, e.g. SGD, Adam, RMSprop, etc.
action_space (gym.spaces.Box) – The Box representing the action space.
kwargs (optional) – Any other arguments accepted by all.approximation.Approximation

class
all.policies.
SoftmaxPolicy
(model, optimizer, name='policy', **kwargs)¶ Bases:
all.approximation.approximation.Approximation
A softmax (or Boltzmann) stochastic policy for discrete actions.
 Parameters
model (torch.nn.Module) – A Pytorch module representing the policy network. The input shape should be the same as the shape of the state (or feature) space, and the output should be a vector the size of the action set.
optimizer (torch.optim.Optimizer) – A optimizer initialized with the model parameters, e.g. SGD, Adam, RMSprop, etc.
kwargs (optional) – Any other arguments accepted by all.approximation.Approximation
all.presets¶
all.presets.atari¶

A2C Atari preset. 

C51 Atari preset. 

Dueling Double DQN with Prioritized Experience Replay (PER). 

DQN Atari preset. 

PPO Atari preset. 

Rainbow Atari Preset. 

Vanilla ActorCritic Atari preset. 

Vanilla Policy Gradient Atari preset. 

Vanilla QNetwork Atari preset. 

Vanilla SARSA Atari preset. 

all.presets.atari.
a2c
(device='cuda', discount_factor=0.99, last_frame=40000000.0, lr=0.0007, eps=0.00015, clip_grad=0.1, entropy_loss_scaling=0.01, value_loss_scaling=0.5, n_envs=16, n_steps=5, feature_model_constructor=<function nature_features>, value_model_constructor=<function nature_value_head>, policy_model_constructor=<function nature_policy_head>)¶ A2C Atari preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
clip_grad (float) – The maximum magnitude of the gradient for any given parameter. Set to 0 to disable.
entropy_loss_scaling (float) – Coefficient for the entropy term in the total loss.
value_loss_scaling (float) – Coefficient for the value function loss.
n_envs (int) – Number of parallel environments.
n_steps (int) – Length of each rollout.
feature_model_constructor (function) – The function used to construct the neural feature model.
value_model_constructor (function) – The function used to construct the neural value model.
policy_model_constructor (function) – The function used to construct the neural policy model.

all.presets.atari.
c51
(device='cuda', discount_factor=0.99, last_frame=40000000.0, lr=0.0001, eps=0.00015, minibatch_size=32, update_frequency=4, target_update_frequency=1000, replay_start_size=80000, replay_buffer_size=1000000, initial_exploration=0.02, final_exploration=0.0, atoms=51, v_min=10, v_max=10, model_constructor=<function nature_c51>)¶ C51 Atari preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
minibatch_size (int) – Number of experiences to sample in each training update.
update_frequency (int) – Number of timesteps per training update.
target_update_frequency (int) – Number of timesteps between updates the target network.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.
initial_exploration (int) – Initial probability of choosing a random action, decayed over course of training.
final_exploration (int) – Final probability of choosing a random action.
atoms (int) – The number of atoms in the categorical distribution used to represent the distributional value function.
v_min (int) – The expected return corresponding to the smallest atom.
v_max (int) – The expected return correspodning to the larget atom.
model_constructor (function) – The function used to construct the neural model.

all.presets.atari.
ddqn
(device='cuda', discount_factor=0.99, last_frame=40000000.0, lr=0.0001, eps=0.00015, minibatch_size=32, update_frequency=4, target_update_frequency=1000, replay_start_size=80000, replay_buffer_size=1000000, initial_exploration=1.0, final_exploration=0.01, final_exploration_frame=4000000, alpha=0.5, beta=0.5, model_constructor=<function nature_ddqn>)¶ Dueling Double DQN with Prioritized Experience Replay (PER).
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
minibatch_size (int) – Number of experiences to sample in each training update.
update_frequency (int) – Number of timesteps per training update.
target_update_frequency (int) – Number of timesteps between updates the target network.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.
initial_exploration (int) – Initial probability of choosing a random action, decayed until final_exploration_frame.
final_exploration (int) – Final probability of choosing a random action.
final_exploration_frame (int) – The frame where the exploration decay stops.
alpha (float) – Amount of prioritization in the prioritized experience replay buffer. (0 = no prioritization, 1 = full prioritization)
beta (float) – The strength of the importance sampling correction for prioritized experience replay. (0 = no correction, 1 = full correction)
model_constructor (function) – The function used to construct the neural model.

all.presets.atari.
dqn
(device='cuda', discount_factor=0.99, last_frame=40000000.0, lr=0.0001, eps=0.00015, minibatch_size=32, update_frequency=4, target_update_frequency=1000, replay_start_size=80000, replay_buffer_size=1000000, initial_exploration=1.0, final_exploration=0.01, final_exploration_frame=4000000, model_constructor=<function nature_dqn>)¶ DQN Atari preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
minibatch_size (int) – Number of experiences to sample in each training update.
update_frequency (int) – Number of timesteps per training update.
target_update_frequency (int) – Number of timesteps between updates the target network.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.
initial_exploration (int) – Initial probability of choosing a random action, decayed until final_exploration_frame.
final_exploration (int) – Final probability of choosing a random action.
final_exploration_frame (int) – The frame where the exploration decay stops.
model_constructor (function) – The function used to construct the neural model.

all.presets.atari.
ppo
(device='cuda', discount_factor=0.99, last_frame=40000000.0, lr=0.00025, eps=1e05, clip_grad=0.5, entropy_loss_scaling=0.01, value_loss_scaling=0.5, clip_initial=0.1, clip_final=0.01, epochs=4, minibatches=4, n_envs=8, n_steps=128, lam=0.95, feature_model_constructor=<function nature_features>, value_model_constructor=<function nature_value_head>, policy_model_constructor=<function nature_policy_head>)¶ PPO Atari preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
clip_grad (float) – The maximum magnitude of the gradient for any given parameter. Set to 0 to disable.
entropy_loss_scaling (float) – Coefficient for the entropy term in the total loss.
value_loss_scaling (float) – Coefficient for the value function loss.
clip_initial (float) – Value for epsilon in the clipped PPO objective function at the beginning of training.
clip_final (float) – Value for epsilon in the clipped PPO objective function at the end of training.
epochs (int) – Number of times to iterature through each batch.
minibatches (int) – The number of minibatches to split each batch into.
n_envs (int) – Number of parallel actors.
n_steps (int) – Length of each rollout.
lam (float) – The Generalized Advantage Estimate (GAE) decay parameter.
feature_model_constructor (function) – The function used to construct the neural feature model.
value_model_constructor (function) – The function used to construct the neural value model.
policy_model_constructor (function) – The function used to construct the neural policy model.

all.presets.atari.
rainbow
(device='cuda', discount_factor=0.99, last_frame=40000000.0, lr=0.0001, eps=0.00015, minibatch_size=32, update_frequency=4, target_update_frequency=1000, replay_start_size=80000, replay_buffer_size=1000000, initial_exploration=0.02, final_exploration=0.0, alpha=0.5, beta=0.5, n_steps=3, atoms=51, v_min=10, v_max=10, sigma=0.5, model_constructor=<function nature_rainbow>)¶ Rainbow Atari Preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
minibatch_size (int) – Number of experiences to sample in each training update.
update_frequency (int) – Number of timesteps per training update.
target_update_frequency (int) – Number of timesteps between updates the target network.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.
initial_exploration (int) – Initial probability of choosing a random action, decayed over course of training.
final_exploration (int) – Final probability of choosing a random action.
alpha (float) – Amount of prioritization in the prioritized experience replay buffer. (0 = no prioritization, 1 = full prioritization)
beta (float) – The strength of the importance sampling correction for prioritized experience replay. (0 = no correction, 1 = full correction)
n_steps (int) – The number of steps for nstep Qlearning.
atoms (int) – The number of atoms in the categorical distribution used to represent the distributional value function.
v_min (int) – The expected return corresponding to the smallest atom.
v_max (int) – The expected return correspodning to the larget atom.
sigma (float) – Initial noisy network noise.
model_constructor (function) – The function used to construct the neural model.

all.presets.atari.
vac
(device='cuda', discount_factor=0.99, lr_v=0.0005, lr_pi=0.0001, eps=0.00015, clip_grad=0.5, value_loss_scaling=0.25, n_envs=16, feature_model_constructor=<function nature_features>, value_model_constructor=<function nature_value_head>, policy_model_constructor=<function nature_policy_head>)¶ Vanilla ActorCritic Atari preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr_v (float) – Learning rate for value network.
lr_pi (float) – Learning rate for policy network and feature network.
eps (float) – Stability parameters for the Adam optimizer.
clip_grad (float) – The maximum magnitude of the gradient for any given parameter. Set to 0 to disable.
value_loss_scaling (float) – Coefficient for the value function loss.
n_envs (int) – Number of parallel environments.
feature_model_constructor (function) – The function used to construct the neural feature model.
value_model_constructor (function) – The function used to construct the neural value model.
policy_model_constructor (function) – The function used to construct the neural policy model.

all.presets.atari.
vpg
(device='cuda', discount_factor=0.99, last_frame=40000000.0, lr=0.0007, eps=0.00015, clip_grad=0.5, value_loss_scaling=0.25, min_batch_size=1000, feature_model_constructor=<function nature_features>, value_model_constructor=<function nature_value_head>, policy_model_constructor=<function nature_policy_head>)¶ Vanilla Policy Gradient Atari preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
clip_grad (float) – The maximum magnitude of the gradient for any given parameter. Set to 0 to disable.
value_loss_scaling (float) – Coefficient for the value function loss.
min_batch_size (int) – Continue running complete episodes until at least this many states have been seen since the last update.
feature_model_constructor (function) – The function used to construct the neural feature model.
value_model_constructor (function) – The function used to construct the neural value model.
policy_model_constructor (function) – The function used to construct the neural policy model.

all.presets.atari.
vqn
(device='cuda', discount_factor=0.99, lr=0.001, eps=0.00015, initial_exploration=1.0, final_exploration=0.02, final_exploration_frame=1000000, n_envs=64, model_constructor=<function nature_ddqn>)¶ Vanilla QNetwork Atari preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
initial_exploration (int) – Initial probability of choosing a random action, decayed until final_exploration_frame.
final_exploration (int) – Final probability of choosing a random action.
final_exploration_frame (int) – The frame where the exploration decay stops.
n_envs (int) – Number of parallel environments.
model_constructor (function) – The function used to construct the neural model.

all.presets.atari.
vsarsa
(device='cuda', discount_factor=0.99, lr=0.001, eps=0.00015, final_exploration_frame=1000000, final_exploration=0.02, initial_exploration=1.0, n_envs=64, model_constructor=<function nature_ddqn>)¶ Vanilla SARSA Atari preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
initial_exploration (int) – Initial probability of choosing a random action, decayed until final_exploration_frame.
final_exploration (int) – Final probability of choosing a random action.
final_exploration_frame (int) – The frame where the exploration decay stops.
n_envs (int) – Number of parallel environments.
model_constructor (function) – The function used to construct the neural model.
all.presets.classic_control¶

A2C classic control preset. 

C51 classic control preset. 

Dueling Double DQN with Prioritized Experience Replay (PER). 

DQN classic control preset. 

PPO classic control preset. 

Rainbow classic control preset. 

Vanilla ActorCritic classic control preset. 

Vanilla Policy Gradient classic control preset. 

Vanilla QNetwork classic control preset. 

Vanilla SARSA classic control preset. 

all.presets.classic_control.
a2c
(device='cpu', discount_factor=0.99, lr=0.003, clip_grad=0.1, entropy_loss_scaling=0.001, n_envs=4, n_steps=32, feature_model_constructor=<function fc_relu_features>, value_model_constructor=<function fc_value_head>, policy_model_constructor=<function fc_policy_head>)¶ A2C classic control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
lr (float) – Learning rate for the Adam optimizer.
clip_grad (float) – The maximum magnitude of the gradient for any given parameter. Set to 0 to disable.
entropy_loss_scaling (float) – Coefficient for the entropy term in the total loss.
n_envs (int) – Number of parallel environments.
n_steps (int) – Length of each rollout.
feature_model_constructor (function) – The function used to construct the neural feature model.
value_model_constructor (function) – The function used to construct the neural value model.
policy_model_constructor (function) – The function used to construct the neural policy model.

all.presets.classic_control.
c51
(device='cpu', discount_factor=0.99, lr=0.0001, minibatch_size=128, update_frequency=1, replay_start_size=1000, replay_buffer_size=20000, initial_exploration=1.0, final_exploration=0.02, final_exploration_frame=10000, atoms=101, v_min=100, v_max=100, model_constructor=<function fc_relu_dist_q>)¶ C51 classic control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
minibatch_size (int) – Number of experiences to sample in each training update.
update_frequency (int) – Number of timesteps per training update.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.
initial_exploration (int) – Initial probability of choosing a random action, decayed over course of training.
final_exploration (int) – Final probability of choosing a random action.
atoms (int) – The number of atoms in the categorical distribution used to represent the distributional value function.
v_min (int) – The expected return corresponding to the smallest atom.
v_max (int) – The expected return correspodning to the larget atom.
model_constructor (function) – The function used to construct the neural model.

all.presets.classic_control.
ddqn
(device='cpu', discount_factor=0.99, lr=0.001, minibatch_size=64, update_frequency=1, target_update_frequency=100, replay_start_size=1000, replay_buffer_size=10000, initial_exploration=1.0, final_exploration=0.0, final_exploration_frame=10000, alpha=0.2, beta=0.6, model_constructor=<function dueling_fc_relu_q>)¶ Dueling Double DQN with Prioritized Experience Replay (PER).
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
minibatch_size (int) – Number of experiences to sample in each training update.
update_frequency (int) – Number of timesteps per training update.
target_update_frequency (int) – Number of timesteps between updates the target network.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.
initial_exploration (int) – Initial probability of choosing a random action, decayed until final_exploration_frame.
final_exploration (int) – Final probability of choosing a random action.
final_exploration_frame (int) – The frame where the exploration decay stops.
alpha (float) – Amount of prioritization in the prioritized experience replay buffer. (0 = no prioritization, 1 = full prioritization)
beta (float) – The strength of the importance sampling correction for prioritized experience replay. (0 = no correction, 1 = full correction)
model_constructor (function) – The function used to construct the neural model.

all.presets.classic_control.
dqn
(device='cpu', discount_factor=0.99, lr=0.001, minibatch_size=64, update_frequency=1, target_update_frequency=100, replay_start_size=1000, replay_buffer_size=10000, initial_exploration=1.0, final_exploration=0.0, final_exploration_frame=10000, model_constructor=<function fc_relu_q>)¶ DQN classic control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
lr (float) – Learning rate for the Adam optimizer.
minibatch_size (int) – Number of experiences to sample in each training update.
update_frequency (int) – Number of timesteps per training update.
target_update_frequency (int) – Number of timesteps between updates the target network.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.
initial_exploration (int) – Initial probability of choosing a random action, decayed until final_exploration_frame.
final_exploration (int) – Final probability of choosing a random action.
final_exploration_frame (int) – The frame where the exploration decay stops.
model_constructor (function) – The function used to construct the neural model.

all.presets.classic_control.
ppo
(device='cpu', discount_factor=0.99, lr=0.001, clip_grad=0.1, entropy_loss_scaling=0.001, epsilon=0.2, epochs=4, minibatches=4, n_envs=8, n_steps=8, lam=0.95, feature_model_constructor=<function fc_relu_features>, value_model_constructor=<function fc_value_head>, policy_model_constructor=<function fc_policy_head>)¶ PPO classic control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
lr (float) – Learning rate for the Adam optimizer.
clip_grad (float) – The maximum magnitude of the gradient for any given parameter. Set to 0 to disable.
entropy_loss_scaling (float) – Coefficient for the entropy term in the total loss.
epsilon (float) – Value for epsilon in the clipped PPO objective function.
epochs (int) – Number of times to iterature through each batch.
minibatches (int) – The number of minibatches to split each batch into.
n_envs (int) – Number of parallel actors.
n_steps (int) – Length of each rollout.
lam (float) – The Generalized Advantage Estimate (GAE) decay parameter.
feature_model_constructor (function) – The function used to construct the neural feature model.
value_model_constructor (function) – The function used to construct the neural value model.
policy_model_constructor (function) – The function used to construct the neural policy model.

all.presets.classic_control.
rainbow
(device='cpu', discount_factor=0.99, lr=0.0002, minibatch_size=64, update_frequency=1, replay_buffer_size=20000, replay_start_size=1000, alpha=0.5, beta=0.5, n_steps=5, atoms=101, v_min=100, v_max=100, sigma=0.5, model_constructor=<function fc_relu_rainbow>)¶ Rainbow classic control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
lr (float) – Learning rate for the Adam optimizer.
minibatch_size (int) – Number of experiences to sample in each training update.
update_frequency (int) – Number of timesteps per training update.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.
alpha (float) – Amount of prioritization in the prioritized experience replay buffer. (0 = no prioritization, 1 = full prioritization)
beta (float) – The strength of the importance sampling correction for prioritized experience replay. (0 = no correction, 1 = full correction)
n_steps (int) – The number of steps for nstep Qlearning.
atoms (int) – The number of atoms in the categorical distribution used to represent the distributional value function.
v_min (int) – The expected return corresponding to the smallest atom.
v_max (int) – The expected return correspodning to the larget atom.
sigma (float) – Initial noisy network noise.
model_constructor (function) – The function used to construct the neural model.

all.presets.classic_control.
vac
(device='cpu', discount_factor=0.99, lr_v=0.005, lr_pi=0.001, eps=1e05, feature_model_constructor=<function fc_relu_features>, value_model_constructor=<function fc_value_head>, policy_model_constructor=<function fc_policy_head>)¶ Vanilla ActorCritic classic control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
lr_v (float) – Learning rate for value network.
lr_pi (float) – Learning rate for policy network and feature network.
eps (float) – Stability parameters for the Adam optimizer.
feature_model_constructor (function) – The function used to construct the neural feature model.
value_model_constructor (function) – The function used to construct the neural value model.
policy_model_constructor (function) – The function used to construct the neural policy model.

all.presets.classic_control.
vpg
(device='cpu', discount_factor=0.99, lr=0.005, min_batch_size=500, feature_model_constructor=<function fc_relu_features>, value_model_constructor=<function fc_value_head>, policy_model_constructor=<function fc_policy_head>)¶ Vanilla Policy Gradient classic control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
min_batch_size (int) – Continue running complete episodes until at least this many states have been seen since the last update.
feature_model_constructor (function) – The function used to construct the neural feature model.
value_model_constructor (function) – The function used to construct the neural value model.
policy_model_constructor (function) – The function used to construct the neural policy model.

all.presets.classic_control.
vqn
(device='cpu', discount_factor=0.99, lr=0.01, eps=1e05, epsilon=0.1, n_envs=8, model_constructor=<function fc_relu_q>)¶ Vanilla QNetwork classic control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
epsilon (int) – Probability of choosing a random action.
n_envs (int) – Number of parallel environments.
model_constructor (function) – The function used to construct the neural model.

all.presets.classic_control.
vsarsa
(device='cpu', discount_factor=0.99, lr=0.01, eps=1e05, epsilon=0.1, n_envs=8, model_constructor=<function fc_relu_q>)¶ Vanilla SARSA classic control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
epsilon (int) – Probability of choosing a random action.
n_envs (int) – Number of parallel environments.
model_constructor (function) – The function used to construct the neural model.
all.presets.continuous¶

DDPG continuous control preset. 

PPO continuous control preset. 

SAC continuous control preset. 

all.presets.continuous.
ddpg
(device='cuda', discount_factor=0.98, last_frame=2000000.0, lr_q=0.001, lr_pi=0.001, minibatch_size=100, update_frequency=1, polyak_rate=0.005, replay_start_size=5000, replay_buffer_size=1000000.0, noise=0.1, q_model_constructor=<function fc_q>, policy_model_constructor=<function fc_deterministic_policy>)¶ DDPG continuous control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent..
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr_q (float) – Learning rate for the Q network.
lr_pi (float) – Learning rate for the policy network.
minibatch_size (int) – Number of experiences to sample in each training update.
update_frequency (int) – Number of timesteps per training update.
polyak_rate (float) – Speed with which to update the target network towards the online network.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.
noise (float) – The amount of exploration noise to add.
q_model_constructor (function) – The function used to construct the neural q model.
policy_model_constructor (function) – The function used to construct the neural policy model.

all.presets.continuous.
ppo
(device='cuda', discount_factor=0.98, last_frame=2000000.0, lr=0.0003, eps=1e05, entropy_loss_scaling=0.01, value_loss_scaling=0.5, clip_grad=0.5, clip_initial=0.2, clip_final=0.01, epochs=20, minibatches=4, n_envs=32, n_steps=128, lam=0.95, ac_model_constructor=<function fc_actor_critic>)¶ PPO continuous control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent.
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr (float) – Learning rate for the Adam optimizer.
eps (float) – Stability parameters for the Adam optimizer.
entropy_loss_scaling (float) – Coefficient for the entropy term in the total loss.
value_loss_scaling (float) – Coefficient for the value function loss.
clip_grad (float) – The maximum magnitude of the gradient for any given parameter. Set to 0 to disable.
clip_initial (float) – Value for epsilon in the clipped PPO objective function at the beginning of training.
clip_final (float) – Value for epsilon in the clipped PPO objective function at the end of training.
epochs (int) – Number of times to iterature through each batch.
minibatches (int) – The number of minibatches to split each batch into.
n_envs (int) – Number of parallel actors.
n_steps (int) – Length of each rollout.
lam (float) – The Generalized Advantage Estimate (GAE) decay parameter.
ac_model_constructor (function) – The function used to construct the neural feature, value and policy model.

all.presets.continuous.
sac
(device='cuda', discount_factor=0.98, last_frame=2000000.0, lr_q=0.001, lr_v=0.001, lr_pi=0.0001, minibatch_size=100, update_frequency=2, polyak_rate=0.005, replay_start_size=5000, replay_buffer_size=1000000.0, temperature_initial=0.1, lr_temperature=1e05, entropy_target_scaling=1.0, q1_model_constructor=<function fc_q>, q2_model_constructor=<function fc_q>, v_model_constructor=<function fc_v>, policy_model_constructor=<function fc_soft_policy>)¶ SAC continuous control preset.
 Parameters
device (str) – The device to load parameters and buffers onto for this agent..
discount_factor (float) – Discount factor for future rewards.
last_frame (int) – Number of frames to train.
lr_q (float) – Learning rate for the Q networks.
lr_v (float) – Learning rate for the statevalue networks.
lr_pi (float) – Learning rate for the policy network.
minibatch_size (int) – Number of experiences to sample in each training update.
update_frequency (int) – Number of timesteps per training update.
polyak_rate (float) – Speed with which to update the target network towards the online network.
replay_start_size (int) – Number of experiences in replay buffer when training begins.
replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.
temperature_initial (float) – Initial value of the temperature parameter.
lr_temperature (float) – Learning rate for the temperature. Should be low compared to other learning rates.
entropy_target_scaling (float) – The target entropy will be (entropy_target_scaling * env.action_space.shape[0])
q1_model_constructor (function) – The function used to construct the neural q1 model.
q2_model_constructor (function) – The function used to construct the neural q2 model.
v_model_constructor (function) – The function used to construct the neural v model.
policy_model_constructor (function) – The function used to construct the neural policy model.