all.agents

Agent()

A reinforcement learning agent.

Multiagent()

A multiagent RL agent.

ParallelAgent()

A reinforcement learning agent that chooses actions for multiple states simultaneously.

A2C(features, v, policy[, discount_factor, ...])

Advantage Actor-Critic (A2C).

A2CTestAgent(features, policy)

C51(q_dist, replay_buffer[, ...])

A categorical DQN agent (C51).

C51TestAgent(q_dist, n_actions[, exploration])

DDPG(q, policy, replay_buffer, action_space)

Deep Deterministic Policy Gradient (DDPG).

DDPGTestAgent(policy)

DDQN(q, policy, replay_buffer[, ...])

Double Deep Q-Network (DDQN).

DDQNTestAgent

alias of DQNTestAgent

DQN(q, policy, replay_buffer[, ...])

Deep Q-Network (DQN).

DQNTestAgent(policy)

PPO(features, v, policy[, discount_factor, ...])

Proximal Policy Optimization (PPO).

PPOTestAgent

alias of A2CTestAgent

Rainbow(q_dist, replay_buffer[, ...])

Rainbow: Combining Improvements in Deep Reinforcement Learning.

RainbowTestAgent

alias of C51TestAgent

SAC(policy, q1, q2, replay_buffer[, ...])

Soft Actor-Critic (SAC).

SACTestAgent(policy)

VAC(features, v, policy[, discount_factor])

Vanilla Actor-Critic (VAC).

VACTestAgent

alias of A2CTestAgent

VPG(features, v, policy[, discount_factor, ...])

Vanilla Policy Gradient (VPG/REINFORCE).

VPGTestAgent

alias of A2CTestAgent

VQN(q, policy[, discount_factor])

Vanilla Q-Network (VQN).

VQNTestAgent(policy)

VSarsa(q, policy[, discount_factor])

Vanilla SARSA (VSarsa).

VSarsaTestAgent

alias of VQNTestAgent

IndependentMultiagent(agents)

class all.agents.A2C(features, v, policy, discount_factor=0.99, entropy_loss_scaling=0.01, n_envs=None, n_steps=4, logger=<all.logging.dummy.DummyLogger object>)

Bases: ParallelAgent

Advantage Actor-Critic (A2C). A2C is policy gradient method in the actor-critic family. It is the synchronous variant of the Asychronous Advantage Actor-Critic (A3C). The key distiguishing feature between A2C/A3C and prior actor-critic methods is the use of parallel actors interaction with a parallel set of environments. This mitigates the need for a replay buffer by providing a different mechanism for decorrelating samples. https://arxiv.org/abs/1602.01783

Parameters:
  • features (FeatureNetwork) – Shared feature layers.

  • v (VNetwork) – Value head which approximates the state-value function.

  • policy (StochasticPolicy) – Policy head which outputs an action distribution.

  • discount_factor (float) – Discount factor for future rewards.

  • n_envs (int) – Number of parallel actors/environments

  • n_steps (int) – Number of timesteps per rollout. Updates are performed once per rollout.

  • logger (Logger) – Used for logging.

act(states)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state_array (all.environment.StateArray) – An array of states for each parallel environment.

Returns:

The actions to take for each parallel environmets.

Return type:

torch.Tensor

class all.agents.A2CTestAgent(features, policy)

Bases: Agent, ParallelAgent

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

class all.agents.Agent

Bases: ABC, Schedulable

A reinforcement learning agent.

In reinforcement learning, an Agent learns by interacting with an Environment. Usually, an Agent tries to maximize a reward signal. It does this by observing environment “states”, taking “actions”, receiving “rewards”, and learning which state-action pairs correlate with high rewards. An Agent implementation should encapsulate some particular reinforcement learning algorithm.

abstract act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

class all.agents.C51(q_dist, replay_buffer, discount_factor=0.99, eps=1e-05, exploration=0.02, minibatch_size=32, replay_start_size=5000, update_frequency=1, logger=<all.logging.dummy.DummyLogger object>)

Bases: Agent

A categorical DQN agent (C51). Rather than making a point estimate of the Q-function, C51 estimates a categorical distribution over possible values. The 51 refers to the number of atoms used in the categorical distribution used to estimate the value distribution. https://arxiv.org/abs/1707.06887

Parameters:
  • q_dist (QDist) – Approximation of the Q distribution.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • discount_factor (float) – Discount factor for future rewards.

  • eps (float) – Stability parameter for computing the loss function.

  • exploration (float) – The probability of choosing a random action.

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • update_frequency (int) – Number of timesteps per training update.

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

eval(state)
class all.agents.C51TestAgent(q_dist, n_actions, exploration=0.0)

Bases: Agent

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

class all.agents.DDPG(q, policy, replay_buffer, action_space, discount_factor=0.99, minibatch_size=32, noise=0.1, replay_start_size=5000, update_frequency=1)

Bases: Agent

Deep Deterministic Policy Gradient (DDPG). DDPG extends the ideas of DQN to a continuous action setting. Unlike DQN, which uses a single joint Q/policy network, DDPG uses separate networks for approximating the Q-function and approximating the policy. The policy network outputs a vector action in some continuous space. A small amount of noise is added to aid exploration. The Q-network is used to train the policy network. A replay buffer is used to allow for batch updates and decorrelation of the samples. https://arxiv.org/abs/1509.02971

Parameters:
  • q (QContinuous) – An Approximation of the continuous action Q-function.

  • policy (DeterministicPolicy) – An Approximation of a deterministic policy.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • action_space (gymnasium.spaces.Box) – Description of the action space.

  • discount_factor (float) – Discount factor for future rewards.

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • noise (float) – the amount of noise to add to each action (before scaling).

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • update_frequency (int) – Number of timesteps per training update.

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

eval(state)
class all.agents.DDPGTestAgent(policy)

Bases: Agent

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

class all.agents.DDQN(q, policy, replay_buffer, discount_factor=0.99, loss=<function weighted_mse_loss>, minibatch_size=32, replay_start_size=5000, update_frequency=1)

Bases: Agent

Double Deep Q-Network (DDQN). DDQN is an enchancment to DQN that uses a “double Q-style” update, wherein the online network is used to select target actions and the target network is used to evaluate these actions. https://arxiv.org/abs/1509.06461 This agent also adds support for weighted replay buffers, such as priotized experience replay (PER). https://arxiv.org/abs/1511.05952

Parameters:
  • q (QNetwork) – An Approximation of the Q function.

  • policy (GreedyPolicy) – A policy derived from the Q-function.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • discount_factor (float) – Discount factor for future rewards.

  • loss (function) – The weighted loss function to use.

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • update_frequency (int) – Number of timesteps per training update.

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

eval(state)
all.agents.DDQNTestAgent

alias of DQNTestAgent

class all.agents.DQN(q, policy, replay_buffer, discount_factor=0.99, loss=torch.nn.functional.mse_loss, minibatch_size=32, replay_start_size=5000, update_frequency=1)

Bases: Agent

Deep Q-Network (DQN). DQN was one of the original deep reinforcement learning algorithms. It extends the ideas behind Q-learning to work well with modern convolution networks. The core innovation is the use of a replay buffer, which allows the use of batch-style updates with decorrelated samples. It also uses a “target” network in order to improve the stability of updates. https://www.nature.com/articles/nature14236

Parameters:
  • q (QNetwork) – An Approximation of the Q function.

  • policy (GreedyPolicy) – A policy derived from the Q-function.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • discount_factor (float) – Discount factor for future rewards.

  • exploration (float) – The probability of choosing a random action.

  • loss (function) – The weighted loss function to use.

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • n_actions (int) – The number of available actions.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • update_frequency (int) – Number of timesteps per training update.

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

eval(state)
class all.agents.DQNTestAgent(policy)

Bases: Agent

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

class all.agents.IndependentMultiagent(agents)

Bases: Multiagent

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

multiagent_state (all.core.MultiagentState) – The environment state at the current timestep.

Returns:

The action for the current agent to take at the current timestep.

Return type:

torch.Tensor

class all.agents.Multiagent

Bases: ABC, Schedulable

A multiagent RL agent. Differs from standard agents in that it accepts a multiagent state.

In reinforcement learning, an Agent learns by interacting with an Environment. Usually, an agent tries to maximize a reward signal. It does this by observing environment “states”, taking “actions”, receiving “rewards”, and learning which state-action pairs correlate with high rewards. An Agent implementation should encapsulate some particular reinforcement learning algorithm.

abstract act(multiagent_state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

multiagent_state (all.core.MultiagentState) – The environment state at the current timestep.

Returns:

The action for the current agent to take at the current timestep.

Return type:

torch.Tensor

class all.agents.PPO(features, v, policy, discount_factor=0.99, entropy_loss_scaling=0.01, epochs=4, epsilon=0.2, lam=0.95, minibatches=4, compute_batch_size=256, n_envs=None, n_steps=4, logger=<all.logging.dummy.DummyLogger object>)

Bases: ParallelAgent

Proximal Policy Optimization (PPO). PPO is an actor-critic style policy gradient algorithm that allows for the reuse of samples by using importance weighting. This often increases sample efficiency compared to algorithms such as A2C. To avoid overfitting, PPO uses a special “clipped” objective that prevents the algorithm from changing the current policy too quickly.

Parameters:
  • features (FeatureNetwork) – Shared feature layers.

  • v (VNetwork) – Value head which approximates the state-value function.

  • policy (StochasticPolicy) – Policy head which outputs an action distribution.

  • discount_factor (float) – Discount factor for future rewards.

  • entropy_loss_scaling (float) – Contribution of the entropy loss to the total policy loss.

  • epochs (int) – Number of times to reuse each sample.

  • lam (float) – The Generalized Advantage Estimate (GAE) decay parameter.

  • minibatches (int) – The number of minibatches to split each batch into.

  • compute_batch_size (int) – The batch size to use for computations that do not need backpropogation.

  • n_envs (int) – Number of parallel actors/environments.

  • n_steps (int) – Number of timesteps per rollout. Updates are performed once per rollout.

  • logger (Logger) – Used for logging.

act(states)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state_array (all.environment.StateArray) – An array of states for each parallel environment.

Returns:

The actions to take for each parallel environmets.

Return type:

torch.Tensor

eval(states)
all.agents.PPOTestAgent

alias of A2CTestAgent

class all.agents.ParallelAgent

Bases: ABC, Schedulable

A reinforcement learning agent that chooses actions for multiple states simultaneously. Differs from SingleAgent in that it accepts a StateArray instead of a State to process input from multiple environments in parallel.

In reinforcement learning, an Agent learns by interacting with an Environment. Usually, an Agent tries to maximize a reward signal. It does this by observing environment “states”, taking “actions”, receiving “rewards”, and learning which state-action pairs correlate with high rewards. An Agent implementation should encapsulate some particular reinforcement learning algorithm.

abstract act(state_array)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state_array (all.environment.StateArray) – An array of states for each parallel environment.

Returns:

The actions to take for each parallel environmets.

Return type:

torch.Tensor

class all.agents.Rainbow(q_dist, replay_buffer, discount_factor=0.99, eps=1e-05, exploration=0.02, minibatch_size=32, replay_start_size=5000, update_frequency=1, logger=<all.logging.dummy.DummyLogger object>)

Bases: C51

Rainbow: Combining Improvements in Deep Reinforcement Learning. Rainbow combines C51 with 5 other “enhancements” to DQN: double Q-learning, dueling networks, noisy networks prioritized reply, n-step rollouts. https://arxiv.org/abs/1710.02298

Whether this agent is Rainbow or C51 depends on the objects that are passed into it. Dueling networks and noisy networks are part of the model used for q_dist, while prioritized replay and n-step rollouts are handled by the replay buffer. Double Q-learning is always used.

Parameters:
  • q_dist (QDist) – Approximation of the Q distribution.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • discount_factor (float) – Discount factor for future rewards.

  • eps (float) – Stability parameter for computing the loss function.

  • exploration (float) – The probability of choosing a random action.

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • update_frequency (int) – Number of timesteps per training update.

all.agents.RainbowTestAgent

alias of C51TestAgent

class all.agents.SAC(policy, q1, q2, replay_buffer, discount_factor=0.99, entropy_backups=True, entropy_target=-2.0, lr_temperature=0.0001, minibatch_size=32, replay_start_size=5000, temperature_initial=0.1, update_frequency=1, logger=<all.logging.dummy.DummyLogger object>)

Bases: Agent

Soft Actor-Critic (SAC). SAC is a proposed improvement to DDPG that replaces the standard mean-squared Bellman error (MSBE) objective with a “maximum entropy” objective that impoves exploration. It also uses a few other tricks, such as the “Clipped Double-Q Learning” trick introduced by TD3. This implementation uses automatic temperature adjustment to replace the difficult to set temperature parameter with a more easily tuned entropy target parameter. https://arxiv.org/abs/1801.01290

Parameters:
  • policy (DeterministicPolicy) – An Approximation of a deterministic policy.

  • q1 (QContinuous) – An Approximation of the continuous action Q-function.

  • q2 (QContinuous) – An Approximation of the continuous action Q-function.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • discount_factor (float) – Discount factor for future rewards.

  • entropy_target (float) – The desired entropy of the policy. Usually -env.action_space.shape[0]

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • temperature_initial (float) – The initial temperature used in the maximum entropy objective.

  • update_frequency (int) – Number of timesteps per training update.

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

class all.agents.SACTestAgent(policy)

Bases: Agent

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

class all.agents.VAC(features, v, policy, discount_factor=1)

Bases: ParallelAgent

Vanilla Actor-Critic (VAC). VAC is an implementation of the actor-critic alogorithm found in the Sutton and Barto (2018) textbook. This implementation tweaks the algorithm slightly by using a shared feature layer. It is also compatible with the use of parallel environments. https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf

Parameters:
  • features (FeatureNetwork) – Shared feature layers.

  • v (VNetwork) – Value head which approximates the state-value function.

  • policy (StochasticPolicy) – Policy head which outputs an action distribution.

  • discount_factor (float) – Discount factor for future rewards.

  • n_envs (int) – Number of parallel actors/environments

  • n_steps (int) – Number of timesteps per rollout. Updates are performed once per rollout.

  • logger (Logger) – Used for logging.

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state_array (all.environment.StateArray) – An array of states for each parallel environment.

Returns:

The actions to take for each parallel environmets.

Return type:

torch.Tensor

eval(state)
all.agents.VACTestAgent

alias of A2CTestAgent

class all.agents.VPG(features, v, policy, discount_factor=0.99, min_batch_size=1)

Bases: Agent

Vanilla Policy Gradient (VPG/REINFORCE). VPG (also known as REINFORCE) is the least biased implementation of the policy gradient theorem. It uses complete episode rollouts as unbiased estimates of the Q-function, rather than the n-step rollouts found in most actor-critic algorithms. The state-value function approximation reduces varience, but does not introduce any bias. This implementation introduces two tweaks. First, it uses a shared feature layer. Second, it introduces the capacity for training on multiple episodes at once. These enhancements often improve learning without sacrifice the essential character of the algorithm. https://link.springer.com/article/10.1007/BF00992696

Parameters:
  • features (FeatureNetwork) – Shared feature layers.

  • v (VNetwork) – Value head which approximates the state-value function.

  • policy (StochasticPolicy) – Policy head which outputs an action distribution.

  • discount_factor (float) – Discount factor for future rewards.

  • min_batch_size (int) – Updates will occurs when an episode ends after at least this many state-action pairs are seen. Set this to a large value in order to train on multiple episodes at once.

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

eval(state)
all.agents.VPGTestAgent

alias of A2CTestAgent

class all.agents.VQN(q, policy, discount_factor=0.99)

Bases: ParallelAgent

Vanilla Q-Network (VQN). VQN is an implementation of the Q-learning algorithm found in the Sutton and Barto (2018) textbook. Q-learning algorithms attempt to learning the optimal policy while executing a (generally) suboptimal policy (typically epsilon-greedy). In theory, This allows the agent to gain the benefits of exploration without sacrificing the performance of the final policy. However, the cost of this is that Q-learning is generally less stable than its on-policy bretheren, SARSA. http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf

Parameters:
  • q (QNetwork) – An Approximation of the Q function.

  • policy (GreedyPolicy) – A policy derived from the Q-function.

  • discount_factor (float) – Discount factor for future rewards.

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state_array (all.environment.StateArray) – An array of states for each parallel environment.

Returns:

The actions to take for each parallel environmets.

Return type:

torch.Tensor

eval(state)
class all.agents.VQNTestAgent(policy)

Bases: Agent, ParallelAgent

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state (all.environment.State) – The environment state at the current timestep.

Returns:

The action to take at the current timestep.

Return type:

torch.Tensor

class all.agents.VSarsa(q, policy, discount_factor=0.99)

Bases: ParallelAgent

Vanilla SARSA (VSarsa). SARSA (State-Action-Reward-State-Action) is an on-policy alternative to Q-learning. Unlike Q-learning, SARSA attempts to learn the Q-function for the current policy rather than the optimal policy. This approach is more stable but may not result in the optimal policy. However, this problem can be mitigated by decaying the exploration rate over time.

Parameters:
  • q (QNetwork) – An Approximation of the Q function.

  • policy (GreedyPolicy) – A policy derived from the Q-function.

  • discount_factor (float) – Discount factor for future rewards.

act(state)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters:

state_array (all.environment.StateArray) – An array of states for each parallel environment.

Returns:

The actions to take for each parallel environmets.

Return type:

torch.Tensor

eval(state)
all.agents.VSarsaTestAgent

alias of VQNTestAgent