all.agents

Agent()

A reinforcement learning agent.

A2C(features, v, policy[, discount_factor, …])

Advantage Actor-Critic (A2C).

C51(q_dist, replay_buffer[, …])

A categorical DQN agent (C51).

DDPG(q, policy, replay_buffer, action_space)

Deep Deterministic Policy Gradient (DDPG).

DDQN(q, policy, replay_buffer[, …])

Double Deep Q-Network (DDQN).

DQN(q, policy, replay_buffer[, …])

Deep Q-Network (DQN).

PPO(features, v, policy[, discount_factor, …])

Proximal Policy Optimization (PPO).

Rainbow(q_dist, replay_buffer[, …])

Rainbow: Combining Improvements in Deep Reinforcement Learning.

SAC(policy, q_1, q_2, v, replay_buffer[, …])

Soft Actor-Critic (SAC).

VAC(features, v, policy[, discount_factor])

Vanilla Actor-Critic (VAC).

VPG(features, v, policy[, discount_factor, …])

Vanilla Policy Gradient (VPG/REINFORCE).

VQN(q, policy[, discount_factor])

Vanilla Q-Network (VQN).

VSarsa(q, policy[, discount_factor])

Vanilla SARSA (VSarsa).

class all.agents.A2C(features, v, policy, discount_factor=0.99, entropy_loss_scaling=0.01, n_envs=None, n_steps=4, writer=<all.logging.DummyWriter object>)

Bases: all.agents._agent.Agent

Advantage Actor-Critic (A2C). A2C is policy gradient method in the actor-critic family. It is the synchronous variant of the Asychronous Advantage Actor-Critic (A3C). The key distiguishing feature between A2C/A3C and prior actor-critic methods is the use of parallel actors interaction with a parallel set of environments. This mitigates the need for a replay buffer by providing a different mechanism for decorrelating samples. https://arxiv.org/abs/1602.01783

Parameters
  • features (FeatureNetwork) – Shared feature layers.

  • v (VNetwork) – Value head which approximates the state-value function.

  • policy (StochasticPolicy) – Policy head which outputs an action distribution.

  • discount_factor (float) – Discount factor for future rewards.

  • n_envs (int) – Number of parallel actors/environments

  • n_steps (int) – Number of timesteps per rollout. Updates are performed once per rollout.

  • writer (Writer) – Used for logging.

act(states, rewards)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(states, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.Agent

Bases: abc.ABC, all.optim.scheduler.Schedulable

A reinforcement learning agent.

In reinforcement learning, an Agent learns by interacting with an Environment. Usually, an agent tries to maximize a reward signal. It does this by observing environment “states”, taking “actions”, receiving “rewards”, and in doing so, learning which state-action pairs correlate with high rewards. An Agent implementation should encapsulate some particular reinforcement learning algorihthm.

abstract act(state, reward)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

abstract eval(state, reward)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.C51(q_dist, replay_buffer, discount_factor=0.99, eps=1e-05, exploration=0.02, minibatch_size=32, replay_start_size=5000, update_frequency=1, writer=<all.logging.DummyWriter object>)

Bases: all.agents._agent.Agent

A categorical DQN agent (C51). Rather than making a point estimate of the Q-function, C51 estimates a categorical distribution over possible values. The 51 refers to the number of atoms used in the categorical distribution used to estimate the value distribution. https://arxiv.org/abs/1707.06887

Parameters
  • q_dist (QDist) – Approximation of the Q distribution.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • discount_factor (float) – Discount factor for future rewards.

  • eps (float) – Stability parameter for computing the loss function.

  • exploration (float) – The probability of choosing a random action.

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • update_frequency (int) – Number of timesteps per training update.

act(state, reward)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(state, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.DDPG(q, policy, replay_buffer, action_space, discount_factor=0.99, minibatch_size=32, noise=0.1, replay_start_size=5000, update_frequency=1)

Bases: all.agents._agent.Agent

Deep Deterministic Policy Gradient (DDPG). DDPG extends the ideas of DQN to a continuous action setting. Unlike DQN, which uses a single joint Q/policy network, DDPG uses separate networks for approximating the Q-function and approximating the policy. The policy network outputs a vector action in some continuous space. A small amount of noise is added to aid exploration. The Q-network is used to train the policy network. A replay buffer is used to allow for batch updates and decorrelation of the samples. https://arxiv.org/abs/1509.02971

Parameters
  • q (QContinuous) – An Approximation of the continuous action Q-function.

  • policy (DeterministicPolicy) – An Approximation of a deterministic policy.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • action_space (gym.spaces.Box) – Description of the action space.

  • discount_factor (float) – Discount factor for future rewards.

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • noise (float) – the amount of noise to add to each action (before scaling).

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • update_frequency (int) – Number of timesteps per training update.

act(state, reward)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(state, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.DDQN(q, policy, replay_buffer, discount_factor=0.99, loss=<function weighted_mse_loss>, minibatch_size=32, replay_start_size=5000, update_frequency=1)

Bases: all.agents._agent.Agent

Double Deep Q-Network (DDQN). DDQN is an enchancment to DQN that uses a “double Q-style” update, wherein the online network is used to select target actions and the target network is used to evaluate these actions. https://arxiv.org/abs/1509.06461 This agent also adds support for weighted replay buffers, such as priotized experience replay (PER). https://arxiv.org/abs/1511.05952

Parameters
  • q (QNetwork) – An Approximation of the Q function.

  • policy (GreedyPolicy) – A policy derived from the Q-function.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • discount_factor (float) – Discount factor for future rewards.

  • loss (function) – The weighted loss function to use.

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • update_frequency (int) – Number of timesteps per training update.

act(state, reward)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(state, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.DQN(q, policy, replay_buffer, discount_factor=0.99, loss=torch.nn.functional.mse_loss, minibatch_size=32, replay_start_size=5000, update_frequency=1)

Bases: all.agents._agent.Agent

Deep Q-Network (DQN). DQN was one of the original deep reinforcement learning algorithms. It extends the ideas behind Q-learning to work well with modern convolution networks. The core innovation is the use of a replay buffer, which allows the use of batch-style updates with decorrelated samples. It also uses a “target” network in order to improve the stability of updates. https://www.nature.com/articles/nature14236

Parameters
  • q (QNetwork) – An Approximation of the Q function.

  • policy (GreedyPolicy) – A policy derived from the Q-function.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • discount_factor (float) – Discount factor for future rewards.

  • exploration (float) – The probability of choosing a random action.

  • loss (function) – The weighted loss function to use.

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • n_actions (int) – The number of available actions.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • update_frequency (int) – Number of timesteps per training update.

act(state, reward)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(state, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.PPO(features, v, policy, discount_factor=0.99, entropy_loss_scaling=0.01, epochs=4, epsilon=0.2, lam=0.95, minibatches=4, n_envs=None, n_steps=4, writer=<all.logging.DummyWriter object>)

Bases: all.agents._agent.Agent

Proximal Policy Optimization (PPO). PPO is an actor-critic style policy gradient algorithm that allows for the reuse of samples by using importance weighting. This often increases sample efficiency compared to algorithms such as A2C. To avoid overfitting, PPO uses a special “clipped” objective that prevents the algorithm from changing the current policy too quickly.

Parameters
  • features (FeatureNetwork) – Shared feature layers.

  • v (VNetwork) – Value head which approximates the state-value function.

  • policy (StochasticPolicy) – Policy head which outputs an action distribution.

  • discount_factor (float) – Discount factor for future rewards.

  • entropy_loss_scaling (float) – Contribution of the entropy loss to the total policy loss.

  • epochs (int) – Number of times to reuse each sample.

  • lam (float) – The Generalized Advantage Estimate (GAE) decay parameter.

  • minibatches (int) – The number of minibatches to split each batch into.

  • n_envs (int) – Number of parallel actors/environments.

  • n_steps (int) – Number of timesteps per rollout. Updates are performed once per rollout.

  • writer (Writer) – Used for logging.

act(states, rewards)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(states, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.Rainbow(q_dist, replay_buffer, discount_factor=0.99, eps=1e-05, exploration=0.02, minibatch_size=32, replay_start_size=5000, update_frequency=1, writer=<all.logging.DummyWriter object>)

Bases: all.agents.c51.C51

Rainbow: Combining Improvements in Deep Reinforcement Learning. Rainbow combines C51 with 5 other “enhancements” to DQN: double Q-learning, dueling networks, noisy networks prioritized reply, n-step rollouts. https://arxiv.org/abs/1710.02298

Whether this agent is Rainbow or C51 depends on the objects that are passed into it. Dueling networks and noisy networks are part of the model used for q_dist, while prioritized replay and n-step rollouts are handled by the replay buffer. Double Q-learning is always used.

Parameters
  • q_dist (QDist) – Approximation of the Q distribution.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • discount_factor (float) – Discount factor for future rewards.

  • eps (float) – Stability parameter for computing the loss function.

  • exploration (float) – The probability of choosing a random action.

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • update_frequency (int) – Number of timesteps per training update.

class all.agents.SAC(policy, q_1, q_2, v, replay_buffer, discount_factor=0.99, entropy_target=-2.0, lr_temperature=0.0001, minibatch_size=32, replay_start_size=5000, temperature_initial=0.1, update_frequency=1, writer=<all.logging.DummyWriter object>)

Bases: all.agents._agent.Agent

Soft Actor-Critic (SAC). SAC is a proposed improvement to DDPG that replaces the standard mean-squared Bellman error (MSBE) objective with a “maximum entropy” objective that impoves exploration. It also uses a few other tricks, such as the “Clipped Double-Q Learning” trick introduced by TD3. This implementation uses automatic temperature adjustment to replace the difficult to set temperature parameter with a more easily tuned entropy target parameter. https://arxiv.org/abs/1801.01290

Parameters
  • policy (DeterministicPolicy) – An Approximation of a deterministic policy.

  • q1 (QContinuous) – An Approximation of the continuous action Q-function.

  • q2 (QContinuous) – An Approximation of the continuous action Q-function.

  • v (VNetwork) – An Approximation of the state-value function.

  • replay_buffer (ReplayBuffer) – The experience replay buffer.

  • discount_factor (float) – Discount factor for future rewards.

  • entropy_target (float) – The desired entropy of the policy. Usually -env.action_space.shape[0]

  • minibatch_size (int) – The number of experiences to sample in each training update.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • temperature_initial (float) – The initial temperature used in the maximum entropy objective.

  • update_frequency (int) – Number of timesteps per training update.

act(state, reward)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(state, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.VAC(features, v, policy, discount_factor=1)

Bases: all.agents._agent.Agent

Vanilla Actor-Critic (VAC). VAC is an implementation of the actor-critic alogorithm found in the Sutton and Barto (2018) textbook. This implementation tweaks the algorithm slightly by using a shared feature layer. It is also compatible with the use of parallel environments. https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf

Parameters
  • features (FeatureNetwork) – Shared feature layers.

  • v (VNetwork) – Value head which approximates the state-value function.

  • policy (StochasticPolicy) – Policy head which outputs an action distribution.

  • discount_factor (float) – Discount factor for future rewards.

  • n_envs (int) – Number of parallel actors/environments

  • n_steps (int) – Number of timesteps per rollout. Updates are performed once per rollout.

  • writer (Writer) – Used for logging.

act(state, reward)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(state, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.VPG(features, v, policy, discount_factor=0.99, min_batch_size=1)

Bases: all.agents._agent.Agent

Vanilla Policy Gradient (VPG/REINFORCE). VPG (also known as REINFORCE) is the least biased implementation of the policy gradient theorem. It uses complete episode rollouts as unbiased estimates of the Q-function, rather than the n-step rollouts found in most actor-critic algorithms. The state-value function approximation reduces varience, but does not introduce any bias. This implementation introduces two tweaks. First, it uses a shared feature layer. Second, it introduces the capacity for training on multiple episodes at once. These enhancements often improve learning without sacrifice the essential character of the algorithm. https://link.springer.com/article/10.1007/BF00992696

Parameters
  • features (FeatureNetwork) – Shared feature layers.

  • v (VNetwork) – Value head which approximates the state-value function.

  • policy (StochasticPolicy) – Policy head which outputs an action distribution.

  • discount_factor (float) – Discount factor for future rewards.

  • min_batch_size (int) – Updates will occurs when an episode ends after at least this many state-action pairs are seen. Set this to a large value in order to train on multiple episodes at once.

act(state, reward)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(state, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.VQN(q, policy, discount_factor=0.99)

Bases: all.agents._agent.Agent

Vanilla Q-Network (VQN). VQN is an implementation of the Q-learning algorithm found in the Sutton and Barto (2018) textbook. Q-learning algorithms attempt to learning the optimal policy while executing a (generally) suboptimal policy (typically epsilon-greedy). In theory, This allows the agent to gain the benefits of exploration without sacrificing the performance of the final policy. However, the cost of this is that Q-learning is generally less stable than its on-policy bretheren, SARSA. http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf

Parameters
  • q (QNetwork) – An Approximation of the Q function.

  • policy (GreedyPolicy) – A policy derived from the Q-function.

  • discount_factor (float) – Discount factor for future rewards.

act(state, reward)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(state, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

class all.agents.VSarsa(q, policy, discount_factor=0.99)

Bases: all.agents._agent.Agent

Vanilla SARSA (VSarsa). SARSA (State-Action-Reward-State-Action) is an on-policy alternative to Q-learning. Unlike Q-learning, SARSA attempts to learn the Q-function for the current policy rather than the optimal policy. This approach is more stable but may not result in the optimal policy. However, this problem can be mitigated by decaying the exploration rate over time.

Parameters
  • q (QNetwork) – An Approximation of the Q function.

  • policy (GreedyPolicy) – A policy derived from the Q-function.

  • discount_factor (float) – Discount factor for future rewards.

act(state, reward)

Select an action for the current timestep and update internal parameters.

In general, a reinforcement learning agent does several things during a timestep: 1. Choose an action, 2. Compute the TD error from the previous time step 3. Update the value function and/or policy The order of these steps differs depending on the agent. This method allows the agent to do whatever is necessary for itself on a given timestep. However, the agent must ultimately return an action.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor

eval(state, _)

Select an action for the current timestep in evaluation mode.

Unlike act, this method should NOT update the internal parameters of the agent. Most of the time, this method should return the greedy action according to the current policy. This method is useful when using evaluation methodologies that distinguish between the performance of the agent during training and the performance of the resulting policy.

Parameters
  • state (all.environment.State) – The environment state at the current timestep.

  • reward (torch.Tensor) – The reward from the previous timestep.

Returns

The action to take at the current timestep.

Return type

torch.Tensor