all.presets.continuous

ddpg([device, discount_factor, last_frame, …])

DDPG continuous control preset.

ppo([device, discount_factor, last_frame, …])

PPO continuous control preset.

sac([device, discount_factor, last_frame, …])

SAC continuous control preset.

all.presets.continuous.ddpg(device='cuda', discount_factor=0.98, last_frame=2000000.0, lr_q=0.001, lr_pi=0.001, minibatch_size=100, update_frequency=1, polyak_rate=0.005, replay_start_size=5000, replay_buffer_size=1000000.0, noise=0.1)

DDPG continuous control preset.

Parameters
  • device (str) – The device to load parameters and buffers onto for this agent..

  • discount_factor (float) – Discount factor for future rewards.

  • last_frame (int) – Number of frames to train.

  • lr_q (float) – Learning rate for the Q network.

  • lr_pi (float) – Learning rate for the policy network.

  • minibatch_size (int) – Number of experiences to sample in each training update.

  • update_frequency (int) – Number of timesteps per training update.

  • polyak_rate (float) – Speed with which to update the target network towards the online network.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.

  • noise (float) – The amount of exploration noise to add.

all.presets.continuous.ppo(device='cuda', discount_factor=0.98, last_frame=2000000.0, lr=0.0003, eps=1e-05, entropy_loss_scaling=0.01, value_loss_scaling=0.5, clip_grad=0.5, clip_initial=0.2, clip_final=0.01, epochs=20, minibatches=4, n_envs=32, n_steps=128, lam=0.95)

PPO continuous control preset.

Parameters
  • device (str) – The device to load parameters and buffers onto for this agent.

  • discount_factor (float) – Discount factor for future rewards.

  • last_frame (int) – Number of frames to train.

  • lr (float) – Learning rate for the Adam optimizer.

  • eps (float) – Stability parameters for the Adam optimizer.

  • entropy_loss_scaling (float) – Coefficient for the entropy term in the total loss.

  • value_loss_scaling (float) – Coefficient for the value function loss.

  • clip_grad (float) – The maximum magnitude of the gradient for any given parameter. Set to 0 to disable.

  • clip_initial (float) – Value for epsilon in the clipped PPO objective function at the beginning of training.

  • clip_final (float) – Value for epsilon in the clipped PPO objective function at the end of training.

  • epochs (int) – Number of times to iterature through each batch.

  • minibatches (int) – The number of minibatches to split each batch into.

  • n_envs (int) – Number of parallel actors.

  • n_steps (int) – Length of each rollout.

  • lam (float) – The Generalized Advantage Estimate (GAE) decay parameter.

all.presets.continuous.sac(device='cuda', discount_factor=0.98, last_frame=2000000.0, lr_q=0.001, lr_v=0.001, lr_pi=0.0001, minibatch_size=100, update_frequency=2, polyak_rate=0.005, replay_start_size=5000, replay_buffer_size=1000000.0, temperature_initial=0.1, lr_temperature=1e-05, entropy_target_scaling=1.0)

SAC continuous control preset.

Parameters
  • device (str) – The device to load parameters and buffers onto for this agent..

  • discount_factor (float) – Discount factor for future rewards.

  • last_frame (int) – Number of frames to train.

  • lr_q (float) – Learning rate for the Q networks.

  • lr_v (float) – Learning rate for the state-value networks.

  • lr_pi (float) – Learning rate for the policy network.

  • minibatch_size (int) – Number of experiences to sample in each training update.

  • update_frequency (int) – Number of timesteps per training update.

  • polyak_rate (float) – Speed with which to update the target network towards the online network.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.

  • temperature_initial (float) – Initial value of the temperature parameter.

  • lr_temperature (float) – Learning rate for the temperature. Should be low compared to other learning rates.

  • entropy_target_scaling (float) – The target entropy will be -(entropy_target_scaling * env.action_space.shape[0])