all.presets.continuous

ddpg([device, discount_factor, last_frame, …])

DDPG continuous control preset.

ppo([device, discount_factor, last_frame, …])

PPO continuous control preset.

sac([device, discount_factor, last_frame, …])

SAC continuous control preset.

all.presets.continuous.ddpg(device='cuda', discount_factor=0.98, last_frame=2000000.0, lr_q=0.001, lr_pi=0.001, minibatch_size=100, update_frequency=1, polyak_rate=0.005, replay_start_size=5000, replay_buffer_size=1000000.0, noise=0.1, q_model_constructor=<function fc_q>, policy_model_constructor=<function fc_deterministic_policy>)

DDPG continuous control preset.

Parameters
  • device (str) – The device to load parameters and buffers onto for this agent..

  • discount_factor (float) – Discount factor for future rewards.

  • last_frame (int) – Number of frames to train.

  • lr_q (float) – Learning rate for the Q network.

  • lr_pi (float) – Learning rate for the policy network.

  • minibatch_size (int) – Number of experiences to sample in each training update.

  • update_frequency (int) – Number of timesteps per training update.

  • polyak_rate (float) – Speed with which to update the target network towards the online network.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.

  • noise (float) – The amount of exploration noise to add.

  • q_model_constructor (function) – The function used to construct the neural q model.

  • policy_model_constructor (function) – The function used to construct the neural policy model.

all.presets.continuous.ppo(device='cuda', discount_factor=0.98, last_frame=2000000.0, lr=0.0003, eps=1e-05, entropy_loss_scaling=0.01, value_loss_scaling=0.5, clip_grad=0.5, clip_initial=0.2, clip_final=0.01, epochs=20, minibatches=4, n_envs=32, n_steps=128, lam=0.95, ac_model_constructor=<function fc_actor_critic>)

PPO continuous control preset.

Parameters
  • device (str) – The device to load parameters and buffers onto for this agent.

  • discount_factor (float) – Discount factor for future rewards.

  • last_frame (int) – Number of frames to train.

  • lr (float) – Learning rate for the Adam optimizer.

  • eps (float) – Stability parameters for the Adam optimizer.

  • entropy_loss_scaling (float) – Coefficient for the entropy term in the total loss.

  • value_loss_scaling (float) – Coefficient for the value function loss.

  • clip_grad (float) – The maximum magnitude of the gradient for any given parameter. Set to 0 to disable.

  • clip_initial (float) – Value for epsilon in the clipped PPO objective function at the beginning of training.

  • clip_final (float) – Value for epsilon in the clipped PPO objective function at the end of training.

  • epochs (int) – Number of times to iterature through each batch.

  • minibatches (int) – The number of minibatches to split each batch into.

  • n_envs (int) – Number of parallel actors.

  • n_steps (int) – Length of each rollout.

  • lam (float) – The Generalized Advantage Estimate (GAE) decay parameter.

  • ac_model_constructor (function) – The function used to construct the neural feature, value and policy model.

all.presets.continuous.sac(device='cuda', discount_factor=0.98, last_frame=2000000.0, lr_q=0.001, lr_v=0.001, lr_pi=0.0001, minibatch_size=100, update_frequency=2, polyak_rate=0.005, replay_start_size=5000, replay_buffer_size=1000000.0, temperature_initial=0.1, lr_temperature=1e-05, entropy_target_scaling=1.0, q1_model_constructor=<function fc_q>, q2_model_constructor=<function fc_q>, v_model_constructor=<function fc_v>, policy_model_constructor=<function fc_soft_policy>)

SAC continuous control preset.

Parameters
  • device (str) – The device to load parameters and buffers onto for this agent..

  • discount_factor (float) – Discount factor for future rewards.

  • last_frame (int) – Number of frames to train.

  • lr_q (float) – Learning rate for the Q networks.

  • lr_v (float) – Learning rate for the state-value networks.

  • lr_pi (float) – Learning rate for the policy network.

  • minibatch_size (int) – Number of experiences to sample in each training update.

  • update_frequency (int) – Number of timesteps per training update.

  • polyak_rate (float) – Speed with which to update the target network towards the online network.

  • replay_start_size (int) – Number of experiences in replay buffer when training begins.

  • replay_buffer_size (int) – Maximum number of experiences to store in the replay buffer.

  • temperature_initial (float) – Initial value of the temperature parameter.

  • lr_temperature (float) – Learning rate for the temperature. Should be low compared to other learning rates.

  • entropy_target_scaling (float) – The target entropy will be -(entropy_target_scaling * env.action_space.shape[0])

  • q1_model_constructor (function) – The function used to construct the neural q1 model.

  • q2_model_constructor (function) – The function used to construct the neural q2 model.

  • v_model_constructor (function) – The function used to construct the neural v model.

  • policy_model_constructor (function) – The function used to construct the neural policy model.