Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Reinforcement Learning Workflow

Reinforcement Learning (RL) is one of the primary focus areas of the Space Robotics Bench. While there are several RL frameworks with their unique peculiarities, SRB offers a unified interface for training and evaluating policies across a diverse set of space robotics tasks.

1. Train your 1st RL Agent

Reference: srb agent train — Train Agent

The fastest way to get started with training an RL agent is by using the srb agent train command, which provides a streamlined interface for all integrated RL frameworks. In general, you want to specify the RL algorithm to use, the environment to train on, and the number of parallel environment instances used for rollout collection.

Let's start with a simple landing environment using the sbx_ppo algorithm (PPO implementation of SBX). For now, omit the --headless flag so that you can observe the convergence in real time:

srb agent train --algo sbx_ppo --env landing env.num_envs=512 --hide_ui

As you begin to observe the training process, you can also monitor the progress in your terminal. After about 25M timesteps, you will see that the agent found a stable policy that successfully solves the task. Checkpoints are saved regularly, so you are free to stop the training process at any point by sending an interrupt signal (Ctrl+C in most terminals).

2. Evaluate your Agent

Reference: srb agent eval — Evaluate Agent

Once training is complete, you can evaluate your agent with the srb agent eval command:

srb agent eval --algo sbx_ppo --env landing env.num_envs=16

By default, the latest checkpoint from the training run is loaded for evaluation. However, you might want to run the evaluation for a checkpoint specified via --model:

srb agent eval --algo sbx_ppo --env landing env.num_envs=16 --model space_robotics_bench/logs/landing/sbx_ppo/ckpt/${CHECKPOINT}

3. Try a Different Algorithm

SRB directly supports several popular RL algorithms from different frameworks:

Algorithm TypeDreamerV3Stable-Baselines3SBXskrl
Model-baseddreamer
On-Policysb3_a2cskrl_a2c
sb3_pposbx_pposkrl_ppo
sb3_ppo_lstmskrl_ppo_rnn
skrl_rpo
sb3_trposkrl_trpo
Off-Policysb3_ddpgsbx_ddpgskrl_ddpg
sb3_td3sbx_td3skrl_td3
sb3_sacsbx_sacskrl_sac
sb3_crossqsbx_crossq
sb3_tqcsbx_tqc
Evolutionarysb3_ars
skrl_cem
Imitation-basedskrl_amp

This time, you can train another agent using an algorithm of your choice:

srb agent train --headless --algo <ALGO> --env landing env.num_envs=1024

Hint: Use --headless mode with more parallel environments for faster convergence.

4. Monitor Training Progress

While training, you might be interested in monitoring the progress and comparing different runs through a visual interface. By default, TensorBoard logs are saved for all algorithms and environments in the space_robotics_bench/logs directory. You can start TensorBoard to visualize the training progress:

tensorboard --logdir ./space_robotics_bench/logs --bind_all

Furthermore, you can enable Weights & Biases (wandb) logging by passing framework-specific flags [subject to future standardization]:

  • DreamerV3: srb agent train ... +agent.logger.outputs=wandb
  • SB3 & SBX: srb agent train ... agent.track=true
  • skrl: srb agent train ... +agent.experiment.wandb=true

Note: Logging to Weights & Biases requires an account and API key.

5. Configure Hyperparameters

Reference: Agent Configuration

The default hyperparameters for all algorithms and environments are available under the space_robotics_bench/hyperparams directory. Similar to the environment configuration, you can adjust the hyperparameters of the selected RL algorithm through Hydra. However, the available hyperparameters and their structure is specific to each framework and algorithm.

Here are some examples (consult hyperparameter configs for more details):

srb agent train --algo dreamer      agent.run.train_raio=128   ...
sed agent train --algo sb3_ppo      agent.gamma=0.99           ...
srb agent train --algo sbx_sac      agent.learning_rate=0.0002 ...
srb agent train --algo skrl_ppo_rnn agent.models.separate=True ...