Performance Tuning Guide

Ray PPO

To achieve optimal performance, we recommend allocating nodes vLLM:Actor:Critic = 1:1:1.

  • For example, for a 70B model with 48 A100 GPUs, it is advised to allocate 16 A100 GPUs to the vLLM Engine, 16 GPUs to the Actor model, and the remaining 16 GPUs to the Critic model.

  • Enable asynchronous training --async_train when the convergence of the RL algorithm meets requirements.

  • Using hybrid engine --colocate_all_models and --vllm_enable_sleep and --deepspeed_enable_sleep rather than distributed RLHF when there are enough GPU memory.

  • Enable the --colocate_critic_reward, --colocate_actor_ref options to merge nodes.

  • You should increase the rollout_micro_batch_size (and minimize the TP size of vLLM engine) as much as possible. During the training phase, a larger --micro_train_batch_size is better and enable --packing_samples.

  • When there are enough GPU memory, please disable --adam_offload and enable --overlap_comm. Also enable --deepcompile to speed up the training.

  • For vLLM, please use --vllm_sync_backend nccl with vLLM 0.8.5.post1+.

  • Enable enable_prefix_caching in vLLM generation when n_samples_per_prompts > 1.

  • For a large base model, if an OOM occurs, do not use any --colocate_xxxx options.

SFT/RM/DPO/PPO training

  • Enable --packing_samples in the training scripts

Batch Inference