Performance Tuning
This guide collects the knobs that matter most for throughput and memory. For hybrid-engine specifics see Hybrid Engine; for error triage see Troubleshooting.
Rule of thumb: start from a known-good recipe (see examples/scripts) and adjust one knob
at a time.
Choosing a deployment mode
Pick based on what you’re optimizing for:
Max throughput → async training (
--train.async_enable). Rollout and training overlap through a bounded queue; tune the degree of asynchrony via--train.async_queue_size(start at1and raise only if rollout bottlenecks training). Add--train.partial_rollout_enableto overlap weight sync with generation. See Async Training & Partial Rollout.Max stability → Hybrid Engine (
--train.colocate_all --vllm.enable_sleep --ds.enable_sleep). Fully on-policy, excellent GPU utilization through role-swapping on shared GPUs. The safe default for convergence-sensitive workloads. See Hybrid Engine.Distributed mode (separate GPU groups for vLLM / Actor / Critic) is a fallback for very large models or mixed-hardware clusters where colocation isn’t viable. Size each group to its own memory and compute needs.
Speed knobs
Knob |
Flag |
When / why |
|---|---|---|
Sample packing |
|
Always on — removes padding, large training speedup. |
NCCL weight sync |
|
Always on for multi-GPU — faster than the default. |
Dynamic batch |
|
Variable sequence lengths — better utilization than fixed micro-batches. |
Async training |
|
Fastest path. Tune degree of asynchrony with |
Partial rollout |
|
With |
Hybrid Engine |
|
Most stable high-throughput option. Best choice when off-policy noise is a concern. |
Overlap comm |
|
Enough GPU memory — overlap backward and gradient reduce. |
DeepCompile |
|
PyTorch 2.0+ — DeepSpeed graph compilation. |
Prefix caching |
|
|
Memory management
When memory is plentiful:
Disable
--ds.adam_offload; enable--ds.overlap_comm.Use
--train.colocate_all(Hybrid Engine), or at least--train.colocate_critic_reward+--train.colocate_actor_ref.
When hitting OOM (priority order):
Enable
--ds.packing_samplesand--actor.gradient_checkpointing_enable.Reduce
--train.micro_batch_size/--rollout.micro_batch_size.Lower
--vllm.gpu_memory_utilization(e.g., 0.6 → 0.5 → 0.4).Enable
--ds.adam_offload; raise--ds.zero_stage(2 → 3).Disable colocation (remove
--train.colocate_*) and move to distributed mode.
Note
--ds.adam_offload is incompatible with --actor.optim muon / --critic.optim muon
— DeepSpeed’s Muon keeps optimizer state on GPU. Use Adam or disable Muon if you need
adam-offload for memory.
Batch size tuning
Generation: maximize
--rollout.micro_batch_sizeand minimize vLLM TP size (prefer more engines over larger TP).Training: maximize
--train.micro_batch_sizewith--ds.packing_samplesenabled.Batch relation: a common choice is
train.batch_size = rollout.batch_size * rollout.n_samples_per_prompt.
Long context (>8K tokens)
Enable RingAttention (
--ds.ring_attn_size) — see Sequence Parallelism (RingAttention).Keep
--ds.packing_sampleson.Increase
--ds.zero_stage(typically 3) and watch memory closely.
For the launch recipes see Hybrid Engine and RL Training Guide; for error triage see Troubleshooting.