Async Training & Partial Rollout

Sync RLHF (the default) executes one phase at a time: rollout finishes, then training runs, then rollout starts again. Async training removes that bubble by running rollout and training concurrently. Partial rollout goes one step further by overlapping weight sync with generation itself.

These are throughput-oriented features — they trade a small amount of on-policy-ness for substantially higher GPU utilization. Use them when convergence has been validated in sync mode.

For the underlying architecture see Architecture Foundation: Ray + vLLM Distribution; for the alternative (sync) pipeline see Hybrid Engine.

How it works 

Sync pipeline (default):

┌────────────┐    ┌──────────┐    ┌────────────┐    ┌──────────┐
│  rollout   │───▶│  train   │───▶│  rollout   │───▶│  train   │
└────────────┘    └──────────┘    └────────────┘    └──────────┘
                       (one phase at a time, GPUs alternate roles)

Async pipeline ( --train.async_enable ):

┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐
│  rollout   │─▶│  rollout   │─▶│  rollout   │─▶│  rollout   │
└─────┬──────┘  └─────┬──────┘  └─────┬──────┘  └─────┬──────┘
      │ (queue)       │               │               │
      ▼               ▼               ▼               ▼
┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  train   │───▶│  train   │───▶│  train   │───▶│  train   │
└──────────┘    └──────────┘    └──────────┘    └──────────┘
(rollout and train run concurrently; bounded queue between them)

Async + Partial Rollout ( --train.partial_rollout_enable ): vLLM never fully stops. When the trainer pushes new weights, vLLM pauses the in-flight requests, swaps weights, and resumes — so a single sample can contain tokens generated under both old and new weights. Off-policy noise in exchange for full overlap.

Flags 

Flag	Meaning
`--train.async_enable`	Enable the async pipeline. Rollout and training overlap through a bounded queue.
`--train.async_queue_size`	Queue depth between rollout and training (default `1`). Larger values raise throughput but increase off-policy lag.
`--train.partial_rollout_enable`	Use vLLM pause/resume for weight sync, so generation overlaps with weight broadcast. Requires `--train.async_enable`.
`--rollout.vllm_generate_batch_size`	vLLM generation batch size; setting it larger than `--rollout.batch_size` enables oversampling. Requires `--train.async_enable` when greater than `--rollout.batch_size`.

Compatibility notes:

--train.async_enable is incompatible with --vllm.enable_sleep. The trainer asserts this; remove --vllm.enable_sleep before adding --train.async_enable.
--train.colocate_all may be combined with --train.async_enable — but in async mode it only colocates the DeepSpeed models (Actor / Ref / Critic / Reward) on shared GPUs; vLLM keeps its own GPU group so it can keep generating.
For maximum GPU utilization without async, prefer the sync Hybrid Engine (--train.colocate_all --vllm.enable_sleep --ds.enable_sleep).

Launch recipe (async + partial rollout)

This is the upstream train_reinforce_baseline_ray_agent_async.sh flattened into a single ray job submit invocation:

export VLLM_USE_V1=1

ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json='{"working_dir": "/openrlhf"}' \
   -- python3 -m openrlhf.cli.train_ppo_ray \
   --actor.model_name_or_path Qwen/Qwen3-4B-Thinking-2507 \
   --train.agent_func_path examples/python/agent_func.py \
   --data.prompt_dataset zhuzilin/dapo-math-17k \
   --data.input_key prompt \
   --data.label_key label \
   --data.apply_chat_template \
   --ds.packing_samples \
   \
   --train.async_enable \
   --train.partial_rollout_enable \
   \
   --ref.num_nodes 1 \
   --ref.num_gpus_per_node 4 \
   --actor.num_nodes 1 \
   --actor.num_gpus_per_node 4 \
   --vllm.num_engines 2 \
   --vllm.tensor_parallel_size 2 \
   --vllm.gpu_memory_utilization 0.95 \
   --train.colocate_all \
   --ds.enable_sleep \
   --vllm.sync_backend nccl \
   --vllm.enforce_eager \
   \
   --rollout.batch_size 128 \
   --rollout.n_samples_per_prompt 8 \
   --train.batch_size 1024 \
   --algo.dynamic_filtering_enable \
   --algo.dynamic_filtering_range 0.0 1.0 \
   --train.dynamic_batch_enable \
   --train.max_tokens_per_gpu 16192 \
   --rollout.max_tokens_per_gpu 32768 \
   --train.micro_batch_size 1 \
   --rollout.micro_batch_size 8 \
   --data.max_len 74240 \
   --rollout.max_new_tokens 64000 \
   --data.max_samples 128000 \
   --train.max_epochs 1 \
   --train.num_episodes 1 \
   \
   --algo.advantage.estimator reinforce_baseline \
   --actor.adam.lr 5e-7 \
   --actor.entropy_coef 0.0 \
   --algo.kl.init_coef 1e-5 \
   --algo.kl.use_loss \
   --algo.kl.estimator k2 \
   --algo.advantage.is_correction_enable \
   --algo.advantage.is_correction_type icepop \
   \
   --ds.zero_stage 3 \
   --actor.gradient_checkpointing_enable \
   --ds.ring_attn_size 2 \
   --ds.ring_attn_head_stride 2 \
   --ds.param_dtype bf16 \
   \
   --ckpt.output_dir ./exp/Qwen3-4B-Thinking \
   --ckpt.path ./exp/Qwen3-4B-Thinking/ckpt \
   --ckpt.save_hf \
   --ckpt.max_num 3 \
   --ckpt.save_steps 10 \
   --logger.logging_steps 1 \
   --eval.steps -1

Tuning guide 

Start with ``–train.async_queue_size 1`` — this is the smallest off-policy lag (~1 step). Increase only if rollout is bottlenecking training.
Pair partial rollout with off-policy correction. Because in-flight samples mix old and new weights, enable --algo.advantage.is_correction_enable (typically --algo.advantage.is_correction_type icepop for reasoning workloads).
Validate convergence in sync mode first. Async + partial rollout can mask convergence regressions caused by other knobs.
Use bigger ``–rollout.vllm_generate_batch_size`` when generation underutilizes vLLM — async mode lets generation oversample without blocking the trainer.

When not to use async 

Tasks sensitive to off-policy noise (small models, short rollouts, sparse rewards).
When convergence in sync mode hasn’t been validated yet.
Single-node small-scale runs where the sync overhead is already low.

In those cases use the Hybrid Engine for best throughput-with-stability.

Warning

Async training and partial rollout deliver the highest throughput but can affect convergence on sensitive tasks. Always validate convergence in sync mode before switching.

Async Training & Partial Rollout

How it works

Flags