Async Training & Partial Rollout ================================ Sync RLHF (the default) executes one phase at a time: rollout finishes, then training runs, then rollout starts again. **Async training** removes that bubble by running rollout and training concurrently. **Partial rollout** goes one step further by overlapping weight sync with generation itself. These are throughput-oriented features — they trade a small amount of on-policy-ness for substantially higher GPU utilization. Use them when convergence has been validated in sync mode. For the underlying architecture see :doc:`architecture`; for the alternative (sync) pipeline see :doc:`hybrid_engine`. .. contents:: :local: :depth: 2 How it works ------------ **Sync pipeline (default)**:: ┌────────────┐ ┌──────────┐ ┌────────────┐ ┌──────────┐ │ rollout │───▶│ train │───▶│ rollout │───▶│ train │ └────────────┘ └──────────┘ └────────────┘ └──────────┘ (one phase at a time, GPUs alternate roles) **Async pipeline (** ``--train.async_enable`` **)**:: ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ rollout │─▶│ rollout │─▶│ rollout │─▶│ rollout │ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │ (queue) │ │ │ ▼ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ train │───▶│ train │───▶│ train │───▶│ train │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ (rollout and train run concurrently; bounded queue between them) **Async + Partial Rollout (** ``--train.partial_rollout_enable`` **)**: vLLM never fully stops. When the trainer pushes new weights, vLLM **pauses** the in-flight requests, swaps weights, and **resumes** — so a single sample can contain tokens generated under both old and new weights. Off-policy noise in exchange for full overlap. Flags ----- .. list-table:: :header-rows: 1 :widths: 38 62 * - Flag - Meaning * - ``--train.async_enable`` - Enable the async pipeline. Rollout and training overlap through a bounded queue. * - ``--train.async_queue_size`` - Queue depth between rollout and training (default ``1``). Larger values raise throughput but increase off-policy lag. * - ``--train.partial_rollout_enable`` - Use vLLM pause/resume for weight sync, so generation overlaps with weight broadcast. **Requires** ``--train.async_enable``. * - ``--rollout.vllm_generate_batch_size`` - vLLM generation batch size; setting it larger than ``--rollout.batch_size`` enables oversampling. **Requires** ``--train.async_enable`` when greater than ``--rollout.batch_size``. Compatibility notes: - ``--train.async_enable`` is **incompatible** with ``--vllm.enable_sleep``. The trainer asserts this; remove ``--vllm.enable_sleep`` before adding ``--train.async_enable``. - ``--train.colocate_all`` may be combined with ``--train.async_enable`` — but in async mode it only colocates the **DeepSpeed** models (Actor / Ref / Critic / Reward) on shared GPUs; vLLM keeps its own GPU group so it can keep generating. - For maximum GPU utilization without async, prefer the sync :doc:`hybrid_engine` (``--train.colocate_all --vllm.enable_sleep --ds.enable_sleep``). Launch recipe (async + partial rollout) --------------------------------------- This is the upstream `train_reinforce_baseline_ray_agent_async.sh `_ flattened into a single ``ray job submit`` invocation: .. code-block:: bash export VLLM_USE_V1=1 ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"working_dir": "/openrlhf"}' \ -- python3 -m openrlhf.cli.train_ppo_ray \ --actor.model_name_or_path Qwen/Qwen3-4B-Thinking-2507 \ --train.agent_func_path examples/python/agent_func.py \ --data.prompt_dataset zhuzilin/dapo-math-17k \ --data.input_key prompt \ --data.label_key label \ --data.apply_chat_template \ --ds.packing_samples \ \ --train.async_enable \ --train.partial_rollout_enable \ \ --ref.num_nodes 1 \ --ref.num_gpus_per_node 4 \ --actor.num_nodes 1 \ --actor.num_gpus_per_node 4 \ --vllm.num_engines 2 \ --vllm.tensor_parallel_size 2 \ --vllm.gpu_memory_utilization 0.95 \ --train.colocate_all \ --ds.enable_sleep \ --vllm.sync_backend nccl \ --vllm.enforce_eager \ \ --rollout.batch_size 128 \ --rollout.n_samples_per_prompt 8 \ --train.batch_size 1024 \ --algo.dynamic_filtering_enable \ --algo.dynamic_filtering_range 0.0 1.0 \ --train.dynamic_batch_enable \ --train.max_tokens_per_gpu 16192 \ --rollout.max_tokens_per_gpu 32768 \ --train.micro_batch_size 1 \ --rollout.micro_batch_size 8 \ --data.max_len 74240 \ --rollout.max_new_tokens 64000 \ --data.max_samples 128000 \ --train.max_epochs 1 \ --train.num_episodes 1 \ \ --algo.advantage.estimator reinforce_baseline \ --actor.adam.lr 5e-7 \ --actor.entropy_coef 0.0 \ --algo.kl.init_coef 1e-5 \ --algo.kl.use_loss \ --algo.kl.estimator k2 \ --algo.advantage.is_correction_enable \ --algo.advantage.is_correction_type icepop \ \ --ds.zero_stage 3 \ --actor.gradient_checkpointing_enable \ --ds.ring_attn_size 2 \ --ds.ring_attn_head_stride 2 \ --ds.param_dtype bf16 \ \ --ckpt.output_dir ./exp/Qwen3-4B-Thinking \ --ckpt.path ./exp/Qwen3-4B-Thinking/ckpt \ --ckpt.save_hf \ --ckpt.max_num 3 \ --ckpt.save_steps 10 \ --logger.logging_steps 1 \ --eval.steps -1 Tuning guide ------------ - **Start with ``--train.async_queue_size 1``** — this is the smallest off-policy lag (~1 step). Increase only if rollout is bottlenecking training. - **Pair partial rollout with off-policy correction**. Because in-flight samples mix old and new weights, enable ``--algo.advantage.is_correction_enable`` (typically ``--algo.advantage.is_correction_type icepop`` for reasoning workloads). - **Validate convergence in sync mode first**. Async + partial rollout can mask convergence regressions caused by other knobs. - **Use bigger ``--rollout.vllm_generate_batch_size``** when generation underutilizes vLLM — async mode lets generation oversample without blocking the trainer. When **not** to use async ------------------------- - Tasks sensitive to off-policy noise (small models, short rollouts, sparse rewards). - When convergence in sync mode hasn't been validated yet. - Single-node small-scale runs where the sync overhead is already low. In those cases use the :doc:`hybrid_engine` for best throughput-with-stability. .. warning:: Async training and partial rollout deliver the highest throughput but can affect convergence on sensitive tasks. Always validate convergence in sync mode before switching.