Welcome to OpenRLHF’s documentation!
OpenRLHF is the first high-performance, production-ready open-source RLHF framework that combines a Ray + vLLM distributed architecture with a unified agent-based design paradigm for scalable and extensible reinforcement learning from human feedback.
Highlights
Ray + vLLM distributed architecture — scales to 70B+ models. vLLM-accelerated generation eliminates the dominant RLHF bottleneck; DeepSpeed ZeRO-3 trains directly from HuggingFace checkpoints with no model conversion.
Unified agent-based paradigm — token-in-token-out pipeline that decouples execution mode (single-turn / multi-turn) from RL algorithm. Any algorithm pairs with any mode through a single shared loss layer.
State-of-the-art RL algorithms — PPO, REINFORCE++, REINFORCE++-baseline, GRPO, Dr. GRPO, RLOO; switchable with one flag.
Hybrid Engine — colocate Actor / Critic / Reward / Reference / vLLM on the same GPUs with sleep-mode memory sharing. Highest utilization on small clusters; simplest deployment.
Async training & Partial Rollout — overlap rollout with training, and overlap weight sync with generation via vLLM pause / resume. Highest throughput when convergence is validated.
Single-turn rewards & multi-turn agents — HTTP remote RM, custom Python reward functions (Reinforced Fine-Tuning), full multi-turn environments, or wrap vLLM as an OpenAI-compatible chat server.
Vision-Language Model RLHF (0.10) — train VLMs (e.g., Qwen3.5) end-to-end with image inputs through the same agent pipeline.
Muon optimizer (new in 0.10.2) — DeepSpeed-native
MuonWithAuxAdamfor SFT / RM / DPO / PPO. Per-entity selection in PPO (--actor.optim muonwith Adam critic is a one-liner). Requires DeepSpeed ≥ 0.18.2.Hierarchical CLI (new in 0.10.2) — every flag lives under a named section (
--ds.*,--vllm.*,--rollout.*,--data.*,--train.*,--eval.*,--ckpt.*,--logger.*,--algo.*,--actor.*,--critic.*,--ref.*,--reward.*). Hierarchy mirrors ownership and is self-documenting at--help.Off-policy correction — TIS / ICEPOP / Seq-Mask-TIS handle vLLM ↔ training log-prob mismatches.
Production essentials — resumable checkpoints, best-checkpoint tracking, EMA, Wandb / TensorBoard logging, SLURM multi-node, LoRA / QLoRA for SFT / RM / DPO.
Start here
New users |
Quick Start — install + first training run. |
Mental model |
Architecture Foundation: Ray + vLLM Distribution (components) and Design Paradigm: Agent-Based Execution (design). |
Pick a recipe |
RL Training Guide (RL training) or Supervised & Preference Training (SFT / RM / DPO) (SFT / RM / DPO). |
Look up a flag |
Common CLI Options (shared) or the trainer-specific page above. |
Upgrading from 0.9.x / early 0.10 |
Flag migration (0.9.x / early 0.10 → 0.10.2) in Common CLI Options. |
Scale or tune |
Hybrid Engine, Async Training & Partial Rollout, Performance Tuning, Multi-node Training. |
Something broke |
Resources
Note
This project is under active development.
Contents
Getting Started
Core Concepts
Training Guides