Welcome to OpenRLHF’s documentation!

OpenRLHF is the first high-performance, production-ready open-source RLHF framework that combines a Ray + vLLM distributed architecture with a unified agent-based design paradigm for scalable and extensible reinforcement learning from human feedback.

Highlights

Ray + vLLM distributed architecture — scales to 70B+ models. vLLM-accelerated generation eliminates the dominant RLHF bottleneck; DeepSpeed ZeRO-3 trains directly from HuggingFace checkpoints with no model conversion.
Unified agent-based paradigm — token-in-token-out pipeline that decouples execution mode (single-turn / multi-turn) from RL algorithm. Any algorithm pairs with any mode through a single shared loss layer.
State-of-the-art RL algorithms — PPO, REINFORCE++, REINFORCE++-baseline, GRPO, Dr. GRPO, RLOO; switchable with one flag.
Hybrid Engine — colocate Actor / Critic / Reward / Reference / vLLM on the same GPUs with sleep-mode memory sharing. Highest utilization on small clusters; simplest deployment.
Async training & Partial Rollout — overlap rollout with training, and overlap weight sync with generation via vLLM pause / resume. Highest throughput when convergence is validated.
Single-turn rewards & multi-turn agents — HTTP remote RM, custom Python reward functions (Reinforced Fine-Tuning), full multi-turn environments, or wrap vLLM as an OpenAI-compatible chat server.
Vision-Language Model RLHF (0.10) — train VLMs (e.g., Qwen3.5) end-to-end with image inputs through the same agent pipeline.
Muon optimizer (new in 0.10.2) — DeepSpeed-native MuonWithAuxAdam for SFT / RM / DPO / PPO. Per-entity selection in PPO (--actor.optim muon with Adam critic is a one-liner). Requires DeepSpeed ≥ 0.18.2.
Hierarchical CLI (new in 0.10.2) — every flag lives under a named section (--ds.*, --vllm.*, --rollout.*, --data.*, --train.*, --eval.*, --ckpt.*, --logger.*, --algo.*, --actor.*, --critic.*, --ref.*, --reward.*). Hierarchy mirrors ownership and is self-documenting at --help.
Off-policy correction — TIS / ICEPOP / Seq-Mask-TIS handle vLLM ↔ training log-prob mismatches.
Production essentials — resumable checkpoints, best-checkpoint tracking, EMA, Wandb / TensorBoard logging, SLURM multi-node, LoRA / QLoRA for SFT / RM / DPO.

Start here

New users	Quick Start — install + first training run.
Mental model	Architecture Foundation: Ray + vLLM Distribution (components) and Design Paradigm: Agent-Based Execution (design).
Pick a recipe	RL Training Guide (RL training) or Supervised & Preference Training (SFT / RM / DPO) (SFT / RM / DPO).
Look up a flag	Common CLI Options (shared) or the trainer-specific page above.
Upgrading from 0.9.x / early 0.10	Flag migration (0.9.x / early 0.10 → 0.10.2) in Common CLI Options.
Scale or tune	Hybrid Engine, Async Training & Partial Rollout, Performance Tuning, Multi-node Training.
Something broke	Troubleshooting.

Resources

Note

This project is under active development.

Contents

Getting Started

Quick Start

Core Concepts

Scaling & Operations