Welcome to OpenRLHF’s documentation!

OpenRLHF is the first high-performance, production-ready open-source RLHF framework that combines a Ray + vLLM distributed architecture with a unified agent-based design paradigm for scalable and extensible reinforcement learning from human feedback.

OpenRLHF Architecture

Highlights

  • Ray + vLLM distributed architecture — scales to 70B+ models. vLLM-accelerated generation eliminates the dominant RLHF bottleneck; DeepSpeed ZeRO-3 trains directly from HuggingFace checkpoints with no model conversion.

  • Unified agent-based paradigm — token-in-token-out pipeline that decouples execution mode (single-turn / multi-turn) from RL algorithm. Any algorithm pairs with any mode through a single shared loss layer.

  • State-of-the-art RL algorithms — PPO, REINFORCE++, REINFORCE++-baseline, GRPO, Dr. GRPO, RLOO; switchable with one flag.

  • Hybrid Engine — colocate Actor / Critic / Reward / Reference / vLLM on the same GPUs with sleep-mode memory sharing. Highest utilization on small clusters; simplest deployment.

  • Async training & Partial Rollout — overlap rollout with training, and overlap weight sync with generation via vLLM pause / resume. Highest throughput when convergence is validated.

  • Single-turn rewards & multi-turn agents — HTTP remote RM, custom Python reward functions (Reinforced Fine-Tuning), full multi-turn environments, or wrap vLLM as an OpenAI-compatible chat server.

  • Vision-Language Model RLHF (0.10) — train VLMs (e.g., Qwen3.5) end-to-end with image inputs through the same agent pipeline.

  • Muon optimizer (new in 0.10.2) — DeepSpeed-native MuonWithAuxAdam for SFT / RM / DPO / PPO. Per-entity selection in PPO (--actor.optim muon with Adam critic is a one-liner). Requires DeepSpeed ≥ 0.18.2.

  • Hierarchical CLI (new in 0.10.2) — every flag lives under a named section (--ds.*, --vllm.*, --rollout.*, --data.*, --train.*, --eval.*, --ckpt.*, --logger.*, --algo.*, --actor.*, --critic.*, --ref.*, --reward.*). Hierarchy mirrors ownership and is self-documenting at --help.

  • Off-policy correction — TIS / ICEPOP / Seq-Mask-TIS handle vLLM ↔ training log-prob mismatches.

  • Production essentials — resumable checkpoints, best-checkpoint tracking, EMA, Wandb / TensorBoard logging, SLURM multi-node, LoRA / QLoRA for SFT / RM / DPO.

Start here

New users

Quick Start — install + first training run.

Mental model

Architecture Foundation: Ray + vLLM Distribution (components) and Design Paradigm: Agent-Based Execution (design).

Pick a recipe

RL Training Guide (RL training) or Supervised & Preference Training (SFT / RM / DPO) (SFT / RM / DPO).

Look up a flag

Common CLI Options (shared) or the trainer-specific page above.

Upgrading from 0.9.x / early 0.10

Flag migration (0.9.x / early 0.10 → 0.10.2) in Common CLI Options.

Scale or tune

Hybrid Engine, Async Training & Partial Rollout, Performance Tuning, Multi-node Training.

Something broke

Troubleshooting.

Resources

Note

This project is under active development.

Contents