Troubleshooting =============== This page consolidates the most common issues you may hit when running OpenRLHF. "argparse: unrecognized arguments" after upgrade ------------------------------------------------ OpenRLHF 0.10.2 moved every CLI flag under a dotted section prefix. Old flat names (``--pretrain``, ``--zero_stage``, ``--vllm_num_engines``, ``--learning_rate``, ...) no longer parse and argparse will error out. Port your launch scripts to the new surface — :ref:`flag_migration` in :doc:`common_options` has the full old → new table, and every file under ``examples/scripts/`` has already been migrated. GPU device index / DeepSpeed init errors ---------------------------------------- If you see GPU device mapping issues (often in DeepSpeed initialization), try: .. code-block:: bash export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 # NVIDIA export RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1 # AMD Then restart Ray and rerun the job. Out-of-memory (OOM) ------------------- Common mitigations (rough priority): - Reduce batch sizes (``--train.micro_batch_size``, ``--rollout.micro_batch_size``). - Reduce vLLM memory fraction (``--vllm.gpu_memory_utilization``). - Disable colocation (remove ``--train.colocate_*``). - Enable memory savers (``--ds.adam_offload``, ``--actor.gradient_checkpointing_enable`` / ``--model.gradient_checkpointing_enable``, higher ``--ds.zero_stage``). See :doc:`performance` and :doc:`hybrid_engine` for detailed tuning. Muon + DeepSpeed compatibility ------------------------------ ``--optim muon`` / ``--actor.optim muon`` is **incompatible** with ``--ds.adam_offload`` — DS's Muon implementation keeps optimizer state on GPU. If you need adam-offload for memory, switch back to Adam. ``--muon.ns_steps`` and ``--muon.nesterov`` / ``--muon.no_nesterov`` are **placeholders** on DeepSpeed 0.18.x: the DS ``muon_update()`` kernel hard-codes ``ns_steps=5`` and Nesterov ``True``. Changing them fires a runtime warning and has no effect. These slots are retained for forward-compat with future DeepSpeed releases. Muon requires **DeepSpeed ≥ 0.18.2**. On older DS you will see an init-time error when the ``MuonWithAuxAdam`` type is not registered — upgrade DeepSpeed or revert to ``--optim adam``. vLLM hangs / NCCL issues ------------------------ If vLLM hangs during weight sync or you see NCCL-related issues: - Try ``--vllm.enforce_eager`` (disables CUDA graphs). - Prefer ``--vllm.sync_backend nccl`` on multi-GPU setups. See :doc:`hybrid_engine` for more troubleshooting tips. Ray runtime environment problems -------------------------------- If workers are missing dependencies, let Ray install them via runtime env: .. code-block:: bash --runtime-env-json='{"setup_commands": ["pip install openrlhf[vllm]"]}' Debug with py-spy (in-container) -------------------------------- Use `py-spy `_ to quickly see what a running OpenRLHF Python process is doing on CPU. 1) Install inside the container: .. code-block:: bash pip install py-spy 2) Find the training PID (common keywords: openrlhf / ray / vllm): .. code-block:: bash ps auxww | rg "openrlhf|ray::|train_ppo_ray|train_sft|train_rm|train_dpo|vllm" 3) Attach and inspect: .. code-block:: bash py-spy top --pid py-spy record --pid --duration 30 -o profile.svg If attach fails in Docker, start the container with ptrace enabled: .. code-block:: bash docker run ... --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ...