vLLM V0 to V1: Correctness Before Corrections in RL

TL;DR

Hugging Face’s vLLM V1 has been updated to match vLLM V0’s reference behavior by fixing four core backend issues. This correction ensures consistency in RL training workflows. The development is crucial for stable, predictable model training.

Hugging Face reports that vLLM V1 now reliably matches vLLM V0’s reference after fixing four critical backend issues, marking a significant step in their migration process for reinforcement learning (RL) workflows.

The update was driven by the need to eliminate discrepancies in rollout logprobs used during RL training, which initially caused divergence in training metrics such as clip rate, KL divergence, entropy, and reward. The team identified and addressed four main issues: the semantics of logprobs returned by the backend, runtime defaults affecting inference paths, the inflight weight update process, and the use of fp32 lm_head for final projections. These fixes were implemented before any changes to the RL objective, ensuring backend parity with the original vLLM V0 reference, which used version 0.8.5, while vLLM V1 runs used version 0.18.1.

Specifically, the semantic mismatch was corrected by setting logprobs-mode to ‘processed_logprobs’, aligning the returned logprobs with what the trainer expects. Runtime defaults such as prefix caching, async scheduling, and override flags were explicitly configured to match the original environment. The inflight weight update path was also synchronized to prevent discrepancies during online RL updates, ensuring that weight changes did not produce inconsistent inference results.

Why It Matters

This development is vital because consistent backend behavior directly impacts the stability and reliability of RL training. Discrepancies in logprobs can lead to incorrect policy updates, affecting model performance and training efficiency. Ensuring backend parity allows researchers to trust the training process and compare results across versions, ultimately accelerating progress in RL applications for language models.

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

The migration from vLLM V0 to V1 was a major rewrite aimed at improving inference performance and flexibility. Early in the process, training metrics diverged significantly, indicating issues with logprob computation and inference behavior. The team initially suspected objective mismatches but pinpointed the problem to backend semantics and runtime defaults. Fixes were implemented over several weeks, with the latest update confirming that the core issues are resolved, restoring confidence in the vLLM V1 engine for RL tasks.

“We have fixed four key backend issues that caused divergence in logprobs and training metrics, bringing vLLM V1 in line with the vLLM V0 reference.”

— Hugging Face team

“Setting logprobs-mode=processed_logprobs was essential to align the semantics of returned logprobs with our training expectations.”

— Hugging Face engineer

Parallel and Distributed Computing, Applications and Technologies: 26th International Conference, PDCAT 2025, Gold Coast, QLD, Australia, November ... (Lecture Notes in Computer Science, 16465)

Parallel and Distributed Computing, Applications and Technologies: 26th International Conference, PDCAT 2025, Gold Coast, QLD, Australia, November … (Lecture Notes in Computer Science, 16465)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is not yet clear whether these backend fixes will fully address other potential discrepancies in more complex RL algorithms or larger models. Ongoing testing is required to confirm stability across diverse training scenarios.

The Ultimate AI Toolbox: Essential Tools & Frameworks for Building, Deploying, and Scaling AI Solutions: A Comprehensive Guide to the Best Tools for ... Model Training to Deployment and Optimization

The Ultimate AI Toolbox: Essential Tools & Frameworks for Building, Deploying, and Scaling AI Solutions: A Comprehensive Guide to the Best Tools for … Model Training to Deployment and Optimization

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include comprehensive validation across different RL algorithms such as PPO and GRPO, as well as testing with larger models and datasets. The team plans to monitor training metrics for any residual issues and prepare for broader deployment of vLLM V1 in production RL workflows.

Deep Learning at Scale: At the Intersection of Hardware, Software, and Data

Deep Learning at Scale: At the Intersection of Hardware, Software, and Data

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What specific issues were fixed in vLLM V1?

The fixes addressed logprobs semantics, runtime default settings (prefix caching, async scheduling), inflight weight updates, and the use of fp32 lm_head for projection, ensuring backend behavior matches vLLM V0.

Why did the initial vLLM V1 attempt diverge from the reference?

The initial divergence was caused by mismatched logprobs semantics, inconsistent runtime defaults, and weight update handling, leading to discrepancies in training metrics.

Will these fixes improve training stability in all RL systems?

While these fixes improve backend consistency for vLLM V1, further testing is needed to confirm stability across different RL algorithms and larger models.

When will vLLM V1 be ready for widespread use?

Following validation and testing, the team anticipates broader deployment in the coming months, pending confirmation of stability across various training scenarios.

You May Also Like

The queue. Why the grid, not the chip, is the binding constraint on AI.

The US interconnection queue now constrains AI infrastructure growth, shifting the buildout challenge from chip supply to grid access and cost allocation.

TIL that 32 bit time will run out in 2038, while 64 bit time will run out approximately 292 billion years from now

The 32-bit Unix time will overflow on January 19, 2038, causing potential system failures. 64-bit systems are unaffected for billions of years.

Quantum computing CEOs hope “validating” government backing proves their technology is no longer speculative

CEOs of Infleqtion and D-Wave see recent U.S. government grants as validation, boosting industry confidence and accelerating quantum computing R&D.

New accessibility features powered by Apple Intelligence

Apple announced new accessibility features using Apple Intelligence, including enhanced VoiceOver, Magnifier, subtitles, and wheelchair control, coming later this year.