vLLM V0 to V1: Correctness Before Corrections in RL

TL;DR

Hugging Face’s vLLM V1 has been updated to match vLLM V0’s reference behavior by fixing four core backend issues. This correction ensures consistency in RL training workflows. The development is crucial for stable, predictable model training.

Hugging Face reports that vLLM V1 now reliably matches vLLM V0’s reference after fixing four critical backend issues, marking a significant step in their migration process for reinforcement learning (RL) workflows.

The update was driven by the need to eliminate discrepancies in rollout logprobs used during RL training, which initially caused divergence in training metrics such as clip rate, KL divergence, entropy, and reward. The team identified and addressed four main issues: the semantics of logprobs returned by the backend, runtime defaults affecting inference paths, the inflight weight update process, and the use of fp32 lm_head for final projections. These fixes were implemented before any changes to the RL objective, ensuring backend parity with the original vLLM V0 reference, which used version 0.8.5, while vLLM V1 runs used version 0.18.1.

Specifically, the semantic mismatch was corrected by setting logprobs-mode to ‘processed_logprobs’, aligning the returned logprobs with what the trainer expects. Runtime defaults such as prefix caching, async scheduling, and override flags were explicitly configured to match the original environment. The inflight weight update path was also synchronized to prevent discrepancies during online RL updates, ensuring that weight changes did not produce inconsistent inference results.

Why It Matters

This development is vital because consistent backend behavior directly impacts the stability and reliability of RL training. Discrepancies in logprobs can lead to incorrect policy updates, affecting model performance and training efficiency. Ensuring backend parity allows researchers to trust the training process and compare results across versions, ultimately accelerating progress in RL applications for language models.

TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers

TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

The migration from vLLM V0 to V1 was a major rewrite aimed at improving inference performance and flexibility. Early in the process, training metrics diverged significantly, indicating issues with logprob computation and inference behavior. The team initially suspected objective mismatches but pinpointed the problem to backend semantics and runtime defaults. Fixes were implemented over several weeks, with the latest update confirming that the core issues are resolved, restoring confidence in the vLLM V1 engine for RL tasks.

“We have fixed four key backend issues that caused divergence in logprobs and training metrics, bringing vLLM V1 in line with the vLLM V0 reference.”

— Hugging Face team

“Setting logprobs-mode=processed_logprobs was essential to align the semantics of returned logprobs with our training expectations.”

— Hugging Face engineer

Parallel and Distributed Computing, Applications and Technologies: 26th International Conference, PDCAT 2025, Gold Coast, QLD, Australia, November ... (Lecture Notes in Computer Science, 16465)

Parallel and Distributed Computing, Applications and Technologies: 26th International Conference, PDCAT 2025, Gold Coast, QLD, Australia, November … (Lecture Notes in Computer Science, 16465)

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is not yet clear whether these backend fixes will fully address other potential discrepancies in more complex RL algorithms or larger models. Ongoing testing is required to confirm stability across diverse training scenarios.

The Ultimate AI Toolbox: Essential Tools & Frameworks for Building, Deploying, and Scaling AI Solutions: A Comprehensive Guide to the Best Tools for ... Model Training to Deployment and Optimization

The Ultimate AI Toolbox: Essential Tools & Frameworks for Building, Deploying, and Scaling AI Solutions: A Comprehensive Guide to the Best Tools for … Model Training to Deployment and Optimization

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include comprehensive validation across different RL algorithms such as PPO and GRPO, as well as testing with larger models and datasets. The team plans to monitor training metrics for any residual issues and prepare for broader deployment of vLLM V1 in production RL workflows.

Hardware-Aware Probabilistic Machine Learning Models: Learning, Inference and Use Cases

Hardware-Aware Probabilistic Machine Learning Models: Learning, Inference and Use Cases

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What specific issues were fixed in vLLM V1?

The fixes addressed logprobs semantics, runtime default settings (prefix caching, async scheduling), inflight weight updates, and the use of fp32 lm_head for projection, ensuring backend behavior matches vLLM V0.

Why did the initial vLLM V1 attempt diverge from the reference?

The initial divergence was caused by mismatched logprobs semantics, inconsistent runtime defaults, and weight update handling, leading to discrepancies in training metrics.

Will these fixes improve training stability in all RL systems?

While these fixes improve backend consistency for vLLM V1, further testing is needed to confirm stability across different RL algorithms and larger models.

When will vLLM V1 be ready for widespread use?

Following validation and testing, the team anticipates broader deployment in the coming months, pending confirmation of stability across various training scenarios.

You May Also Like

ASML to equip India’s first commercial chip fab — $11 billion Dholera project targets 50,000 wafers a month

ASML will provide lithography equipment for India’s first commercial semiconductor fabrication plant in Dholera, Gujarat, as part of an $11 billion project.

OpenAI feels “burned” by Apple’s crappy ChatGPT integration, insiders say

OpenAI is exploring legal options after Apple’s ChatGPT integration failed to meet expectations, with insiders citing poor implementation and strategic issues.

TIL that 32 bit time will run out in 2038, while 64 bit time will run out approximately 292 billion years from now

The 32-bit Unix time will overflow on January 19, 2038, causing potential system failures. 64-bit systems are unaffected for billions of years.

What Synthetic Data Means for Model Development

No matter your goal, understanding what synthetic data means for model development can unlock new possibilities you won’t want to miss.