TL;DR
Hugging Face’s vLLM V1 has been updated to match vLLM V0’s reference behavior by fixing four core backend issues. This correction ensures consistency in RL training workflows. The development is crucial for stable, predictable model training.
Hugging Face reports that vLLM V1 now reliably matches vLLM V0’s reference after fixing four critical backend issues, marking a significant step in their migration process for reinforcement learning (RL) workflows.
The update was driven by the need to eliminate discrepancies in rollout logprobs used during RL training, which initially caused divergence in training metrics such as clip rate, KL divergence, entropy, and reward. The team identified and addressed four main issues: the semantics of logprobs returned by the backend, runtime defaults affecting inference paths, the inflight weight update process, and the use of fp32 lm_head for final projections. These fixes were implemented before any changes to the RL objective, ensuring backend parity with the original vLLM V0 reference, which used version 0.8.5, while vLLM V1 runs used version 0.18.1.
Specifically, the semantic mismatch was corrected by setting logprobs-mode to ‘processed_logprobs’, aligning the returned logprobs with what the trainer expects. Runtime defaults such as prefix caching, async scheduling, and override flags were explicitly configured to match the original environment. The inflight weight update path was also synchronized to prevent discrepancies during online RL updates, ensuring that weight changes did not produce inconsistent inference results.
Why It Matters
This development is vital because consistent backend behavior directly impacts the stability and reliability of RL training. Discrepancies in logprobs can lead to incorrect policy updates, affecting model performance and training efficiency. Ensuring backend parity allows researchers to trust the training process and compare results across versions, ultimately accelerating progress in RL applications for language models.

TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Background
The migration from vLLM V0 to V1 was a major rewrite aimed at improving inference performance and flexibility. Early in the process, training metrics diverged significantly, indicating issues with logprob computation and inference behavior. The team initially suspected objective mismatches but pinpointed the problem to backend semantics and runtime defaults. Fixes were implemented over several weeks, with the latest update confirming that the core issues are resolved, restoring confidence in the vLLM V1 engine for RL tasks.
“We have fixed four key backend issues that caused divergence in logprobs and training metrics, bringing vLLM V1 in line with the vLLM V0 reference.”
— Hugging Face team
“Setting logprobs-mode=processed_logprobs was essential to align the semantics of returned logprobs with our training expectations.”
— Hugging Face engineer

Parallel and Distributed Computing, Applications and Technologies: 26th International Conference, PDCAT 2025, Gold Coast, QLD, Australia, November … (Lecture Notes in Computer Science, 16465)
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What Remains Unclear
It is not yet clear whether these backend fixes will fully address other potential discrepancies in more complex RL algorithms or larger models. Ongoing testing is required to confirm stability across diverse training scenarios.

The Ultimate AI Toolbox: Essential Tools & Frameworks for Building, Deploying, and Scaling AI Solutions: A Comprehensive Guide to the Best Tools for … Model Training to Deployment and Optimization
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
What’s Next
Next steps include comprehensive validation across different RL algorithms such as PPO and GRPO, as well as testing with larger models and datasets. The team plans to monitor training metrics for any residual issues and prepare for broader deployment of vLLM V1 in production RL workflows.

Hardware-Aware Probabilistic Machine Learning Models: Learning, Inference and Use Cases
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What specific issues were fixed in vLLM V1?
The fixes addressed logprobs semantics, runtime default settings (prefix caching, async scheduling), inflight weight updates, and the use of fp32 lm_head for projection, ensuring backend behavior matches vLLM V0.
Why did the initial vLLM V1 attempt diverge from the reference?
The initial divergence was caused by mismatched logprobs semantics, inconsistent runtime defaults, and weight update handling, leading to discrepancies in training metrics.
Will these fixes improve training stability in all RL systems?
While these fixes improve backend consistency for vLLM V1, further testing is needed to confirm stability across different RL algorithms and larger models.
When will vLLM V1 be ready for widespread use?
Following validation and testing, the team anticipates broader deployment in the coming months, pending confirmation of stability across various training scenarios.