vLLM V0 to V1: Correctness Before Corrections in RL

TL;DR

Hugging Face’s vLLM V1 has been updated to match vLLM V0’s reference behavior by fixing four core backend issues. This correction ensures consistency in RL training workflows. The development is crucial for stable, predictable model training.

Hugging Face reports that vLLM V1 now reliably matches vLLM V0’s reference after fixing four critical backend issues, marking a significant step in their migration process for reinforcement learning (RL) workflows.

The update was driven by the need to eliminate discrepancies in rollout logprobs used during RL training, which initially caused divergence in training metrics such as clip rate, KL divergence, entropy, and reward. The team identified and addressed four main issues: the semantics of logprobs returned by the backend, runtime defaults affecting inference paths, the inflight weight update process, and the use of fp32 lm_head for final projections. These fixes were implemented before any changes to the RL objective, ensuring backend parity with the original vLLM V0 reference, which used version 0.8.5, while vLLM V1 runs used version 0.18.1.

Specifically, the semantic mismatch was corrected by setting logprobs-mode to ‘processed_logprobs’, aligning the returned logprobs with what the trainer expects. Runtime defaults such as prefix caching, async scheduling, and override flags were explicitly configured to match the original environment. The inflight weight update path was also synchronized to prevent discrepancies during online RL updates, ensuring that weight changes did not produce inconsistent inference results.

Why It Matters

This development is vital because consistent backend behavior directly impacts the stability and reliability of RL training. Discrepancies in logprobs can lead to incorrect policy updates, affecting model performance and training efficiency. Ensuring backend parity allows researchers to trust the training process and compare results across versions, ultimately accelerating progress in RL applications for language models.

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

As an affiliate, we earn on qualifying purchases.

Background

The migration from vLLM V0 to V1 was a major rewrite aimed at improving inference performance and flexibility. Early in the process, training metrics diverged significantly, indicating issues with logprob computation and inference behavior. The team initially suspected objective mismatches but pinpointed the problem to backend semantics and runtime defaults. Fixes were implemented over several weeks, with the latest update confirming that the core issues are resolved, restoring confidence in the vLLM V1 engine for RL tasks.

“We have fixed four key backend issues that caused divergence in logprobs and training metrics, bringing vLLM V1 in line with the vLLM V0 reference.”

— Hugging Face team

“Setting logprobs-mode=processed_logprobs was essential to align the semantics of returned logprobs with our training expectations.”

— Hugging Face engineer

Parallel and Distributed Computing, Applications and Technologies: 26th International Conference, PDCAT 2025, Gold Coast, QLD, Australia, November … (Lecture Notes in Computer Science, 16465)

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is not yet clear whether these backend fixes will fully address other potential discrepancies in more complex RL algorithms or larger models. Ongoing testing is required to confirm stability across diverse training scenarios.

The Ultimate AI Toolbox: Essential Tools & Frameworks for Building, Deploying, and Scaling AI Solutions: A Comprehensive Guide to the Best Tools for … Model Training to Deployment and Optimization

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include comprehensive validation across different RL algorithms such as PPO and GRPO, as well as testing with larger models and datasets. The team plans to monitor training metrics for any residual issues and prepare for broader deployment of vLLM V1 in production RL workflows.

Deep Learning at Scale: At the Intersection of Hardware, Software, and Data

As an affiliate, we earn on qualifying purchases.

Key Questions

What specific issues were fixed in vLLM V1?

The fixes addressed logprobs semantics, runtime default settings (prefix caching, async scheduling), inflight weight updates, and the use of fp32 lm_head for projection, ensuring backend behavior matches vLLM V0.

Why did the initial vLLM V1 attempt diverge from the reference?

The initial divergence was caused by mismatched logprobs semantics, inconsistent runtime defaults, and weight update handling, leading to discrepancies in training metrics.

Will these fixes improve training stability in all RL systems?

While these fixes improve backend consistency for vLLM V1, further testing is needed to confirm stability across different RL algorithms and larger models.

When will vLLM V1 be ready for widespread use?

Following validation and testing, the team anticipates broader deployment in the coming months, pending confirmation of stability across various training scenarios.

vLLM V0 to V1: Correctness Before Corrections in RL

Up next

Building Blocks for Foundation Model Training and Inference on AWS

Author

Geek Salad Team

Share article

Why It Matters

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

Background

Parallel and Distributed Computing, Applications and Technologies: 26th International Conference, PDCAT 2025, Gold Coast, QLD, Australia, November … (Lecture Notes in Computer Science, 16465)

What Remains Unclear

The Ultimate AI Toolbox: Essential Tools & Frameworks for Building, Deploying, and Scaling AI Solutions: A Comprehensive Guide to the Best Tools for … Model Training to Deployment and Optimization

What’s Next

Deep Learning at Scale: At the Intersection of Hardware, Software, and Data

Key Questions

What specific issues were fixed in vLLM V1?

Why did the initial vLLM V1 attempt diverge from the reference?

Will these fixes improve training stability in all RL systems?

When will vLLM V1 be ready for widespread use?

The queue. Why the grid, not the chip, is the binding constraint on AI.

TIL that 32 bit time will run out in 2038, while 64 bit time will run out approximately 292 billion years from now

Quantum computing CEOs hope “validating” government backing proves their technology is no longer speculative

New accessibility features powered by Apple Intelligence

10 Best Interior Paints in 2026

How AI Trainers, Evaluators, and Model Reviewers Differ

Today’s NYT Connections Hints, Answers and Help for June 27, #1112

Appointment no-show recovery planner for therapy practices

vLLM V0 to V1: Correctness Before Corrections in RL

Up next

Author

Geek Salad Team

Share article

Why It Matters

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

Background

Parallel and Distributed Computing, Applications and Technologies: 26th International Conference, PDCAT 2025, Gold Coast, QLD, Australia, November … (Lecture Notes in Computer Science, 16465)

What Remains Unclear

The Ultimate AI Toolbox: Essential Tools & Frameworks for Building, Deploying, and Scaling AI Solutions: A Comprehensive Guide to the Best Tools for … Model Training to Deployment and Optimization

What’s Next

Deep Learning at Scale: At the Intersection of Hardware, Software, and Data

Key Questions

What specific issues were fixed in vLLM V1?

Why did the initial vLLM V1 attempt diverge from the reference?

Will these fixes improve training stability in all RL systems?

When will vLLM V1 be ready for widespread use?

You May Also Like