vLLM V0 To V1: Correctness Before Corrections In RL

TL;DR

Hugging Face’s vLLM V1 has been updated to match vLLM V0’s reference behavior by fixing four core backend issues. This correction ensures consistency in RL training workflows. The development is crucial for stable, predictable model training.

Hugging Face reports that vLLM V1 now reliably matches vLLM V0’s reference after fixing four critical backend issues, marking a significant step in their migration process for reinforcement learning (RL) workflows.

The update was driven by the need to eliminate discrepancies in rollout logprobs used during RL training, which initially caused divergence in training metrics such as clip rate, KL divergence, entropy, and reward. The team identified and addressed four main issues: the semantics of logprobs returned by the backend, runtime defaults affecting inference paths, the inflight weight update process, and the use of fp32 lm_head for final projections. These fixes were implemented before any changes to the RL objective, ensuring backend parity with the original vLLM V0 reference, which used version 0.8.5, while vLLM V1 runs used version 0.18.1.

Specifically, the semantic mismatch was corrected by setting logprobs-mode to ‘processed_logprobs’, aligning the returned logprobs with what the trainer expects. Runtime defaults such as prefix caching, async scheduling, and override flags were explicitly configured to match the original environment. The inflight weight update path was also synchronized to prevent discrepancies during online RL updates, ensuring that weight changes did not produce inconsistent inference results.

Why It Matters

This development is vital because consistent backend behavior directly impacts the stability and reliability of RL training. Discrepancies in logprobs can lead to incorrect policy updates, affecting model performance and training efficiency. Ensuring backend parity allows researchers to trust the training process and compare results across versions, ultimately accelerating progress in RL applications for language models.

LLM Systems Engineering: Training and Building Large Language Models – Engineering AI Models Through Fine-Tuning, Continued Pretraining, and From-Scratch Development

As an affiliate, we earn on qualifying purchases.

Background

The migration from vLLM V0 to V1 was a major rewrite aimed at improving inference performance and flexibility. Early in the process, training metrics diverged significantly, indicating issues with logprob computation and inference behavior. The team initially suspected objective mismatches but pinpointed the problem to backend semantics and runtime defaults. Fixes were implemented over several weeks, with the latest update confirming that the core issues are resolved, restoring confidence in the vLLM V1 engine for RL tasks.

“We have fixed four key backend issues that caused divergence in logprobs and training metrics, bringing vLLM V1 in line with the vLLM V0 reference.”

— Hugging Face team

“Setting logprobs-mode=processed_logprobs was essential to align the semantics of returned logprobs with our training expectations.”

— Hugging Face engineer

Accelerated Deep Learning: Harnessing GPUs for High-Performance AI

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is not yet clear whether these backend fixes will fully address other potential discrepancies in more complex RL algorithms or larger models. Ongoing testing is required to confirm stability across diverse training scenarios.

vLLM and High-Performance Inference: Memory Optimization, Parallel Execution, Token Streaming, and Scalable Model Serving (Large Language Model Refinement and Inference Series)

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include comprehensive validation across different RL algorithms such as PPO and GRPO, as well as testing with larger models and datasets. The team plans to monitor training metrics for any residual issues and prepare for broader deployment of vLLM V1 in production RL workflows.

INFINIBAND FOR HIGH-PERFORMANCE COMPUTING AND AI CLUSTERS: Configure RDMA networking, optimize GPU interconnects, and build low-latency infrastructure for distributed training and HPC workload

As an affiliate, we earn on qualifying purchases.

Key Questions

What specific issues were fixed in vLLM V1?

The fixes addressed logprobs semantics, runtime default settings (prefix caching, async scheduling), inflight weight updates, and the use of fp32 lm_head for projection, ensuring backend behavior matches vLLM V0.

Why did the initial vLLM V1 attempt diverge from the reference?

The initial divergence was caused by mismatched logprobs semantics, inconsistent runtime defaults, and weight update handling, leading to discrepancies in training metrics.

Will these fixes improve training stability in all RL systems?

While these fixes improve backend consistency for vLLM V1, further testing is needed to confirm stability across different RL algorithms and larger models.

When will vLLM V1 be ready for widespread use?

Following validation and testing, the team anticipates broader deployment in the coming months, pending confirmation of stability across various training scenarios.

vLLM V0 To V1: Correctness Before Corrections In RL

Up next

Building Blocks for Foundation Model Training and Inference on AWS

Author

Geek Salad Team

Share article

Why It Matters

LLM Systems Engineering: Training and Building Large Language Models – Engineering AI Models Through Fine-Tuning, Continued Pretraining, and From-Scratch Development

Background

Accelerated Deep Learning: Harnessing GPUs for High-Performance AI

What Remains Unclear

vLLM and High-Performance Inference: Memory Optimization, Parallel Execution, Token Streaming, and Scalable Model Serving (Large Language Model Refinement and Inference Series)

What’s Next

INFINIBAND FOR HIGH-PERFORMANCE COMPUTING AND AI CLUSTERS: Configure RDMA networking, optimize GPU interconnects, and build low-latency infrastructure for distributed training and HPC workload

Key Questions

What specific issues were fixed in vLLM V1?

Why did the initial vLLM V1 attempt diverge from the reference?

Will these fixes improve training stability in all RL systems?

When will vLLM V1 be ready for widespread use?

The City That Watches Itself: The Living Digital Twin, And The God’s-Eye View We’re Building

AR Glasses in Everyday Life: Is 2025 the Year of AR Adoption?

NASA’s Psyche spacecraft just got an assist from Mars on the way to its asteroid namesake

Jay Forrester filed the first practical computer RAM patent 75 years ago this week — his Magnetic Core Memory patent would be granted five years later

What AI Career Signals Recruiters Trust Most

How RISC OS Open Has Been A Barometer For Tech Trends Over 20 Years

15 Best ESP32-C3 Development Boards in 2026

2026’s Best AI-Powered Apps for Student Organization and Management

vLLM V0 To V1: Correctness Before Corrections In RL

Up next

Author

Geek Salad Team

Share article

Why It Matters

LLM Systems Engineering: Training and Building Large Language Models – Engineering AI Models Through Fine-Tuning, Continued Pretraining, and From-Scratch Development

Background

Accelerated Deep Learning: Harnessing GPUs for High-Performance AI

What Remains Unclear

vLLM and High-Performance Inference: Memory Optimization, Parallel Execution, Token Streaming, and Scalable Model Serving (Large Language Model Refinement and Inference Series)

What’s Next

INFINIBAND FOR HIGH-PERFORMANCE COMPUTING AND AI CLUSTERS: Configure RDMA networking, optimize GPU interconnects, and build low-latency infrastructure for distributed training and HPC workload

Key Questions

What specific issues were fixed in vLLM V1?

Why did the initial vLLM V1 attempt diverge from the reference?

Will these fixes improve training stability in all RL systems?

When will vLLM V1 be ready for widespread use?

You May Also Like