Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

TL;DR

Orthrus-Qwen3 is a new framework that accelerates large language model inference by up to 7.8× without sacrificing output accuracy. It unifies autoregressive and diffusion methods, promising faster, lossless generation.

Orthrus-Qwen3 has been introduced as a new framework that achieves up to 7.8 times faster token generation on Qwen3 models, while ensuring strictly lossless output fidelity. This development, confirmed by its creators, represents a significant advancement in large language model (LLM) inference technology, combining the high accuracy of autoregressive models with the speed advantages of diffusion-based approaches.

The Orthrus framework employs a dual-architecture design that allows it to generate tokens in parallel without losing the predictive fidelity of the original autoregressive model. It guarantees exact distribution matching through an intra-model consensus mechanism, meaning the output remains identical to that of the base Qwen3 models. The system leverages a shared high-fidelity key-value cache for both views, resulting in zero redundant memory overhead, and fine-tunes only 16% of model parameters to enable parallel inference.

Orthrus-Qwen3 has demonstrated notable performance improvements over existing speculative decoding methods like EAGLE-3 and DFlash, especially at larger context lengths, where it maintains high throughput and token acceptance rates. It also surpasses recent diffusion language models in both speed and accuracy, avoiding the conditional drift and degradation observed in other parallel decoding approaches. The developers have provided benchmarks showing a roughly 6× speedup over baseline models while preserving exact output distributions, making it a promising solution for large-scale deployment.

Why It Matters

This development matters because it addresses a key bottleneck in deploying large language models at scale—namely, inference speed—without compromising output quality. By enabling faster, lossless generation, Orthrus-Qwen3 could significantly reduce computational costs and latency for applications like chatbots, code generation, and complex reasoning tasks. Its ability to deliver high throughput at long context lengths also positions it as a practical solution for real-time AI services, potentially transforming how LLMs are integrated into commercial and research workflows.

ASUS ROG Astral GeForce RTX 5090 White OC Edition GPU, 32GB GDDR7, 3352 AI Tops, DLSS 4, 512-bit, DP 2.1b x3, HDMI 2.1b x2, AI Content Creation, LLM Inference, with GPU Holder

[3352 AI TOPS, 5th Gen Tensor Cores, AI Content Creation] Accelerate AI-powered photo and video workflows like upscaling,…

As an affiliate, we earn on qualifying purchases.

Background

Prior efforts to accelerate LLM inference have included speculative decoding and diffusion-based models, but these often faced trade-offs between speed and accuracy. Orthrus builds on recent advances in shared key-value caching and parameter-efficient fine-tuning to create a hybrid approach that maintains fidelity while improving speed. The framework’s announcement follows ongoing research into parallel decoding methods and aims to set a new standard for high-performance LLM inference in 2026.

“Orthrus achieves a 7.8× speedup while guaranteeing lossless output, a breakthrough in scalable LLM inference.”

— Chien Van Nguyen, lead researcher

“By sharing the exact same KV cache across dual views, Orthrus avoids redundant memory overhead, enabling efficient parallel decoding.”

— Chaitra Hegde, co-author

Amazon

large language model acceleration hardware

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is not yet clear how Orthrus-Qwen3 performs across a broader range of tasks outside benchmark tests, or how it will scale with even larger models. Details about deployment in real-world applications are still emerging.

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include broader testing across diverse tasks, integration with existing AI frameworks like vLLM and SGLang, and potential commercial deployment. Researchers are also expected to explore further optimizations and extensions of the dual-view diffusion approach.

Hands-On LLM Serving and Optimization: Hosting LLMs at Scale

As an affiliate, we earn on qualifying purchases.

Key Questions

How does Orthrus-Qwen3 achieve such high speed without losing fidelity?

It employs a dual-architecture framework that shares a high-fidelity key-value cache across both views, enabling parallel token generation that exactly matches the autoregressive model’s distribution.

Can Orthrus-Qwen3 be used with models other than Qwen3?

Currently, it is specifically designed for Qwen3 models, but future developments may extend its architecture to other LLMs with similar structures.

What are the main technical innovations behind Orthrus?

The key innovations include the shared high-fidelity KV cache, exact intra-model consensus mechanism, and fine-tuning only 16% of model parameters for parallel inference.

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

Up next

I Bought a “Junk” PSP From Japan

Author

Geek Salad Team

Share article

Why It Matters

ASUS ROG Astral GeForce RTX 5090 White OC Edition GPU, 32GB GDDR7, 3352 AI Tops, DLSS 4, 512-bit, DP 2.1b x3, HDMI 2.1b x2, AI Content Creation, LLM Inference, with GPU Holder

Background

large language model acceleration hardware

What Remains Unclear

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

What’s Next

Hands-On LLM Serving and Optimization: Hosting LLMs at Scale

Key Questions

How does Orthrus-Qwen3 achieve such high speed without losing fidelity?

Can Orthrus-Qwen3 be used with models other than Qwen3?

What are the main technical innovations behind Orthrus?

Chuwi Minibook X: the netbook we deserve

New Beam Spring Keyboards

The bridge. Why the AI buildout runs on a nuclear story and a gas reality.

Trump-Xi summit live: Beijing prepares for US president’s evening arrival

zelda ocarina of time remake price

ULA launches final Atlas 5 rocket supporting Amazon Leo’s broadband internet satellite constellation

AI compliance brief generator for small clinics

Europe Regulated the Interface and Forgot to Build the Engine

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

Up next

Author

Geek Salad Team

Share article

Why It Matters

ASUS ROG Astral GeForce RTX 5090 White OC Edition GPU, 32GB GDDR7, 3352 AI Tops, DLSS 4, 512-bit, DP 2.1b x3, HDMI 2.1b x2, AI Content Creation, LLM Inference, with GPU Holder

Background

large language model acceleration hardware

What Remains Unclear

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

What’s Next

Hands-On LLM Serving and Optimization: Hosting LLMs at Scale

Key Questions

How does Orthrus-Qwen3 achieve such high speed without losing fidelity?

Can Orthrus-Qwen3 be used with models other than Qwen3?

What are the main technical innovations behind Orthrus?

You May Also Like