Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

TL;DR

Thorsten Meyer AI published guidance saying GPU power limits and undervolting can reduce heat, power draw and fan noise for local LLM inference while preserving much of throughput. The figures cited show an RTX 4090 at a 70% power limit using about 300W, running cooler and keeping 93.4% of token speed, though results vary by card, model and workload.

Thorsten Meyer AI has published a GPU tuning guide arguing that local inference users can cut heat and noise by power-limiting or undervolting high-end NVIDIA cards, with little or no loss in tokens per second in many workloads. The guidance matters for users running local LLMs on hot, high-power workstations, where GPU heat can affect noise, comfort and sustained performance.

The guide says local inference is often limited by memory bandwidth rather than GPU core compute. On that basis, it recommends starting with a power limit before changing hardware, especially for high-power cards such as the RTX 4090 and RTX 5090.

In the source material, Thorsten Meyer AI presents measured RTX 4090 results showing stock operation at 390W, 72C and 100% relative speed. At a 70% power limit, the guide lists 300W, 67C and 93.4% of tokens per second, a reduction of 90W for a reported 6.6% speed loss. At 60%, it lists 260W, 62C and 91.5% speed; at 40%, performance drops sharply to 61.3%.

The guide separates two methods: power limiting and direct undervolting. It describes power limiting as the safer starting point because the user restricts GPU power and lets the card adjust voltage and clocks. It describes undervolting as a more advanced voltage-frequency curve change that may keep more performance at a given heat level but requires testing under the user’s real workload.

Why It Matters

The practical impact is cost and comfort. If the reported pattern holds for a user’s workload, a single power-limit setting can lower GPU heat output, reduce fan noise and cut electricity use without buying a cooler, changing a case or moving fans.

The finding is more relevant for local inference than for gaming because the performance bottleneck can be different. The guide states that gaming loads are often compute-bound, while many inference runs wait on VRAM bandwidth. If the GPU core is not the limiting part of the pipeline, lowering core power may reduce heat faster than it reduces token throughput.

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

The article is framed as the first lever in Thorsten Meyer AI’s broader series on reducing heat and noise in high-power AI workstations. The source says the guidance is based on published RTX 4090 fine-tuning power-scaling measurements and RTX 5090/4090 power-cap tests from 2025-2026.

The guide recommends a practical starting range of 60% to 80% power limit, with 70% presented as a recommended setting for the RTX 4090 example. It also cites RTX 5090 power-cap figures, saying a cap to 450W is about 5% slower and 400W is about 10% slower, though those numbers are presented as workload-dependent.

“This is the first thing you should do to a high-power AI workstation, and it costs nothing.”

— Thorsten Meyer AI guide

“Local inference is memory-bound — the GPU core spends much of its time waiting on VRAM, not maxing out compute.”

— Thorsten Meyer AI guide

“Sweet spot: 90W of heat gone, only ~7% slower.”

— Thorsten Meyer AI guide

“This is a tuning guide, not a warranty document.”

— Thorsten Meyer AI guide

3.5 Inch Secondary Display, IPS Full View Angle Monitor, USB Surveillance Screen, USB Powered PC Hardware Status Screen, Desktop PC Status Monitor, Computer Monitoring,

3.5 Inch Secondary Display, IPS Full View Angle Monitor, USB Surveillance Screen, USB Powered PC Hardware Status Screen, Desktop PC Status Monitor, Computer Monitoring,

1. REAL-TIME PC HARDWARE MONITORING: Clearly displays CPU, GPU, RAM, HDD temperature and usage data; keeps track of…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

The exact speed loss is not fixed. The source says results vary by GPU sample, model, quantization, cooling, case airflow and workload length. It is also unclear from the supplied material how broadly the listed RTX 4090 and RTX 5090 figures apply across consumer cards, workstation cards and non-NVIDIA setups.

The guide says a curve that appears stable for 10 minutes can fail hours later, so any undervolt needs validation under the user’s own inference workload. Power caps may also reset after reboot unless saved through a startup profile or Linux service.

Graphics Card Cooling Fans Suitable for MSI Radeon RX 5600XT 5700XT 5700 MECH Series Desktop PLD09210S12HH 87mm 12V 0.40A 4Pin Video Card Cooler Fans GPU Fan (2PCS)

Graphics Card Cooling Fans Suitable for MSI Radeon RX 5600XT 5700XT 5700 MECH Series Desktop PLD09210S12HH 87mm 12V 0.40A 4Pin Video Card Cooler Fans GPU Fan (2PCS)

-Wide Compatibility : Replacement FOR MSI Radeon RX5600XT RX5700XT RX5700 MECH Series.Please check your GPU for compatibility.

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next step for users is measurement: set a conservative power limit, run the same local model and prompt workload for a sustained period, and compare power draw, GPU temperature, held clock and actual tokens per second. The guide recommends tools such as MSI Afterburner on Windows and nvidia-smi or LACT on Linux, then saving the setting only after the workload remains stable.

CybNemo 3 Pin 4 Pin PWM Chassis Fan Hub 4 Knob Cooling Fan Speed Controller 8 Channels PC Fan Hub 15 Pin SATA Powered for CPU HDD PCI Bracket Cooling System

CybNemo 3 Pin 4 Pin PWM Chassis Fan Hub 4 Knob Cooling Fan Speed Controller 8 Channels PC Fan Hub 15 Pin SATA Powered for CPU HDD PCI Bracket Cooling System

4 Knob Fan Hub: 8-channel PC fan hub used for PC computer water cooling systems to reduce speed…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Is this a new GPU feature?

No. The development is guidance around existing controls: GPU power limits and undervolting. The news value is the inference-focused recommendation and the reported token-throughput tradeoff.

What setting does the guide recommend first?

The guide recommends starting with a power limit, often around 70% for the RTX 4090 example. It says this captures much of the heat reduction while avoiding direct voltage-curve editing.

Does undervolting always keep the same tokens per second?

No. The source says throughput can stay close in many inference workloads, but the measured result depends on the model, quantization, card, cooling and whether the workload is memory-bound.

Can this damage a GPU?

The guide describes power limiting as restrictive rather than aggressive, because it lowers the card’s allowed power. Direct undervolting is also described as reversible, but the source warns that users make changes at their own risk and should test stability.

Why does this matter for local AI users?

Local inference can run GPUs for long periods. Lowering heat output may reduce fan noise, room heat and power use while preserving enough throughput for daily work.

Source: Thorsten Meyer AI

You May Also Like

Mark Zuckerberg announces ‘completely private’ encrypted Meta AI chat

Mark Zuckerberg announces Meta’s new Incognito Chat, offering end-to-end encrypted, no-log AI conversations, marking a privacy milestone for Meta.

Apple Silicon costs more than OpenRouter

New analysis shows Apple Silicon hardware costs more than OpenRouter for running large language models locally, with implications for AI deployment costs.

The Future of Online Identity: Decentralized IDs Explained

Keen to discover how decentralized IDs will reshape your online identity and why they are essential for your digital future?

Can Mukesh Ambani pull off his biggest gamble yet?

Mukesh Ambani is undertaking his most ambitious business move yet, with the outcome uncertain. This development could reshape his empire and the Indian market.