Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

TL;DR

Thorsten Meyer AI published guidance saying GPU power limits and undervolting can reduce heat, power draw and fan noise for local LLM inference while preserving much of throughput. The figures cited show an RTX 4090 at a 70% power limit using about 300W, running cooler and keeping 93.4% of token speed, though results vary by card, model and workload.

Thorsten Meyer AI has published a GPU tuning guide arguing that local inference users can cut heat and noise by power-limiting or undervolting high-end NVIDIA cards, with little or no loss in tokens per second in many workloads. The guidance matters for users running local LLMs on hot, high-power workstations, where GPU heat can affect noise, comfort and sustained performance.

The guide says local inference is often limited by memory bandwidth rather than GPU core compute. On that basis, it recommends starting with a power limit before changing hardware, especially for high-power cards such as the RTX 4090 and RTX 5090.

In the source material, Thorsten Meyer AI presents measured RTX 4090 results showing stock operation at 390W, 72C and 100% relative speed. At a 70% power limit, the guide lists 300W, 67C and 93.4% of tokens per second, a reduction of 90W for a reported 6.6% speed loss. At 60%, it lists 260W, 62C and 91.5% speed; at 40%, performance drops sharply to 61.3%.

The guide separates two methods: power limiting and direct undervolting. It describes power limiting as the safer starting point because the user restricts GPU power and lets the card adjust voltage and clocks. It describes undervolting as a more advanced voltage-frequency curve change that may keep more performance at a given heat level but requires testing under the user’s real workload.

Why It Matters

The practical impact is cost and comfort. If the reported pattern holds for a user’s workload, a single power-limit setting can lower GPU heat output, reduce fan noise and cut electricity use without buying a cooler, changing a case or moving fans.

The finding is more relevant for local inference than for gaming because the performance bottleneck can be different. The guide states that gaming loads are often compute-bound, while many inference runs wait on VRAM bandwidth. If the GPU core is not the limiting part of the pipeline, lowering core power may reduce heat faster than it reduces token throughput.

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

As an affiliate, we earn on qualifying purchases.

Background

The article is framed as the first lever in Thorsten Meyer AI’s broader series on reducing heat and noise in high-power AI workstations. The source says the guidance is based on published RTX 4090 fine-tuning power-scaling measurements and RTX 5090/4090 power-cap tests from 2025-2026.

The guide recommends a practical starting range of 60% to 80% power limit, with 70% presented as a recommended setting for the RTX 4090 example. It also cites RTX 5090 power-cap figures, saying a cap to 450W is about 5% slower and 400W is about 10% slower, though those numbers are presented as workload-dependent.

“This is the first thing you should do to a high-power AI workstation, and it costs nothing.”

— Thorsten Meyer AI guide

“Local inference is memory-bound — the GPU core spends much of its time waiting on VRAM, not maxing out compute.”

— Thorsten Meyer AI guide

“Sweet spot: 90W of heat gone, only ~7% slower.”

— Thorsten Meyer AI guide

“This is a tuning guide, not a warranty document.”

— Thorsten Meyer AI guide

Mini Gaming PC, Latest AMD Ryzen AI 9 HX 470 Processor, 12C/24T Up to 5.2GHz, 48GB DDR5 1TB SSD, Radeon 890M GPU, 8K Quad Display, Built-in Speaker, Oculink, WiFi 7, BT 5.4 RDNA 3 NPU 86TOPS AI PC

✅Latest Flagship AMD Ryzen AI 9 HX 470 Processo – Equipped with the newly released 4nm AMD Ryzen…

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

The exact speed loss is not fixed. The source says results vary by GPU sample, model, quantization, cooling, case airflow and workload length. It is also unclear from the supplied material how broadly the listed RTX 4090 and RTX 5090 figures apply across consumer cards, workstation cards and non-NVIDIA setups.

The guide says a curve that appears stable for 10 minutes can fail hours later, so any undervolt needs validation under the user’s own inference workload. Power caps may also reset after reboot unless saved through a startup profile or Linux service.

CF9010H12D 87MM 6-Pin Replacement Fan for RX 6600 XT Graphics Card – High-Performance GPU Cooling Solution(Left Fan)

High-performance replacement fan designed specifically for RX 6600 XT graphics cards, ensuring optimal cooling efficiency.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next step for users is measurement: set a conservative power limit, run the same local model and prompt workload for a sustained period, and compare power draw, GPU temperature, held clock and actual tokens per second. The guide recommends tools such as MSI Afterburner on Windows and nvidia-smi or LACT on Linux, then saving the setting only after the workload remains stable.

Single 12V 0.8A DC PWM 2-3 Wire Fan Temperature Control Speed Controller Chassis Computer Noise Reduction Module Board

As an affiliate, we earn on qualifying purchases.

Key Questions

Is this a new GPU feature?

No. The development is guidance around existing controls: GPU power limits and undervolting. The news value is the inference-focused recommendation and the reported token-throughput tradeoff.

The guide recommends starting with a power limit, often around 70% for the RTX 4090 example. It says this captures much of the heat reduction while avoiding direct voltage-curve editing.

Does undervolting always keep the same tokens per second?

No. The source says throughput can stay close in many inference workloads, but the measured result depends on the model, quantization, card, cooling and whether the workload is memory-bound.

Can this damage a GPU?

The guide describes power limiting as restrictive rather than aggressive, because it lowers the card’s allowed power. Direct undervolting is also described as reversible, but the source warns that users make changes at their own risk and should test stability.

Why does this matter for local AI users?

Local inference can run GPUs for long periods. Lowering heat output may reduce fan noise, room heat and power use while preserving enough throughput for daily work.

Source: Thorsten Meyer AI

Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

Up next

How Contract Testing Helps Distributed Teams

Author

Geek Salad Team

Share article

Why It Matters

VIPERA NVIDIA GeForce RTX 4090 Founders Edition Graphic Card

Background

Mini Gaming PC, Latest AMD Ryzen AI 9 HX 470 Processor, 12C/24T Up to 5.2GHz, 48GB DDR5 1TB SSD, Radeon 890M GPU, 8K Quad Display, Built-in Speaker, Oculink, WiFi 7, BT 5.4 RDNA 3 NPU 86TOPS AI PC

What Remains Unclear

CF9010H12D 87MM 6-Pin Replacement Fan for RX 6600 XT Graphics Card – High-Performance GPU Cooling Solution(Left Fan)

What’s Next

Single 12V 0.8A DC PWM 2-3 Wire Fan Temperature Control Speed Controller Chassis Computer Noise Reduction Module Board