Undervolting Your GPU for Local Inference: Lower Heat, Same Tokens/sec

TL;DR

Thorsten Meyer AI published guidance saying GPU power limits and undervolting can reduce heat, power draw and fan noise for local LLM inference while preserving much of throughput. The figures cited show an RTX 4090 at a 70% power limit using about 300W, running cooler and keeping 93.4% of token speed, though results vary by card, model and workload.

Thorsten Meyer AI has published a GPU tuning guide arguing that local inference users can cut heat and noise by power-limiting or undervolting high-end NVIDIA cards, with little or no loss in tokens per second in many workloads. The guidance matters for users running local LLMs on hot, high-power workstations, where GPU heat can affect noise, comfort and sustained performance.

The guide says local inference is often limited by memory bandwidth rather than GPU core compute. On that basis, it recommends starting with a power limit before changing hardware, especially for high-power cards such as the RTX 4090 and RTX 5090.

In the source material, Thorsten Meyer AI presents measured RTX 4090 results showing stock operation at 390W, 72C and 100% relative speed. At a 70% power limit, the guide lists 300W, 67C and 93.4% of tokens per second, a reduction of 90W for a reported 6.6% speed loss. At 60%, it lists 260W, 62C and 91.5% speed; at 40%, performance drops sharply to 61.3%.

The guide separates two methods: power limiting and direct undervolting. It describes power limiting as the safer starting point because the user restricts GPU power and lets the card adjust voltage and clocks. It describes undervolting as a more advanced voltage-frequency curve change that may keep more performance at a given heat level but requires testing under the user’s real workload.

Why It Matters

The practical impact is cost and comfort. If the reported pattern holds for a user’s workload, a single power-limit setting can lower GPU heat output, reduce fan noise and cut electricity use without buying a cooler, changing a case or moving fans.

The finding is more relevant for local inference than for gaming because the performance bottleneck can be different. The guide states that gaming loads are often compute-bound, while many inference runs wait on VRAM bandwidth. If the GPU core is not the limiting part of the pipeline, lowering core power may reduce heat faster than it reduces token throughput.

Amazon

GPU undervolting tools for NVIDIA RTX 4090

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

The article is framed as the first lever in Thorsten Meyer AI’s broader series on reducing heat and noise in high-power AI workstations. The source says the guidance is based on published RTX 4090 fine-tuning power-scaling measurements and RTX 5090/4090 power-cap tests from 2025-2026.

The guide recommends a practical starting range of 60% to 80% power limit, with 70% presented as a recommended setting for the RTX 4090 example. It also cites RTX 5090 power-cap figures, saying a cap to 450W is about 5% slower and 400W is about 10% slower, though those numbers are presented as workload-dependent.

“This is the first thing you should do to a high-power AI workstation, and it costs nothing.”

— Thorsten Meyer AI guide

“Local inference is memory-bound — the GPU core spends much of its time waiting on VRAM, not maxing out compute.”

— Thorsten Meyer AI guide

“Sweet spot: 90W of heat gone, only ~7% slower.”

— Thorsten Meyer AI guide

“This is a tuning guide, not a warranty document.”

— Thorsten Meyer AI guide

Amazon

GPU power limit adjustment software

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

The exact speed loss is not fixed. The source says results vary by GPU sample, model, quantization, cooling, case airflow and workload length. It is also unclear from the supplied material how broadly the listed RTX 4090 and RTX 5090 figures apply across consumer cards, workstation cards and non-NVIDIA setups.

The guide says a curve that appears stable for 10 minutes can fail hours later, so any undervolt needs validation under the user’s own inference workload. Power caps may also reset after reboot unless saved through a startup profile or Linux service.

Amazon

GPU cooling solutions for high-performance cards

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next step for users is measurement: set a conservative power limit, run the same local model and prompt workload for a sustained period, and compare power draw, GPU temperature, held clock and actual tokens per second. The guide recommends tools such as MSI Afterburner on Windows and nvidia-smi or LACT on Linux, then saving the setting only after the workload remains stable.

Amazon

GPU noise reduction fan control

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Is this a new GPU feature?

No. The development is guidance around existing controls: GPU power limits and undervolting. The news value is the inference-focused recommendation and the reported token-throughput tradeoff.

What setting does the guide recommend first?

The guide recommends starting with a power limit, often around 70% for the RTX 4090 example. It says this captures much of the heat reduction while avoiding direct voltage-curve editing.

Does undervolting always keep the same tokens per second?

No. The source says throughput can stay close in many inference workloads, but the measured result depends on the model, quantization, card, cooling and whether the workload is memory-bound.

Can this damage a GPU?

The guide describes power limiting as restrictive rather than aggressive, because it lowers the card’s allowed power. Direct undervolting is also described as reversible, but the source warns that users make changes at their own risk and should test stability.

Why does this matter for local AI users?

Local inference can run GPUs for long periods. Lowering heat output may reduce fan noise, room heat and power use while preserving enough throughput for daily work.

Source: Thorsten Meyer AI

You May Also Like

Panama Canal oil shipments soar 70% as Asian buyers turn to US crude

Oil shipments through the Panama Canal increased over 70% in April, driven by Asian buyers sourcing more US crude amid Strait of Hormuz disruptions.

Augmented Reality in Industry: Training, Repair, and Design

Prepare to revolutionize your industry with augmented reality, unlocking new levels of efficiency in training, repair, and design—discover how inside.

Smart Cities Update: Tech Transforming Urban Life by 2025

Unlock how smart city innovations are reshaping urban life by 2025, transforming mobility, energy, and resilience—discover what’s next for our cities.

Google officially announces that ads will be included in AI Mode search results

Google confirms that ads will now be integrated into AI Mode search results, using Gemini for personalized, AI-driven advertising experiences.