News New Rowhammer attack silently corrupts AI models on GDDR6 Nvidia cards — 'GPUHammer' attack drops AI accuracy from 80% to 0.1% on RTX A6000

Admin · Monday at 10:06 AM

GPUHammer is a new Rowhammer-based attack targeting NVIDIA GPUs with GDDR6 memory. It flips bits in VRAM to silently corrupt AI models, dropping accuracy from 80% to under 1%. NVIDIA urges users to enable ECC, though it slightly reduces performance and available memory.

New Rowhammer attack silently corrupts AI models on GDDR6 Nvidia cards — 'GPUHammer' attack drops AI accuracy from 80% to 0.1% on RTX A6000 : Read more

bit_user · Monday at 10:33 AM

The article said:
the attack could drop an AI model’s accuracy from 80% to under 1%—just by flipping a single bit in memory.

This is somewhat alarmist. Very few bits, in an AI model, would have that sort of impact. In the vast majority of cases, you could flip a bit and the impact would barely be detectable.

The article said:
GPUHammer proves it can happen on GDDR6 VRAM too

There was never a reason to suspect otherwise. GDDR works on the same principles and regular DDR memory. It's DRAM, in both cases.

The article said:
The attacker just needs to share the same GPU in a cloud environment or server, and they could potentially interfere with your workload however they want.

It's not accurate to say they "can interfere however they want". You can't control which bits flip. More importantly, in a virtualized environment, address layout is effectively randomized, if not also intentionally random. So, the best an attacker could do is just try to mess with whatever of your programs are using the GPU. They wouldn't have any way to control exactly how it affects you.

The article said:
NVIDIA has published a full list of affected models and recommends ECC for most of them.

Cloud GPUs should all support out-of-band ECC.

Most consumer models do not, but then client PCs aren't at much risk of such exploits (would effectively require running some malware on your PC). There's an outside chance that some WebGPU-enabled code runs in your browser and trashes your GPU state, using such an exploit.

The article said:
newer GPUs like the RTX 5090 and H100 have built-in ECC directly on the chip, which handles this automatically—no user setup required.

That's because they use GDDR7, which has on-die ECC like DDR5. It's not as good as out-of-band ECC, due to being much lower density, but should provide a little protection.

The article said:
NVIDIA has responded with a simple but important recommendation: turn on ECC (Error Correction Code) if your GPU supports it

Cloud operators should already be doing this. I wouldn't trust my data to any who weren't enabling it since day 1. That's because DRAM errors happen in the course of normal operation, and become more frequent as DRAM ages. So, if you're using a GPU for anything where data-integrity is important, that's why they have an ECC capability, in the first place.

The same is true for server memory. I would be shocked and appalled, if any cloud operators were running without ECC enabled on their server DRAM. Failing to enable it on their GPUs is nearly as bad.

leoneo.x64 · Monday at 10:56 AM

Admin said:
GPUHammer is a new Rowhammer-based attack targeting NVIDIA GPUs with GDDR6 memory. It flips bits in VRAM to silently corrupt AI models, dropping accuracy from 80% to under 1%. NVIDIA urges users to enable ECC, though it slightly reduces performance and available memory.

New Rowhammer attack silently corrupts AI models on GDDR6 Nvidia cards — 'GPUHammer' attack drops AI accuracy from 80% to 0.1% on RTX A6000 : Read more

Yes. Now kill all AI models!

DS426 · Monday at 11:00 AM

bit_user said:
Cloud operators should already be doing this. I wouldn't trust my data to any who weren't enabling it since day 1. That's because DRAM errors happen in the course of normal operation, and become more frequent as DRAM ages. So, if you're using a GPU for anything where data-integrity is important, that's why they have an ECC capability, in the first place.

The same is true for server memory. I would be shocked and appalled, if any cloud operators were running without ECC enabled on their server DRAM. Failing to enable it on their GPUs is nearly as bad.

I strongly agree. Unfortunately, I can see where cloud gaming vendors might not have ECC enabled for the performance losses. That said, that was before this notice; if these operators needed more concrete evidence on why ECC should be enabled, here it is.

I'm actually more surprised if ECC isn't enabled by default on professional-class GPU's.

JRStern · Monday at 12:39 PM

I should think that the bigger problem is that an AI system might accidentally row-hammer itself.

Amdlova · Monday at 4:10 PM

JRStern said:
I should think that the bigger problem is that an AI system might accidentally row-hammer itself.

Feel days ago got this News...

‘Full Nazi’: Elon Musk's AI chatbot started calling itself 'MechaHitler'
After a recent update to the AI chatbot’s code, Grok started posting openly antisemitic responses to queries from X users on Elon Musk’s platform.

Think The AI has some row-hammer problems

bit_user · Monday at 5:03 PM

JRStern said:
I should think that the bigger problem is that an AI system might accidentally row-hammer itself.

The cache hierarchy naturally tends to prevent accidental rowhammer. Frequent reads or writes to the same address would get handled by cache and not actually reach DRAM very often, between refreshes.

This is why the vulnerability went unnoticed until about a decade ago, when security researchers started to get really crafty. The CPU exploits either actively invalidate cache lines or do other things to evict them, so the memory accesses will go all the way out to DRAM.

Another reason why AI wouldn't accidentally trigger this problem is due to the way they use DRAM. Because it tends to be a major bottleneck in AI training & inference, model weights are streamed in and used as much as possible. If they're used again, it won't be right away. This is exactly the kind of data access pattern DRAM likes, and therefore the chances of unintentional rowhammer from AI are probably infinitesimal.

bit_user · Monday at 5:07 PM

Amdlova said:
Think The AI has some row-hammer problems

No, that's a lot more to do with how certain AI's are trained. It's a classic GIGO problem - Garbage In: Garbage Out.

If you've seen some of the trash on Twitter/X, it's no surprise that a chatbot trained on Tweets is going to be very toxic.

Mr Majestyk · Monday at 9:57 PM

Wait, AI models have "accuracy"! When did that happen? 😱

JRStern · Wednesday at 4:35 PM

Amdlova said:
Think The AI has some row-hammer problems

Maybe that explains Elon's behaviors, too.

Search

News New Rowhammer attack silently corrupts AI models on GDDR6 Nvidia cards — 'GPUHammer' attack drops AI accuracy from 80% to 0.1% on RTX A6000

Admin

Administrator

bit_user

Titan

leoneo.x64

Honorable

DS426

Commendable

JRStern

Distinguished

Amdlova

Distinguished

bit_user

Titan

bit_user

Titan

Mr Majestyk

Distinguished

JRStern

Distinguished

TRENDING THREADS

Latest posts

Moderators online

Share this page