News Massive VRAM pools on AMD Instinct accelerators drown Linux's hibernation process — 1.5 TB of memory per server creates headaches

Linux isn't the only thing with hibernation problems!

My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!
 
Linux isn't the only thing with hibernation problems!

My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!
Hibernation has always been an issue on the two major PC platforms.

In contrast to PCs, this is a benefit Apple provides for itself by keeping everything so controlled; as an example. Less choice and less user allowances does have its benefits for the provider from a support and testing perspective. That in turn allows Apple to "offer" to its customers the concept that "everything just works".
 
Linux isn't the only thing with hibernation problems!

My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!
Yeah, Windows has pretty much always had issues resuming from hibernation and/or other sleep states over various versions of Windows and all of the monthly build levels. The same being true pretty much regardless of make and model, though like anything, some can certainly be way more troublesome than others. It's a crapshoot, really!

We actually disable hibernation with:

powercfg /h off

Using Command Prompt. This is pushed out as a GPO script. Laptops do have sleep and deep sleep (S5) enabled, even though that can cause problems as well. So any of our laptop users just shutdown (and it's a true shutdown, not hibernation as Windows starting preferring several years ago) when the laptop isn't going to be used for some time. Anyways, not writing the hibernation file to disk also saves some SSD disk writes. 😉
 
In contrast to PCs, this is a benefit Apple provides for itself by keeping everything so controlled; as an example. Less choice and less user allowances does have its benefits for the provider from a support and testing perspective. That in turn allows Apple to "offer" to its customers the concept that "everything just works".
Contrary to popular belief, macs, even the newer M-series, are not immune to sleep breaking in some way.
The likely cause is a 3rd party driver, rather than the OS itself, but YMMV.
 
  • Like
Reactions: MosephV and ezst036
Years ago in the high-altitude deserts of the Sierra Nevada I ran a small computing cluster of about 50 nodes that would pause computations during the summer days and resume them at night when the outside temperatures were much cooler. Otherwise, the air-conditioning tended to fail.

Although AI inference is usually performed on demand, I can well imagine a training cluster which benefits enough from lower electrical and cooling costs at night to the point that hibernation during the day is cost effective.
 
Linux isn't the only thing with hibernation problems!

My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!

I have a Thinkpad X1 (also for work-related stuff and also runs Windows 11 that came with it) that never gets hibernation right. Sometimes it works fine, sometimes it refuses to wake up and forces me to turn it back on from cold boot, sometimes it doesn't hibernate at all and remains powered on while stored away. Sometimes it wakes up only for it to freeze on the lock screen, forcing me to forcefully reboot it. Sometimes everything works fine but the fingerprint scanner refuses to work (or some random things not working right like the touchpad / touchscreen not recognising gestures)...

Pretty much everything is still in default settings as I don't want to screw up anything on a laptop that I use for my job. I simply have accepted it as a quirk of the laptop, and to remind myself to not use the hibernate (or sleep) feature.

So I think it's not the fault of that particular model of Dell. Maybe it's more of a Windows thing.
 
Although AI inference is usually performed on demand, I can well imagine a training cluster which benefits enough from lower electrical and cooling costs at night to the point that hibernation during the day is cost effective.
The hardware and opportunity costs are so high that I'm sure they'll just plow ahead, 24/7. At the very most, they might slightly reduce clock speeds during the heat of the day and perhaps at peak demand times, for the grid.
 
The hardware and opportunity costs are so high that I'm sure they'll just plow ahead, 24/7. At the very most, they might slightly reduce clock speeds during the heat of the day and perhaps at peak demand times, for the grid.
True - but then it may be a grid capacity problem, because A/C may be overworked during that time too, in which case you'd need to reduce clock speeds by quite a lot for it to be effective. Since those parts already run at lower clock speeds because they rely more on pure parallelism than max clock speed, it is not unlikely that the only way to really reduce power draw is to actually shut down some racks.
 
Linux developers claim hibernation is still an unstable feature. Specifically on Linux Desktop. The Desktop itself is still unstable. The default installations for Ubuntu/Linux Mint do not even offer hibernation.
Linux may fail not only to hibernate but also to Standby (called Suspend in Linux). Even with or without any VRAM.

Like Linux was not user friendly 30 years ago, 20 years ago, 10 years ago, still today it is continuing its trend. Too small users pool and too many distros do bad job to Linux: no time and money to polish them. Plus looks like Torvalds does not care about Linux Desktop enjoying always to see his beloved Terminal and the orgasmic list of his creations every user must enjoy permanently to see too:
bin
boot
cdrom
dev
etc
lib
lib32
lib64
libx32
mnt
opt
proc
root
run
sbin
srv
sys
tmp
usr
var
(sorry for torture exposing what no one wants to see).
When Linux devs will learn something from Android?
 
Last edited:
Since those parts already run at lower clock speeds because they rely more on pure parallelism than max clock speed, it is not unlikely that the only way to really reduce power draw is to actually shut down some racks.
Have you seen the power requirements of these datacenter GPUs, lately? Blackwell Ultra (B300) is using up to 1.4 kW!

The silicon is so expensive that they're running the clock speeds as high as they possibly can, in order to maximize perf/$. At this point, the hardware costs are like 10x the lifetime energy costs, so it makes sense for them to essentially clock it to the limit of how much heat they can dissipate.
 
Now, most high-end AI servers run continuously, so it's fair to ask why anyone would hibernate them.
Zactly.
I've never used hibernate, not even to take the laptop home where I would open it up again, or vice-versa.
And, on a server???
Do these fancy GPUs have a sleep mode? That would seem to be a better idea than flushing terabytes every day. Sleep might make more sense than hibernate for servers, generally, much faster recovery.
Actually if the server is going to hibernate, aren't the odds about 99% that the GPUs can just be flushed, not stored?
 
Zactly.
I've never used hibernate, not even to take the laptop home where I would open it up again, or vice-versa.
And, on a server???
Do these fancy GPUs have a sleep mode? That would seem to be a better idea than flushing terabytes every day.
Maybe hibernation is used as a way to snapshot their state, during training runs? You could then copy the swap file/device and later restore from it, in order to pick up where you left off.
 
  • Like
Reactions: JRStern
Maybe hibernation is used as a way to snapshot their state, during training runs? You could then copy the swap file/device and later restore from it, in order to pick up where you left off.
Logically valid, but I can't see it being much used in the real world, so hardly worth developing and supporting.
Except as a challenge, maybe, LOL.
 
AMD AI Linux-powered servers are failing to hibernate due to excessive VRAM and a high number of AMD Instinct accelerators per system.

Massive VRAM pools on AMD Instinct accelerators drown Linux's hibernation process — 1.5 TB of memory per server creates headaches : Read more
Why on Earth would you hibernate such a server system with full VRAM load?

If you first stop the processes that allocate lots of VRAM, the system doesn't need to save contents of VRAM to hibernate the system. The system must obviously store the active VRAM contents to somewhere before cutting power or it cannot be restored after wake up from hibernation, right?

And if you truly have to hibernate such a system under full load, you just have to get fast enough storage. For example, Intel P5800X 3.2 TB should be good enough for this kind of requirements. It's not cheap but neither is offlining GPUs with hundreds of gigs of VRAM.