News AMD's EPYC Rome Chips Crash After 1,044 Days of Uptime

I don't know of any server type and ops team in which they'd allow machines to spend over a year without a restart. While not all, there's a few patching cycles that force you to restart machines and if you're not patching kernel and not restarting them, then you're doing OPs wrong IMO.

Even for critical infrastructure, you plan for such scenarios.

This being said, I don't know 100% of the industry and there may be cases where they do have a valid use case for a machine to be always on from the moment it's put in service, but I just don't know of any or even could rationalize that being the case.

Anyway, the bug sounds funny and easily avoidable.

Regards.
 
It's a massive bug. Not every server is security-critical and has to be updated regularly. I have Linux servers which have been running for 6+ years straight, without any changes. And I don't intend to change anything in the foreseeable future. They work perfectly well, and they will continue to do so. I would be mightily angry if it turned out I have to reboot them every 3 years.
 
  • Like
Reactions: RedBear87
I guess I could see instances when you might want the server to never restart but honestly it seems like a good idea to restart periodically anyway, not just to avoid weird timer bugs. I wouldn't patch this either if I was AMD, not a big issue and I think they don't even sell this CPU anymore.
 
I agree that this is more a curiosity than a major issue. Nice that it can be fixed by a first line maintainer without any knowledge of the situation. "Did you try restarting it ?"

The fact that they know about it does probably means someone ran up against the limit and cared enough to ask AMD to look into it.

I've never maintained any major piece of datacenter equipment but I can totally see why some would not want to shut down and restart something that is working just fine right now. It's another chance to introduce a problem.
 
Last edited:
I agree that this is more a curiosity than a major issue. Nice that it can be fixed by a first line maintainer without any knowledge of the situation. "Did you try restarting it ?"

The fact that they know about it does probably means someone ran up against the limit and cared enough to ask AMD to look into it.

I've never maintained any major piece of datacenter equipment but I can totally see why some would not want to shut down and restart something that is working just fine right now. It's another chance to introduce a problem.
Hello IT, did you try turning it off an on again?

View: https://youtu.be/nn2FB1P_Mn8
 
While not a great flaw I doubt most servers are up 1,044 days continuously. They certainly aren't if they are doing regular security updates or use a Windows OS. I'm not going to worry about it.
 
The fix is simple ...
@PaulAlcorn , that's a workaround, not a fix. The fix would be some microcode update, or at least a kernel patch that resolves the issue without any functional deficits.

What's unclear to me is how it affects users in virtualization environments, where probably the bulk of these CPUs are being deployed. If you merely restart all the VMs more frequently than the issue occurs, is that sufficient? Or would you have to actually restart the hypervisor? I'm thinking it's a lot more likely the VMs get patched and restarted enough to avoid this than the hypervisor.
 
Last edited:
  • Like
Reactions: PaulAlcorn
@PaulAlcorn , that's a workaround, not a fix. The fix would be some microcode update, or at least a kernel patch that resolves the issue without any functional deficits.

What's unclear to me is how it affects users in virtualization environments, where probably the bulk of these CPUs are being deployed. If you merely restart all the VMs more frequently than the issue occurs, is that sufficient? Or would you have to actually restart the hypervisor? I'm thinking it's a lot more likely the VMs get patched and restarted enough to avoid this than the hypervisor.
Good point, fixed!~

I believe this requires a full reset of the chip itself, so a reboot, but that is not 100% clear -- I'm following up with AMD to learn more.
 
  • Like
Reactions: King_V and bit_user
I guess I could see instances when you might want the server to never restart but honestly it seems like a good idea to restart periodically anyway, not just to avoid weird timer bugs. I wouldn't patch this either if I was AMD, not a big issue and I think they don't even sell this CPU anymore.
With VMware 6.7 and later, when you do vSphere updates you might not have to reboot the entire host and just ESXi. That changes your reboot schedule for the entire host considerably.
 
@PaulAlcorn , that's a workaround, not a fix. The fix would be some microcode update, or at least a kernel patch that resolves the issue without any functional deficits.

What's unclear to me is how it affects users in virtualization environments, where probably the bulk of these CPUs are being deployed. If you merely restart all the VMs more frequently than the issue occurs, is that sufficient? Or would you have to actually restart the hypervisor? I'm thinking it's a lot more likely the VMs get patched and restarted enough to avoid this than the hypervisor.
You have to restart the host. Just restarting the hypervisor doesn't reboot the entire host.
 
That reminds me of an odd bug in the Oracle client runtime I once faced...

We were running 32-bit Oracle clients linked to a 32-bit app in OpenVZ containters on a 64-bit kernel at the time, which had migrated straight off 32-bit Linux bare metal hosts.

After running somewhere between 30 to 40 days our applications would disconnect from the database and nothing but restarting the container (and app) would fix it.

That seemed a little odd so I tried to find out what was going wrong and killed one such instance with a SIGQUIT after it started misbehaving to have a look at the stack on the dump file and see where it got stuck.

Turned out it was trying to detect a time-out in the connect but was using an old Unix time() syscall which counted the ticks or timer interrupts in HZ since boot.

On 32-bit Unix systems HZ was 100 per second, which wouldn't overflow that easily. But on 64-bit Unix systems HZ was upped to 1000 per second, which was ok if your application was using 64 bit code and interpreting the 64-bit native syscall return value. But when using 32-bit libraries those 1KHz ticks would overflow somewhere after 30 days and the application the Oracle client software didn't properly interpret the error codes but got stuck.

Sure enough a newer variant of the Oracle client had already fixed the issue by the time I had diagnosed it and we switched to pure 64-bits shortly after that anyway.

Yeah, our Linux server guys used to pride themselves if a system they dismantled had not done a reboot during its entire 4-5 year life time. Perhaps it's because they'd been running Stratus machines before 9/11 made that not good enough.

Then came PCI-DSS and any uptime beyond a month could get you fired...
 
  • Like
Reactions: -Fran- and bit_user
Not a huge issue unless someone was thinking about using them for high uptime applications.
Even hypervisors need to get patched at least regularly, and you'd just move VMs to another one, patch, reboot and get on with your life.
I'd completely agree, because it's true for anything I operate myself.

But I also know that computers are used by some in ways the vendors never imagined.

And that's not even always a bad thing, because I'd argue that it's mostly also abusing computers for things like gaming which advances the science.

And if someone built built an EPYC into something embedded and with a true unbreachable network gap to make OS updates much less of a necessity, this could come as a nasty surprise.

Thank God the Voyagers aren't running EPYC! Because juggling VMs and patches are no fun with signal round trip times of 35 hours!
 
Thank God the Voyagers aren't running EPYC! Because juggling VMs and patches are no fun with signal round trip times of 35 hours!
They're getting so low on power that some subsystems have had to get powered down. I'm not sure how much longer they can keep doing anything useful.

For that sort of thing, you'd want a processor that operates on mW of power. No way it'd be an EPYC-class CPU. The main draw I can see for even having multi-core is redundancy + spares.
 
  • Like
Reactions: King_V