News AMD's EPYC Rome Chips Crash After 1,044 Days of Uptime

Admin · Jun 3, 2023

AMD's EPYC 7022 chips can hang after 1,044 days due to an errata that AMD posted to its revision CKVojpvtPMwcbNfFLRF7TKe.

AMD's EPYC Rome Chips Crash After 1,044 Days of Uptime : Read more

The Historical Fidelity · Jun 3, 2023

I’m getting Y2K vibes with this errata lol

AtomicRobotMan0101 · Jun 3, 2023

!RemindMe 1043 days

usertests · Jun 3, 2023

Kinda bad. I think Epyc/server users are more likely to want and actually achieve 3-yr/indefinite uptime than a Ryzen desktop user.

Friesiansam · Jun 3, 2023

Far worse things to worry about, than a minor impact bug that can be prevented with a triyearly restart.

/=^>3-O-B= · Jun 3, 2023

You misspelled error, duh.
To flex in Latin, it's called an "erratum"; "errata" is the plural.

-Fran- · Jun 3, 2023

I don't know of any server type and ops team in which they'd allow machines to spend over a year without a restart. While not all, there's a few patching cycles that force you to restart machines and if you're not patching kernel and not restarting them, then you're doing OPs wrong IMO.

Even for critical infrastructure, you plan for such scenarios.

This being said, I don't know 100% of the industry and there may be cases where they do have a valid use case for a machine to be always on from the moment it's put in service, but I just don't know of any or even could rationalize that being the case.

Anyway, the bug sounds funny and easily avoidable.

Regards.

kaalus · Jun 3, 2023

It's a massive bug. Not every server is security-critical and has to be updated regularly. I have Linux servers which have been running for 6+ years straight, without any changes. And I don't intend to change anything in the foreseeable future. They work perfectly well, and they will continue to do so. I would be mightily angry if it turned out I have to reboot them every 3 years.

johnrock2 · Jun 3, 2023

I guess I could see instances when you might want the server to never restart but honestly it seems like a good idea to restart periodically anyway, not just to avoid weird timer bugs. I wouldn't patch this either if I was AMD, not a big issue and I think they don't even sell this CPU anymore.

Paul Alcorn · Jun 3, 2023

The Historical Fidelity said:
I’m getting Y2K vibes with this errata lol

Same! It was the first thing I thought of, actually

Co BIY · Jun 3, 2023

I agree that this is more a curiosity than a major issue. Nice that it can be fixed by a first line maintainer without any knowledge of the situation. "Did you try restarting it ?"

The fact that they know about it does probably means someone ran up against the limit and cared enough to ask AMD to look into it.

I've never maintained any major piece of datacenter equipment but I can totally see why some would not want to shut down and restart something that is working just fine right now. It's another chance to introduce a problem.

The Historical Fidelity · Jun 3, 2023

/=^>3-O-B= said:
You misspelled error, duh.
To flex in Latin, it's called an "erratum"; "errata" is the plural.

Thank you for your service…

Deleted member 14196 · Jun 3, 2023

Co BIY said:
I agree that this is more a curiosity than a major issue. Nice that it can be fixed by a first line maintainer without any knowledge of the situation. "Did you try restarting it ?"

The fact that they know about it does probably means someone ran up against the limit and cared enough to ask AMD to look into it.

I've never maintained any major piece of datacenter equipment but I can totally see why some would not want to shut down and restart something that is working just fine right now. It's another chance to introduce a problem.

Hello IT, did you try turning it off an on again?

View: https://youtu.be/nn2FB1P_Mn8

TechieTwo · Jun 3, 2023

While not a great flaw I doubt most servers are up 1,044 days continuously. They certainly aren't if they are doing regular security updates or use a Windows OS. I'm not going to worry about it.

bit_user · Jun 3, 2023

The fix is simple ...

@PaulAlcorn , that's a workaround, not a fix. The fix would be some microcode update, or at least a kernel patch that resolves the issue without any functional deficits.

What's unclear to me is how it affects users in virtualization environments, where probably the bulk of these CPUs are being deployed. If you merely restart all the VMs more frequently than the issue occurs, is that sufficient? Or would you have to actually restart the hypervisor? I'm thinking it's a lot more likely the VMs get patched and restarted enough to avoid this than the hypervisor.

bit_user · Jun 3, 2023

The Historical Fidelity said:
I’m getting Y2K vibes with this errata lol

My mind went straight to that infamous Windows 98 timer bug that crashed after about a month of uptime. Perhaps this is it?

Why Windows 95 and 98 would crash after 49.7 days of uptime | Hacker News

news.ycombinator.com

Paul Alcorn · Jun 3, 2023

bit_user said:
@PaulAlcorn , that's a workaround, not a fix. The fix would be some microcode update, or at least a kernel patch that resolves the issue without any functional deficits.

What's unclear to me is how it affects users in virtualization environments, where probably the bulk of these CPUs are being deployed. If you merely restart all the VMs more frequently than the issue occurs, is that sufficient? Or would you have to actually restart the hypervisor? I'm thinking it's a lot more likely the VMs get patched and restarted enough to avoid this than the hypervisor.

Good point, fixed!~

I believe this requires a full reset of the chip itself, so a reboot, but that is not 100% clear -- I'm following up with AMD to learn more.

brandonjclark · Jun 3, 2023

AMD - 'I can only count to (104)4!'

View: https://www.youtube.com/watch?v=u8ccGjar4Es

/=^>3-O-B= · Jun 3, 2023

The Historical Fidelity said:
Thank you for your service…

I know it's a trivia but as an alumni of the axes of evil school of grammer I couldn't resist 😉

jeremyj_83 · Jun 3, 2023

johnrock2 said:
I guess I could see instances when you might want the server to never restart but honestly it seems like a good idea to restart periodically anyway, not just to avoid weird timer bugs. I wouldn't patch this either if I was AMD, not a big issue and I think they don't even sell this CPU anymore.

With VMware 6.7 and later, when you do vSphere updates you might not have to reboot the entire host and just ESXi. That changes your reboot schedule for the entire host considerably.

jeremyj_83 · Jun 3, 2023

bit_user said:
@PaulAlcorn , that's a workaround, not a fix. The fix would be some microcode update, or at least a kernel patch that resolves the issue without any functional deficits.

What's unclear to me is how it affects users in virtualization environments, where probably the bulk of these CPUs are being deployed. If you merely restart all the VMs more frequently than the issue occurs, is that sufficient? Or would you have to actually restart the hypervisor? I'm thinking it's a lot more likely the VMs get patched and restarted enough to avoid this than the hypervisor.

You have to restart the host. Just restarting the hypervisor doesn't reboot the entire host.

abufrejoval · Jun 3, 2023

That reminds me of an odd bug in the Oracle client runtime I once faced...

We were running 32-bit Oracle clients linked to a 32-bit app in OpenVZ containters on a 64-bit kernel at the time, which had migrated straight off 32-bit Linux bare metal hosts.

After running somewhere between 30 to 40 days our applications would disconnect from the database and nothing but restarting the container (and app) would fix it.

That seemed a little odd so I tried to find out what was going wrong and killed one such instance with a SIGQUIT after it started misbehaving to have a look at the stack on the dump file and see where it got stuck.

Turned out it was trying to detect a time-out in the connect but was using an old Unix time() syscall which counted the ticks or timer interrupts in HZ since boot.

On 32-bit Unix systems HZ was 100 per second, which wouldn't overflow that easily. But on 64-bit Unix systems HZ was upped to 1000 per second, which was ok if your application was using 64 bit code and interpreting the 64-bit native syscall return value. But when using 32-bit libraries those 1KHz ticks would overflow somewhere after 30 days and the application the Oracle client software didn't properly interpret the error codes but got stuck.

Sure enough a newer variant of the Oracle client had already fixed the issue by the time I had diagnosed it and we switched to pure 64-bits shortly after that anyway.

Yeah, our Linux server guys used to pride themselves if a system they dismantled had not done a reboot during its entire 4-5 year life time. Perhaps it's because they'd been running Stratus machines before 9/11 made that not good enough.

Then came PCI-DSS and any uptime beyond a month could get you fired...

Sleepy_Hollowed · Jun 3, 2023

Not a huge issue unless someone was thinking about using them for high uptime applications.
Even hypervisors need to get patched at least regularly, and you'd just move VMs to another one, patch, reboot and get on with your life.

abufrejoval · Jun 3, 2023

Sleepy_Hollowed said:
Not a huge issue unless someone was thinking about using them for high uptime applications.
Even hypervisors need to get patched at least regularly, and you'd just move VMs to another one, patch, reboot and get on with your life.

I'd completely agree, because it's true for anything I operate myself.

But I also know that computers are used by some in ways the vendors never imagined.

And that's not even always a bad thing, because I'd argue that it's mostly also abusing computers for things like gaming which advances the science.

And if someone built built an EPYC into something embedded and with a true unbreachable network gap to make OS updates much less of a necessity, this could come as a nasty surprise.

Thank God the Voyagers aren't running EPYC! Because juggling VMs and patches are no fun with signal round trip times of 35 hours!

bit_user · Jun 4, 2023

abufrejoval said:
Thank God the Voyagers aren't running EPYC! Because juggling VMs and patches are no fun with signal round trip times of 35 hours!

They're getting so low on power that some subsystems have had to get powered down. I'm not sure how much longer they can keep doing anything useful.

For that sort of thing, you'd want a processor that operates on mW of power. No way it'd be an EPYC-class CPU. The main draw I can see for even having multi-core is redundancy + spares.

News AMD's EPYC Rome Chips Crash After 1,044 Days of Uptime

Administrator

Reputable

Distinguished

Distinguished

Glorious

Distinguished

Honorable

Editor in Chief (Interim)

Splendid

Reputable

Deleted member 14196

Guest

Respectable

Titan

Titan

Editor in Chief (Interim)

Distinguished

Glorious

Glorious

Honorable

Distinguished

Honorable

Titan

Share this page