Question Random hard system freezes

Status
Not open for further replies.

ddawson

Honorable
Sep 23, 2016
14
1
10,515
I could really use some help with this. It's been driving me crazy, not to mention interfering with my activities, for quite some time. I'm not even sure where to post this, as I'm not sure just where the root cause is.

There are two systems here. Both are similarly set up and have Gentoo Linux installed with similar configurations. Here are the hardware setups of the two systems:

System 1 (almost all data on the problem is from this one)
Motherboard: ASRock Z390M Pro4 rev. 1.01
CPU: Intel Core i7-9700
RAM: Crucial CT16G4DFD8266.C16FP DDR4 PC4-21300 DDR4 16 GiB ×2 (tested a couple of times with MemTest86, but no errors reported)
Video: integrated graphics

System 2
Motherboard: ASRock Z370M Pro4
CPU: Intel Core i3-8100
RAM: DDR4 8 GiB ×2 (tested at least once with MemTest86 not too long ago)
Video: integrated graphics

For quite a while, I've been seeing this issue on system 1 (and it seems to be happening on system 2, though I'm not the one using it regularly, so I get less data there), where it will, seemingly randomly, hang. And by this, I mean, it becomes nearly completely unresponsive. The picture freezes, and if there is any audio playing, it will go into an endless loop, replaying whatever is in the buffer at the time.

It responds to almost no input, and cannot be reached over the network; even magic SysRq keys do nothing. All I can do at that point is either soft-off with the power button or press the reset button; the latter results in a reset after nearly 20 seconds. Until I do this, the system almost always stays in its hung state, possibly indefinitely, as I often leave it alone for hours and find it hung when I return. I have observed exactly one instance in which it reset on its own, which happened while I was away for a few minutes; I have no idea why, but it left the same type of evidence as every other instance.

After every reboot from a hang on system 1 (but not system 2, which is the biggest reason almost all the data is from 1), the kernel gives a BERT dump, which always starts like this:
Code:
BERT: Error records from previous boot:
[Hardware Error]: event severity: fatal
[Hardware Error]:  Error 0, type: fatal
[Hardware Error]:   section_type: Firmware Error Record Reference
[Hardware Error]:   Firmware Error Record Type: SOC Firmware Error Record Type1 (Legacy CrashLog Support)
[Hardware Error]:   Revision: 0
[Hardware Error]:   Record Identifier: 100300100000000
[Hardware Error]:   00000000: 00000000 00000000 00000000 00000000  ................

What leads to this happening is still not something I understand very well (never mind the underlying cause). It used to be it only seemed to happen while I was watching a video, but it also often happens with no video playback and while I'm not actively doing anything. It's also not unusual for it to be triggered right when I do something in an application, usually Firefox.

I have tried doing/changing various things, thinking that maybe they have something to do with the problem, but pretty much every time it seems a hang occurs eventually. Some of these things include:
  • Disabling desktop compositing (thinking of video playback, as I mentioned above, since that's using the GPU)
  • Disabling VT-x (I run some VMs, so I need that on normally)
  • Making sure microcode is up-to-date (at one point, it turned out not to be loaded
  • Not leaving Firefox running (since a lot of hangs have been triggered while Firefox is being used actively)
  • Compiling the kernel with GCC, instead of Clang (for LTO) as I have been for a while (in case there is a compiler-related problem, as I seem to recall the problem got worse around the same time I switched to Clang)
  • Compiling the kernel with a specific processor family option, instead of MNATIVE_INTEL (surprisingly, no hang for several days after I changed it, but I guess it was just a fluke)
  • Probably some other things I forgot
The only thing I have tried so far that seems to have stopped the hangs completely is to stay in a virtual console other than the one X is running in. I did this for a while, running dmesg -w, hoping in vain I might catch some output that would give a clue (before realizing some X applications block undesirably when X's VC is inactive). Still, that makes me think graphics might have something to do with this.

It's also interesting that this is happening on different systems. That suggests to me the problem is not a one-off defect on one system's hardware, but is instead either something wrong with this family of ASRock boards or something in the HW or SW configuration. But what exactly that could be has so far eluded me.

I am providing the kernel config for each system (since they are custom config/build), along with the BERT data from the system log on system 1. For the latter, I wrote a script to parse the output of journalctl -o short-full, extracting the best approximation of the time of the hang (the last timestamp before the "-- Boot" marker, which is converted to UNIX epoch seconds), the current kernel, and the hexdump from the BERT report. The script converted the hexdump to binary and Base64-encoded it. Please note that Linux seems to skip dumping the start of the BERT data that it shows in decoded form.

Also note that this data does not include all instances, as I have kept logs only as far back as late October last year.

The information can be downloaded either as a .zip, or as separate zstd-compressed text files. If you graph the dates, you may notice one seemingly anomalous gap of nearly a week. This is not an error; the system really did not hang for that long. All I can say about this is that I had built and installed Linux 5.15.6, for what that is worth.

I'm also considering whether I can save the firmware config to provide even more information. But I'm stopping here for the moment. I appreciate any help anyone can provide.
 
Last edited:
What BIOS version ? Up to date ?
I'll have to check and see what version it says, but it at least seems to be up to date. I have tried the online update function several times, but it always says it's up to date. Maybe I should download one? System info available from the OS says, "P4.30", which I think is the latest.

And the drivers provided by Gentoo (video, audio etc...) ?
Most drivers on Linux are in the kernel source, which Gentoo (which is a mostly build-from-source distro) does have a few patches for, but certainly doesn't provide its own drivers. A cursory look doesn't show anything that looks related. X.org has its own video drivers, but patches are not related to those. In short, if the problem is in software, I think it's most likely upstream. I suppose I could try building a kernel without Gentoo's patches, though.

Besides that, as X.org is usermode, I would expect most severe bugs to crash it but not lock up the system this hard. Then again, you never know, right?

There's also Wayland, which I've tried, but haven't switched to due to various shortcomings I experienced. If X.org is causing this, it's quite possible Wayland won't.
 
Last edited:
A fundamental problem with Linux is that motherboard manufacturers do not guarantee functioning with any distro, due to the great numbers of distros.

I would eventually test on your secondary PC something like Ubuntu, just to determine if the OS is directly to blame.
 
And what Linux desktop ?
I somehow missed this question until now. Well, mainly I'm using KDE/Plasma on X.org. That said, I have tried others, and as I recall, I still experienced a hang. That was some time ago, though, so I could check it again.

In the meantime, it might be good to know some more info about my graphics. The kernel config is already linked in the OP. In addition, here is a current Xorg.0.log. Anything else that would be helpful to know?
 
Hm, I strongly support the mentioned suggestion to try another desktop (use live desktop) to see if it makes a difference. Also check if the issues you experiences on the installation also are present when using a live desktop.

Reason I support this idea is that some months ago I installed Fedora 35 with KDE/Plasma on my laptop and I experienced drop freeze during log off, and after I installed the Cinnamon spin of Fedora, that problem was gone.
 
Here is my make.conf. Note that there is quite a bit in package.use (a directory with package-specific files), but I don't imagine most of it is important here. Any suggestions on whether, and how, you want to see any content there?

And since you asked about that, you probably want to see what emerge --info says, I imagine. Well, here you go. And before anyone says anything about all of those overlays, I do have most of them locked down with */*::repo in my package.mask, so I don't get overlay versions I don't need or want.
 
Package.use is also a source of problems if you get aggressive with the build flags.

I see:
ACCEPT_KEYWORDS="~amd64"

You're living on the edge of stability there. If you want to run there you've got to accept some glitches from time to time. Same goes for any unmasks you may be using. They're masked for a reason.

Your best bet is to build something like a Ubuntu Live USB and run from there for a while to see if it clears up. I'm betting it's an over aggressive set of build flags mixed with a healthy dose of not quite ready for prime time code pulled in by that "~amd64" flag.

And, if you haven't already, give the Gentoo forums a try: https://forums.gentoo.org/
 
Package.use is also a source of problems if you get aggressive with the build flags.
Of course. It's an extension of the flags in make.conf, after all. I just doubt most applications are capable of causing/triggering this type of problem. The kernel and X-server, sure. Maybe mesa? But not most regular stuff, when it has no direct access to most hardware. But in any case, I can post some or all of those files. But maybe you only want to see the flags in some of them?

I see:
ACCEPT_KEYWORDS="~amd64"

You're living on the edge of stability there.
No doubt you're right about this. But...

If you want to run there you've got to accept some glitches from time to time.
Key phrase, "from time to time". I upgrade the software once a week, and these hangs are happening a couple of times a week, give or take, and have been for quite a while. I would think that any instabilities in new versions would only rarely cause severe problems. Certainly, I don't see many such problems other than this. Applications crashing are pretty unusual, for instance.

Same goes for any unmasks you may be using. They're masked for a reason.
The only unmasked packages I have installed right now are the few from overlays that I blanket-mask as I described earlier, so potential problems here are limited to those.

Your best bet is to build something like a Ubuntu Live USB and run from there for a while to see if it clears up.
Working on it. Already ran it for a little while, but it definitely needs more time. It's a balance of uptime. I'll report once it either hangs, or I think it's run for sufficient time that it doesn't look like it ever will hang.

I'm betting it's an over aggressive set of build flags mixed with a healthy dose of not quite ready for prime time code pulled in by that "~amd64" flag.
Could be. What do you think of the flags in make.conf?

And, if you haven't already, give the Gentoo forums a try: https://forums.gentoo.org/
I'll think about it. But I came here because of my suspicion it's something that goes beyond the distro, or possibility even the software.

Incidentally, what about all of those BERT reports? Is there anything useful that can be done with them? Or do they really only exist so hardware/firmware engineers can debug things?
 
"Build Flag Bingo" can cause all sorts of problems that don't, on the surface, look to be even remotely related to that/those packages. The fact that you are reporting mostly X related problems would point directly at X and your desktop environment (KDE, Gnome, etc.), so start with the build flags there. Though keep in mind that you're in the test tree so may also be looking at the source having some gotcha's in it. The whole point of the testing tree ("AMD" in this case) is to put code that's still in the testing phase there for both the developers and brave souls (and sometimes the foolhardy as well) Therefore do not expect anything resembling the stability of the non testing tree for any packages available there. You should also be compiling your kernel & system with GCC only, as Clang has always given problems, going back many years.

Your flags in make.conf don't look that bad on an initial pass. Mine have been MUCH more aggressive over the years (and you don't even want to see what's in my package.use 😎 ). It's at the package level that things might be dicey.

As for Gentoo.org, that's your #1 source for everything Gentoo related. The developers live there and can likely pinpoint your issue in a matter of seconds (that's what the BERT reports are intended for).
 
Status
Not open for further replies.