I could really use some help with this. It's been driving me crazy, not to mention interfering with my activities, for quite some time. I'm not even sure where to post this, as I'm not sure just where the root cause is.
There are two systems here. Both are similarly set up and have Gentoo Linux installed with similar configurations. Here are the hardware setups of the two systems:
System 1 (almost all data on the problem is from this one)
Motherboard: ASRock Z390M Pro4 rev. 1.01
CPU: Intel Core i7-9700
RAM: Crucial CT16G4DFD8266.C16FP DDR4 PC4-21300 DDR4 16 GiB ×2 (tested a couple of times with MemTest86, but no errors reported)
Video: integrated graphics
System 2
Motherboard: ASRock Z370M Pro4
CPU: Intel Core i3-8100
RAM: DDR4 8 GiB ×2 (tested at least once with MemTest86 not too long ago)
Video: integrated graphics
For quite a while, I've been seeing this issue on system 1 (and it seems to be happening on system 2, though I'm not the one using it regularly, so I get less data there), where it will, seemingly randomly, hang. And by this, I mean, it becomes nearly completely unresponsive. The picture freezes, and if there is any audio playing, it will go into an endless loop, replaying whatever is in the buffer at the time.
It responds to almost no input, and cannot be reached over the network; even magic SysRq keys do nothing. All I can do at that point is either soft-off with the power button or press the reset button; the latter results in a reset after nearly 20 seconds. Until I do this, the system almost always stays in its hung state, possibly indefinitely, as I often leave it alone for hours and find it hung when I return. I have observed exactly one instance in which it reset on its own, which happened while I was away for a few minutes; I have no idea why, but it left the same type of evidence as every other instance.
After every reboot from a hang on system 1 (but not system 2, which is the biggest reason almost all the data is from 1), the kernel gives a BERT dump, which always starts like this:
What leads to this happening is still not something I understand very well (never mind the underlying cause). It used to be it only seemed to happen while I was watching a video, but it also often happens with no video playback and while I'm not actively doing anything. It's also not unusual for it to be triggered right when I do something in an application, usually Firefox.
I have tried doing/changing various things, thinking that maybe they have something to do with the problem, but pretty much every time it seems a hang occurs eventually. Some of these things include:
It's also interesting that this is happening on different systems. That suggests to me the problem is not a one-off defect on one system's hardware, but is instead either something wrong with this family of ASRock boards or something in the HW or SW configuration. But what exactly that could be has so far eluded me.
I am providing the kernel config for each system (since they are custom config/build), along with the BERT data from the system log on system 1. For the latter, I wrote a script to parse the output of
Also note that this data does not include all instances, as I have kept logs only as far back as late October last year.
The information can be downloaded either as a .zip, or as separate zstd-compressed text files. If you graph the dates, you may notice one seemingly anomalous gap of nearly a week. This is not an error; the system really did not hang for that long. All I can say about this is that I had built and installed Linux 5.15.6, for what that is worth.
I'm also considering whether I can save the firmware config to provide even more information. But I'm stopping here for the moment. I appreciate any help anyone can provide.
There are two systems here. Both are similarly set up and have Gentoo Linux installed with similar configurations. Here are the hardware setups of the two systems:
System 1 (almost all data on the problem is from this one)
Motherboard: ASRock Z390M Pro4 rev. 1.01
CPU: Intel Core i7-9700
RAM: Crucial CT16G4DFD8266.C16FP DDR4 PC4-21300 DDR4 16 GiB ×2 (tested a couple of times with MemTest86, but no errors reported)
Video: integrated graphics
System 2
Motherboard: ASRock Z370M Pro4
CPU: Intel Core i3-8100
RAM: DDR4 8 GiB ×2 (tested at least once with MemTest86 not too long ago)
Video: integrated graphics
For quite a while, I've been seeing this issue on system 1 (and it seems to be happening on system 2, though I'm not the one using it regularly, so I get less data there), where it will, seemingly randomly, hang. And by this, I mean, it becomes nearly completely unresponsive. The picture freezes, and if there is any audio playing, it will go into an endless loop, replaying whatever is in the buffer at the time.
It responds to almost no input, and cannot be reached over the network; even magic SysRq keys do nothing. All I can do at that point is either soft-off with the power button or press the reset button; the latter results in a reset after nearly 20 seconds. Until I do this, the system almost always stays in its hung state, possibly indefinitely, as I often leave it alone for hours and find it hung when I return. I have observed exactly one instance in which it reset on its own, which happened while I was away for a few minutes; I have no idea why, but it left the same type of evidence as every other instance.
After every reboot from a hang on system 1 (but not system 2, which is the biggest reason almost all the data is from 1), the kernel gives a BERT dump, which always starts like this:
Code:
BERT: Error records from previous boot:
[Hardware Error]: event severity: fatal
[Hardware Error]: Error 0, type: fatal
[Hardware Error]: section_type: Firmware Error Record Reference
[Hardware Error]: Firmware Error Record Type: SOC Firmware Error Record Type1 (Legacy CrashLog Support)
[Hardware Error]: Revision: 0
[Hardware Error]: Record Identifier: 100300100000000
[Hardware Error]: 00000000: 00000000 00000000 00000000 00000000 ................
What leads to this happening is still not something I understand very well (never mind the underlying cause). It used to be it only seemed to happen while I was watching a video, but it also often happens with no video playback and while I'm not actively doing anything. It's also not unusual for it to be triggered right when I do something in an application, usually Firefox.
I have tried doing/changing various things, thinking that maybe they have something to do with the problem, but pretty much every time it seems a hang occurs eventually. Some of these things include:
- Disabling desktop compositing (thinking of video playback, as I mentioned above, since that's using the GPU)
- Disabling VT-x (I run some VMs, so I need that on normally)
- Making sure microcode is up-to-date (at one point, it turned out not to be loaded
- Not leaving Firefox running (since a lot of hangs have been triggered while Firefox is being used actively)
- Compiling the kernel with GCC, instead of Clang (for LTO) as I have been for a while (in case there is a compiler-related problem, as I seem to recall the problem got worse around the same time I switched to Clang)
- Compiling the kernel with a specific processor family option, instead of MNATIVE_INTEL (surprisingly, no hang for several days after I changed it, but I guess it was just a fluke)
- Probably some other things I forgot
dmesg -w
, hoping in vain I might catch some output that would give a clue (before realizing some X applications block undesirably when X's VC is inactive). Still, that makes me think graphics might have something to do with this.It's also interesting that this is happening on different systems. That suggests to me the problem is not a one-off defect on one system's hardware, but is instead either something wrong with this family of ASRock boards or something in the HW or SW configuration. But what exactly that could be has so far eluded me.
I am providing the kernel config for each system (since they are custom config/build), along with the BERT data from the system log on system 1. For the latter, I wrote a script to parse the output of
journalctl -o short-full
, extracting the best approximation of the time of the hang (the last timestamp before the "-- Boot" marker, which is converted to UNIX epoch seconds), the current kernel, and the hexdump from the BERT report. The script converted the hexdump to binary and Base64-encoded it. Please note that Linux seems to skip dumping the start of the BERT data that it shows in decoded form.Also note that this data does not include all instances, as I have kept logs only as far back as late October last year.
The information can be downloaded either as a .zip, or as separate zstd-compressed text files. If you graph the dates, you may notice one seemingly anomalous gap of nearly a week. This is not an error; the system really did not hang for that long. All I can say about this is that I had built and installed Linux 5.15.6, for what that is worth.
I'm also considering whether I can save the firmware config to provide even more information. But I'm stopping here for the moment. I appreciate any help anyone can provide.
Last edited: