[SOLVED] Very, very technical issue, I'm not expecting anybody to have an answer!

May 27, 2020
4
1
15
Hello guys I've 4 identical systems:
Motherboard X570M PRO4 Asrock
GPU RTX 2060 Super founder edition (I've 2 install on this build will explain why in a sec)
CPU Ryzen 7 3700X
M.2 NVME SN750 1TB
and the only things that changes in the 4 system is the RAM:
-1st & 2nd system have 32GB (2 x 16gb) G.Skill DDR4-3200mhz Ripjaws V
-3rd system has 32GB (4 x 8gb) G.Skill DDR4-3200mhz Ripjaws V
-4th system has 2 normal Nemix Ram 32GB (2 x 16gb) DDR4-2666mhz UDIMM

So these 4 computer are part of a medical machine which with Machine learning is helping people every day, very cool humanitarian project, anyway the two GPU are there because the real time analysis needed require a lot of power (we are running Ubuntu).
For the past few weeks I'm being running into continuous crashes of this GUI we created and as soon as the program loads the different environments crashes, everybody that is reading this at this point will rightfully think that is a software problem NOT hardware, but here is where things gets confusing, the only machine that don't crash at all is the 4th machine with slower Unbuffered UDIMMs, instead the one with the G.Skill crashes all the time, I swap the two kits of RAM and I had same result normal UDIMMs 2666mhz they do just fine and the 3200mhz G.skill crash, everything else is identical, I even duplicate the same hard drive, and did multiple test, can somebody direct me to an actual explanation?
I'm not super amazingly trained in computers, but this doesn't make any sense, how is a slower RAM preventing a crash?

I know that is not much data to work on and there's a million different variables, is not so black and white, but if any more experienced person can clue me in on the direction to take, I may be able to figure it out.

Thank for anybody that is willing to help me with this!
 
Solution
I'm leaning towards a memory controller thing. I'm betting the ram is good, and even stable at rated speeds on simpler tests, but the memory controller is dumping with the minimal voltages when all 32Gb is coming into play on the higher speed ram.

I'd go into bios on the gskill machine and bump up SoC voltage by 0.05v-0.1v, not to exceed 1.2v. (try this on the 4x8Gb)

In another gskill machine bump up Dram voltage by 0.05-0.1v, not to exceed 1.4v (try this on a 2x16Gb)

Try both in the last gskill machine.

See if that makes a difference.
May 27, 2020
4
1
15
I guess the obvious question is "Have you tested the memory?"
I actually didn't test it, so wasn't an obvious question! But I gave for granted that will be very unlikely that I've 8 sticks of G.Skill defective RAM, because all the G.Skill machine behave identically under stress, but maybe I should give it a go. Thanks for the suggestion.

By the way I just re-read my original message and I used the word "crashes" lightly, the system itself doesn't crash is just the GUI interrupt his analysis after 20 or so seconds, so is a "program crash" not an overall system crash
 
Last edited:
I actually didn't test it, so wasn't an obvious question! But I gave for granted that will be very unlikely that I've 8 sticks of G.Skill defective RAM, because all the G.Skill machine behave identically under stress, but maybe I should give it a go. Thanks for the suggestion.

By the way I just re-read my original message and I used the word "crashes" lightly, the system itself doesn't crash is just the GUI interrupt his analysis after 20 or so seconds, so is a "program crash" not an overall system crash
I don't think the memory has to be 'defective', Ryzen CPUs just seem to be super picky about RAM.

I think Memtest takes about an hour to test 32 Gig. I usually call if good after 1 pass if I don't suspect a problem, but you probably want to run several passes if nothing shows up in the 1st pass.
 
  • Like
Reactions: ValerioF
May 27, 2020
4
1
15
I don't think the memory has to be 'defective', Ryzen CPUs just seem to be super picky about RAM.

I think Memtest takes about an hour to test 32 Gig. I usually call if good after 1 pass if I don't suspect a problem, but you probably want to run several passes if nothing shows up in the 1st pass.
The test is going right now, will do multiple passes.
Could you expand a little more on your first statement? (Ryzen CPUs just seem to be super picky about RAM)
I've heard about the same line few times by now, just trying to understand a bit more.
 

Deicidium369

Permanantly banned.
BANNED
Mar 4, 2020
390
61
290
Hello guys I've 4 identical systems:
Motherboard X570M PRO4 Asrock
GPU RTX 2060 Super founder edition (I've 2 install on this build will explain why in a sec)
CPU Ryzen 7 3700X
M.2 NVME SN750 1TB
and the only things that changes in the 4 system is the RAM:
-1st & 2nd system have 32GB (2 x 16gb) G.Skill DDR4-3200mhz Ripjaws V
-3rd system has 32GB (4 x 8gb) G.Skill DDR4-3200mhz Ripjaws V
-4th system has 2 normal Nemix Ram 32GB (2 x 16gb) DDR4-2666mhz UDIMM

So these 4 computer are part of a medical machine which with Machine learning is helping people every day, very cool humanitarian project, anyway the two GPU are there because the real time analysis needed require a lot of power (we are running Ubuntu).
For the past few weeks I'm being running into continuous crashes of this GUI we created and as soon as the program loads the different environments crashes, everybody that is reading this at this point will rightfully think that is a software problem NOT hardware, but here is where things gets confusing, the only machine that don't crash at all is the 4th machine with slower Unbuffered UDIMMs, instead the one with the G.Skill crashes all the time, I swap the two kits of RAM and I had same result normal UDIMMs 2666mhz they do just fine and the 3200mhz G.skill crash, everything else is identical, I even duplicate the same hard drive, and did multiple test, can somebody direct me to an actual explanation?
I'm not super amazingly trained in computers, but this doesn't make any sense, how is a slower RAM preventing a crash?

I know that is not much data to work on and there's a million different variables, is not so black and white, but if any more experienced person can clue me in on the direction to take, I may be able to figure it out.

Thank for anybody that is willing to help me with this!
Main issue is obvious. NeXT used the term "Mission Critical Computing"... if this is important. The most obvious issue is - why have each machine being a different config? Standardizing around one design would make sense. Reduce complexity by standardizing hardware.

use a sniffer to see what is happening preceding the crashes, Test the memory - as someone above pointed out. IF all machines do not need to be connected / powered on - try taking one of the loop and seeing if the crashes stop. If systems stops crashing in when 1 particular systems is powered off, then you can narrow the testing to that one machine.

It could be software - but I lean towards hardware. TBH sounds like a poor design.
 

Karadjgne

Titan
Ambassador
I'm leaning towards a memory controller thing. I'm betting the ram is good, and even stable at rated speeds on simpler tests, but the memory controller is dumping with the minimal voltages when all 32Gb is coming into play on the higher speed ram.

I'd go into bios on the gskill machine and bump up SoC voltage by 0.05v-0.1v, not to exceed 1.2v. (try this on the 4x8Gb)

In another gskill machine bump up Dram voltage by 0.05-0.1v, not to exceed 1.4v (try this on a 2x16Gb)

Try both in the last gskill machine.

See if that makes a difference.
 
  • Like
Reactions: ValerioF
Solution

Zerk2012

Titan
Ambassador
^^^^^^^^^^ pretty much that BUT I would lower the clock speed to 3000 and add 0.02 volts to the memory and work from their.

From the above posters If I run memtest I let it run all nite while I'm sleeping. Start memtest let it run about 10 minutes to get the memory nice and warm the rerun it all nite one error after it has warmed up is a fail
 
Last edited:
  • Like
Reactions: ValerioF
May 27, 2020
4
1
15
I'm leaning towards a memory controller thing. I'm betting the ram is good, and even stable at rated speeds on simpler tests, but the memory controller is dumping with the minimal voltages when all 32Gb is coming into play on the higher speed ram.

I'd go into bios on the gskill machine and bump up SoC voltage by 0.05v-0.1v, not to exceed 1.2v. (try this on the 4x8Gb)

In another gskill machine bump up Dram voltage by 0.05-0.1v, not to exceed 1.4v (try this on a 2x16Gb)

Try both in the last gskill machine.

See if that makes a difference.

Thank everybody for the help!
So I test the RAM over night and after many, many passes no error at all, as you correctly guessed.
Then I went in the BIOS to change the voltage of the RAM, and I found out that I'm an idiot!
For some reason all the G.Skill RAM was running at 2133mhz, instead the normal RAM 2666mhz was running at stock speed, so I turn on the XPM profile 1 and the voltages went at 1.35v from 1.2v (on the 2x16GB) and the speed to 3200mhz.
That solved the crashed all the way, so was the right think to increase the voltage, I just didn't know that the RAM wasn't running at stock speed by default, sorry for the inconvenience.

I really appreciate the help on this matter, otherwise I'll have lost many more days figuring things out, thanks again guys!
 
  • Like
Reactions: DeauteratedDog