Dual Xeon Build - Memory Issues

Status
Not open for further replies.

kuakman

Prominent
Oct 21, 2017
22
0
510
Hi guys,

After a while looking for recommendations at tom's hardware for my next workstation rig, I finally decided to go with a Dual Xeon based build. Here is the list of components:

- Chassis: Thermaltake 91 RGB Tempered Glass
- Motherboard: SuperMicro MBD-X11DAI-N-O
- Processors: 2x Intel Xeon Silver 4114 - 10-Cores 2.2Ghz - Socket 3647 (Scalable Processors Family).
- Heatsinks: Supermicro 4U - SNK-P0070APS4
- Memory 4xRDIMM Crucial (Micron) DDR4 16GB 2666Ghz 1.2 CL19 - Model#: CT16G4RFD8266
- Storage: Samsung 960 Evo M.2 - Nvme PCIe 4x 500GB
- Power Supply: EVGA SuperNova P2 1200w.
- GPU: EVGA FTW3 1080 Ti

Here a few pictures:

qZ64lowbkTIAzOp1-AS2i8AJs3ocoogrf7s_k9kST1I


EOFtXv9mEE8js839zUqg9C5-C8ix4pBLFQfxApDp7ls


Everything went great with the assemble of all the components.
But unfortunately, I ran into an issue when I boot up my system for first time.

There is one Memory module that it's not detected by the BIOS and I can't figure out a solution to this:
Here is the only warning (or message) I can see in the initialization phase (it doesn't always appear, only when I apply a change and save my bios configuration):

P1-DIMMD1: Control Clock Margin Eye width is too small

Ceziz8icZiMoLCPVMGEou5XzKzwr4MNBv34nt0wXKRg


Here are 2 more screenshots with my CPUs detected, total memory detected and the the memory topology list:

SVKoOv8O_3gn5sT-PakeXWNJLb-mg7JyJrFfLBsLpHA

bq5cyD_Ozo3L4HZ2gjPYhxXO0SIkPmvlOcGLb2vH_1I

XFqzAAvGBmzGJGPzd7xr2yx5TIttcLbw1ix4TMSRlOI


This misconfiguration or memory issue is causing my system to run unstable like, for example: the machine will restart (shuts down and restarts) after a few minutes with no errors (randomly, which I assume... of course this may be related to have unbalanced memory population between both sockets). This also is giving me trouble while trying to install my windows 10 pro from a USB drive to complete the installation since it powers off and back on suddenly.

Have anyone run into this problem before? I started to get a little bit desperate since I've tried to contact supermicro technical support and the answer I received was kinda of vague. They suggested me to swap the CPUs but they didn't mention anything related to the "warning" I'm getting on that specific slot. I also tried to replace with a new memory module, same type and speed (because I thought the old memory module on the slot P1-DIMMD1 of my CPU 1 was defective) but still, the same symptom even with a brand new one. No beeps codes from my internal speaker which I checked with the manual, everything seems perfect except for this.

I will appreciate any recommendations or directions from experts here in the community to see if there is anything else I can do to solve this issue.
I haven't swap the CPUs yet as the technical support from supermicro suggested. I'd rather hear other thoughts before moving forward and try more things. Disassemble the PHMs to switch directions means, cleaning the thermal paste and follow the airflow of the aircoolers, which is a hard job and honestly, the bios seems to detect perfectly the CPUs.

Thank you!
- kuakman
 
Maybe you can try to isolate the issue by install only 2 memory modules.

Skip the socket that involves the error message and leave another socket associated with it blank.

In this case, you still get 32GB of memory which should be more than enough for OS installation.

Then run some stress test to make sure all current components are working reliably.
 


@rjsq1989 you mean, leaving both CPUs installed the way they are right now and remove both memory modules from CPU 1? To give you more insight about the memory population I found in the motherboard manual, here is the setup:

For (2x) CPUs, and (4x) DIMMs:

CPU 1:
P1-DIMMA1 - OK [Works]
P1-DIMMD1 - [This guy is the problematic, in the bios it appears blank as shown in one of my pictures]

CPU 2:
P2-DIMMA1 - OK [Works]
P2-DIMMD1 - OK [Works]

I will try to remove both modules from CPU1 - P1DIMM-A1 and P1-DIMMD1.
Hopefully that will work... technically that motherboard can work with one cpu but I guess the socket 1 is mandatory.
Anyways I will try what you suggest.

If I could avoid swapping the CPUs it would be a win, since it's tedious to clean the thermal paste, re apply it, assemble the PHM and set them up in a way that the air flow direction of the air coolers are correct and so on. A lot of touching.
I would try to avoid messing up with the CPUs to avoid any possible risk of burn them out. Each processors is $730 bucks each LOL.

Alright, Thanks Much!
 


In this case, install 1 RAM for each CPU

In P1-DIMMA1 & P2-DIMMA1 sockets.

This should balance the RAM setup and enable OS installation if they are working fine.

If this is successful, then you can verify each RAM stick by swapping them into these sockets.

Again if all RAM sticks are in good condition, then try install RAMs only in P1-DIMMD1 & P2-DIMMD1 sockets.

This steps are what you can do to troubleshoot. However if the last step fails with same error, you may still need to swap the CPU to troubleshoot further.
 


Gotcha, ok I finally got some time with the PC to start troubleshooting, here ready to start debugging. I will be back to you with my findings... Thank you!
 
Alright, I just removed P1-DIMMD1 and P2-DIMMD2 and the pc restarts itself once it reaches 26 ~ 28% the windows 10 Pro installation tries to copy the config files to the SSD. Same symptom. It shuts off and restarts the boot up cycle in a random number of minutes, not over 4-5 minutes maximum.
I think that the problem is different. Could it be a sign of overheating on any of my processors that it's causing these immediate restarts? The Motherboard doesn't give me any message, beep code or any sign of overheating, may it be any configuration I need to set in the BIOS to not shut off automatically? This is weird.

Here more insight about how I built the machine:

The supermicro heatsinks I bought came with a pre applied thermal pad. When I gave the first shot to install both CPUs, I assembled the PHMs (cpu + heatsink) and I chose the wrong air flow direction so I had to disassemble the CPU 1, fix the direction so the point both air coolers to the same direction. Unfortunately to do that, I needed to re apply thermal compound with thermal paste (unfortunately, supermicro doesn't sell the same thermal pads separately that comes originally with the heatsinks. So I used this thermal paste: https://www.amazon.com/gp/product/B004ULZITS/ref=oh_aui_detailpage_o02_s00?ie=UTF8&psc=1

So I guess what I can do is to keep removing parts, like CPU 2 (the one with the original thermal pad) plus the P2-DIMMA1 so I will leave the CPU 1 with P1-DIMMA1.
What do you think?

Unfortunately the BIOS doesn't provide information about the temperature of the CPUs or at least the speed of the air coolers to see if they are running hot. That is a bummer.
 
Ok here more information. Reading the manual I found a Utility I can run to called IPMI. I connected a network cable to my router and access the BMC admin with my laptop connected to the same network. Here I could verify that all the temperatures are running normally, not overheating or anything like that.

Ztd5DZEFCFf_h69oTKT81UwISZrvXzUF0u3qHD4vgPw


All is green, so that's good. Removed those memory sticks you mentioned to isolate.
Temperatures seem normal, the only one a little bit high is the South Bridge (PCH) it climbs up to 52 Degrees Celsius , which the utility indicates it's still normal.

I think I'm gonna need to use a different USB drive, maybe there is something wrong with the windows 10 PRO image I copied into the pen drive. I will keep investigating why on earth the motherboard restarts the system, even when the pc is doing nothing, the system will boot up again and again after a few minutes. I'm pretty sure it's those kinda of issues that are in front of your eyes and you can't see it. I can't see any option in the bios to tell the motherboard to stop restarting the system suddenly and let me continue with the installation.
I will keep investigating and work on it...
 
Oh Jeezzz! It finally was that watchdog utility what it was causing these restarts of the system. That guy made my day.
Alright, working on the OS installation, after that I will start installing the memory sticks again and continue my debugging...
 
Alright, I installed Windows 10 Pro successfully, processors are fine.
Tried switching memory sticks and that memory bank P1-DIMMD1 still gets that warning in the BIOS.
It detects 48 GB of ram, same warning. I'm afraid the motherboard bank slot is damaged or there is something wrong with it.

P1-DIMMD1: Control Clock Margin Eye width is too small

Not sure what that issue comes from, I don't think swapping the processors will help honestly, since when I ran it with 2 memory sticks everything is normal.

Well, I want to try more things before start thinking of returning returning the motherboard and get a replacement.
I will try to do some research as the last resort, to run memtest86 software in a flash drive. Other than that, I don't find any information about that error anywhere. Supermicro is not being helpful about that specific error or warning.

At least got the OS going so far
 
Status
Not open for further replies.