Supermicro X8DAi can't install 2nd cpu

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.

KansasA

Distinguished
Jul 13, 2011
11
0
18,510
My backordered 2nd cpu came and when I try to install it my 7046a-t with mb x8DAi it shows no power led and I get a flashing red led indicating chassis fan failure. The monitor never lights up. I have been running whs 2011 fine with one cpu for almost a week. The cpu's are Intel Xeon E5506 Quad Core 2.13GHZ LGA1366 4MB 4.8GT/SEC Nehalem Retail Processor. The machine runs fine with one cpu and I've even switched the cpu's and heatsinks around and each one will work but not when I have both installed. I can't figure out what the problem is, anyone know?

Edit to add: I updated my bios with a beta from Supermicro and it didn't help. I've switched cpu's and both work fine independently whether in cpu 1 or cpu 2 spot so it's not bent pins. I'm at a loss here and hoping to hear back from supermicro soon.

Edit: Supermicro has issued me an RMA. It's going to cost a bundle to ship this server back!
 
Well they are finally shipping me a replacement board. It seems as though they have thoroughly tested it, and everything seems OK. I will keep my fingers crossed and pray that it makes it here fine and still works OK. It is scheduled to be delivered by 3 PM tomorrow. Maybe the motherboard will help with some of the other problems.

My LSI 9260-4i MegaRAID controller card with 512 MB DDR2 onboard cache has an alarm present, but I have disabled it. In MegaRAID manager, it says that the Virtual Disk State is Optimal, and that all four drives in the RAID5 array are Online. There are no Media Count Errors or Predicted Failure Counts on any of them. The MRM does not list any errors. Any idea what could be going on? Is it safe to ignore the alarm (which is set to disabled by default I do believe), or should I be worried?

I will keep everyone updated.
 


Wow you have really been around the block with this haven't you? On the plus side you will know the system inside and out. :) But what a hassle for sure.
No idea why an alarm would be present, especially if no error is listed, google might help? Firmware update maybe? It would bug me knowing there's something there.
 
I did in fact flash the MegaRAID controller with the newest BIOS. It is getting worse and worse. The maximum read speeds from the RAID5 array have gone from around 850 MB/s to around 400, and the Sustained Read speeds have gone from around 350 MB/s to 275, and I have been having system reboots with Fatal cache errors. Event log said that the file system on a certain partition had become corrupt and unusable. I ran chkdsk and that seemingly fixed the problem, but I have been having unexplained reboots. I am going to RMA the controller.

As far good news, my replacement motherboard is working just fine with both physical CPUs, and all 24 threads. I did some more testing on the bank of memory, but it was only one stick that was bad, so I am RMAing it too. I have the other two 4GB modules, but because this machine uses triple channel memory, I will wait until I get the replacement before I add it and the other two. I am hoping that the failing LSI MegaRAID card is the cause of the random reboots. Other than the reboots, the MB has been working fine, and I think they may have shipped me a brand new one, but I am not certain.

There is another strange occurrence. BIOS, CPU-Z, System Information, and other applications all register 36GB. Task Manager registers 36GB under Total Physical Memory, but the Commit Charge is 10/35, for example, depending on how much memory I am using, not 10/36. No big deal but I still wonder why. I can't wait to add the other three modules for a total of 48GB.

I made another purchase recently, due to the fact that I do racing Sims and listen to music occasionally ( I usually listen to talk radio, so that doesn't really justify it, but even that is notably better). The main thing I couldn't stand was that the wires always needing to be adjusted to get sound, or to get rid of static. It was a cheap model, and I replaced it with the Bose Companion 3, series 2.

I bought it mainly because my old speakers sucked, and I wanted decent audio quality. I have always wanted to try Bose. The deciding factor is that Sam's Club had it on their discount rack for $135.79, instead of the $194.74 regular price. The only reason for the markdown was it was open box. I cut it open and looked inside, and it was in pristine condition, and didn't even look like it had been used. For a savings of over $60 (with tax), I couldn't go wrong. They usually sell closer to $200 online.

I know it is not the absolute best sound system in the world, but it is excellent and more than met my expectations. The sound is truly amazing for the size and price. I would recommend this system to anyone. The desktop speakers take up a very small amount of space, and the Acoustimass Module hides away and the effect is amazing. It really spreads out the sound and makes it come to life. If you close your eyes, you can visualize where all of the musicians are, and so forth. If you don't mind paying full price, go ahead and find one online or even better check for deals.

I have several rules deciding on online purchases. If it is a well-known retailer such as Newegg or Amazon, I look for a high customer service rating, or how many eggs, and I prefer at least 98% satisfaction for sellers and around 80% in eggs of 5 and 4 combined, for example out of 100, I would prefer at least 70% with 5 eggs, and 10% or more with 4 eggs. For unknown sellers or companies, I first use NetCraft to determine how long they have had a website up and such. If a deal seems too good to be true, it probably is. Then I check RipoffReport, BBB, and Consumer Reports.org (I have a subscription, and I renewed for like $19.00 for one year) to start with. I think the normal price is $26 a year. It is well worth it. It pays for itself in one purchase in many cases.

My parents chose not to use it, or either I did not offer first. They purchased an inexpensive dryer, and out of about 90 models, they picked the one which only had three models worse than it. That means about 85 models were better than this one, and a Consumer Reports Best But model which was way up on the scale only cost around $100 extra. How stupid on their part. Oh well it is their dryer and their money. Ok enough of my rants. I am starting to sound like a salesman.

Hopefully when I get my controller and memory module replaced, I will have the perfect system (until I can afford an 8 way server). I have over $5750 in the whole system, including everything, so I just want everything to work properly. I wonder if the motherboard has not been responsible for some of the other problems. Oh well, I see light at the end of the tunnel. Hopefully, with the new stick of RAM, and controller, I will be golden.

Next is an 800 watt Honeywell Inverter type generator, which is designed for sensitive electronics, and an onboard battery backup for the LSI MegaRAID card, then I can turn off Windows write cache flushing to disk altogether. A SSD for my pagefile and CacheCade software are also on the wanted list. I have room for 4 more 3.5" hard drives in the backplane, but I am not certain whether I can use the onboard LSI 2008 SAS controller for the other 4 drives.
 
Supermicro got their board back last Thursday, and I am waiting on my $536.00 refund. Won't that be nice? Everything seems to be working great, except I get random reboots from time to time. Seeing as how I have already RMA'd two motherboards, and this one works with both physical CPUs (all 12 cores, and all 24 threads), and all 48 GB of ECC Registered Server Memory, I think I will keep it, and hope to find a solution elsewhere. I got my replacement memory module, and RAID card. The memory works fine, but I will get into the Controller Card later.

I have had 10 reboots since 10/1/2011, and they do not seem to be related, but I am coming up with a theory. Yesterday it crashed while playing F1 2010 (a racing sim), and today during a GPU stress test the screen turned white a little after an hour. The GPU never got over 60C, and it is a Radeon IceQ 5670, the only piece of consumer grade hardware that I have on my server. I could not do anything. It locked up, and Ctrl+Alt+Del did nothing, nor did Ctrl+Shift+Esc. The strange thing is the Num Lock worked, but not Scroll Lock or Caps Lock. I had to manually press the power button, and I got the exact same event in the event log as the other nine: Event 41. Then, during or after POST, I get this message from the LSI controller:

Cache data was lost due to an unexpected power-off or reboot during a write operation, but the adapter has recovered. This could be due to memory problems, bad battery, or you may not have a battery installed. Press any key to continue, or 'C' to load the configuration utility.
_

Well I don't yet have a BBU for the controller, so the battery is not the problem. I am going to get one soon, hopefully this week. The system memory is just fine, but I assume they may be talking about the memory on the card, but if the RMA replacement is having memory problems, then it needs to go back too. Could the controller card work with bad memory, and all the speed is just from the four disk RAID5 array? It has a sticker on it that does not instill a great deal of confidence. It reads, "Serviceable Used Part". That means to me, it was ready for the trash and someone said no, it still works.

The funny thing is that it is slower than the one I was planning on RMAing. I moved it from the lowest slot, closest to the bottom of the case, where there was less than half an inch between the heat spreader and the case to the PCI-Express slot above it, and the speeds went up, dramatically. My guess is that it got better airflow there, or for some reason that slot performs better, although not as good as my card when new. I thought they were all X8. It makes no sense to me.

I have run Windows Memory Diagnostics several times, including the option where you press F1 and can run all of the more advanced memory tests, which takes several hours. It came back fine with no problems. The other day I ran System Stability Test - AIDA64 [TRIAL VERSION] for over one and a half hours, with the options checked to Stress CPU, FPU, cache and system memory. None of the cores really got much hotter than 65C, maybe 68C for a second, but there were no problems. I ran Prime95 with 24 execution threads for three hours, 16 minutes, in which time it completed 78 tests with 0 errors, and 0 warnings. I monitor the voltage in Supermicro SD3, and they are always well within tolerance (although I have never been staring at that screen when a crash happens).


Even though I have told the computer not to automatically restart on system crashes, there still is no blue screen, which makes me all warm and fuzzy inside, but keeps me from figuring out what the problem is. Do power switches ever go bad? I know they may quit working, but have you ever known of them to shut down the computer themselves? I am just trying to think of anything. I want to offer web-hosting service, but I must have a rock-solid machine which does not crash in order to do so.

Gag me with a spoon, I could shut the machine down, change the jumper, take out my 1GB 5670 Radeon card, and switch back from HDMI to VGA, then hook it onto the motherboard's onboard 8MB video. If it ran fine for a week or so, I think I could safely say the video card is the culprit. I don't want to do that because it would suck. My buddy said I should put it in a colo and vpn when I need to, but just let it sit there and run. I know this is a server, but I built it as a dual purpose machine. With this much money in it, I want to play with it too!

Here is the simple error that I get:

The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.

Here is the more technical log (which I have put an X in the place of some numbers or letters)


- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
<Provider Name="Microsoft-Windows-Kernel-Power" Guid="{XXXXXXXX-XXXX-XXX-XXXX-XXXXXXXXXXXX}" />
<EventID>41</EventID>
<Version>2</Version>
<Level>1</Level>
<Task>63</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000002</Keywords>
<TimeCreated SystemTime="2011-11-14T01:34:35.938850000Z" />
<EventRecordID>278807</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="8" />
<Channel>System</Channel>
<Computer>XXXX</Computer>
<Security UserID="X-X-X-XX" />
</System>
- <EventData>
<Data Name="BugcheckCode">0</Data>
<Data Name="BugcheckParameter1">0x0</Data>
<Data Name="BugcheckParameter2">0x0</Data>
<Data Name="BugcheckParameter3">0x0</Data>
<Data Name="BugcheckParameter4">0x0</Data>
<Data Name="SleepInProgress">false</Data>
<Data Name="PowerButtonTimestamp">0</Data>
</EventData>
</Event>

Here is the friendly version:

- System

- Provider

[ Name] Microsoft-Windows-Kernel-Power
[ Guid] ="{XXXXXXXX-XXXX-XXX-XXXX-XXXXXXXXXXXX}

EventID 41

Version 2

Level 1

Task 63

Opcode 0

Keywords 0x8000000000000002

- TimeCreated

[ SystemTime] 2011-11-04T04:34:56.531651000Z

EventRecordID 267944

Correlation

- Execution

[ ProcessID] 4
[ ThreadID] 8

Channel System

Computer XXXX

- Security

[ UserID] X-X-X-XX


- EventData

BugcheckCode 0
BugcheckParameter1 0x0
BugcheckParameter2 0x0
BugcheckParameter3 0x0
BugcheckParameter4 0x0
SleepInProgress false
PowerButtonTimestamp 0

This is driving me nuts. The main reason I spent this type of money on this machine was to make it bulletproof, but my old P4 that I spent about a third the price on 5 years ago, when parts were expensive, has proven much more reliable in some cases. I know I should have started a new thread, but someone please help! I am at my wits end on this. It is not right to have over $5,000 in an Enterprise Server, and have all these problems.

I will be up to $6250 with a generator and a battery backup unit for my LSI 9260-4i controller, for everything. Let's not even start thinking about adding four more hard drives, and then we will be topping the $7K mark! What have I gotten myself into? Not bragging, but am seriously starting to wonder why I put so much money into this system.

I could enable Hibernation at 100%, and use a full sized Pagefile. I wonder if that would help. That would only waste 98 Gigabytes of hard drive space. That is 2GB less than my Boot, Page File and Crash Dump Partition (Where Windows is Located). My system reserved partition is a measly 100MB. I don't suppose that I can put the Hibernation and Pagefile files on a different partition, and expect it work properly. I am of the impression that they have to be on the root partition.
 
Sorry, I'm at a loss as to what it could be. How long is the system on for before you get the error? Could it be a heat issue? Have you considered installing more fans?
 
It did it yesterday as I was replying to this message. I think it might have to do with simultaneous access of SMBus, but I am not sure how. Speedfan Exotics page killed it. I had a game paused and was writing an email when it happened today. Whatever the cause, I am getting sick of it. Intel lists the TCASE as 81.3°C, although I am not exactly certain what that means. I probably could benefit from two more fans and will most likely get them, but, nothing in my system has ever gotten anywhere near that temperature listed above. On the most severe CPU testing with all 24 threads running at 100%, they rarely ever go above 65C, and at rest they are usually less than the system temperature, that is why Intel started using the reading of "Low", because they said that below around 45 or 50 I think, the measurements of the core's temperatures are not that accurate. Most everything in my system stays at or below 45C during general usage, with the power supply closer to 50 (It is the super-quiet 875 watt model).

Just Error 41, your system did not cleanly restart. I think I might have another bad LSI adapter. What are the odds of that? Who knows. It took three motherboards from Supermicro to get it right. It seems like the CPUID programs hardware monitor and CPU-Z work fine, but their PC Wizard 2010 makes it crash (I think, so I don't want to try). How about this for good measure, I have only seen this exact thing posted only one other time, maybe I need to leave off information. There was no resolution, although it came to fruition that he had things in the 120C range in there. Gee, I wonder why increased CPU usage causes my computer to slow down. The guy asked for answers for weeks before that was discovered. There was not another post after that. Either his computer bit the dust, or he wised up and purchased some canned air. I found this in Device Manager:

General tab, Device status -No drivers are installed for this device

Driver tab lists below named as installed

Intel(R) 7500/5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller-32D

Resource settings:
Memory Range 0000000FEC8A000 - 0000000FEC8AFF

Conflicting device list:
Memory Range 00000000FEC8A000 - 00000000FEC8AFFF used by:
ACPI x64-based PC
System board

Under the details tab, with Power Data selected, it lists:
Current power state:
D3

Power capabilities:
00000099
PDCAP_D0_SUPPORTED
PDCAP_D3_SUPPORTED
PDCAP_WAKE_FROM_D0_SUPPORTED
PDCAP_WAKE_FROM_D3_SUPPORTED

Power state mappings:
S0 -> D0
S1 -> D3
S2 -> Unspecified
S3 -> Unspecified
S4 -> D3
S5 -> D3

I do not know if I am looking squarely at the problem, or if this is pretty much standard. I am generally of the opinion that conflicting devices are not a good thing, although I am not certain that anything can be done in this case. I also don't know about the S2 and S3 states. This might be worth a further look, although I do not know how I could fix such a thing. Is the computer blaming itself?
 
In system Info, the following conflicts/sharing are listed:

I/O Port 0x00000000-0x0000000F Direct memory access controller
I/O Port 0x00000000-0x0000000F PCI bus

I/O Port 0x000003C0-0x000003DF ATI Radeon HD 5600 Series
I/O Port 0x000003C0-0x000003DF Intel(R) 7500/5520/X58 I/O Hub PCI Express Root Port 5 - 340C

IRQ 10 Intel(R) Chipset QuickData Technology device - 3431
IRQ 10 Intel(R) Chipset QuickData Technology device - 342A

IRQ 11 Intel(R) Chipset QuickData Technology device - 3429
IRQ 11 Intel(R) Chipset QuickData Technology device - 3430

IRQ 23 Intel(R) ICH10 Family USB Enhanced Host Controller - 3A3A
IRQ 23 Intel(R) ICH10 Family USB Universal Host Controller - 3A34

IRQ 14 Intel(R) Chipset QuickData Technology device - 3432
IRQ 14 Intel(R) Chipset QuickData Technology device - 342B
IRQ 14 Intel(R) ICH10 Family SMBus Controller - 3A30

IRQ 15 Intel(R) Chipset QuickData Technology device - 3433
IRQ 15 Intel(R) Chipset QuickData Technology device - 342C

IRQ 16 Intel(R) ICH10 Family USB Universal Host Controller - 3A37
IRQ 16 Intel(R) ICH10 Family PCI Express Root Port 6 - 3A4A

IRQ 17 Intel(R) ICH10 Family PCI Express Root Port 1 - 3A40
IRQ 17 Intel(R) ICH10 Family PCI Express Root Port 5 - 3A48

Memory Address 0xD0000000-0xDFFFFFFF ATI Radeon HD 5600 Series
Memory Address 0xD0000000-0xDFFFFFFF Intel(R) 7500/5520/X58 I/O Hub PCI Express Root Port 5 - 340C

IRQ 18 Intel(R) ICH10 Family USB Enhanced Host Controller - 3A3C
IRQ 18 Intel(R) ICH10 Family USB Universal Host Controller - 3A36

IRQ 19 Intel(R) ICH10 Family 4 port Serial ATA Storage Controller 1 - 3A20
IRQ 19 Intel(R) ICH10 Family USB Universal Host Controller - 3A39
IRQ 19 Intel(R) ICH10 Family 2 port Serial ATA Storage Controller 2 - 3A26
IRQ 19 Intel(R) ICH10 Family USB Universal Host Controller - 3A35

Memory Address 0xA0000-0xBFFFF ATI Radeon HD 5600 Series
Memory Address 0xA0000-0xBFFFF PCI bus
Memory Address 0xA0000-0xBFFFF Intel(R) 7500/5520/X58 I/O Hub PCI Express Root Port 5 - 340C

I/O Port 0x000003B0-0x000003BB ATI Radeon HD 5600 Series
I/O Port 0x000003B0-0x000003BB Intel(R) 7500/5520/X58 I/O Hub PCI Express Root Port 5 - 340C

Memory Address 0xFEC8A000-0xFEC8AFFF Intel(R) 7500/5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller - 342D
Memory Address 0xFEC8A000-0xFEC8AFFF System board

I/O Port 0x0000D000-0x0000D0FF ATI Radeon HD 5600 Series
I/O Port 0x0000D000-0x0000D0FF Intel(R) 7500/5520/X58 I/O Hub PCI Express Root Port 5 - 340C

Does that look bad? Is there an easy fix? I am not sure how to assign an IRQ, or if I should just let the system handle it. It looks to my untrained eye like the graphics card might be wreaking havoc, but I am no expert. What do you think?
 
Casually looking over the MegaRAID SAS 9260-4i RAID Controllers Quick Installation Guide, on page 2 of 4, on the right hand column on the bottom half of the page, I found this "Step 4 Insert the RAID controller in a PCI Express slot on the motherboard, as shown in figure 2......
Note: This is a PCI Express X8 card and it can operate in X8 or X16 slots

Well silly old me somehow got the mistaken idea that all of my PCI Express slots were X8. According to the manual, there are four X8 slots, and two X4 slots, but in reality, my motherboard has three 4X slots, with a grand total of five. I had moved it originally because it was sandwiched between the CPU Heatsink, and the double width graphics card. There is about a half inch on either side.

The performance has improved drastically; I have not had any problems. I think I will give it a few more days then decide whether to keep it, or send back the RMA replacement part. I cannot believe I did not think of that. That may also explain why my video card (or system) has hung during game play. I know you are wondering what that has to do with anything, but here is my answer: The graphics card is X16 in an X8 slot, which has an opening in the back. The raid controller is should be installed in an X8 or X16 slot. That is my theory. The adapter takes to it better, but not the LSI MegaRAID card. I am praying that this will answer my prayers.

As I said before, I have an APC 950 VA UPS, but it does nothing when the computer just shuts down with error 41, with no record in the event log of what may have caused it. I am considering purchasing a LSI00161 MegaRAID LSIiBBU07 Battery Backup Unit for my array. My question is twofold. First, is there a need for both, or is the APC suitable? It can store cache data for up to 72 hours during the event of an extended power outage, so that should answer my question. I have had the crashes, and it says you either have bad memory, a bad battery, or no battery installed, and that the cache data has been lost, but the adapter can recover. If it could not recover, I might be in trouble. What I am saying is apart from the problems I have experienced, this machine runs good, and I do not want to re-do anything if I can avoid it.

So here is the second and probably most important question. Say the machine reboots all on its own again, and I had the LSI battery backup unit attached to the MegaRAID card, would that save the data, during an unexpected power outage where the UPS doesn't even help, the machine just shuts down in an unclean fashion.

I am seriously considering purchasing it, but if it is not going to do anything better than my UPS, then why waste the money? With a calculated risk, even though I do not yet have the LSI BBU installed, I have already enabled write cache policy to always write back on my server. There are three selections in WebBios or MegaRAID Manager: Write Through, Always Write Back, and Write Back With BBU. I have heard folks say that it does go faster with the BBU installed, and set to Write Back With BBU. Would there really be any difference that and Always Write back, or are they just trying to sell their batteries?