Supermicro got their board back last Thursday, and I am waiting on my $536.00 refund. Won't that be nice? Everything seems to be working great, except I get random reboots from time to time. Seeing as how I have already RMA'd two motherboards, and this one works with both physical CPUs (all 12 cores, and all 24 threads), and all 48 GB of ECC Registered Server Memory, I think I will keep it, and hope to find a solution elsewhere. I got my replacement memory module, and RAID card. The memory works fine, but I will get into the Controller Card later.
I have had 10 reboots since 10/1/2011, and they do not seem to be related, but I am coming up with a theory. Yesterday it crashed while playing F1 2010 (a racing sim), and today during a GPU stress test the screen turned white a little after an hour. The GPU never got over 60C, and it is a Radeon IceQ 5670, the only piece of consumer grade hardware that I have on my server. I could not do anything. It locked up, and Ctrl+Alt+Del did nothing, nor did Ctrl+Shift+Esc. The strange thing is the Num Lock worked, but not Scroll Lock or Caps Lock. I had to manually press the power button, and I got the exact same event in the event log as the other nine: Event 41. Then, during or after POST, I get this message from the LSI controller:
Cache data was lost due to an unexpected power-off or reboot during a write operation, but the adapter has recovered. This could be due to memory problems, bad battery, or you may not have a battery installed. Press any key to continue, or 'C' to load the configuration utility.
_
Well I don't yet have a BBU for the controller, so the battery is not the problem. I am going to get one soon, hopefully this week. The system memory is just fine, but I assume they may be talking about the memory on the card, but if the RMA replacement is having memory problems, then it needs to go back too. Could the controller card work with bad memory, and all the speed is just from the four disk RAID5 array? It has a sticker on it that does not instill a great deal of confidence. It reads, "Serviceable Used Part". That means to me, it was ready for the trash and someone said no, it still works.
The funny thing is that it is slower than the one I was planning on RMAing. I moved it from the lowest slot, closest to the bottom of the case, where there was less than half an inch between the heat spreader and the case to the PCI-Express slot above it, and the speeds went up, dramatically. My guess is that it got better airflow there, or for some reason that slot performs better, although not as good as my card when new. I thought they were all X8. It makes no sense to me.
I have run Windows Memory Diagnostics several times, including the option where you press F1 and can run all of the more advanced memory tests, which takes several hours. It came back fine with no problems. The other day I ran System Stability Test - AIDA64 [TRIAL VERSION] for over one and a half hours, with the options checked to Stress CPU, FPU, cache and system memory. None of the cores really got much hotter than 65C, maybe 68C for a second, but there were no problems. I ran Prime95 with 24 execution threads for three hours, 16 minutes, in which time it completed 78 tests with 0 errors, and 0 warnings. I monitor the voltage in Supermicro SD3, and they are always well within tolerance (although I have never been staring at that screen when a crash happens).
Even though I have told the computer not to automatically restart on system crashes, there still is no blue screen, which makes me all warm and fuzzy inside, but keeps me from figuring out what the problem is. Do power switches ever go bad? I know they may quit working, but have you ever known of them to shut down the computer themselves? I am just trying to think of anything. I want to offer web-hosting service, but I must have a rock-solid machine which does not crash in order to do so.
Gag me with a spoon, I could shut the machine down, change the jumper, take out my 1GB 5670 Radeon card, and switch back from HDMI to VGA, then hook it onto the motherboard's onboard 8MB video. If it ran fine for a week or so, I think I could safely say the video card is the culprit. I don't want to do that because it would suck. My buddy said I should put it in a colo and vpn when I need to, but just let it sit there and run. I know this is a server, but I built it as a dual purpose machine. With this much money in it, I want to play with it too!
Here is the simple error that I get:
The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Here is the more technical log (which I have put an X in the place of some numbers or letters)
- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
<Provider Name="Microsoft-Windows-Kernel-Power" Guid="{XXXXXXXX-XXXX-XXX-XXXX-XXXXXXXXXXXX}" />
<EventID>41</EventID>
<Version>2</Version>
<Level>1</Level>
<Task>63</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000002</Keywords>
<TimeCreated SystemTime="2011-11-14T01:34:35.938850000Z" />
<EventRecordID>278807</EventRecordID>
<Correlation />
<Execution ProcessID="4" ThreadID="8" />
<Channel>System</Channel>
<Computer>XXXX</Computer>
<Security UserID="X-X-X-XX" />
</System>
- <EventData>
<Data Name="BugcheckCode">0</Data>
<Data Name="BugcheckParameter1">0x0</Data>
<Data Name="BugcheckParameter2">0x0</Data>
<Data Name="BugcheckParameter3">0x0</Data>
<Data Name="BugcheckParameter4">0x0</Data>
<Data Name="SleepInProgress">false</Data>
<Data Name="PowerButtonTimestamp">0</Data>
</EventData>
</Event>
Here is the friendly version:
- System
- Provider
[ Name] Microsoft-Windows-Kernel-Power
[ Guid] ="{XXXXXXXX-XXXX-XXX-XXXX-XXXXXXXXXXXX}
EventID 41
Version 2
Level 1
Task 63
Opcode 0
Keywords 0x8000000000000002
- TimeCreated
[ SystemTime] 2011-11-04T04:34:56.531651000Z
EventRecordID 267944
Correlation
- Execution
[ ProcessID] 4
[ ThreadID] 8
Channel System
Computer XXXX
- Security
[ UserID] X-X-X-XX
- EventData
BugcheckCode 0
BugcheckParameter1 0x0
BugcheckParameter2 0x0
BugcheckParameter3 0x0
BugcheckParameter4 0x0
SleepInProgress false
PowerButtonTimestamp 0
This is driving me nuts. The main reason I spent this type of money on this machine was to make it bulletproof, but my old P4 that I spent about a third the price on 5 years ago, when parts were expensive, has proven much more reliable in some cases. I know I should have started a new thread, but someone please help! I am at my wits end on this. It is not right to have over $5,000 in an Enterprise Server, and have all these problems.
I will be up to $6250 with a generator and a battery backup unit for my LSI 9260-4i controller, for everything. Let's not even start thinking about adding four more hard drives, and then we will be topping the $7K mark! What have I gotten myself into? Not bragging, but am seriously starting to wonder why I put so much money into this system.
I could enable Hibernation at 100%, and use a full sized Pagefile. I wonder if that would help. That would only waste 98 Gigabytes of hard drive space. That is 2GB less than my Boot, Page File and Crash Dump Partition (Where Windows is Located). My system reserved partition is a measly 100MB. I don't suppose that I can put the Hibernation and Pagefile files on a different partition, and expect it work properly. I am of the impression that they have to be on the root partition.