LOTS of strange drive problems...

bobbintb

Distinguished
Oct 21, 2010
38
0
18,560
Ok, so bear with me as this is lengthy and has a lot of history to it. I'm a lifelong IT guy and this is a tough one for me. I have an Asrock z87 Extreme11/ac. I have an mSATA drive I use for my main OS and 12 other hard drives I use for storage in a software RAID (tRaid). I think there is something wrong with my controller. I keep getting hard drive errors and there are too many and inconsistent for it to likely be the drives themselves.

Here are a few of the common messages I get in the event log:

Error
Source LSI_SAS3
Event 11
The driver detected a controller error on \Device\RaidPort0.

Error
Source LSI_SAS3
Event 11
The driver detected a controller error on \Device\RaidPort4.

Warning
Source LSI_SAS3
Event 129
Reset to device, \Device\RaidPort0, was issued.

Error
Source disk
Event 11
The driver detected a controller error on \Device\Harddisk5\DR6.

Warning
Source iaStorA
Event 129
Reset to device, \Device\RaidPort1, was issued.

Warning
Source iaStorA
Event 129
Reset to device, \Device\RaidPort2, was issued.

Warning
Source disk
Event 153
The IO operation at logical block address 0x475ea7a0 for Disk 13 (PDO name: \Device\0000005b) was retried.

Warning
Source disk
Event 153
The IO operation at logical block address 0x734f804e for Disk 1 (PDO name: \Device\00000034) was retried.

Warning
Source disk
Event 153
The IO operation at logical block address 0x14508dc8 for Disk 4 (PDO name: \Device\00000037) was retried.

Error
Source cbfs5
Event 1
(no details for some reason)

Warning
Source disk
Event 51
An error was detected on device \Device\Harddisk13\DR13 during a paging operation.

Error
Source disk
Event 7
The device, \Device\Harddisk17\DR17, has a bad block.

Error
Source disk
Event 7
The device, \Device\Harddisk18\DR18, has a bad block.

Error
Source Ntfs (Ntfs)
Event 55
A corruption was discovered in the file system structure on volume \\?\Volume{c167489f-d8c1-11e3-8259-240a649f9450}.
A corruption was found in a file system index structure. The file reference number is 0x9000000000009. The name of the file is "<unable to determine file name>". The corrupted index attribute is ":$SII:$INDEX_ALLOCATION".

Error
Source Ntfs (Microsoft-Windows-Ntfs)
Event 98
Volume \\?\Volume{c167489f-d8c1-11e3-8259-240a649f9450} (\Device\HarddiskVolume7) needs to be taken offline to perform a Full Chkdsk. Please run "CHKDSK /F" locally via the command line, or run "REPAIR-VOLUME <drive:>" locally or remotely via PowerShell.

Error
Source Ntfs (Microsoft-Windows-Ntfs)
Event 98
Volume \\?\Volume{c167489f-d8c1-11e3-8259-240a649f9450} (\Device\HarddiskVolume8) needs to be taken offline to perform a Full Chkdsk. Please run "CHKDSK /F" locally via the command line, or run "REPAIR-VOLUME <drive:>" locally or remotely via PowerShell.

Error
Source Ntfs (Microsoft-Windows-Ntfs)
Event 98
Volume H: (\Device\HarddiskVolume6) needs to be taken offline to perform a Full Chkdsk. Please run "CHKDSK /F" locally via the command line, or run "REPAIR-VOLUME <drive:>" locally or remotely via PowerShell.

Also getting a lot of "disk has been surprised removed"

So, I don't quite know what to make of it. I have ran chkdsk on all the disks and only two had errors that needed fixed. Even still I keep getting these errors. The drivers are all updated. I thought flashing the embedded LSI 3008 to IT mode might help but it didn't. The drives don't have problems in other machines that I can recall but I might be off on that. Any ideas?
 
Solution
thank goodness. I am glad you were able to determine the source of the problem.My computer did somewhat the same thing, and it turned out, that the bracket holding the HEATSINK FAN HAD BROKEN. I GOT IT FIXED QUICKLY.
Might as well start with the basics, if you haven't already. I'd start by replacing all SATA cables, use the ones with a metal latch at both ends. You might also want to take a look at the power supply, could be in its death throes. Question, what are you using for your software RAID?
 

bobbintb

Distinguished
Oct 21, 2010
38
0
18,560
I forget to mention I checked that as well a long time ago, cables, PSU, etc. The RAID software is tRAID and I'm on Windows 8.1 x64. I'm checking with the mobo manufacturer as well in case it might be faulty.
 

bobbintb

Distinguished
Oct 21, 2010
38
0
18,560
OK, I just spent the last two weeks verify my drives are ok in two other machines. I added them back and now I am getting a lot of this:

The device, \Device\Harddisk19\DR19, has a bad block.
The device, \Device\Harddisk15\DR15, has a bad block.

and a little of this:
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{16fe4cd7-a38d-11e4-831f-240a649f9450}, DeviceName: \Device\HarddiskVolume16.
(STATUS_DEVICE_DATA_ERROR)

{Delayed Write Failed} Windows was unable to save all the data for the file \\?\Volume{16fe4cd7-a38d-11e4-831f-240a649f9450}\$Mft. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.

The system failed to flush data to the transaction log. Corruption may occur in VolumeId: \\?\Volume{947a16e3-711a-490c-8dba-0be01af39381}, DeviceName: \Device\HarddiskVolume9.
(STATUS_DEVICE_DATA_ERROR)

I think I am going to see about an RMA for the board. Any other ideas?
 

bobbintb

Distinguished
Oct 21, 2010
38
0
18,560
Ok, well after months, maybe even a year, I think I have found the issue. The drives were overheating. I never really considered this because I use a 4U server chassis and I figured it would have adequate cooling but I guess not. I have also never ever had a drive overheat on me. I have had CPUs and video cards overheat, just never a hard drive. While I knew they could, I just never thought about it. I also did not account for the fact that server chassis are typically kept in cold room. In hindsight it seems consistent with the issue with the machine taking a few minutes to boot and other times just seconds. I just didn't realize the common factors of cold starting or already being warmed up. During another round of troubleshooting one of the drives was extremely hot, so much so that I could not hold onto it for very long. So, I took all of the drives out of the chassis and stacked them on top of each other using Legos for spacer and pointed a small room fan at them and started testing again. It has been almost a month and I have had no errors in the event viewer and not performance issues as all. I still get that generic LSI_SAS3 error but I don't think it is affecting anything. Anyway, thanks for the help.
 

simonchipmunk

Reputable
Apr 8, 2014
619
0
5,010


 

simonchipmunk

Reputable
Apr 8, 2014
619
0
5,010
thank goodness. I am glad you were able to determine the source of the problem.My computer did somewhat the same thing, and it turned out, that the bracket holding the HEATSINK FAN HAD BROKEN. I GOT IT FIXED QUICKLY.
 
Solution