[SOLVED] Debugging drives in RAID 6 ?

May 19, 2022
14
0
10
0
Hello,
I have a local server. Have 6 drives connected to 9650SE-8LPML and have strange errors.

But, from the beginning. Last year I purchased 6 SATA HDDs. These were unused, but manufacturing dated to 2014 (6x Hitachi HUS724040AL)
Since i set up RAID i had issues, random drives disconnecting. And - what seems strange - they seem to spinup, and later i can hear like some of them are slowing and spinning up again (hard to say which ones, since these are in bunch).

I'm wondering if it's power problem, or drive problem. Don't think it's SATA cable problem, since I replaced old sff 8087 with new ones, and problem persits.
Since it's 3ware controller, I can't access SMART that easy... (haven't figured how yet).

After installing 3ware utility to check, I first saw something like this:

So problem is on ports 5,6,7. I thought that changing cable maybe could work, so I replaced one cable (reconnected drive from port 7 to port 2 - only this one). After booting up system I had table like this:

But after few hours... bang


So basically, disk switched from port 7 to 2 is again causing problems.

I have 2 cables plugged into controller: ports 0-3 - Cable 1 (red cable) and ports 4-7 - Cable 2 (blue cable).
Ports 4,5,6,and 7 (now 2) are connected with same SATA power cable (split cable connected to MOLEX), rest is connected directly to PSU.

Wondering what could be the cause of this. Am I unlucky and 3 out of 6 drives are failing? These are working like 6-8 months only. Problems are from almost beginning.
Or could these be used and someone reset SMART statistics? (checked them before putting into RAID, and were fine)

PS: In status I once saw SMART error in Status field.

Edit: Inspected now, and i can see:
 
Last edited:
I see a lot of Power Off Retracts. I suspect that these are emergency retracts of the headstack which the drive executes in response to critical faults. The failing drive appears to have maxed out the raw value (65535 = 0xFFFF in hexadecimal).

Do your drives all have rotational vibration sensors? These are the two horizontal white components at the top left and right of the PCB:

https://www.storagereview.com/review/hgst-ultrastar-7k4000-review-hus724040ale640

https://www.storagereview.com/wp-content/uploads/2013/04/StorageReview-Hitachi-Ultrastar-7K4000-Circuit-Board.jpg

These sensors detect the sinusoidal vibrations from other drives in the same rack. The drive then injects a compensatory sinusoidal adjustment into its own track servo to counterbalance these external disturbances. Many desktop PCBs have optional RV sensors, but their circuit locations are usually not populated.

It would be worth noting the positions of the failing drives. Are they neighbours? Are they closest to the fan(s)?
 
Last edited:
Maybe it's not drives, but something else that's making them die?
Some speculative comments since you asked this way.

Any drive can die at any time, and the drives are all same model and maybe even from same batch. So maybe, they have a tendency to stay good in a similar period of time - i.e. a production fault may affect more than one drive.

Maybe (Haven't seen your casing) - dependent of the casing and mounting structures - it is possible that the hdd's hits a resonance frequency, either by spin or head movement patterns, that in turn cause unexpected wearing.
 
Also asked if these are SMR or CMR. However, I don't get what they mean... do they keep it secret, or they don't know?
Just before I answered your first post, I remember I found a web site that had benchmark results from that hdd model, and no specific indication on smr drive issues afaik.

https://www.storagereview.com/review/hgst-ultrastar-7k4000-review-hus724040ale640

However it does says:
the web site mention above said:
Cons
  • Weaker performance in Web and File Server Profiles

But also says:
the web site mention above said:
Pros

  • Excellent Performance in 8K and 128K testing
---------------

Bottom Line

The HGST enterprise-class Ultrastar 7K4000 offers high-end performance and robust capacity and is a logical upgrade for capacity-hungry applications.
And also - given the fact that the hdd in that model series have capacity up to 4TB - my overall assessment is this hdd must be cmr drive (the regular type not having the issues of bad performance at random writes as smr drives do).
 
Those sensor readings make no sense to me. I would start by observing the readings reported by BIOS. That should tell you which supply voltage (or current) each "inX" is sensing.

Here is a Gigabyte motherboard schematic which shows the IT8720/IT8721 LPC IO chip (page 18):

http://kythuatphancung.vn/uploads/download/170f8_GA-H55M-UD2H_r101.pdf

"in3" is identified as "VIN3/ATXPG" and is routed to the POK signal from the PSU. If all the PSU voltages are in spec, then the PSU outputs 3.3V on this pin. This would be a logic level rather than an analogue value, so anything above 1.66V would be sensed as POK or ATX Power Good. Therefore I don't understand why this should raise an alarm.

Your motherboard's sensors may be configured differently, but in the Gigabyte example VIN5 (in5) senses the +12V supply. If this is how your motherboard does it, then an input of 2.18V would correspond to the maximum allowable excursion of the 12V supply. A reading of 2.51V would then suggest that the supply is now sitting at 13.8V or above.

(2.51 / 2.18) x 12 = 13.8V

That said, each motherboard is different, so the scale factors may also be different.
 
Last edited:
May 19, 2022
14
0
10
0
I have found a way to access SMART.
https://pastebin.pl/view/a9e59db0 (couldn't access for one of drives, as it's marked as INOPERABLE by controller)

2 most failing disks have problems with Seek_Error_Rate, but none have with reallocated sector count.
i can hear some of these drives "clicking" and like these are slowing and speeding up.

Can this be because of power supply? Or is it just disks really failing with head positioning and will break?
I already ordered new replacement disks, but trying to get ready, if it could be not drives, but something else.

I have doubts if it's drive issue, since 3 drives died almost in same time? Maybe it's not drives, but something else that's making them die?
 
May 19, 2022
14
0
10
0
@Grobe
Since i had no space in case and didn't think it was good idea to put them in one column without shock absorbing, I attached some self made rack to quickly assemble array. I also made it heavy, not to reasonance much.


Failing one is exactly in the middle, 4th from left and 4th from right :)

I made some spacing and adjusted it's width to hold them with screws like these:



I also hdparm'ed them:

Code:
hdparm -tT /dev/sd[a,b,d-f]

/dev/sda:
 Timing cached reads:   7860 MB in  2.00 seconds = 3937.05 MB/sec
 Timing buffered disk reads: 432 MB in  3.00 seconds = 143.83 MB/sec

/dev/sdb:
 Timing cached reads:   7896 MB in  2.00 seconds = 3954.68 MB/sec
 Timing buffered disk reads: 500 MB in  3.00 seconds = 166.66 MB/sec

/dev/sdd:
 Timing cached reads:   7866 MB in  2.00 seconds = 3939.72 MB/sec
 Timing buffered disk reads: 480 MB in  3.01 seconds = 159.67 MB/sec

/dev/sde:
 Timing cached reads:   7764 MB in  2.00 seconds = 3888.31 MB/sec
 Timing buffered disk reads: 502 MB in  3.00 seconds = 167.06 MB/sec

/dev/sdf:
 Timing cached reads:   7858 MB in  2.00 seconds = 3935.89 MB/sec
 Timing buffered disk reads:   2 MB in  7.91 seconds = 258.92 kB/sec

-----------------

 hdparm -tT /dev/sd[a,b,d-f]

/dev/sda:
 Timing cached reads:   7892 MB in  2.00 seconds = 3953.23 MB/sec
 Timing buffered disk reads: 414 MB in  3.00 seconds = 137.83 MB/sec

/dev/sdb:
 Timing cached reads:   7840 MB in  2.00 seconds = 3926.79 MB/sec
 Timing buffered disk reads: 488 MB in  3.00 seconds = 162.58 MB/sec

/dev/sdd:
 Timing cached reads:   7706 MB in  2.00 seconds = 3859.74 MB/sec
 Timing buffered disk reads: 476 MB in  3.00 seconds = 158.49 MB/sec

/dev/sde:
 Timing cached reads:   7894 MB in  2.00 seconds = 3953.98 MB/sec
 Timing buffered disk reads: 500 MB in  3.00 seconds = 166.58 MB/sec

/dev/sdf:
 Timing cached reads:     2 MB in  7.76 seconds = 263.82 kB/sec
 Timing buffered disk reads:   2 MB in  5.07 seconds = 403.57 kB/sec

-----------------

sudo hdparm -tT /dev/sdf

/dev/sdf:
 Timing cached reads:   7936 MB in  2.00 seconds = 3975.12 MB/sec
 Timing buffered disk reads:   2 MB in  5.61 seconds = 364.85 kB/sec
/dev/sdf in 2nd test seems to be failing on cached reads also. Third test seems more like first one.

PS: GPU is attached temporarily for debugging purposes. Normally it does not have GPU installed, as both PCI ports are used (one by nvme other by 3ware controller)
 
does CrystalDIskInfo data on operating hours support the seller's claim of 'unused drives', or, show any drives in less than 'Good' health?

Perhaps a GSmartControl Short and Long test and each subject drive....(Many test a drive/group of drives for a few days before even considering it/them as worthy for inclusion into an actual RAID for production purposes...)
 
May 19, 2022
14
0
10
0
For now, i added 3 ironwolf pro drives and am trying to restore degraded array. But funny thing is, that for now (few hours now) none of disks failed... I only added 3 new drives and disconnected one of those faulty drives.
Maybe when I disconnected all drives and reconnected them some connection got fixed. Maybe molex to sata power splitter was failing.
These old molex connectors tend to misbehave sometimes...
Will see how it goes, will try to stress test these drives one by one and see if they're really faulty...

@fzabkar Also, as for RVS, i wrote to support with question if it has or does not RVS.

Thank you for contacting Western Digital Customer Service and Support. My name is Ramsey E..

I assure you that i will try my best in order to assist you with your inquiry.

The models you inquiry have "Enhanced Rotational Vibration Safeguard (RVS)" however i regret to inform you that we cannot provide information regarding the drives being SMR or CMR since it is not common knowledge.

You can see data sheet from this direct link for more information. A possible case escalation may be created if the drive being SMR or CMR is crucial for you however this will not mean that the information you required will be provided for certain.
Also asked if these are SMR or CMR. However, I don't get what they mean... do they keep it secret, or they don't know?
 
May 19, 2022
14
0
10
0
I already have DCB Read failure on IronWolf drive, so it seems that disks may be fine, but most probably power supply is failing. Maybe i put too many disks on this one. Ordered new 700W PSU. Old PSU is really old, and does not have any 80PLUS certificate. So maybe it does not keep voltages properly.

:rolleyes: (while writng it, author thought of checking voltages in commandline...) :

Edit: Still readings and min/max values are quite strange... will get to it with multimeter today.
 
May 19, 2022
14
0
10
0
@Grobe gonna test some then. if brand new ironwolf is behaving same way it's either power supply, 3ware controller (got replacement just in case :) ) or cables (but I replaced them already some time ago).
Will come back in few days with my findings
 
May 19, 2022
14
0
10
0
I just check voltages with multimeter (voltcraft, a decent one, not cheapest there is) and replaced PSU from 350W (has 80+ only label) to 750W 80+Gold

3.3V was 3.6V
12V was 11.6 on SATA cable going out directly from PSU and 11.46 on Molex -> SATA splitter and was Jumping. I was monitoring it for a while, and when voltage dropped to 11.43, then disks clicked and restarted (i think).
5V was 4.66, so not bad.

on new PSU
12V is 11.92 on longest cable (Molex -> SATA splitter)
5V is 4.93 on longest cable (Molex -> SATA splitter)

And sound of them slowing down and spinning up seems to stopped. Maybe they rectracted heads because of low voltages.

Old PSU had 300W on 12V lines max. new one has 744W on 12V.
With 10 drives, each probably leaning around 12W that should make 120W for disks, so 300W should be plenty... but this PSU was quite old , so maybe that was the reason voltages were so low.

Now raid is rebuilding
Code:
md25 : active raid6 sdc1[7] sdg1[3] sda1[1] sdf1[2] sdb1[6] sdd1[4]
      15585404928 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/4] [_UUUU_]
      [>....................]  recovery =  1.3% (53517568/3896351232) finish=4573.3min speed=14004K/sec
      bitmap: 4/30 pages [16KB], 65536KB chunk
Speed was before like 2K/sec, now is 14 000K/s. So if this keeps up, then it may've been a problem.

I really hope that was the reason for it and it will not get errors. Old one was not reporting errors right after start, but after some time. So there is possibility it will still have errors... hopefully it will not be the case anymore :)
When i think of it now, only disks with problems were ones connected via splitter to molex. So I assume that may've been the reason for all this problems.
Keep thumbs up! :)
 
May 19, 2022
14
0
10
0
@Grobe @fzabkar Thank you for your help and knowledge.
Replacing PSU seems to have solved the problem. It's been 20h since replacement and no problems occured. Raid is rebuilding with proper speeds. No more disks errors. No failed drives show anymore. System became more responsive.

PS: Bonus question, which answer should I mark as best answer (if I should). Don't want to commit faux pas, should I mark my last post as best answer, since it describes how this problem got solved?
 
May 19, 2022
14
0
10
0
Ok. I was either wondering if I should mark one that summarizes problem resolution (so next viewers will have quick answer right at top) or who helped most.
@Grobe weak GPU or PSU?
 
Ok. I was either wondering if I should mark one that summarizes problem resolution (so next viewers will have quick answer right at top) or who helped most.
@Grobe weak GPU or PSU?
The PSU because that is the component that can typically affect all other components if getting faulty. Now, the term "weak" may not being the correct term for a PSU that still works but doesn't deliver stable voltage, the term "faulty" or "bad" probably express the situation better (I'm not native English so error in language may occur).
 

USAFRet

Titan
Moderator
Mar 16, 2013
156,320
11,712
176,090
24,279
The PSU because that is the component that can typically affect all other components if getting faulty. Now, the term "weak" may not being the correct term for a PSU that still works but doesn't deliver stable voltage, the term "faulty" or "bad" probably express the situation better (I'm not native English so error in language may occur).
"weak" may refer to a PSU that otherwise works perfectly, but is underpowered for what you want it to do.

A 350W PSU that otherwise works perfectly is too "weak" for a new RTX 3090 GPU.

Faulty may mean a PSU that actually does not work right.
An 850W PSU would be good enough for a 3090, but not if it is faulty and cannot actually deliver stable power to the system.
 
Reactions: Grobe
May 19, 2022
14
0
10
0
I think it's just broken. As server had no GPU installed and PSU was (theoretically) used at 60%. But, maybe some CPU usage spikes made it go out of that. Looking at power socket meter, it had constant usage of 180W at idle to keep everything up. And, problem was only on disks that were connected on longer cables (so voltage dropped too much at farthest ends due to resistance I assume).

Anyway, mainly came here to report, that since PSU replacement (2 weeks ago), everything works fine. There are no problems at all at the moment. Thank you again everyone for your help and advice!
 

ASK THE COMMUNITY

TRENDING THREADS