[SOLVED] Future planning for NVMe failure

Sep 20, 2019
2
0
10
Hi, I'm building a server using ThreadRipper and 6 NVMe drives (non-boot). I'm using AsRock Taichi x399 and Asus M.2 Hyper X16. (buy the way they all work wonderfully!!). I'm also using ProxMox with ZFS Raid 5 for the 6 NVMe Drives. But, I was wondering when I get a drive failure, how will I know in the hardware which drive fails? they are all labeled nvme0n1, nvme1n1 etc. I don't physically know where nvme3n1 is actually located either on the MB or on the ASUS M.2 add-on card. When using a raid card and SSD, they were directly tied to the interface and I labeled everything. So I can label the drives, but when it fails, where do I identify that?
 
Sep 20, 2019
2
0
10
So answered my own question. @fzabkar helped partially, so I'll give him credit. I was looking for how I identify where the drive is on the motherboard and in the add-on slot. Pretty much the same way you do in spinning drives. It takes advanced planning, but there is no physical cable to trace back to an interface.

Here is what I did.

In the case of NVME smart drives, you can use smartmontools in linux to get the serial numbers of the drives (thank you fzabkar)

Physically mark/write the drive slots 1-6
Get the serial number of the drives. (record them, write them down)
Put the drives in the appropriate slots. Note the serial number with the marked slot number.
Then run smartctl to get the serial numbers matching your dev mapping in linux.
So now I have the following:

M2 Hyper x16 / MKNSSDHL1TB-D8
nvme0n1 - MK19071610054794D (Slot 1)
nvme1n1 - MK19071810054A0B5 (Slot 3)
nvme3n1 - MK19071810054A087 (Slot 2)
nvme5n1 - MK19071810054A0B7 (Slot 4)

MotherBoard
Samsung SSD 970 EVO 1TB
nvme2n1 - S467NX0M823106R (Slot 5)
nvme4n1 - S467NX0M823089V (Slot 6)

Note that there is no rhyme or reason for the default mapping in linux to the add-in card or the MB slots.

My ZFS pool had already failed with 1 drive being suspect. Now with this information above, ZFS will tell me which drive "was" there using the drive-by-id number which included the serial number.. phewwww..

Using the serial number and my saved info, I can now know exactly which drive failed and needs to be replaced.