Question Where to store my dataset (read-only)

Feb 20, 2023
1
0
10
I have a 2TB training dataset to run in my desktop PC with 4 nvidia rtx2080ti graphics cards. This entire 2TB dataset is going to be continuously read from the disk (without break) at each training epoch and there are 200 epochs to train (total training time is estimated as two months).

My desktop storage configuration is as follow:

C drive: 4TB Samsung 970 EVO NVMe SSD with 320 MB/sec R/W speed. Windows 10 OS, anaconda environment and all pytorch program files reside in the C drive (there are still 2.5TB free space available).

D drive: 8TB Western Digital HDD with approx. 30MB/sec R/W speed.

External (USB): 4TB portable SSD memory card with USB 3.1 and 240MB/sec R/W speed.

Under this hardware environment, I am contemplating where to store my training dataset (read-only).

C drive is the fastest. However, if I store the training dataset in C drive, then Windows OS, anaconda, all programs scripts and dataset read/write will be congested (??) in the same drive ???

HDD in D drive is somehow slow (20MB/sec) and I am also afraid whether continuous massive data reads for 2 months may damage this cheap HDD .

External SSD might be a good choice, but I saw warnings in the internet saying that continuous massive data transfer through USB port will make the data transfer speed slower and slower over long time period (eventually got stuck down to 1 ~ 2MB/sec data transfer speed) and also any single read failure (due to power or heat issue over time) might terminate program without warning. Is the USB driver so vulnerable??

I wonder whether there is anyone who experienced similar situation in the past and can give a proper suggestion in this environment