[SOLVED] Achieving high I/O speed onto an NVMe SSD?

Sep 9, 2021
3
0
10
Hello,

I have a Windows 10 Lenovo P620 Workstation with two NVMe SSD drives, and I have an application where I need to write to the drive at around 3 GB/s. I have benchmarked the drives using CrystalDiskMark, and get something like the following for both.

LTUJWxp.png


Now, when I try and write a C program to do the same thing -- the best I can achieve is about 2700 MB/s. I've tried fwrite, fstream, the Windows CreateFile and WriteFile functions, etc, and with differing block sizes in powers of two. I've also tried going through the CrystalDiskMark and DiskSPD source code to find out what they're doing differently, but I can't find the information I'm looking for.

Should I be expecting to hit these > 4 GB/s write speeds with my own program if I do everything correctly? Or is the benchmark tool metric something that isn't practical to achieve. If it is possible, any tips on increasing my speeds? I've included a sample code function below that gets me ~2700 GB/s.

Thank you.

C-like:
#include <stdio.h>
#include <windows.h>
#define buf_size 8388608
#define iterations 100

int main()
{
    unsigned char* buf;
    
    buf = new unsigned char[buf_size] {0};
    FILE* fp;

    LARGE_INTEGER frequency;
    LARGE_INTEGER start;
    LARGE_INTEGER end;
    double interval;
    double bandwidth;

    fp = fopen("D:\\test.bin", "wb");

    QueryPerformanceFrequency(&frequency);
    printf("Starting Timer.\n");

    QueryPerformanceCounter(&start);
    for (int i = 1; i <= iterations; i++)
    {
        fwrite(buf, buf_size, 1, fp);
    }

    QueryPerformanceCounter(&end);

    interval = (double)(end.QuadPart - start.QuadPart) / frequency.QuadPart;
    bandwidth = (double)((double)buf_size * (double)iterations / interval) / (1048576);

    printf("The interval is % f. \n", interval);
    printf("The estimated bandwidth is %f MB. \n", bandwidth);

    delete[] buf;
    fclose(fp);
    return 0;

}
 
Solution
A sophisticated app will have one task reading the source while a second separate task writes the output. Their processes can overlap.
The trick is in coordinating the process.

To your point, since both your app and the benchmark both generate data, the difference should not be so apparent.

When you are writing to a ssd, you need available nand blocks to receive the data.
In time, all of the available free nand blocks get filled and ssd trim and nand management needs to get involved in freeing up blocks. That causes a read and rewrite.
The size of the ssd will matter, larger drives will have more readily available space.

What is the make/model of the ssd devices involved?

Drives can have different controllers.

Unless your pc has...
Sep 9, 2021
58
6
35
Hello,

I have a Windows 10 Lenovo P620 Workstation with two NVMe SSD drives, and I have an application where I need to write to the drive at around 3 GB/s. I have benchmarked the drives using CrystalDiskMark, and get something like the following for both.

LTUJWxp.png


Now, when I try and write a C program to do the same thing -- the best I can achieve is about 2700 MB/s. I've tried fwrite, fstream, the Windows CreateFile and WriteFile functions, etc, and with differing block sizes in powers of two. I've also tried going through the CrystalDiskMark and DiskSPD source code to find out what they're doing differently, but I can't find the information I'm looking for.

Should I be expecting to hit these > 4 GB/s write speeds with my own program if I do everything correctly? Or is the benchmark tool metric something that isn't practical to achieve. If it is possible, any tips on increasing my speeds? I've included a sample code function below that gets me ~2700 GB/s.

Thank you.

C-like:
#include <stdio.h>
#include <windows.h>
#define buf_size 8388608
#define iterations 100

int main()
{
    unsigned char* buf;
   
    buf = new unsigned char[buf_size] {0};
    FILE* fp;

    LARGE_INTEGER frequency;
    LARGE_INTEGER start;
    LARGE_INTEGER end;
    double interval;
    double bandwidth;

    fp = fopen("D:\\test.bin", "wb");

    QueryPerformanceFrequency(&frequency);
    printf("Starting Timer.\n");

    QueryPerformanceCounter(&start);
    for (int i = 1; i <= iterations; i++)
    {
        fwrite(buf, buf_size, 1, fp);
    }

    QueryPerformanceCounter(&end);

    interval = (double)(end.QuadPart - start.QuadPart) / frequency.QuadPart;
    bandwidth = (double)((double)buf_size * (double)iterations / interval) / (1048576);

    printf("The interval is % f. \n", interval);
    printf("The estimated bandwidth is %f MB. \n", bandwidth);

    delete[] buf;
    fclose(fp);
    return 0;

}
can't say if this is the reason, but storage companies use gigabytes (1000 megabytes) while windows uses gibibytes (1024 mebibytes), so it cuts storage down to ~91%, this is also about 91% of the advertised speeds, so could it be that? (also I'm pretty sure they advertise the best of their product, so you probably shouldn't expect to get more than what they advertise)
 
Sep 9, 2021
3
0
10
can't say if this is the reason, but storage companies use gigabytes (1000 megabytes) while windows uses gibibytes (1024 mebibytes), so it cuts storage down to ~91%, this is also about 91% of the advertised speeds, so could it be that? (also I'm pretty sure they advertise the best of their product, so you probably shouldn't expect to get more than what they advertise)

But I have benchmarked my SSDs on the same machine I'm running my code on. If the benchmark software gets 4.5 GB/s, why can I only get about 2.5-2.7 GB/s?
 
Sep 9, 2021
58
6
35
But I have benchmarked my SSDs on the same machine I'm running my code on. If the benchmark software gets 4.5 GB/s, why can I only get about 2.5-2.7 GB/s?
it was just a guess, you could try seeing if something is hogging your card in task manager, but I don't really have any ideas other than that
 
One theory:
The benchmark app is generating data to write.

Your test app reads data from one device and writes to a second in an alternating manner.
The total time is the combination of both.
It is more difficult to write an app that overlaps the reading with writing.
 
Sep 9, 2021
3
0
10
One theory:
The benchmark app is generating data to write.

Your test app reads data from one device and writes to a second in an alternating manner.
The total time is the combination of both.
It is more difficult to write an app that overlaps the reading with writing.


My test app writes a buffer full of zeroes to a file, there is no data generation. I start the timer before the write loop and end it after. Where are you seeing this alternating manner?
 
A sophisticated app will have one task reading the source while a second separate task writes the output. Their processes can overlap.
The trick is in coordinating the process.

To your point, since both your app and the benchmark both generate data, the difference should not be so apparent.

When you are writing to a ssd, you need available nand blocks to receive the data.
In time, all of the available free nand blocks get filled and ssd trim and nand management needs to get involved in freeing up blocks. That causes a read and rewrite.
The size of the ssd will matter, larger drives will have more readily available space.

What is the make/model of the ssd devices involved?

Drives can have different controllers.

Unless your pc has pcie4 capability, you may not be able to do 3000mbps.
SSD devices with QLC nand are cheaper, and read well, but do not accept writes at your desired rate.

Some ssd devices have fast MLC buffers which can accept writes faster.
On a short test, you may go fast, but eventually, for a large file the buffers get filled and the underlying nand chips become the limiter.
 
Solution
CrystalDiskMark by default tests with 1 GiB of data as a single transfer for the raw bandwidth test. What it looks like you're doing is you're transferring 8MiB chunks of data 100 times, which means you're hitting the storage driver stack and file system 100 times. The overhead between those two adds up.

For raw bandwidth testing you need to do as much data as practicable as a single transfer.