News SMI CEO claims Nvidia wants SSDs with 100M IOPS — up to 33X performance uplift could eliminate AI GPU bottlenecks

Nvidia should buy Optane from Intel then.
Was high IOPS one of the benefits of Optane? I thought it was mostly about latency, low queue depth performance, and cost-per-bit being less than DRAM.

There have been companies searching/working on would-be NAND and DRAM replacements for decades. If the hundreds of billions flowing into AI gets one of those technologies past the vaporware stage, that could have immense benefits for everyone.

We don't even need a universal memory necessarily. You could kick NAND to the curb if you could match/beat it at some combination of latency, performance and endurance (which suffer as you go to QLC and beyond), and density/cost. Cost can be higher but fall as production scales up.
 
Is this really so hard? I mean, to fake? Get thirty-three slower drives, and a boatload of DRAM for buffers, and a pool of processors, and a little hack code to assure transaction consistency, and there you are. Sounds like a Google interview question.
 
  • Like
Reactions: cyrusfox
read yes, down to bit or byte level.
write, not so much, slow and power hungry and overheated chip.

No flash SSD is going to enjoy being written for hours at ludicrous speed, either, but that shouldn't be a problem I don't think, need major clarification of the requirements on that point.
Write very extremely yes, in fact. The big benefit of Optane (and why database users gobbled it up) was that unlike NAND flash it had true bit-level writes without associated block wear.

For NAND flash to write a bit, you need to:
- first read an entire block (typically 4kb)
- store it temporarily in RAM (either on the drive on on the host, if on the host you have to shuffle it over the PCIe bus too)
- then erase the entire 4kb block (this is where NAND wear occurs)
- then modify the bit in the 4kb block in RAM
- finally rewrite the modified block
(TRIM and wear-levelling means you skip the erase and instead write to a 'fresh' block, and wait to erase until either sufficient 'writes' accumulate to that block or you run out of unTRIMed blocks, but both of those occur rapidly if you are doing bit-level operations rather than block-level)

3DXP/Optane:
- write the bit
 
  • Like
Reactions: Notton
You can never eliminate bottlenecks, you just push them around to other components.

If you make 1 part faster math says the next slowest component becomes the bottleneck.
 
The article said:
increasing the number of I/O operations per second by 33 times is hard, given the limitations of both SSD controllers and NAND memory.
The main way to do it is to get the controller out of the way, as much as possible. It should be little more than a CXL frontend for the NAND chips. Let the GPU run the FTL and error-correction algorithms. They know how to scale performance a lot better than SMI does and they're on better nodes that what SMI can afford.

SMI won't like that, because it means reducing their ability to "add value".

The article said:
However, the head of SMI believes that achieving 100 million IOPS on a single drive featuring conventional NAND with decent cost and power consumption will be extremely hard, so a new type of memory might be needed.
IMO, that's a rather self-serving narrative. Not to say you don't need NAND like HBF or at least many chips of XL-Flash that can be addressed in parallel, but let's just say I'm taking SMI's words with a grain of salt.
 
Was high IOPS one of the benefits of Optane? I thought it was mostly about latency, low queue depth performance, and cost-per-bit being less than DRAM.
People have gotten at least 6.5M 512B IOPS out of the PCIe 4.0 P5800X. I wonder what the equivalent IOPS from their PMem DIMMs was.
That article is referencing 13M IOPS from a pair of Optane SSDs. Also, he's quoting the numbers he achieved using a single core! However, the fact that he needed two drives to hit 13M IOPS probably means a single drive couldn't do much more than 6.5M.
 
  • Like
Reactions: usertests
Is this really so hard? I mean, to fake? Get thirty-three slower drives,
No, what Nvidia probably wants is 100M IOPS per-drive. They probably plan to scale up from there!

If it were just as simple as assembling a big farm of drives, we wouldn't even be talking about it, since people already do this as a matter of course. No, I'm sure they want a big farm of 100M IOPS drives!
 
For NAND flash to write a bit, you need to:
- first read an entire block (typically 4kb)
A lot of SSDs still come formatted to 512B blocks. I just had to reformat a Samsung datacenter drive to 4k sectors, once I noticed it came in 512B mode.

Weirdly, I hear Optane SSDs actually come formatted in 4k blocks. I don't know if they even support 512B.

- store it temporarily in RAM (either on the drive on on the host, if on the host you have to shuffle it over the PCIe bus too)
- then erase the entire 4kb block (this is where NAND wear occurs)
- then modify the bit in the 4kb block in RAM
- finally rewrite the modified block
Read-modify-write is how filesystems work. So, it's doing what you said, but in host memory. Sure, if you did a smaller write to the SSD, it would have to work like how you said.

On Linux, even doing O_DIRECT I/O limits you to writing multiple of 4k on 4k-aligned boundaries.

3DXP/Optane:
- write the bit
I'm pretty sure that only applies to the PMem DIMMs. Even then, the way CPUs work is by doing read-modify-write at 64B granularity, via the cache hierarchy. You could configure them to treat Optane as an uncacheable memory region and then get byte-addressability, but performance would suck.
 
You can never eliminate bottlenecks, you just push them around to other components.

If you make 1 part faster math says the next slowest component becomes the bottleneck.
Actually, what you can do is create a balanced design, where all the parts are scaled appropriately so there there's no single choke-point.

Even then, it's difficult to achieve this for all workloads. Somebody is going to have a workload (or go searching for one) where this balance is upset, and then there will probably be a small number of places that are limiting.

I think CPU design is a good example of such a balancing act. In order to use silicon and power efficiently, designers try to scale each part of the architecture to match the others.