News Intel Announces Optane DIMMs for Workstations, SSD 665P With 96-Layer QLC, Roadmap

DavidC1

Distinguished
May 18, 2006
494
67
18,860
PCIe 4.0 doesn't even compare. While the bandwidth is there, the latency is not. The Optane drives are at 10us. These DIMMs are an additional 20-40x faster in terms of latency.

You could get the bandwidth to be at 500GB/s, but it'll still be unusable for system memory if the latencies are in the 100us range as with SSDs.
 
PCIe 4.0 doesn't even compare. While the bandwidth is there, the latency is not. The Optane drives are at 10us. These DIMMs are an additional 20-40x faster in terms of latency.

You could get the bandwidth to be at 500GB/s, but it'll still be unusable for system memory if the latencies are in the 100us range as with SSDs.

These are not Optane drives, they are Optane DIMMs. They are basically non-volatile memory. They are still in the ns for latency, although they are higher then regular DRAM. However most systems allow for mixed setups with some DRAM and some Optane DIMMs, at least the servers have. So it might be possible to have half of the DIMMs be faster DRAM and the others be larger size Optane DIMMs allowing for storage that has vastly higher bandwidth than NVMe and vastly faster latency than NVMe (average is 20ms)
 

bit_user

Polypheme
Ambassador
They are still in the ns for latency, although they are higher then regular DRAM. However most systems allow for mixed setups with some DRAM and some Optane DIMMs, at least the servers have. So it might be possible to have half of the DIMMs be faster DRAM and the others be larger size Optane DIMMs allowing for storage that has vastly higher bandwidth than NVMe and vastly faster latency than NVMe (average is 20ms)
I'm not sure where you got 20 ms for an average NVMe, but that ain't right. Maybe you meant 20 microseconds, but 20 ms is hard disk territory. 20 us is in the ballpark, for SSD reads.

Here's a fairly recent measurement of SATA drive IOPS:
aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9ML0QvNzM2ODk3L29yaWdpbmFsL2ltYWdlMDA2LnBuZw==

Source: https://www.tomshardware.com/reviews/crucial-mx500-ssd-review-nand,5390-3.html

So, if you just take the reciprocal of the QD1 IOPS, then you get 83 to 183 usec. However, that includes the time to read & transfer the whole 4k of data, as well as potentially traversing the entire OS I/O stack & probably also the filesystem driver. I don't know what the standard is for measuring read latency, but it might be less.

Here's the same measurement of high-end NVMe drives, from the Intel 905P review:
aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS9RL0UvNzY5NDc4L29yaWdpbmFsL2ltYWdlMDA3LnBuZw==

Source: https://www.tomshardware.com/reviews/intel-optane-ssd-905p,5600-2.html

Again, the same math for the 905P indicates 14.8 usec for the fastest Optane NVMe and 58.1 to 98.5 usec for the NAND-based NVMe drives.


As for how latencies and bandwidth compare between Optane DIMMs and DDR4, there's a handy table in this article:

https://www.tomshardware.com/news/intel-optane-dimm-pricing-performance,39007.html

According to that, the bandwidth of a single Optane DIMM is roughly in the same ballpark as some of the high-end PCIe 4.0 NVMe drives - not vastly higher. AFAIK, Intel CPUs only support up to two channels, meaning your max is only double that.

And latency is about 100x lower, for Optane DIMMs vs NVMe. However, I'm certain the measurement methodology for the DIMMs is just like a memory access. They can't be going through the OS, any filesystem, or even the kernel. And it's not going to be 4k, either. If you would measure the 905P in the same way, maybe the difference could be brought to within 1 order of magnitude.
 
Last edited:
One step closer. While not as fast as traditional memory these will be a big game changer for storage as they are vastly faster than even PCIe 4.0 x4.
That's not even the important part of it.
The CPU will be aware that the memory addresses of the optane dim are storage so all the OS calls of gofer this file gofer that file will fall away,the CPU will spend less time (as in none at all) with senseless copying around files from storage to ram only to copy them back after the CPU worked on them.
 

bit_user

Polypheme
Ambassador
That's not even the important part of it.
The CPU will be aware that the memory addresses of the optane dim are storage so all the OS calls of gofer this file gofer that file will fall away,the CPU will spend less time (as in none at all) with senseless copying around files from storage to ram only to copy them back after the CPU worked on them.
That requires OS support, which will take time.

https://www.phoronix.com/scan.php?page=news_item&px=Intel-PMEMFILE

Then, if you want apps to use the data in-place, that will take yet more time (unless they're already written to use mmap()).
 
Last edited:
That requires OS support, which will take time.

https://www.phoronix.com/scan.php?page=news_item&px=Intel-PMEMFILE

Then, if you want apps to use the data in-place, that will take yet more time (unless they're already written to use mmap()).
The very article you link to states that intel tries to keep the kernel i.e OS out of it to get better performance.
It's all about the CPU having direct access without having to go through OS calls.
Apps don't have to be optimized for it either,at least the apps us common people care about,they will automatically load up- and do all of their I/O -faster.
Very specialized very I/O heavy apps will have to be specially written to take full advantage but that's the worry of server/workstation IT.

" Intel's new user-space file-system for persistent memory is designed for maximum performance and thus they want to get the kernel out of the way. "
 
The very article you link to states that intel tries to keep the kernel i.e OS out of it to get better performance.
It's all about the CPU having direct access without having to go through OS calls.
Apps don't have to be optimized for it either,at least the apps us common people care about,they will automatically load up- and do all of their I/O -faster.
Very specialized very I/O heavy apps will have to be specially written to take full advantage but that's the worry of server/workstation IT.

" Intel's new user-space file-system for persistent memory is designed for maximum performance and thus they want to get the kernel out of the way. "

I am all for best performance but there is a downside to bypassing the kernal. If written wrong it could cause hard crashes. Its much like the old days of gaming where a game could crash a system while now it will typically crash the program and API instead.

However I have some faith that since Intel is heavily developing these for HPCs and enterprise first they would work to help prevent that sort of issue. Can't have a poorly written application taking a server down.
 

bit_user

Polypheme
Ambassador
The very article you link to states that intel tries to keep the kernel i.e OS out of it to get better performance.
It's all about the CPU having direct access without having to go through OS calls.
The OS must necessarily be involved at some point, for security, if nothing else. I believe the goal is to remove the OS from the "fast path" (i.e. individual data-accesses).

Apps don't have to be optimized for it either,at least the apps us common people care about,they will automatically load up- and do all of their I/O -faster.
Apps currently make OS calls. So, either the user has to use a shim, like that article cites, or the OS needs to be modified to do effectively the same thing. So, I think we're looking at having some level of OS support, before this stuff is ready for prime time.
 

bit_user

Polypheme
Ambassador
I am all for best performance but there is a downside to bypassing the kernal. If written wrong it could cause hard crashes. Its much like the old days of gaming where a game could crash a system while now it will typically crash the program and API instead.
Right. I think we're not looking at completely bypassing the kernel. I'd imagine something more like how virtual memory works, where the OS uses the page table to ensure users don't scribble on filesystem datastructures or access files/areas they don't have permission to read or write.

However I have some faith that since Intel is heavily developing these for HPCs and enterprise first they would work to help prevent that sort of issue. Can't have a poorly written application taking a server down.
Yeah, this stuff sounded like early days, but it shows one possible direction they're headed.
 
Apps currently make OS calls. So, either the user has to use a shim, like that article cites, or the OS needs to be modified to do effectively the same thing. So, I think we're looking at having some level of OS support, before this stuff is ready for prime time.
They could make some sort of Dx12 thing were it's still a driver but basically direct access to the hardware.
 
Nov 15, 2019
1
0
10
Yes, you have to have access to a file to map it, whether you are creating it or accessing an existing file.
Then the TLB has to be populated for each page you are trying to access.
After that, reads and writes go directly from your program to the controller on the device.
 

DavidC1

Distinguished
May 18, 2006
494
67
18,860
And it's not going to be 4k, either. If you would measure the 905P in the same way, maybe the difference could be brought to within 1 order of magnitude.

I know, I know, late late reply.

905P is on a different interface so half of its latency is due to that. So if the media had zero latency, then it could be halved.

4K access being at 2-3us with Optane DIMMs are only possible because it's on the DIMM format.

Of course the real thing about the DIMM is avoiding the 4K access. Hopefully we'll get to see the line blurred very soon and applications using the 256B mode.

And initially it'll be on a workstation, so, 4 channels. Right now its for 6 channel chips.
 

bit_user

Polypheme
Ambassador
905P is on a different interface so half of its latency is due to that. So if the media had zero latency, then it could be halved.
You're saying PCIe vs. DIMM adds equivalent latency to what the 3D XPoint memory itself has? How did you arrive at this conclusion?

4K access being at 2-3us with Optane DIMMs are only possible because it's on the DIMM format.
Huh? In that post, I linked an article saying that Optane DIMMs had sequential read speeds of 7.6 GB/sec. So, by my calculations, that gives you a 4 kB latency of 0.54 usec.

Of course the real thing about the DIMM is avoiding the 4K access.
So should NVMe. In fact, because 3D XPoint is byte-addressable, it seems like even SATA shouldn't need to read more bytes from the media than what you actually request.

4 kB is just a common benchmarking size, which conveniently matches the smallest sector size on more recent HDDs and SSD. The sector size dictates the smallest block the disk can write. And, if you're talking about a real HDD or a NAND-based SSD, maybe it also has to read that much, for ECC, but I think it shouldn't send the entire sector over the bus - just the part you wanted.

And initially it'll be on a workstation, so, 4 channels. Right now its for 6 channel chips.
I believe the server Xeons that support it actually devote a pair of channels to it, in addition to their DDR4 channels. So, they actually have a total of 8 channels.

I don't know much about Intel's Optane plans, on quad-channel & dual-channel CPUs, but I'm pretty sure they're not expecting people to use Optane instead of DDR4. Not in the short term, at least. For now, I think it's intended as supplemental.
 

DavidC1

Distinguished
May 18, 2006
494
67
18,860
You're saying PCIe vs. DIMM adds equivalent latency to what the 3D XPoint memory itself has? How did you arrive at this conclusion?

The Optane DIMMs have a Read latency of 180-340ns. Yes, the difference is due to PCI Express vs DIMMs, and the fact that it uses 256B addressing mode.

Huh? In that post, I linked an article saying that Optane DIMMs had sequential read speeds of 7.6 GB/sec. So, by my calculations, that gives you a 4 kB latency of 0.54 usec.

https://www.tomshardware.com/news/intel-optane-dimm-pricing-performance,39007.html
https://www.intel.com/content/www/u...cing-bandwidth-and-latency-article-brief.html
https://cdn.mos.cms.futurecdn.net/AW9oYo9kXB5gu5eyx9SdFF.jpg

4 kB is just a common benchmarking size, which conveniently matches the smallest sector size on more recent HDDs and SSD. The sector size dictates the smallest block the disk can write. And, if you're talking about a real HDD or a NAND-based SSD, maybe it also has to read that much, for ECC, but I think it shouldn't send the entire sector over the bus - just the part you wanted.

For compatibility issues all storage devices have to support the 4K format. When NAND SSDs first came out, it did that so its as equal as HDDs as possible. Same with Optane SSDs.

I believe the server Xeons that support it actually devote a pair of channels to it, in addition to their DDR4 channels. So, they actually have a total of 8 channels.

No. You need at least 1 stick of DDR4 DIMM for every 1 stick of Optane DIMM. Each channel can support two DIMM slots, and it has 12 channels so with 256GB you can have 3TB capacity.

With Optane, 6x is filled with DRAM and 6x is filled with Optane. So you have (256GB x 6) + (512GB Optane x 6) = 4.5TB capacity if used in App Direct mode, 3TB if used in memory mode. Storage-over-App-Direct like for running the Crystaldiskmark tests also end up with 3TB.

Have you done any reading on the Optane DIMMs?
 
Last edited: