3D XPoint SSD Pictured, Performance And Endurance Revealed At FMS

  • Thread starter Thread starter Guest
  • Start date Start date
Status
Not open for further replies.
Some may be not so impressed but this is just the very beginning of a tech thats just in its infancy (just 2 layers...lol)...give it 2-3 years and I bet it'll be a mind blow.
 
Allyn Malventano at pcper.com writes about the U.2 chart:

http://www.pcper.com/news/Storage/FMS-2016-Micron-Keynote-Teases-XPoint-QuantX-Real-World-Performance

"These are the performance figures from an U.2 device with a PCIe 3.0 x4 link. Note the outstanding ramp up to full saturation of the bus at a QD of only 4."

If the bus can be saturated that easily, it's time to up the clock rate on PCIe lanes, at a minimum to 12 Gbps just like 12G SAS.

I'm optimistic that these non-volatile memory technologies will get refined with time, and get cheaper with mass production. Also, the PCIe 4.0 clock will oscillate at 16G, so it's time the industry embraced the concept of "overclocking" storage a/k/a variable channel bandwidth.
 
Well, 4-5 times more expensive should be expected, and even more. New technology, and when it comes to market the performance leap could be as big as the jump from HDD to SSD, and SSDs were 10 times more expensive at the beginning.

I hope this continues, but I'm also afraid of a scenario were all XPoint technology belongs to just 1 or 2 big companies, which would be awful for pricing and compatibility.
 


Rather than 4-5 times more expensive than NAND, you should be thinking of it as half the price of DRAM (as it's closer to DRAM specs than it is NAND specs). The charts show it being crippled by the interface, not the tech.
 
"Much of the leading-edge developmental work to increase performance is focusing on peeling back these layers... such as reducing interrupts by utilizing polling drivers ... can offer drastic performance increases" - I really don't know how realistic this idea is, but maybe its worth trying using pcie entries (cards, cables) inside big data (fastest nand etc..) just as SAS controllers are used in SCSI /SATA/ethernet. Does that make sense? If many pcie connections to fast data can work simultaneously and manipulate the data, then we can drop Ethernet style switches with load balancing, latency and other last millennium concerns .. no?
 
I expect it to be at least 10 times or even more expensive during the release than normal nand. In the longer run 4 times more than normal nand may sounds plausible.

First devices will go to big corporates that handle a lot of data, like Google and Facebook who has more money, than is needed to get very high speed storage.
So first devices are like luxury sport cars, fast and expensive.
 
> I really don't know how realistic this idea is

From an analytical point of view, I share your concern.

Here's why:

When doing optimization research, we focus on where
a system is spending most of its time.

For example, if the time spent is distributed 90% / 10%,
cutting the 90% in half is MUCH more effective
than cutting the 10% in half.

Now, modern 6G SSDs are reaching 560 MB/second
on channels oscillating at 6 GHz ("6G").

The theoretical maximum is 6G / 10 bits per byte = 600 MB/second
(1 start bit + 8 data bits + 1 stop bit = "legacy frame")

So, what percentage of that ceiling is overhead?

Answer: (600 - 560) / 600 = 40 / 600 = 6.7%

WHAT IF we merely increase the channel clock to 8 GHz?

Then, 8G / 10 bits per byte = 800 MB/second

That alone raises the ceiling by 800/600 = 33.3%

Now, add PCIe 3.0's 128b/130b jumbo frame
(130 bits / 16 bytes).

Then, 8G / 8.125 bits per byte = 984.6 MB/second

984.6 / 600 = 64% improvement

And, what if we increase the channel clock to 12G (like SAS)
-and- we add jumbo frames too:

Then, 12G / 8.125 bits per byte = 1,476.9 MB/second

1,476.9 / 600 = 146% improvement

I submit to you that the clock rate AND the frame size
are much more sensitive factors.

Yes, latencies are also a factor, but we must be
very realistic about "how sensitive" each factor is, in fact,
and be empirical about this question, NOT allowing
theories to "morph" into fact without experimental proof.
 


I'm thinking of the usage, not the specs. In the end, no matter where it's deployed, it's still memory as storage, like NAND. It's a NAND replacement on a memory bus. The early adapters will pay a premium. I accept that. What I won't accept is 32-64GByte of storage, for the price of 2TBytes of storage in my precious memory slot. I'll just populate with memory and go on about my business in that case. Unless and until Intel and Micron get this price structure under control, I'll wait.
 
> Unless and until Intel and Micron get this price structure under control, I'll wait.

I agree 200%.

Case in point: note well how Intel's emphasis is now the big data centers,
at the expense of individual prosumers.


("We are NOT an oligopoly," cried the entire group of SSD manufacturers.)

Plug-and-Play is also being neglected, and ways to circumvent the
ceiling imposed by Intel's narrow DMI 3.0 link must come from third-party vendors.

Why not populate 2.5" SSDs with 3D XPoint, and work "backwards"
to raise the ceilings imposed by the upstream circuitry?

If SAS clocks can oscillate at 12G, then the data channels
to 2.5" Optane SSDs can go at least that fast, if not also 16G.

So, start out with SAS-only 2.5" Optane SSDs.

Think about the large number of 12G SAS RAID controllers
to choose from, with full support for all modern RAID modes.

Plug and Play, remember? :)

The infrastructure is already in place to cool 2.5" form factors
with any of several hundred chassis now being marketed worldwide.

3.5-to-2.5" adapters are a dime a dozen.

If I were a decision maker at ASUS, I would start designing
a motherboard with at least 4 x U.2 ports, support
for all modern RAID modes and a BIOS setting allowing
the clock speed to vary -- perhaps with pre-sets like 6G, 8G,
12G and 16G.

While they are at it, add 128b/130b jumbo frames as another option!

This approach seems a lot more sensible to me,
anyway, than 3 x M.2 slots that are not bootable,
that are also prone to overheating and hence
thermal throttling, and that can't exceed the
DMI 3.0 ceiling because those M.2 slots are
all downstream of that DMI link.

Heck, I remain convinced that the original architecture
of PCI-Express was to permit expansion slots to be populated
with new Add-On Cards withOUT needing to upgrade
an entire motherboard. Wasn't that one of the main
reasons why PCI-Express was developed in the first place?
 


These same items are covered in our piece, but in much greater detail.
 
How many small boot SSDs are coupled with at least one spin drive for capacity? The race to the bottom has moved Flash into the realm of a primary drive system today. Xpoint is going to be similar. I expect a 128GB/256GB Xpoint setup to be coupled with some slower flash drive(s) for capacity. The entertaining part is that the slow drive will be NVMe cards and the spin drive will finally be dead in the enthusiast class machine. Continuing to reduce the storage bottleneck will be much more of a perceivable user experience change than increasing CPU performance. With all of this being piled on the PCI bus available channels are going to start becoming more relevant due to storage concerns rather than multiple GPU cards.
 


If we look at SATA its primary target for design was a spin drive that topped out at around 30-60 megabytes per second with 8-9 millisecond seek times. The design was really there to optimize the CPUs ability to mitigate this bottleneck. For the first time we started to see serialization of the consumer hard drive bus which allowed for simpler cabling and better error handling, but at a theoretical cost of performance in latency because we now have to serialize, and then de-serialize the data. In real world performance the answer to this is to simply increase bandwidth thus it is faster, right? Not really. If we look at spin drive performance on PATA vs First Gen SATA there wasn't a real magical performance leap. A spin drive still performed pretty much the same. Some manufactures were throwing bigger cache on the board and increasing performance but not because of SATA. On the other hand SATA cables are definitely more reliable, easier to install and cheaper, thus the industry was all good with SATA. As the tech moved forward Native Command Queuing stepped in as well, which did give performance gains, but this wasn't really dependent on SATA and was a carryover from SCSI. It was a controller evolution technology that was just part of the SATA progression because by that time PATA was dead.
In step SSD flash drives. Suddenly instead of 8-9 millisecond responses we move into 65 microseconds in a m25 drive. It is also reading at 250 megabytes per second and writing at 70! Suddenly your 1.5 Gigabit (remember to divide by 8 for bytes) SATA 1.0 bandwidth which is theoretically good for 188 megabytes per second (without overheads) is completely saturated. So the easy answer here is to keep jumping the bandwidth to keep up with the drives. So in concept we just keep bumping up the SATA spec and we will keep having faster drives? Well it doesn't work quite like that. Remember that issue of serializing/deserializing? It takes time. Also SATA controllers typically run on top of the PCI bus. And the PCI bus has interrupts that have to be flagged before the CPU will even look at them. Real world work loads on a computer are not about extremely large files that we move around in a linear fashion, but a ton of tiny files being appended and changed at unexpected times and most of the CPU time spent on a computer goes into an idle cycle. So first generation SSD's magic over spin drives wasn't really the bandwidth, but actually the latency reduction. Suddenly IOPS became a thing. As flash SSDs have matured the controller technology has significantly reduced the latency. So much that the SATA controller and bus simply being there became a bottleneck. Current NVMe drives are now down to 20 microsecond latency. That is with the drive sitting directly on the PCI bus. If Xpoint on the lowest level is really 1000x faster than flash the PCI(e) bus itself is already a latency bottleneck. And latency here will be key for perceivable performance change, not simply bandwidth. The wild part here is that even as an infant prototype tech it is already maxing out x4 PCI lanes in the bandwidth arena. DRAM which has operated in the nanosecond arena for quite a while now (15-25 nanoseconds) isn't put on top of the PCI bus for this very reason. It has its own special bus, linked to the CPU, so that that ultra low latency can do its thing. Xpoint's latency figures put it into a class that is somewhat slower than DRAM, but in the nanosecond class. That means that this stuff sitting on a PCI(e) bus will probably only show 10x-20x gains because PCI is the primary bottleneck. On the other hand tweaked into a new DRAM super-set that gain will get much closer to the 1000x figure. As storage latency approaches CPU clock cycle on gains of this magnitude we will have real world performance changes that will be much more perceivable than doubling CPU performance. Being in these forums though everyone seems focus on bandwidth. In real world applications if I am retrieving a 150 kilobyte file with 200-300 nanoseconds latency on the RAM bus in contrast with 2-20 microseconds on PCI I will have finished the total operation before the high bandwidth low latency connection even starts (even if we bound 40 PCI lanes to the NVme controller and have an unwieldy ability to move a massive file). Quit worrying about 12gbs channels and start focusing on getting this stuff closer to the CPU. 16x PCI e is already at 16 GB per second but binding channels won't reduce latency.

 
" the four chips on the two daughterboards are the first 3D XPoint packages we have seen in the wild. The packages are unmarked, but the overall capacity of the PCIe 3.0 x8 card weighs in at 128 GB. We are unsure if there are more packages on the board, as our time with the prototype was short.

If there were only four packages on the board, it would equate to 32 GB per package. Micron noted that 20nm 3D XPoint would feature a die density of 128 Gbit (16 GB) per die, so that implies that each of the four packages features a two-die stack."

Um, I'd say its pretty obvious there are 8 packages there. The two riser cards have two each on top, and the patterns diagonally from them clearly indicate the other 2 mounted underneath.

They must've had a trace width / layer count issue for the dev board to fan out all the connections hence the top/bottom staggering.
 
....or the boards were designed for 4 packages each but only 2 positions were populated and I'm seeing the landing pads for the 2 not placed. ;-) But that wouldn't equate to the die density statement from Micron. Hm....
 
It seems to me at this point local computing hardware needs to get away from the memory/storage archetype and move to 100% memory with secondary external storage for archiving and low use storage. There seems very little separating RAM and NAND and I think this is the next step in faster more responsive computing. This would be a paradigm shift though. Industry momentum will likely keep this from happening for a long time. The idea and having 1 TB of storage located at the memory level is pretty interesting to me. I am sure there are issues I am not addressing while considering this, but at this point having a separate memory system makes so sense. Just load everything into solid sate storage. Instant boots? No caching delays? Yes sir, thanks. Because as far as I know, almost 100% waiting for computers has to do with memory/storage swaps and loading cache data to run programs. I could be wrong, I am not an expert.
 
> a 200 GB QuantX SSD saturates the x4 bus at QD4
> The second chart is an identical test with an x8 connection.

x4 saturates bus at 900
x8 saturates bus at 1800

= almost exactly linear scaling.

Thus, x16 saturates bus at 3600??

Looks like lane count makes a BIG DIFFERENCE.

I'd like to see the same comparison of sequential speeds:
that will permit an empirical calculation of controller overhead
vs. theoretical maximum bandwidth.
 


Actually, it is very common to have empty pads, with nothing underneath, on smaller-capacity SSDs. The pads are populated for denser variant.
 


I noticed that as well. My problem is that I don't want to pay an extra $200 for a motherboard with 7 X-16 PCIe slots. Or $100 extra for a board with 7 X-8 slots for a Z series board. Intel should have addressed this problem a long time ago in the consumer space. A top of the line processor should have no less than 48 PCIe lanes and the chipset that goes with it should have 96. Even an S series processor should have at least 32 lanes. I know it will up the pin count dramatically, but I find myself thinking, "So what." Am I wrong?
 
> Intel should have addressed this problem a long time ago in the consumer space. A top of the line processor should have no less than 48 PCIe lanes and the chipset that goes with it should have 96. ... Am I wrong?

AGREED!

Your point is well proven by the MAX HEADROOM of Intel's DMI 3.0 link
which is THE LATEST DMI spec = EXACT SAME bandwidth
as a single M.2 NVMe SSD (x4 PCIe 3.0 lanes @ 8 GHz).

We had 4.0 GB/s upstream bandwidth 5 YEARS ago with
a cheap Highpoint 2720SGL HBA (x8 lanes @ 500 MB/second).

Just to illustrate, FOR YEARS we have been hammering on the fact
that multiple GPUs have had x16 edge connectors, and only lately
have NVMe HBAs started showing up with x16 edge connectors.

Since Intel is shifting focus to the large data centers,
I say: leave the DMI link where it is for now, and
bump its clock to 16G at PCIe 4.0.

But, allow the third-party HBA vendors to do their thing
with x16 PCIe expansion slots.

That's why I'm making such a fuss about Highpoint's new RocketRAID 3840A:
just what the doctor ordered: PCIe 3.0 NVMe RAID with x16 edge connector
and 4 x U.2 ports.

For years now we've been populating the first x16 slot with a RAID controller,
to ensure assignment of the maximum available PCIe lanes,
because we don't need high-performance graphics and
our GPUs work fine in the lower x16 slots (e.g. with x8 lanes assigned).

One obsolete PCIe motherboard complains at Startup,
but it still works.

 
Status
Not open for further replies.