News AMD 3D V-Cache enables RAM disk to hit 182 GB/s speeds — over 12X faster than the fastest PCIe 5.0 SSDs

Status
Not open for further replies.
What would make this even more worthless is if you do manage to fill up the cache with stuff from the RAM disk, you've now evicted everyone else's data and instructions. So performance drops for everyone else because they all have cache misses now.

It's amusing to see that number in a benchmark though.
 
  • Like
Reactions: Makaveli and Geef
This is a good time to speculate: when will AMD go for a bigger L3 cache (or even add L4)?

3D V-Cache 1st/2nd gen are a single layer of 64 MiB SRAM, but AMD/TSMC said from the start that multiple layers are possible. There are diminishing returns for games and more layers would make it more expensive, but multi-layer 3D V-Cache chiplets could be made for the benefit of Epyc-X customers at least.

AMD might want to grow or shrink the SRAM layer capacity at some point. There's also a possibility of future microarchitectures sharing L3 between multiple chiplets, potentially allowing a single core to use more than the current 96 MiB limit.
 
A fun foray into ram disks, but apples and oranges...😉 A ram disk is way fast--but when you power down the machine, you lose your data. NVMe SSDs are not system ram! They hold their data when powered off. Very big difference. The correct performance comparison for NVMe is HDD. But HDD wins hands down when it comes to capacity comparison.

This story reminded me of when I was using ram disks way back with my Amigas--I had a startup sequence that took a few minutes at boot to copy a bunch of disk-based data into a ram disk I automated at the same time. It was fun, for awhile, until I ran out of ram, among other major inconveniences, like the time it consumed at boot. Try and imagine having more than, say, a couple of Terabytes of system ram, so that when you booted the system it could make the ram disk and then copy ~2 TBs of program data to the ram disk. What about people with 10 TBs of programs, games, and data? You quickly run into the practical limitations of ram disks. System ram is very fast and so cheap these days because it cannot store data when the power is off as that is not its purpose.
 
A fun foray into ram disks, but apples and oranges...😉 A ram disk is way fast--but when you power down the machine, you lose your data. NVMe SSDs are not system ram! They hold their data when powered off. Very big difference. The correct performance comparison for NVMe is HDD. But HDD wins hands down when it comes to capacity comparison.

This story reminded me of when I was using ram disks way back with my Amigas--I had a startup sequence that took a few minutes at boot to copy a bunch of disk-based data into a ram disk I automated at the same time. It was fun, for awhile, until I ran out of ram, among other major inconveniences, like the time it consumed at boot. Try and imagine having more than, say, a couple of Terabytes of system ram, so that when you booted the system it could make the ram disk and then copy ~2 TBs of program data to the ram disk. What about people with 10 TBs of programs, games, and data? You quickly run into the practical limitations of ram disks. System ram is very fast and so cheap these days because it cannot store data when the power is off as that is not its purpose.
I think you missed the point of the comparison. The point was that for most use cases, DRAM RAM disks are not beneficial anymore because top-tier SSD speeds nearly reach DRAM speeds.

On build servers, I've used RAM disks on occasion because there's usually a very good speed up on huge projects, like LLVM.
There is zero use for this on Ryzen outside of a one time "hey, that's cool". No matter how hard tech bloggers try to make it seem there is so they can pad out their articles.
Shouldn't you be on Reddit or something, what with your idiotic "AkTShuAlly"? They also mentioned the currently larger sizes like 1.3GB, and it being possibly useful in the future. Read: FUTURE.

Having the ability to easily map files and ensure they remain in cache can be very beneficial. It's not something most programs should be doing, especially without explicitly making the user aware. But aptly used, it can be an extreme boon..
 
Having the ability to easily map files and ensure they remain in cache can be very beneficial. It's not something most programs should be doing, especially without explicitly making the user aware. But aptly used, it can be an extreme boon..
It could be, but cache is transparent to software in most ISAs. MIPS is the only one I'm aware of that allows software to directly manipulate cache. Besides the method that the person used to get the figure, at least as the article writes it, is how cache is supposed to work anyway. If you want something to remain in cache, just have the software poke at the memory location enough times.
 
File Under: "Hold my beer"

This is a neat parlor trick, but utterly and completely useless.

Seeing enthusiasts find new usage for AMD's 3D V-Cache is fascinating. While the performance figures look extraordinary, they're still far from fulfilling 3D V-Cache's potential. For instance, the first-generation 3D V-Cache has a peak throughput of 2 TB/s.
First, it's not a new use. It's a cheap trick. You can't actually use it for any practical purpose, because it stops working as soon as your PC starts doing literally anything else. This can only work when it's comptelely idle, and isn't even reliable then.

Second, the V-Cache is shared by all of the cores on the chiplet. A single core can't max out the 2 TB/s of bandwidth.
Bandwidth-MLNvsMLN-X.png

AMD's EPYC processors, such as Genoa-X, which has 1.3GB of L3 cache, could be an interesting use case.
Nope. Won't work for two reasons. The first is that AMD's L3 cache is segmented. Each CCD only gets exclusive access to its own slice. That means even the mighty EPYC wil top out at 96 MB, for something like a RAM drive. Worse, if the thread that's running your benchmark gets migrated to another CCD, then your performance will drop because now it has to fault in the contents from the other CCD.

Finally - and this gets to the heart of how useless the trick actually is - your system must be completely idle. Some other background process spinning up can blow your cache contents, forcing the benchmark to re-fetch it from DRAM. That's why the benchmark is so temperamental and must be run multiple times to get a good result. So, good job taking a 96-core/192-thread CPU and turning it into a single-core, single-thread one!

we think there's potential with a 3D V-Cache and a RAM disk. It's a clever way of making old-school and new technologies gel together. SSDs have made RAM disks obsolete, but maybe massive slabs of 3D V-Cache can revive them.
Nope. Nothing you do on your PC is that I/O-bound, especially not if it fits in such a small amount of space.

RAM disks started to go obsolete even before the SSD era, when operating systems do sophisticated caching and read-ahead optimizations. Those are still at play now, but you don't notice them as much because the difference is much less vs. reading from storage.

Just think of the possibilities if AMD embraced the idea and put out a fail-safe implementation where consumers can turn the 3D V-Cache into a RAM disk with a flip of a switch.
Absolutely terrible idea! You're saying you want to reserve a huge chunk of your L3 cache for storage? The ratio of memory reads/writes to storage reads/writes is many orders of magnitude higher. This would absolutely tank performance.
 
Last edited:
What would make this even more worthless is if you do manage to fill up the cache with stuff from the RAM disk, you've now evicted everyone else's data and instructions. So performance drops for everyone else because they all have cache misses now.
Nah, it wouldn't stay there. You might fill it up, but then any other thread would come along and evict chucks of your RAM disk when it reads literally anything from RAM.

Is there any use for your comment other than to make yourself sound impressive by being so unimpressed by everything?
Nah, @Pete Mitchell is right. Given that the article's author didn't have such a firm grasp on how L3 cache works, it stands to reason that others in this thread might not, either.

IMO, the article's text is all the proof you need that such comments as Pete's are warranted.

On build servers, I've used RAM disks on occasion because there's usually a very good speed up on huge projects, like LLVM.
No, I don't believe it. I think you're better off just letting the OS use free memory for the page cache.

Shouldn't you be on Reddit or something, what with your idiotic "AkTShuAlly"?
Welcome, but we don't tolerate ad hominem attacks, here. Please make your points on merit and do not belittle or insult other members.
 
Last edited:
cache is transparent to software in most ISAs. MIPS is the only one I'm aware of that allows software to directly manipulate cache.
Cache is never truly invisible, but the amount of control varies. All CPU ISAs provide the ability to flush and invalidate the cache, since that's needed for doing non-coherent memory-mapped I/O. They also tend to provide the ability to configure address windows for restricting cache operation (on x86, see MTRR (Memory Type Range Registers)).

Beyond that, you often see data prefetching instructions, which have obvious implications on the cache.

Over the years, Intel has added further features, like:
  • SSE added non-temporal reads & writes (i.e. memory operations which don't cause cache pollution - in reality, I've observed the way early CPUs implemented this is simply by restricting the pollution to a dedicated cache set)
  • Tremont added CLWB instruction - Force cache line write-back without flush
  • Also, the CLDEMOTE instruction - Cache line demote
  • Direct store instructions: MOVDIRI, MOVDIR64B
  • QoS oriented features, like CAT (Cache Allocation Technology)
The last point is for managing QoS in real-time systems. They describe it like this:
"Cache Allocation Technology (CAT) provides a method to partition processor caches and assign these partitions to a Class-of-Service (COS). Associating workloads to different COS can effectively isolate parts of cache available to a workload, thus preventing cache contention altogether."​

QoS is also relevant for ensuring equal sharing of resources between workloads in a virtualized environment. I know Intel added featuers for this, but I'm not sure if it relies on CAT or some other technology. Note that CAT is only supported on specific models of Intel CPUs.
 
Last edited:
  • Like
Reactions: atomicWAR
would be nice to boot an os out of this ???
The way operating systems and modern CPUs work, you sort of already enjoy the same benefits, but without having to devote a chunk of expensive and scarce SRAM to it.

In modern operating systems, memory is divided up into pages. These are most commonly 4 kiB, but larger pages are starting to get more popular - especially on servers. Anyway, when you open a file, the OS can seemlessly map memory pages to blocks of the file, in order to cache its contents. These memory pages (or parts thereof, usually at 64 B granularity) can then get cached by the CPU's L3, just like anything else currently held in DRAM.

The advantage of this approach is that the OS is tuned (hopefully) to know when it makes sense to hold a file's contents in its page cache, and the CPU's cache hierarchy is tuned to know when it makes sense to hold the contents of a certain address range in various levels of cache.

So... that's a long way of saying that you're better off just letting the system manage what to keep in RAM, when. By making a RAMDISK, you're asserting you know what's best to hold in RAM, and preventing the OS from using that memory as part of a larger pool that it manages dynamically.

If you have I/O problems, and it's not from having too little RAM, then there are parameters you can tweak to adjust things like read-ahead and "swapiness". I haven't gone far down this rabbit hole, but I'm sure you can easily find lots of information on tuning I/O performance, if anyone finds it particularly interesting.
 
"Who needs Intel's Optane when you have AMD's 3D V-Cache?"

I use the 900P 280gb the system response it's wayyy better than the 2022 PCI 4.0 Shine and catch fire after a heavy use. the 2017 tech (low grade) its 3x faster on ramdon 4k and yes maybe one day I willl get another dead tech

Because who need a optane 😉
 
So you can install windows 95 and all his applications entirely in cache.
Yeah, I think Win 95 is about how far you'd have to go back, to find an OS that would fit in such a small amount of space.

The only reason booting a Win 95 VM on a modern CPU (much less an X3D model) might not be instantaneous is if its drivers have built-in delays for trying to probe various devices.
 
Nonetheless, we think there's potential with a 3D V-Cache and a RAM disk. It's a clever way of making old-school and new technologies gel together. SSDs have made RAM disks obsolete, but maybe massive slabs of 3D V-Cache can revive them. Just think of the possibilities if AMD embraced the idea and put out a fail-safe implementation where consumers can turn the 3D V-Cache into a RAM disk with a flip of a switch.

Don't know who this "we" you're talking about is, but Intel's HBM2 equipped Xeons are not only faster than AMD's 3D Cache EPYC, they also feature up to 64GB of HBM2 vs 96MB.

 
Don't know who this "we" you're talking about is, but Intel's HBM2 equipped Xeons are not only faster than AMD's 3D Cache EPYC, they also feature up to 64GB of HBM2 vs 96MB.

I don't follow what you're referring to, with either "we" or why HBM entered into the discussion.

Yes, HBM is faster for certain things. More L3 cache is faster for others. To say one is better than the other is fairly reductive, since it's very workload-dependent. They're not mutually exclusive, either.

BTW, your link cites Intel's first-party benchmarks. A more trustworthy source would be independent benchmarks. These are the only ones I'm aware of:

In those benchmarks, the EPYC 9684X beats the Xeon Max 9480 by a country mile! It's a complete blowout (Geomean 330.30 vs. 143.43 - a 130% margin!), even if we disregard the EPYC's 400 W results.
 
Last edited:
Status
Not open for further replies.