News Patriot developing high-speed DDR5 RAM that guarantees speeds up to DDR5-7200 regardless of CPU IMC quality — unlocks higher overclocking potential...

Status
Not open for further replies.
Very informative. I really appreciate the background info. I had seen mention of RCDs mentioned before, but I thought it was a nice touch to relate these CKDs to them.

Also, thanks for mentioning the relation between voltage and timings. It's not surprising to me, but I didn't know as hadn't been tracking that.

I hope this sort of thing becomes virtually standard on all but the cheapest DDR5 DIMMs.
 
I'd love to see if this has any real world impact on overclocking. I'm not aware of any high frequency XMP/EXPO kits featuring a clock driver outside of RDIMMs where it's required.

The timings referred to in the article are just JEDEC 6400 timings so I could see where it could help with frequency while keeping voltage at 1.1v now since no IMC natively supports 6400. If the clock driver is effectively required to maintain those clocks to meet JEDEC specs I could see them becoming fairly common once AMD/Intel move up in base CPU memory spec. Of course at the same time most DDR4 high frequency memory used 2133 baseline instead of 3200 so who knows.
 
Rather than highest speed, memory needs to become smart, and automatically doing tasks like filling an area of memory with zeros, or a constant, basic bitwise operations, like a constant XOR, mask, vector addition or some basic operations. Even carryless products or carryless additions would be useful, and save a lot of computing time.
 

Zhiye Liu,​

I have enjoyed this article as well as others you have written. Quite simply, I would like to know where this new Patriot memory falls into the "Best RAM for Gaming: DDR4, DDR5 Kits for 2024" article you wrote.
Don't short-circuit and burn the place down trying to figure it out. 😉
 
Rather than highest speed, memory needs to become smart, and automatically doing tasks like filling an area of memory with zeros, or a constant, basic bitwise operations, like a constant XOR, mask, vector addition or some basic operations. Even carryless products or carryless additions would be useful, and save a lot of computing time.
I think that wouldn't help much. First, those aren't things CPUs spend much time doing. Zeroing memory is the only one worth taking seriously, but even that is really tweaking at the margins.

Second, the operations can't break cache-coherency, which means work for the CPU, anyhow. You'd have to flush or invalidate all of the cache lines for the address range + bring in the new data from memory, since it would presumably happen when you're about to use it. Together, you've now spent a large fraction of the cost of just having the CPU do it.

Third, if the operation would be truly expensive, then it's going to be asynchronous and that would introduce synchronization overhead.

To provide a significant benefit, you have to look at more complex operations, like convolutions. That's what Samsung and SK Hynix have been working on.
 
  • Like
Reactions: snemarch
I'd love to see if this has any real world impact on overclocking. I'm not aware of any high frequency XMP/EXPO kits featuring a clock driver
Did you see this part?

"It isn't a novel idea, nor is Patriot the first to do it. TeamGroup did the same thing with its mainstream Elite and Elite Plus lineups."

Are those kits not XMP/EXPO?
 
First, those aren't things CPUs spend much time doing.
... because they are too expensive.
Second, the operations can't break cache-coherency, which means work for the CPU, anyhow.
Not a problem, because the programmer would know, before using it, and would know when to use it.

An use case is to process an array speculatively assuming that only 5 bits are required, but then finding that actually 6 bits are required. Then the past calculations become useless, so it better to assume 8, or 16 bits, even when it wastes space, leaves most bits unused, and is slower to process.
But a smart memory could be commanded to convert the past data to 6 bits (which involve carryless arithmetic), so the CPU would continue processing the rest of the array meanwhile the RAM reformats the already processed array. There is no cache conflict, because the CPU is occupied with a different area of memory.
 
How are there so many makers of CKD chips, when none of the performance-oriented DIMMs are using them? Seems contradictory, but maybe it's just the next big thing and enough chipmakers figured that out?
I thought it was just part of the spec in more of a "if you need it" type capacity, but according to Renesas it's a requirement once you hit 6400. So I'd assume everyone making CKDs are also companies making RCDs and they just got a jump on manufacturing since DDR5 base specs have been moving rapidly. Granite Rapids is 6400 which leads me to believe ARL will be as well which means volume production will already be necessary.
 
Last edited:
  • Like
Reactions: bit_user
... because they are too expensive.
No, I meant they're too cheap & infrequent to tie up a significant amount of CPU time. For something to make sense to embed into DRAM, it has to be a common operation (in some context) and data-intensive (relative to the amount of computation involved). That's how you achieve a worthwhile savings by keeping the data in memory instead of shipping it back & forth between memory and the CPU.

I provided you 3 links of efforts SK Hynix has made to tightly couple compute with memory. There are similar examples from Samsung. If you're interested in the subject, I'd recommend reading those linked articles, as a starting point.

As far as I'm aware, these are the main efforts currently under way to do compute-in-memory that are likely to bear fruit.
 
  • Like
Reactions: snemarch
No, I meant they're too cheap & infrequent to tie up a significant amount of CPU time.
Carryless arithmetic has applications in: sorting algorithms, data compression, hash tables, remapping arrays.
Those are basic tasks that computers do 24/7, and present hardware doesn't supports (not even the ALU).

You fail to tell the difference between "is not used because is not supported", and "is not used because is not needed".
 
Carryless arithmetic has applications in: sorting algorithms, data compression, hash tables, remapping arrays.
The type of arithmetic you're describing is cheap for CPU cores to do. Data movement and latency are expensive. Make sure you're solving the right problem.

As technology evolves, the costs & bottlenecks in the system shift around. What seemed like a good idea 30 years ago might no longer make sense. Conversely, some things which seemed trivial in the past are now becoming increasingly expensive.

Again, you don't have to take my word for it. Look at what the memory makers themselves are working on.
 
Carryless arithmetic has applications in: sorting algorithms, data compression, hash tables, remapping arrays.
Those are basic tasks that computers do 24/7, and present hardware doesn't supports (not even the ALU).

You fail to tell the difference between "is not used because is not supported", and "is not used because is not needed".
If you move a workload from the CPU to the memory modules, it has to be a task that makes sense to run asynchronously – if your CPU sits waiting for the workload to complete, you gain nothing. Doing async stuff like this has overhead and complexity, and your examples make no sense for our current hardware and operating system designs.

From regular application code, even memory copying / zeroing (which sounds simple to implement, and is something all applications deal with) don't make sense to offload. It probably doesn't even make sense for your OS zero-deallocated-pages scrubber – that use sounds like a good fit for async usecase, but the individual pages are *tiny* and the implementations are highly optimized.

Go read the articles @bit_user linked to :)
 
  • Like
Reactions: bit_user
From regular application code, even memory copying / zeroing (which sounds simple to implement, and is something all applications deal with) don't make sense to offload. It probably doesn't even make sense for your OS zero-deallocated-pages scrubber – that use sounds like a good fit for async usecase, but the individual pages are *tiny* and the implementations are highly optimized.
There's an interesting footnote to the memory-zeroing case, which is that Apple has apparently built-in some special case for optimizing transfer of zero-blocks across its SoC interconnects. However, that's presumably not only for writing runs of zeros, but also reading them?

Another whole aspect we haven't even touched on is memory encryption. Compute-in-memory would seem to be incompatible with it, unless you build encryption hardware into the actual memory chips and send a key with these operations, which sounds expensive and could have implications on the security of the system (i.e. if someone can sniff those keys being sent off chip, they should then be able to decrypt the data the CPU is writing out).

On the other hand, if you stack the memory dies with the processing elements, and Nvidia seems to be working on, then it should be almost as hard to sniff encryption keys between the compute & memory dies as it is to sniff them within existing compute dies.
 
  • Like
Reactions: snemarch
I would like to know where this new Patriot memory falls into the "Best RAM for Gaming: DDR4, DDR5 Kits for 2024" article.
In time, yes. Right now, they only have an "engineering preview". I'm not sure how long it takes to go from that to a launched product available for reviewers & purchase, but expect to wait several months.
 
Status
Not open for further replies.