News AMD RX 7600 Has Better Cache Latency Compared to RX 7900 XTX

Admin · Jun 5, 2023

A new review of the RX 7600 reveals that AMD's mediocre mid-range GPU actually has substantially better memory and cache latency performance compared to the RX 7900 XTX.

AMD RX 7600 Has Better Cache Latency Compared to RX 7900 XTX : Read more

InvalidError · Jun 5, 2023

If you want to keep stuff as fast as possible, you keep it on-die as close as possible, as much as possible. Shouldn't surprise anybody.

I wonder if anybody will manage to solve the challenge of consistently scalable multi-die GPUs. I wouldn't be surprised if the SM-to-SM bandwidth scaling across the chip-to-chip interface to achieve seamless integration turns out too steep for that to ever work fine without software implementing some variation of explicit multi-GPU to minimize die-to-die traffic, much like how OS schedulers and programs need to be CCD-aware to avoid getting bogged down by CCD-to-CCD latency.

bit_user · Jun 5, 2023

the RX 7950 XTX takes up to RX 7600 enjoys a massive 58% longer to retrieve data from its Infinity Cache

Uh... not only does that not parse, but I strongly reject the ironic use of "enjoys". Next thing you know, people are using it in a negative sense, without a hint of irony, and the word "enjoys" loses all connotations of pleasure.

Kind of like how a couple of the authors at WCCFTech use the phrase "sips power", when talking about something that actually guzzles it. With no hint of irony, they're depriving the word "sips" of any sense of rate or quantity.

Please have some pride, as a writer. Use the language, don't abuse it.

bit_user · Jun 5, 2023

InvalidError said:
If you want to keep stuff as fast as possible, you keep it on-die as close as possible, as much as possible. Shouldn't surprise anybody.

IIRC, the RX 7600 review talked about L2 cache. I'm glad Chips & Cheese investigated the matter, because I was curious just how much difference it made vs. the RX 7900's L3 cache.

I hope AMD does a version of the RX 7900 with 3D V-cache. That would seem to help justify their use of chiplets, as well as stepping down from the RX 6900's 128 MB of cache to a mere 96 MB.

InvalidError said:
I wonder if anybody will manage to solve the challenge of consistently scalable multi-die GPUs.

Apple's M1 Ultra.

InvalidError · Jun 5, 2023

bit_user said:
Apple's M1 Ultra.

From what few comparisons can be made across PC to Mac worlds to gauge the M1's scaling success, the performance ranging anywhere from the RTX3090 being only 15% faster to as much as 160% faster looks like consistency is far from perfect.

L1fe Test · Jun 5, 2023

So basically the 7900xtx sucks and lets wait for version 2.0

Kamen Rider Blade · Jun 5, 2023

This should surprise no one.

Monolithic vs Chiplet.

Obviously Monolithic will have lower latency when L3$ is off-die.

bit_user · Jun 5, 2023

InvalidError said:
From what few comparisons can be made across PC to Mac worlds to gauge the M1's scaling success, the performance ranging anywhere from the RTX3090 being only 15% faster to as much as 160% faster looks like consistency is far from perfect.

I didn't say it was perfect - just that they did deliver a multi-die GPU.

bit_user · Jun 5, 2023

L1fe Test said:
So basically the 7900xtx sucks

Nobody said it "sucks". Chips & Cheese specializes in architectural analysis, largely fueled by micro-benchmarking. They estimated the latency of the chiplet approach as worse, but they did not say how much impact it had on final performance!

GPUs are designed to be pretty good at latency-hiding. So, the impact of the additional latency could be fairly minor.

JarredWaltonGPU · Jun 5, 2023

bit_user said:
Uh... not only does that not parse, but I strongly reject the ironic use of "enjoys". Next thing you know, people are using it in a negative sense, without a hint of irony, and the word "enjoys" loses all connotations of pleasure.

Kind of like how a couple of the authors at WCCFTech use the phrase "sips power", when talking about something that actually guzzles it. With no hint of irony, they're depriving the word "sips" of any sense of rate or quantity.

Please have some pride, as a writer. Use the language, don't abuse it.

Sorry, it was a bad edit on my part. I removed the "RX 7600 enjoys a massive" bit, because even if latency is theoretically that much worse, you can built around that. Ultimately, the proof is in the eating of the pudding, and RX 7900 XTX is much faster. But I do think it could have been faster still had it not gone the GPU chiplet route.

InvalidError · Jun 5, 2023

bit_user said:
I didn't say it was perfect - just that they did deliver a multi-die GPU.

Anyone can make a multi-die GPU. All of the challenge is in making one that scales well. Had AMD's multi-GCDs scaled as well as it hoped it would in gaming workloads, we would likely have seen that in place of the RX7900 we got.

Metal Messiah. · Jun 5, 2023

Cache latency aside, the RDNA 3’s frontend and shader clocks though actually don't benefit the RX 7600 die here.

Since the RX 7600’s smaller shader array can be easily fed at lower shader clocks, so both these clocks run at the same frequency. Regardless, the RX 7600 doesn’t reduce power by clocking down the shader array, when there is a bottleneck in the frontend.

Kamen Rider Blade · Jun 5, 2023

JarredWaltonGPU said:
But I do think it could have been faster still had it not gone the GPU chiplet route.

But that would cost more to manufacture.

AMD made a decision based on cost of manufacturing.

Otherwise you would run into nVIDIA style BoM costs that are much higher.

InvalidError · Jun 5, 2023

Kamen Rider Blade said:
But that would cost more to manufacture.

Maybe, maybe not. Depends on how much performance got left on the table by going with cachemem chiplets.

If Nvidia dropped all of the features it has that AMD has no equivalent for and downscaled everything else to match AMD, Nvidia would likely hit similar performance in a smaller total footprint without the added costs related to multi-chip packages and not necessarily be much more expensive for a given performance level.

bit_user · Jun 6, 2023

InvalidError said:
Anyone can make a multi-die GPU. All of the challenge is in making one that scales well.

But, you haven't shown it doesn't scale well. All you showed is that its absolute performance wasn't Earth-shattering, which we knew. Neither is gaming consoles', but few seem to criticize them for being APUs.

I found one data point, which is the GPU Compute benchmark from Geek Bench 5.

M1 Ultra vs M1 Max: Twice as nice, but not always twice as fast

How the M1 Ultra in the Mac Studio impacts pro users' real-world workflows.

www.macworld.com

Here, they quote a 52.5% speedup, which is not particularly good. However, both benchmarks used a Mac Studio and I don't know if the Ultra version was capable of providing 2x the power or dissipating 2x the heat. As the Mac Studio is a SFF machine of diminutive stature (7.7" x 7.7" x 3.7"), with no air vents on the sides or top, I rather doubt it. For a fair scaling metric, you'd really want to ensure that power & cooling were scaled up correspondingly.

InvalidError · Jun 6, 2023

bit_user said:
But, you haven't shown it doesn't scale well.

The numbers I have seen show that it doesn't scale consistently (well): the performance gap against the RTX3090 in the benchmarks I've seen has a >10X percentile spread, going from slightly faster in some productivity workloads (which confirms that it does have the power to do it) to well under half as fast.

Of course, the inconsistency could be down to drivers not agreeing with the specific benchmarks and games the M1 struggles on or the game/benchmark not being optimized for the M1, much like Intel's ARC GPUs.

usernamist · Jun 6, 2023

InvalidError said:
I wonder if anybody will manage to solve the challenge of consistently scalable multi-die GPUs.

Apple already has. Where have you been in the past 3 years?

Avro Arrow · Jun 6, 2023

Interesting but ultimately pointless.

bit_user · Jun 6, 2023

InvalidError said:
the inconsistency could be down to drivers not agreeing with the specific benchmarks and games the M1 struggles on or the game/benchmark not being optimized for the M1, much like Intel's ARC GPUs.

Yes, you really don't know what's behind that variation. It could have a lot more to do with something that's well-optimized on Nvidia/Direct3D vs. poorly-optimized on Metal or vice versa. Comparing with a completely different product, using completely different APIs, isn't a good way to measure scaling. Worse even than comparing Intel Alchemist vs. Nvidia.

JarredWaltonGPU · Jun 6, 2023

bit_user said:
Yes, you really don't know what's behind that variation. It could have a lot more to do with something that's well-optimized on Nvidia/Direct3D vs. poorly-optimized on Metal or vice versa. Comparing with a completely different product, using completely different APIs, isn't a good way to measure scaling. Worse even than comparing Intel Alchemist vs. Nvidia.

Yeah, fundamentally, you would need to know the workload and what it's doing, and then analyze how it scales. It could be that some tests are more latency sensitive, and I can pretty much guarantee that when the first chip in an Apple dual-chip setup has to access data from the other chip, it will incur a big latency penalty. Then the question is whether Apple (or developers) could code around that in some fashion to make the hit less painful.

There are going to be plenty of workloads that use GPUs where bandwidth is less of a factor, and I suspect that's where Apple sees the best scaling. Games on the other hand rarely tend to be in that category. Some need more bandwidth, but all games (at settings that push the GPU to ~99% load) will usually be at least moderately bandwidth limited.

I think AMD's GPU chiplet approach isn't a bad idea, it's just version 1.0. They'll have learned a lot about how things scale (or don't scale), and then RDNA 4 will improve things. Maybe AMD just needs another level of cache? Put a 16MB or 32MB L3 on the GCD, then have all the MCDs function as an L4 cache. That would further reduce the latency hit for the chiplets, and when the GCD is moved to say TSMC N3 or N2, having some extra cache might not be that terrible of a use of die space. Or alternatively, just make the L2 cache even bigger than it is with RDNA 3.

InvalidError · Jun 6, 2023

bit_user said:
Yes, you really don't know what's behind that variation. It could have a lot more to do with something that's well-optimized on Nvidia/Direct3D vs. poorly-optimized on Metal or vice versa.

Don't forget x86 vs ARM, Windows vs MacOS. Lots of stuff can go horribly wrong well beyond the graphics API swap there.

As flawed as the results may be, they are still what we've got. Much like the Steam survey - not perfect but they are the closest thing to transparent numbers of a meaningful sample size we've got.

bit_user · Jun 6, 2023

InvalidError said:
As flawed as the results may be, they are still what we've got.

With so many variables, I'm not sure how you can even extract anything meaningful from that data. I put way more stock in the Geekbench GPU Compute results I found, in spite of my concerns about power & cooling.

If I were more invested in the matter, I'd dig around to see who in the Mac community has done scaling analysis because the M1 has been such an intense area of interest that people surely must've.

My only point was that Apple brought a multi-die GPU product to market. They can claim that "first". At the time (maybe still?), it also had the highest chip-to-chip bandwidth ever achieved.

helper800 · Jun 6, 2023

InvalidError said:
Maybe, maybe not. Depends on how much performance got left on the table by going with cachemem chiplets.

If Nvidia dropped all of the features it has that AMD has no equivalent for and downscaled everything else to match AMD, Nvidia would likely hit similar performance in a smaller total footprint without the added costs related to multi-chip packages and not necessarily be much more expensive for a given performance level.

That is a gigantic assumption. I would say that is so far off inference it not worth stating...

News AMD RX 7600 Has Better Cache Latency Compared to RX 7900 XTX

Administrator

Titan

Polypheme

Polypheme

Titan

Distinguished

Polypheme

Polypheme

Senior GPU Editor

Titan

Splendid

Distinguished

Titan

Polypheme

Titan

Honorable

Splendid

Polypheme

Senior GPU Editor

Titan

Polypheme

Splendid

Share this page