News AMD RX 7600 Has Better Cache Latency Compared to RX 7900 XTX

InvalidError

Titan
Moderator
If you want to keep stuff as fast as possible, you keep it on-die as close as possible, as much as possible. Shouldn't surprise anybody.

I wonder if anybody will manage to solve the challenge of consistently scalable multi-die GPUs. I wouldn't be surprised if the SM-to-SM bandwidth scaling across the chip-to-chip interface to achieve seamless integration turns out too steep for that to ever work fine without software implementing some variation of explicit multi-GPU to minimize die-to-die traffic, much like how OS schedulers and programs need to be CCD-aware to avoid getting bogged down by CCD-to-CCD latency.
 

bit_user

Titan
Ambassador
the RX 7950 XTX takes up to RX 7600 enjoys a massive 58% longer to retrieve data from its Infinity Cache
Uh... not only does that not parse, but I strongly reject the ironic use of "enjoys". Next thing you know, people are using it in a negative sense, without a hint of irony, and the word "enjoys" loses all connotations of pleasure.

Kind of like how a couple of the authors at WCCFTech use the phrase "sips power", when talking about something that actually guzzles it. With no hint of irony, they're depriving the word "sips" of any sense of rate or quantity.

Please have some pride, as a writer. Use the language, don't abuse it.
 
  • Like
Reactions: Makaveli

bit_user

Titan
Ambassador
If you want to keep stuff as fast as possible, you keep it on-die as close as possible, as much as possible. Shouldn't surprise anybody.
IIRC, the RX 7600 review talked about L2 cache. I'm glad Chips & Cheese investigated the matter, because I was curious just how much difference it made vs. the RX 7900's L3 cache.

I hope AMD does a version of the RX 7900 with 3D V-cache. That would seem to help justify their use of chiplets, as well as stepping down from the RX 6900's 128 MB of cache to a mere 96 MB.

I wonder if anybody will manage to solve the challenge of consistently scalable multi-die GPUs.
Apple's M1 Ultra.
 

bit_user

Titan
Ambassador
So basically the 7900xtx sucks
Nobody said it "sucks". Chips & Cheese specializes in architectural analysis, largely fueled by micro-benchmarking. They estimated the latency of the chiplet approach as worse, but they did not say how much impact it had on final performance!

GPUs are designed to be pretty good at latency-hiding. So, the impact of the additional latency could be fairly minor.
 
Uh... not only does that not parse, but I strongly reject the ironic use of "enjoys". Next thing you know, people are using it in a negative sense, without a hint of irony, and the word "enjoys" loses all connotations of pleasure.

Kind of like how a couple of the authors at WCCFTech use the phrase "sips power", when talking about something that actually guzzles it. With no hint of irony, they're depriving the word "sips" of any sense of rate or quantity.

Please have some pride, as a writer. Use the language, don't abuse it.
Sorry, it was a bad edit on my part. I removed the "RX 7600 enjoys a massive" bit, because even if latency is theoretically that much worse, you can built around that. Ultimately, the proof is in the eating of the pudding, and RX 7900 XTX is much faster. But I do think it could have been faster still had it not gone the GPU chiplet route.
 
  • Like
Reactions: bit_user
D

Deleted member 2731765

Guest
Cache latency aside, the RDNA 3’s frontend and shader clocks though actually don't benefit the RX 7600 die here.

Since the RX 7600’s smaller shader array can be easily fed at lower shader clocks, so both these clocks run at the same frequency. Regardless, the RX 7600 doesn’t reduce power by clocking down the shader array, when there is a bottleneck in the frontend.
 

InvalidError

Titan
Moderator
But that would cost more to manufacture.
Maybe, maybe not. Depends on how much performance got left on the table by going with cachemem chiplets.

If Nvidia dropped all of the features it has that AMD has no equivalent for and downscaled everything else to match AMD, Nvidia would likely hit similar performance in a smaller total footprint without the added costs related to multi-chip packages and not necessarily be much more expensive for a given performance level.
 

bit_user

Titan
Ambassador
Anyone can make a multi-die GPU. All of the challenge is in making one that scales well.
But, you haven't shown it doesn't scale well. All you showed is that its absolute performance wasn't Earth-shattering, which we knew. Neither is gaming consoles', but few seem to criticize them for being APUs.

I found one data point, which is the GPU Compute benchmark from Geek Bench 5.


Here, they quote a 52.5% speedup, which is not particularly good. However, both benchmarks used a Mac Studio and I don't know if the Ultra version was capable of providing 2x the power or dissipating 2x the heat. As the Mac Studio is a SFF machine of diminutive stature (7.7" x 7.7" x 3.7"), with no air vents on the sides or top, I rather doubt it. For a fair scaling metric, you'd really want to ensure that power & cooling were scaled up correspondingly.
 

InvalidError

Titan
Moderator
But, you haven't shown it doesn't scale well.
The numbers I have seen show that it doesn't scale consistently (well): the performance gap against the RTX3090 in the benchmarks I've seen has a >10X percentile spread, going from slightly faster in some productivity workloads (which confirms that it does have the power to do it) to well under half as fast.

Of course, the inconsistency could be down to drivers not agreeing with the specific benchmarks and games the M1 struggles on or the game/benchmark not being optimized for the M1, much like Intel's ARC GPUs.
 

bit_user

Titan
Ambassador
the inconsistency could be down to drivers not agreeing with the specific benchmarks and games the M1 struggles on or the game/benchmark not being optimized for the M1, much like Intel's ARC GPUs.
Yes, you really don't know what's behind that variation. It could have a lot more to do with something that's well-optimized on Nvidia/Direct3D vs. poorly-optimized on Metal or vice versa. Comparing with a completely different product, using completely different APIs, isn't a good way to measure scaling. Worse even than comparing Intel Alchemist vs. Nvidia.
 
Yes, you really don't know what's behind that variation. It could have a lot more to do with something that's well-optimized on Nvidia/Direct3D vs. poorly-optimized on Metal or vice versa. Comparing with a completely different product, using completely different APIs, isn't a good way to measure scaling. Worse even than comparing Intel Alchemist vs. Nvidia.
Yeah, fundamentally, you would need to know the workload and what it's doing, and then analyze how it scales. It could be that some tests are more latency sensitive, and I can pretty much guarantee that when the first chip in an Apple dual-chip setup has to access data from the other chip, it will incur a big latency penalty. Then the question is whether Apple (or developers) could code around that in some fashion to make the hit less painful.

There are going to be plenty of workloads that use GPUs where bandwidth is less of a factor, and I suspect that's where Apple sees the best scaling. Games on the other hand rarely tend to be in that category. Some need more bandwidth, but all games (at settings that push the GPU to ~99% load) will usually be at least moderately bandwidth limited.

I think AMD's GPU chiplet approach isn't a bad idea, it's just version 1.0. They'll have learned a lot about how things scale (or don't scale), and then RDNA 4 will improve things. Maybe AMD just needs another level of cache? Put a 16MB or 32MB L3 on the GCD, then have all the MCDs function as an L4 cache. That would further reduce the latency hit for the chiplets, and when the GCD is moved to say TSMC N3 or N2, having some extra cache might not be that terrible of a use of die space. Or alternatively, just make the L2 cache even bigger than it is with RDNA 3.
 

InvalidError

Titan
Moderator
Yes, you really don't know what's behind that variation. It could have a lot more to do with something that's well-optimized on Nvidia/Direct3D vs. poorly-optimized on Metal or vice versa.
Don't forget x86 vs ARM, Windows vs MacOS. Lots of stuff can go horribly wrong well beyond the graphics API swap there.

As flawed as the results may be, they are still what we've got. Much like the Steam survey - not perfect but they are the closest thing to transparent numbers of a meaningful sample size we've got.
 

bit_user

Titan
Ambassador
As flawed as the results may be, they are still what we've got.
With so many variables, I'm not sure how you can even extract anything meaningful from that data. I put way more stock in the Geekbench GPU Compute results I found, in spite of my concerns about power & cooling.

If I were more invested in the matter, I'd dig around to see who in the Mac community has done scaling analysis because the M1 has been such an intense area of interest that people surely must've.

My only point was that Apple brought a multi-die GPU product to market. They can claim that "first". At the time (maybe still?), it also had the highest chip-to-chip bandwidth ever achieved.
 
  • Like
Reactions: helper800
Maybe, maybe not. Depends on how much performance got left on the table by going with cachemem chiplets.

If Nvidia dropped all of the features it has that AMD has no equivalent for and downscaled everything else to match AMD, Nvidia would likely hit similar performance in a smaller total footprint without the added costs related to multi-chip packages and not necessarily be much more expensive for a given performance level.
That is a gigantic assumption. I would say that is so far off inference it not worth stating...
 
  • Like
Reactions: bit_user