Yes, you really don't know what's behind that variation. It could have a lot more to do with something that's well-optimized on Nvidia/Direct3D vs. poorly-optimized on Metal or vice versa. Comparing with a completely different product, using completely different APIs, isn't a good way to measure scaling. Worse even than comparing Intel Alchemist vs. Nvidia.
Yeah, fundamentally, you would need to know the workload and what it's doing, and then analyze how it scales. It could be that some tests are more latency sensitive, and I can pretty much guarantee that when the first chip in an Apple dual-chip setup has to access data from the other chip, it will incur a big latency penalty. Then the question is whether Apple (or developers) could code around that in some fashion to make the hit less painful.
There are going to be plenty of workloads that use GPUs where bandwidth is less of a factor, and I suspect that's where Apple sees the best scaling. Games on the other hand rarely tend to be in that category. Some need
more bandwidth, but all games (at settings that push the GPU to ~99% load) will usually be at least moderately bandwidth limited.
I think AMD's GPU chiplet approach isn't a bad idea, it's just version 1.0. They'll have learned a lot about how things scale (or don't scale), and then RDNA 4 will improve things. Maybe AMD just needs another level of cache? Put a 16MB or 32MB L3 on the GCD, then have all the MCDs function as an L4 cache. That would further reduce the latency hit for the chiplets, and when the GCD is moved to say TSMC N3 or N2, having some extra cache might not be that terrible of a use of die space. Or alternatively, just make the L2 cache even bigger than it is with RDNA 3.