It's a bit more complicated then that, remember cache servers as a buffer between the CPU and the memory subsystem. L1 is the fastest and is where the CPU will execute from and if what it's looking for is not there then it looks in slower L2, then much slower L3 and finally catastrophe of going to system memory while everything waits. This makes things like branch prediction and prefetching more important and historically Intel has always had a better algorithm.
Even the best branch prediction and prefetching won't help in a memory-intensive workload, where there's simply not enough bandwidth to keep the cores fed. Assuming at least some of the data has a significant amount of reuse, increasing cache sizes would also help such cases.
Also, I'm not sure how far back you're going, when you say Intel is historically better at branch prediction. That was one of the areas of Zen 1 which saw the greatest amount of improvement. Overall, I don't have a sense of how big a gap exists between Zen 4 and Golden Cove, but the sense I get is that they're at least in roughly the same ballpark. Here are the two comparative plots in Chips & Cheese' analysis which give that impression:
Since todays games have such large datasets,
You mean like the amount of assets, which range well into the hundreds of GB? Whether you have 32 or 96 MB of L3 cache would not seem to make a dent in that. Furthermore, if the game assets are the main bottleneck, that seems like data with a low reuse rate, as the GPU is really what's interested in those. Where L3 differences should have the greatest impact is in the realm of thread-to-thread communication, by avoiding the data needing to do a round-trip through DRAM and with things like collision detection, which do MxN interactions.
To get a sense of just how much better it is to keep thread-to-thread communication in L3 cache, consider the 7950X line in this plot of aggregate bandwidth, where a dataset of 64 MiB (remember, this is two regular CCDs, which each have 32 MiB) sees about 1.4 TB/s of bandwidth, but once you go above that, you hit DRAM speeds of like 1/20th that speed.
AMD's just brute forced it by making a large L3 cache and then cramming as much as possible. That is why those X3D CPUs perform so damn well in games but need special attention to prevent games running on the wrong CCD.
The wrong CCD has a 2-fold issue, for CPUs like the 7900X3D and 7950X3D. Not only don't those threads gain the benefits of more L3, but they also have a slower communication path to the cores on the other CCD. This shows the the penalty for going between CCDs.
As with any latency benchmarks, keep in mind that they only represents a best-case scenario of an otherwise idle CPU and queues not running deep with other requests.
BTW, Chips & Cheese actually did an analysis of the 7950X3D, which included a couple of games:
Compute performance has been held back by memory performance for decades, with DRAM performance falling even further behind with every year.
chipsandcheese.com
They also went on to do some detailed analysis of gaming workloads on Zen 4. Here are a few interesting tidbits:
"Both gaming workloads are overwhelmingly frontend bound. They’re significantly backend bound as well, and lose further throughput from bad speculation. Useful work occupies a relatively minor proportion of available pipeline slots, explaining the low IPC."
"AMD has invested heavily in making a very capable branch predictor. It does achieve very high accuracy, but we still see 4-5 mispredicts per 1000 instructions. That results in 13-15% of core throughput getting lost due to going down the wrong path. Again, we have a problem because these games simply have a ton of branches. Even with over 97% accuracy, you’re going to run into mispredicts fairly often if there’s a branch every four to five instructions."
"Memory loads are the biggest culprit (for backend stalls). Adding more execution units or improving instruction execution latency wouldn’t do much, because the problem is feeding those execution units in the first place. Out of order execution can hide memory latency to some extent by moving ahead of a stalled instruction. ... AMD could address this by adding more entries to that scheduling queue. But as we all know, coping with latency is just one way to go about things. Better caching means you don’t have to cope as hard"
"Zen 4’s backend is primarily memory bound in both games. That brings up the question of whether it’s latency or bandwidth bound. ... Zen 4 is often waiting for data from L2, L3, or memory, but rarely had more than four such outstanding requests ... indicating that bandwidth isn’t an issue."
"Zen 4’s improvements over Zen 3 are concentrated in the right areas. The larger ROB and supporting structures help absorb memory latency. Tracking more branches at faster BTB levels helps the frontend deal with giant instruction footprints, as does the larger L2 cache. But AMD still has room to improve. A 12K entry BTB like the one in Golden Cove could improve frontend instruction delivery. Branch predictor accuracy can always get better. The store queue’s size has not kept pace and sometimes limits effective reordering capacity."
AMD didn’t present a lot of new info about the Zen 4 core at their Hot Chips 2023 presentation.
chipsandcheese.com
So, it sounds like you're right to focus on branch prediction as an issue, for games. Unfortunately, in spite of using a 7950X3D for the article, I see virtually no comparative between the different CCDs, looking at where the 3D cache is having the greatest impact. For that, it seems the best we've got is the prior article, which offers only this one tidbit:
"VCache makes Zen 4 less backend-bound, which makes sense because the execution units can be better fed with a higher cache hitrate. However, the core suffers heavily from frontend bottlenecks. VCache seems to have little effect on frontend performance, suggesting that most of the L3 hitrate gain came from catching more data-side misses."
It's just one game, but that suggests
the benefit of 3D VCache isn't primarily to compensate for any weakness in branch prediction.