Not every workload is bandwidth intensive, and while the cache helps in some areas, it does not help in all areas. Gamers Nexus focused more on the cache in their comparison of the 4060 Ti 8GB and 3060 Ti 8GB. Keep in mind the RTX 4060 Ti has around 21% improved raw compute performance over the 3060 Ti, and around 8-12% improved performance at 1080p, though the benefits diminish and even scale negatively in some cases at higher resolutions on some games.
View: https://youtu.be/Y2b0MWGwK_U
To see more of the impact of reduced VRAM throughput, look at workloads where the data being worked on cannot all fit in cache, and constant use of the VRAM will be needed.
https://www.pugetsystems.com/labs/articles/nvidia-rtx-4070-and-4060-ti-8gb-content-creation-review/
PS, overclocking the VRAM offers a decent performance boost in davinci resolve, especially for the fusion functions and temporal noise reduction, though good GPU compute performance is still needed. For example, a VRAM overclock on a GTX 1070 only gets a very small performance boost.
PS, Blender strongly favors the Nvidia architecture as well as strongly favoring cache in their tile rendering, thus the RTX 40xx series gets a large performance boost.
I am not claiming to know more than nvidia, but the same can be said about your clams about balancing compute and bandwidth, as we do not know if that was their goal. The most those of us on the outside can do is draw logical inferences as to why they might have made some changes, and it strongly seems like they are forcing additional resolution segmentation.
When running a game, there are aspects of the game engine that deal in small data sets that are compute intensive, and added cache helps in those areas. Aspects of the game engine leaning more heavily on the cache will see a performance improvement, while aspects leaning more on VRAM throughput will struggle more.
PS, those applications are not PCIe bus intensive workloads, thus the drop to PCIe 4.0 X8 does not make a difference.