InvalidError :
bit_user :
the core counts of these server chips seem to be growing past the point of diminishing returns.
Depends on the workloads. For HPC-style applications where algorithms are finely tuned to the underlying system architecture and scale beyond 100k cores for current record-holders,
I specifically said "core counts of these server chips". I don't know why you would broaden the conversation to core counts of clusters, but it's not relevant to my point.
InvalidError :
there would no doubt be many cases where you could put 64+ full-blown cores on a CPU with 256bits memory architecture and still have no meaningful bottleneck because the algorithms are designed to keep the bulk of their working set within the CPU's caches with the rest flowing smoothly and timely within the limits of available bandwidth and latency.
We've been down this rat hole, before. My premise is that cache coherency ain't free. Even if your workload is not bottlenecked by memory bandwidth, your energy efficiency will drop, by virtue of more cache coherency overhead and more on-chip interconnect links to traverse, in the process.
For workloads that are highly-scalable, they can already be run across multiple machines, so the benefit of ultra-high core count chips is negligible. It's really a question of when the TCO of adding another cluster node is less than the efficiency loss of adding more cores per node.
This ties into memory, in the sense that additional memory channels are only needed as long as core counts keep climbing. If/when core counts plateau, then memory just needs to keep pace with core clock increases. Packing HBM2/HMC cache in-package, as Intel did in the Xeon Phi 7200-series, might even enable them to roll back some of these DDR channel increases.