juanrga :
Let me clarify my point, because it has been misunderstood.
I expect AMD to develop Zen2 around a 6-core CCX module. So I expect AMD to release a 12-core die for CPU and 6-core die for APUs; two CCX in the first die and one CCX in the second die. I expect Zen2 ThreadRipper CPUs to be up to 24-core (two dies of 12-core each). I expect Starship CPUs to be up to 48-core (four dies of 12-core each).
Different models will be obtainewd by disabling cores. For instance I expect the cheaper Zen2 CPU to be a 6-core CPU (3+3).
I expect Zen2 to bring about 7% higher IPC than Zen2. What I said is that AMD engineers cannot use all the extra space given by the 7LP node to increase the IPC of each core. They cannot by virtue of the IPC wall. That is the reason why AMD engineers will use only a small amount of that space to increase the IPC of Zen2 cores, whereas most of the extra space will be used to add more cores to the die. As stated above I expect 4-core CCX to be replaced by 6-core CCX. I.e. I am expecting 50% moar cores.
What I said is that whereas 6-core Zen2 APU will be relevant for 99% of users, those future 12-core CPUs will be useless for most users, because most desktop code doesn't scale to many cores. One can simply compare reviews of 12-core ThreadRipper vs 8-core RyZen and see that overall ThreadRipper isn't 50% faster.
AMD will be not doing 12-core Zen2 dies because it benefits most desktop users. AMD will be doing 12-core Zen2 dies, because of (i) IPC wall and (ii) those dies will be used in servers, where workloads scale to many cores.
...and you know this how? a crystal ball? how can you possibly know about an IPC wall?
A lot of things that give you more power just require more transistors to build them. Wider buses scale the transistor count up in almost all processor components. High speed caches add transistors according to cache size. If you lengthen a pipeline you need to add stages and more complex control units. If you add execution units to help mitigate a bottleneck in the pipeline, each of those requires more transistors, and then the controls to keep the execution units allocated adds still more transistors.
The thing is, in an electronic circuit, everything happens in parallel. In the software world, the default is for things to be sequential, and software designers go to great pains to get parallelism built into the software so that it can take advantage of the parallel nature of hardware. Parallelism just means more stuff happening at the same time, so roughly equates to speed; the more things that can be done in parallel, the faster you can get things done. The only real parallelism is what you get when you have more transistors on the job.
First instructions are not necessarily "executed sequentially" even on a non-VLIW ISA, execution only needs to appear sequential. An in-order superscalar implementation can execute more than one instruction in parallel with another. To do this effectively the hardware for decoding instructions must be increased (widened), hardware must be added to ensure data independence of instructions to be executed in parallel, the execution resources must be increased, and the number of register file ports is generally increased. All of these add transistors.
An out-of-order implementation, which allows later instructions to be executed before earlier ones as long as there are no data dependencies, uses additional hardware to handle scheduling of instructions as soon as data becomes available and adds rename registers and hardware for mapping, allocating, and freeing them (more transistors) to avoid write-after-read and write-after-write hazards. Out-of-order execution allows the processor to avoid stalling.
The reordering of loads and stores in an out-of-order processor requires ensuring that stores earlier in program order will forward results to later loads of the same address. This implies address comparison logic as well as storage for the addresses (and size) of stores (and storage for the data) until the store has been committed to memory (the cache). (For an ISA with a less weak memory consistency model, it is also necessary to check that loads are ordered properly with respect to stores from other processors--more transistors.)
Pipelining adds some additional control and buffering overhead and prevents the reuse of logic for different parts of instruction handling, but allows the different parts of handling an instruction to overlap in time for different instructions.
Pipelining and superscalar execution increase the impact of control hazards (i.e., conditional branches and jumps). Pipelining (and also out-of-order execution) can delay the availability of the target of even unconditional jumps, so adding hardware to predict targets (and direction for conditional branches) allows fetching of instructions to continue without waiting for the execution portion of the processor to make the necessary data available. More accurate predictors tend to require more transistors.
For an out-of-order processor, it can be desirable to allow a load from memory to execute before the addresses of all preceding stores have been computed, so some hardware to handle such speculation is required, potentially including a predictor.
Caches can reduce the latency and increase the bandwidth of memory accesses, but add transistors to store the data and to store tags (and compare tags with the requested address). Additional hardware is also needed to implement the replacement policy. Hardware prefetching will add more transistors.
Implementing functionality in hardware rather than software can increase performance (while requiring more transistors). E.g., TLB management, complex operations like multiplication or floating point operations, specialized operations like count leading zeros. (Adding instructions also increase the complexity of instruction decode and typically the complexity of execution as well--e.g., to control which parts of the execution hardware will be used.)
SIMD/vector operations increase the amount of work performed per instruction but require more data storage (wider registers) and typically use more execution resources.
(Speculative multithreading could also allow multiple processors to execute a single threaded program faster. Obviously adding processors to a chip will increase the transistor count.)
Having more transistors available can also allow computer architects to provide an ISA with more registers visible to software, potentially reducing the frequency of memory accesses which tend to be slower than register accesses and involve some degree of indirection (e.g., adding an offset to the stack pointer) which increases latency.
Integration--which increases the number of transistors on a chip but not in the system--reduces communication latency and increases bandwidth, obviously allowing an increase in performance. (There is also a reduction in power consumption which may be translated into increased performance.)
Even at the level of instruction execution, adding transistors can increase performance. E.g., a carry select adder adds upper bits twice in parallel with different assumptions of the carry-in from the lower bits, selecting the correct sum of upper bits when the carry out from the lower bits is available, obviously requiring more transistors than a simple ripple carry adder but reducing the delay in producing the full sum. Similarly a multiplier with a single row of carry-save adders uses fewer transistors (but is slower) than a Dadda (or Wallace) tree multiplier and cannot be pipelined (so would have to be replicated to allow another multiply to begin execution while an earlier multiply was in progress).
Microprocessors have advanced significantly in recent years, things like longer pipelines, predicative branching and on chip cache have all added to the complexities associated with a processor.
Sure the basics of CPU processing, fetch, decode, ALU, write is still the same, but to speed things up, longer pipelines are used. Longer pipelnes increase performance for continous code executiion, but also incur bigger hit times when the code branches damage performance. Remedy, predictive branching. Predictive branching is a trade secret, that intel do not normally disclose the full workings of, just simply use it to keep the performance as high as possible on their CPUs.
Cache memory is much faster than RAM, but what to move from RAM into cache and from cache back to RAM??? That is again, proprietary stuff, but it again takes transistors to implement.
So the extra transistors go into things like the longer pipeline, predictive branch algorithms, cache memory, and memory algorithms.
This is without mentioning multi core processors, and shared memory/resource access controllers.
------
steady increase in single threaded performance has been achieved consistently year on year.
matching specific code to the cpu architecture is a factor.