TabrisDarkPeace
Distinguished
(Question) How many 6th/7th gen x86 cores would fit in the space of a 640 million (640 x 10^6) transistor processor (65nm can do this) if a suitable amount of cache is used (and said cache was using Z-RAMs).
- Note that said cores include OoO / Register renaming, etc, etc
The clock speeds would still be in the 2 to 3 GHz range.
So I decided to do some research....
(Answer) 6 x 6/7th generation x86/x64 processor cores, with 'only' 6 MB shared L2 cache, would fit within 630 million transistors, leaving space for 10 million more (extra logic, etc).
TDP for such a processor would be be 75 - 90 watts, (depending on load), which can be cooled by todays 'standard' heatsink + fan combinations (thanks to the PreScott we have some rather advanced cooling technology available 😛, ironic no ?).
Power usage would be up to 115 watts (give or take), which is 'acceptable' aswell, and likely lower assuming 4 of the 6 cores are in power saving modes most the time, leaving the 6 MB shared L2 cache to only 2 or 3 of the cores in 'actual usage' patterns.... at least in most software.
The processor would require an external (thin but high speed, likely 64 to 128 bit 'thin') link northbridge / MCH as it would be to complex if the memory controller was integrated within it, and it would require somewhere in the order of between 12.8 and 25.6 GB/sec of peak memory throughput (likely 'wider' at 256 - 512 bit, in total anyway) to keep the cores well fed with instructions and data.
eg: Sort of like how the typical current Intel FSB is 800 MHz (QDR) x 64 bits wide for 6.4 GB/sec, but they use 400 MHz (DDR) x 128 bit wide (dual channel) memory to feed it, up to, the same 6.4 GB/sec.
This could be done using todays 65nm technology, ironically XDR RAMBUS would be the best choice for it aswell (lowest pin count to high data transfer rate ratio than any other solution)..... however I don't see Intel going back to them for 'help'. (Pin counts would become an issue to get enough sustained memory throughput).
It is possible 2 x CPU socket systems using 2 x ccNUMA nodes may become the norm just to aggregate and 'share' the total available memory throughput to keep performance scaling.
Because of manufacturer control it would be unlikely to happen for some time, and Intel tend to burn space on large L2 caches IMHO when an increase in L1 (data and instruction) cache sizes would be a more effective way (transistor wise) to keep performance up. As 6 cores + 12 MB cache would require around 840 million transistors it is not likley to happen with Intel until 45nm manufacturing begins.
eg: Intel would pair 12 MB with 6 cores most likely, maybe even more, and then keep the interface to memory slower, while keeping 'overall complexity' down to reduce mainboard costs. (Frankly I'd rather pay up to $500 for a processor, and up to $500 for a mainboard, excluding RAM, and 'typically' expect to pay $750 for video card).
eg2: A 1 billion (10^9) transistor processor could house 40 Celeron (P6) processor cores, including the ~ 5 MB cache (ideally shared) L2 cache, and overheads. Such as logic to control the large number of cores and a shared interface to memory for them all. (Now that is cool, as they'd be clocked at over 3 GHz using a 52nm, or more likely 45nm, manufacturing process). Go ahead crunch the numbers yourself.... it sounds pretty cool to me.
IBM Cell, just like Intel IA-64 (Itanium), requires a very good compiler, that is where many of the advancements are made, and the point is to make the Out of Order / Register Renaming units on the processors take up less silicon space, so more can be dedicated to combinations of the FPU (Floating Point Unit), cache(s), and other processor cores.
Eventually we'd hit a point where the Out of Order / Register Renaming units grew so large they'd need half the processor space to keep scaling performance, when (or if) we hit that point without moving to another architecture gamers will not be a happy market.
CPUs (General & Central processing units) are going to become more like GPU / VPUs are today... we don't bash ATI or nVidia for not having advanced, large, and complex Out of Order execution units in their GPU / VPUs, and physics calculations on 1,000's of objects at once do not require OoO feaures to make effective use of 'silicon real-estate' within a processor.
==================
Cell is just a cut down (cheaper to produce and scaled) varient of the above.... (that still has good performance) as you don't require Out of Order execution when you can execute multiple threads at once in well coded and compiled software, especially if each core can issue 3 - 4 instructions per clock cycle.
Cell has applications in gaming, but it might take awhile before people migrate to it for gaming.... it also has excellent video encoding and decoding performance for a 'single chip' solution. Bear in mind gaming consoles are getting closer and closer to Desktop PCs in functionality (eg: surf the net, while watching a DVD, BluRay, HD-DVD, etc. How long would it be until consoles can encode video faster, and better, than existing PCs ?).
The compilers job moves to 'making code that avoids pipeline stalls at all costs', which isn't really that much of a jump for compilers anyway, and less on making (streams of) code that can be executed out of order.
It is a far more elegant solution and will result in performance scaling far better than it has since 2002.
Really since 2002 has CPU performance, excluding the 2 x 4 issue core Intel Conroe, really scaled that well compared to previous 3 to 4 year periods ?
- Note that said cores include OoO / Register renaming, etc, etc
The clock speeds would still be in the 2 to 3 GHz range.
So I decided to do some research....
(Answer) 6 x 6/7th generation x86/x64 processor cores, with 'only' 6 MB shared L2 cache, would fit within 630 million transistors, leaving space for 10 million more (extra logic, etc).
TDP for such a processor would be be 75 - 90 watts, (depending on load), which can be cooled by todays 'standard' heatsink + fan combinations (thanks to the PreScott we have some rather advanced cooling technology available 😛, ironic no ?).
Power usage would be up to 115 watts (give or take), which is 'acceptable' aswell, and likely lower assuming 4 of the 6 cores are in power saving modes most the time, leaving the 6 MB shared L2 cache to only 2 or 3 of the cores in 'actual usage' patterns.... at least in most software.
The processor would require an external (thin but high speed, likely 64 to 128 bit 'thin') link northbridge / MCH as it would be to complex if the memory controller was integrated within it, and it would require somewhere in the order of between 12.8 and 25.6 GB/sec of peak memory throughput (likely 'wider' at 256 - 512 bit, in total anyway) to keep the cores well fed with instructions and data.
eg: Sort of like how the typical current Intel FSB is 800 MHz (QDR) x 64 bits wide for 6.4 GB/sec, but they use 400 MHz (DDR) x 128 bit wide (dual channel) memory to feed it, up to, the same 6.4 GB/sec.
This could be done using todays 65nm technology, ironically XDR RAMBUS would be the best choice for it aswell (lowest pin count to high data transfer rate ratio than any other solution)..... however I don't see Intel going back to them for 'help'. (Pin counts would become an issue to get enough sustained memory throughput).
It is possible 2 x CPU socket systems using 2 x ccNUMA nodes may become the norm just to aggregate and 'share' the total available memory throughput to keep performance scaling.
Because of manufacturer control it would be unlikely to happen for some time, and Intel tend to burn space on large L2 caches IMHO when an increase in L1 (data and instruction) cache sizes would be a more effective way (transistor wise) to keep performance up. As 6 cores + 12 MB cache would require around 840 million transistors it is not likley to happen with Intel until 45nm manufacturing begins.
eg: Intel would pair 12 MB with 6 cores most likely, maybe even more, and then keep the interface to memory slower, while keeping 'overall complexity' down to reduce mainboard costs. (Frankly I'd rather pay up to $500 for a processor, and up to $500 for a mainboard, excluding RAM, and 'typically' expect to pay $750 for video card).
eg2: A 1 billion (10^9) transistor processor could house 40 Celeron (P6) processor cores, including the ~ 5 MB cache (ideally shared) L2 cache, and overheads. Such as logic to control the large number of cores and a shared interface to memory for them all. (Now that is cool, as they'd be clocked at over 3 GHz using a 52nm, or more likely 45nm, manufacturing process). Go ahead crunch the numbers yourself.... it sounds pretty cool to me.

IBM Cell, just like Intel IA-64 (Itanium), requires a very good compiler, that is where many of the advancements are made, and the point is to make the Out of Order / Register Renaming units on the processors take up less silicon space, so more can be dedicated to combinations of the FPU (Floating Point Unit), cache(s), and other processor cores.
Eventually we'd hit a point where the Out of Order / Register Renaming units grew so large they'd need half the processor space to keep scaling performance, when (or if) we hit that point without moving to another architecture gamers will not be a happy market.
CPUs (General & Central processing units) are going to become more like GPU / VPUs are today... we don't bash ATI or nVidia for not having advanced, large, and complex Out of Order execution units in their GPU / VPUs, and physics calculations on 1,000's of objects at once do not require OoO feaures to make effective use of 'silicon real-estate' within a processor.
==================
Cell is just a cut down (cheaper to produce and scaled) varient of the above.... (that still has good performance) as you don't require Out of Order execution when you can execute multiple threads at once in well coded and compiled software, especially if each core can issue 3 - 4 instructions per clock cycle.
Cell has applications in gaming, but it might take awhile before people migrate to it for gaming.... it also has excellent video encoding and decoding performance for a 'single chip' solution. Bear in mind gaming consoles are getting closer and closer to Desktop PCs in functionality (eg: surf the net, while watching a DVD, BluRay, HD-DVD, etc. How long would it be until consoles can encode video faster, and better, than existing PCs ?).
The compilers job moves to 'making code that avoids pipeline stalls at all costs', which isn't really that much of a jump for compilers anyway, and less on making (streams of) code that can be executed out of order.
It is a far more elegant solution and will result in performance scaling far better than it has since 2002.
Really since 2002 has CPU performance, excluding the 2 x 4 issue core Intel Conroe, really scaled that well compared to previous 3 to 4 year periods ?