The Cell Processor

TabrisDarkPeace · Mar 12, 2006

(Question) How many 6th/7th gen x86 cores would fit in the space of a 640 million (640 x 10^6) transistor processor (65nm can do this) if a suitable amount of cache is used (and said cache was using Z-RAMs).

- Note that said cores include OoO / Register renaming, etc, etc

The clock speeds would still be in the 2 to 3 GHz range.

So I decided to do some research....

(Answer) 6 x 6/7th generation x86/x64 processor cores, with 'only' 6 MB shared L2 cache, would fit within 630 million transistors, leaving space for 10 million more (extra logic, etc).

TDP for such a processor would be be 75 - 90 watts, (depending on load), which can be cooled by todays 'standard' heatsink + fan combinations (thanks to the PreScott we have some rather advanced cooling technology available 😛, ironic no ?).

Power usage would be up to 115 watts (give or take), which is 'acceptable' aswell, and likely lower assuming 4 of the 6 cores are in power saving modes most the time, leaving the 6 MB shared L2 cache to only 2 or 3 of the cores in 'actual usage' patterns.... at least in most software.

The processor would require an external (thin but high speed, likely 64 to 128 bit 'thin') link northbridge / MCH as it would be to complex if the memory controller was integrated within it, and it would require somewhere in the order of between 12.8 and 25.6 GB/sec of peak memory throughput (likely 'wider' at 256 - 512 bit, in total anyway) to keep the cores well fed with instructions and data.

eg: Sort of like how the typical current Intel FSB is 800 MHz (QDR) x 64 bits wide for 6.4 GB/sec, but they use 400 MHz (DDR) x 128 bit wide (dual channel) memory to feed it, up to, the same 6.4 GB/sec.

This could be done using todays 65nm technology, ironically XDR RAMBUS would be the best choice for it aswell (lowest pin count to high data transfer rate ratio than any other solution)..... however I don't see Intel going back to them for 'help'. (Pin counts would become an issue to get enough sustained memory throughput).

It is possible 2 x CPU socket systems using 2 x ccNUMA nodes may become the norm just to aggregate and 'share' the total available memory throughput to keep performance scaling.

Because of manufacturer control it would be unlikely to happen for some time, and Intel tend to burn space on large L2 caches IMHO when an increase in L1 (data and instruction) cache sizes would be a more effective way (transistor wise) to keep performance up. As 6 cores + 12 MB cache would require around 840 million transistors it is not likley to happen with Intel until 45nm manufacturing begins.

eg: Intel would pair 12 MB with 6 cores most likely, maybe even more, and then keep the interface to memory slower, while keeping 'overall complexity' down to reduce mainboard costs. (Frankly I'd rather pay up to $500 for a processor, and up to $500 for a mainboard, excluding RAM, and 'typically' expect to pay $750 for video card).

eg2: A 1 billion (10^9) transistor processor could house 40 Celeron (P6) processor cores, including the ~ 5 MB cache (ideally shared) L2 cache, and overheads. Such as logic to control the large number of cores and a shared interface to memory for them all. (Now that is cool, as they'd be clocked at over 3 GHz using a 52nm, or more likely 45nm, manufacturing process). Go ahead crunch the numbers yourself.... it sounds pretty cool to me.

IBM Cell, just like Intel IA-64 (Itanium), requires a very good compiler, that is where many of the advancements are made, and the point is to make the Out of Order / Register Renaming units on the processors take up less silicon space, so more can be dedicated to combinations of the FPU (Floating Point Unit), cache(s), and other processor cores.

Eventually we'd hit a point where the Out of Order / Register Renaming units grew so large they'd need half the processor space to keep scaling performance, when (or if) we hit that point without moving to another architecture gamers will not be a happy market.

CPUs (General & Central processing units) are going to become more like GPU / VPUs are today... we don't bash ATI or nVidia for not having advanced, large, and complex Out of Order execution units in their GPU / VPUs, and physics calculations on 1,000's of objects at once do not require OoO feaures to make effective use of 'silicon real-estate' within a processor.

==================

Cell is just a cut down (cheaper to produce and scaled) varient of the above.... (that still has good performance) as you don't require Out of Order execution when you can execute multiple threads at once in well coded and compiled software, especially if each core can issue 3 - 4 instructions per clock cycle.

Cell has applications in gaming, but it might take awhile before people migrate to it for gaming.... it also has excellent video encoding and decoding performance for a 'single chip' solution. Bear in mind gaming consoles are getting closer and closer to Desktop PCs in functionality (eg: surf the net, while watching a DVD, BluRay, HD-DVD, etc. How long would it be until consoles can encode video faster, and better, than existing PCs ?).

The compilers job moves to 'making code that avoids pipeline stalls at all costs', which isn't really that much of a jump for compilers anyway, and less on making (streams of) code that can be executed out of order.

It is a far more elegant solution and will result in performance scaling far better than it has since 2002.

Really since 2002 has CPU performance, excluding the 2 x 4 issue core Intel Conroe, really scaled that well compared to previous 3 to 4 year periods ?

TabrisDarkPeace · Mar 12, 2006

Or,... in more laymens terms:

If multiple branches and/or possibilities of the code can be run in parallel, (similar to the Quantum computing concepts in a way), aswell as multiple loops of code running 'mostly' independant of each other, do we still require advanced branch prediction and out of order execution to go with it ?

The answer is no, we'll only require more basic branch prediction and very basic out of order execution with 'smarter' code, especially when upwards of 128 instuctions (over 32 threads) can be fetched and processed per clock cycle.

It would still require a smart and fairly complex scheduler, to feed entire branches of code to multiple processor cores at once, and just ignore any wasted branches that have been executed. (Wasteful perhaps, but still lightyears ahead of what we are doing now).

But that is where the compiler needs to make significant jumps forward (they have not changed much since x86's inception, ideally we should be moving to a new architecure, not x64 extentions either, by around 2008 to 2012).

Intel want this architecture to be IA-64, or a combiniation of x86/x64 and IA-64 within a single box.

Search

The Cell Processor

TabrisDarkPeace

Distinguished

TabrisDarkPeace

Distinguished

TRENDING THREADS

Latest posts

Moderators online

Share this page