I recently attended a Cell talk by IBM. It turned out it was by the chief scientist of the Cell project, Dr. Peter Hofstee, who was giving the presentation. So he is "THE MAN" to talk to about Cell. Let's start with some pics:
Cell is of course a Power based architecture, with 1 PowerPC core and 8 SPEs (1 SPE disable in PS3 to improve yield). Each SPE has 256KB cache-like SRAM local storage, a DMA (direct memory access) unit, which is VERY important in this architecture.
To use an analogy, you are working on a plumbing project. Traditional processors are akin to you get a pipe from Home Depot, run back, install it, and figures out you need a fitting, goes back to Home Depot, buys it, run back, install it, and figures out you need an elbow. Very inefficient. Traditional techniques is akin to having a long supply line from Home Depot to your house, so ideally you reach out the window, and grab what part you need. With cell, the SPE creates a "shopping list", and the DMA fetches from memory and puts it in the local storage. Memory latency is the biggest slow down to performance, and that's what Cell is good at.
There are 128 registers, because that's the number of bits (7) available in the op codes to address registers. There are no reorder buffers, branch predictors, or reservation stations. Loop unrolling is used to speed things up, taking advantage of the large number of registers. Since each core is in-order, a lot of complexity associated with out-of-order execution is eliminated.
Every program has branches, and some are quite branch heavy. So Cell uses branch hint instructions that the compiler generates to guide it, as well as select instruction to calculate two outcomes before committing to a path. Cell is not the best at every type of application, but it is great for many types of apps. For example, variable length decode of video codecs has a lot of branches, which a lot of people predicted will run poorly on the Cell, but it actually runs great.
Some more highlights:
- 300 GB/s peak, 200 GB/s sustained transfer on the internal bus
- 26 GB/s bandwidth to memory, which is utilized much more efficiently by the Cell architecture than traditional ones. It's not uncommon to see 90% sustained utilization during a program execution.
- Rambus Flex I/O - configurable I/O, could be used as coherency links between multiple Cells
- Very power efficient, runs 3.2Ghz@1V.
- Highly scalable, up to 6Ghz achieved in lab
By 2008, an enhanced Cell will be released featuring more double precision float point units for each SPE. Notice by 2010, there will be a cell with a total of 34 cores!
Advanced Cell blade servers. IBM is actually building a supercomputer that consists of 2000 processors of AMD + Cell (don't remember if it's 2000 each or 2000 total). They are linked together, and the Cell is there to greatly increase float point performance.
If you know anything about computer graphics, you know that ray tracing is EXTREMELY process intensive and hard to accelerate through hardware. So although it has superior quality, no existing graphics card use ray tracing, instead using "hacks" to approximate lighting. So it's insane to see real time ray tracing. The system uses 8 Cell blades.
There is another slide that I forgot to take a picture of. It's Cell doing terrain rendering from color maps and height data and rendering a 3d terrain. A 2.4Ghz Power5 (used in Apple G5) renders about 2/3 fps, which is about twice as fast as an Intel processor at the time. A Cell renders the scene at 30fps.
Lastly, the question y'all have been waiting for. You asked, I deliver
Q: What is more powerful, the 3 general purpose PPC cores in the XBox 360, or the Cell in the PS3?
A: He was a bit evasive at first. He says both are different concepts, and IBM isn't biased toward either, and helps all game developers. So I asked, "what about raw performance between the 2?" And Dr. Hofstee answered, "raw peak performance wise, Cell is about 2x as much, but it really depends on the developer to deliver."
So there you have it, straight from the horse's mouth