Personally, I'll remain a bit wary over the concept of GPGPU in an industrial/scientific environment. Sure, both nVidia and AMD *claim* massive theoretical FLOPS figures for their cards, but truth be told, in areal-world computing application, they perform far more below their theoretical maximums than more conventional CPUs; that's because while a GPU was designed for raw theoretical power, a CPU has loads more branch predictors, registers, cache, and other things to make sure every clock cycle counts no matter what it's running.
[citation][nom]rhodesar[/nom]This is interesting, however I would like to know more about the performance expectations in relation to non-GPU based predecessors.[/citation]
We actually have a real-world example; earlier this year China took nVidia up on their offer, and built a massive supercomputer primarily out of nVidia Tesla cards.
The machine, the "Nebulae," Consists of 9,280 Intel Xeon X5650 CPUs, coupled with 4,640 Tesla C2050 cards. Figures for them are:
- Xeon X5650: 2.66 GHz, 6 cores, 64 GFLOPs theoretical total, (10.6 GFLOPS per core) 95 watts per CPU.
- Tesla C2050: 1.15 GHz, 448 SPs, 515.2 GFLOPs theoretical total, (1.15 GFLOPS per SP) 238 watts per card.
- Total theoretical performance: 593.92 TFLOPs (9,280 x 64) for CPUs, 2,390.5 TFLOPs (4,640 x 515.2) for GPUs (~2.98 PetaFLOPS total)
Now, the real-world benchmarking, (using LINPACK, the standard benchmark) the computer only gets 1,271 TFLOPs, or 42.6% of its projected power. Other benchmarks of pure-Xeon supercomputers (with the same Westemere Core) show that they manage to hit 80-90% of their theoretical performance in real-world testing, so 475-535 TFLOPS was from them, leaving 736-796 TFLOPS for the Tesla cards; That's 30.8-33.3% efficiency.
Of course, to put into perspective, we also must consider performance-per-watt; while the THEORETICAL number of a Tesla may look attractive, offering 2.16 GFLOPS/watt compared to only 0.67 for the Xeon, real-world results show numbers that are 0.72 and 0.6 for the Tesla and Xeon, respectively, a far smaller margin. Similarly, there aren't much savings in hardware cost; the Xeon
costs $1,025, or 62.4 MFLOPS/dollar, vs. the Tesla, which
costs $2,500, or 68.7 FLOPS/dollar. Given that using the Tesla involves wrangling with CUDA to write your app, a 20% increase in performance-per-watt and 10% increase in FLOPS-per-dollar seems a dubious trade-off.
Where this REALLY starts to fall apart is once you start comparing this to alternative designs. A comparable CPU from AMD, such as a 2.6 GHz Opteron 6-core, can be
had for $231, or 216.1 MFLOPS/dollar, though it appears the performance/watt ratio sinks to 0.43 GFLOPS/watt. (though if the "ACP" figure is more accurate for AMD, then it's 0.67 GFLOPS/watt)
Worse yet is if you compare this to the PowerXCell, which boasts a theoretical 102.4 GFLOPS at 3.2GHz, and at 92w per chip and 80% efficiency, gets you up to 0.89 GFLOPS/watt for the 65nm version; the 45nm version would likely push that well over 1 GFLOPS/watt; and assuming a CPU price of $1,000, that's 81.9 MFLOPS-per-dollar. Note that this doesn't apply to the PS3, which uses a version not designed for DP-FP; hence, it only gets a measly 0.10 GFLOPS/watt and 25.5 MFLOPS/dollar, even with the cheapest Slim model; the PS3 doesn't make as good a supercomputer node as it does a console.
Potentially there's some promise in this area, but for now, for true supercomputers, I think Tesla and FireStream don't belong. They could be great in the future, especially if they implemented true native double-precision math, but for now I see it largely as a marketting gimmick.