"One petaflops is equal to 1,000 teraflops, or 1,000,000,000,000,000 FLOPS."
"FLOPS can be recorded in different measures ...
Is there a point in your "story" or is this simply complete nonsense?
I
explicitly wrote something of
FP16 performance (half precision) because of "
a massive processor featuring 47 components, over 100 billion transistors, and offering PetaFLOPS-class AI performance".
Whether this claim is correct or not is a completely different story, but it's not unlikely, because already nVidia's A100 offers with a monolithic 826 mm2 chip 0,31 PFlops (~ 1/3 PFlops) via their Tensor Cores v3 and MMA-operations.
And I measured/compared all exemplary designs correctly according to their best FP16 performance in my initial post. So, what didn't you understand?
"
Do you still think that Intel has a so-called one petaFLOP GPU ..."
Of course I do, because you simply haven't understood what I was writing about. There's nothing wrong with the numbers and it is also very likely that such a big design even exceeds 1,0 PFlops FP16 performance with MMA-operations. (Btw., its funny how much you emphasize the "
s" in "
Petaflops in Your Palm"; additionally, this claim is already fulfilled if it would be only 1,1 PFlops.
😉)
And no, nobody is lying to me (or anybody else). The problem here is that you mess up different workloads, types of calculations and units.
And maybe you missed the fact during your misguided quotation-orgy: The Top500 system
Summit uses Volta-cards (GV100), which only provide Tensor Cores v1 but still deliver 0,13 PFlops FP16 performance for AI workloads. The system has over 27,600 GPUs in total and therefore a combined/theoretical (GPU-only) performance of ~ 3,450 PetaFlops or ~3.5 ExaFlops already!
With its current Rmax value of ~ 149 PFlops, the system is still no. 2 in the Top500 list, but again, this is FP64 performace, not FP16 performance, which I referred to and obviously also R. Koduri.
And here also, nVidia's own DGX-supercomputer
Selene (currently no. 5 in the Top500) uses the latest A100 cards. It has a Rmax of ~ 63 PFlops, but: "
By that metric, using the A100’s 3rd generation tensor core, Selene delivers over 2,795 petaflops, or nearly 2.8 exaflops, of peak AI performance." **)
To do the math,
with a lot of uncertainty, because a lot of assumptions are made:
- approx. 9000+ nodes for the Aurora
- 1 node with 6x Xe-HPC (and 2x Xeon)
- assuming only 1 PFlops FP16 per Ponte Vecchio
- also assuming this big 600 W-version is used ***)
- therefore about 4+ kW per node
Resulting in:
- 36+ MW
- 54,000 PFlops or about 54 ExaFlops FP16/AI performance via GPGPUs
- about 1,5 ExaFlops/MW (the Selene achieves about 1,08 ExaFlops/MW, the much older Summit achives only about 0,34 ExaFlops/MW)
Estimations for Rmax and the Top500 ranking are more problematic, because Intel hasn't disclosed any FP64 performance numbers for HPC and the size of a single HPC-compute tile is unknown, therefore it doesn't make sense to try to extrapolate from Xe-HP with about 10 TFlops FP32 performance for a single compute tile. (Additionally the composition of function units in HPC and HP most likely will differ considerably.)
But we can reverse the process and assume that the system will reach (only exactly) 1 ExaFlops FP64 performance.
If we ignore the Xeon's, we have about 54,000 Xe-HPC-packages/sockets in use and to reach this goal a single "chip" only has to achieve about 18.5 TFlops FP64, which is already in the range of today's hardware and therefore nothing special. In fact it is more likely that such a massive "chip" will already achieve even more and also the whole system will most likely exceed 1,0 ExaFlops FP64 (Rmax) performance.
*) Note: The Instinct MI100 has 11,5 TFlops FP64 peak performance. An A100 has 9,7 TFlops, but for the Ampere-Design this is only half the truth because the chip has additional FP64 functionality inside the Tensor Cores v3. (With FP64-MMA-operations, the A100 can reach theoretically up to 19,5 TFlops.)
**) I assume nVidia calculates the performance in this quote with the
sparsity feature in mind. Without it (therefore basic FP16/bfloat16 performance via Tensor Cores v3) the system should achieve up to 1,4 ExaFlops. Still quiet impressive for this relatively small system. (The system only uses about 4480 A100.)
***) Intel already stressed the fact, that they are quiet flexible according to Ponte Vecchio-like designs because of Foveros/EMIB, therefore they can custom-tailor different designs for different customers and use cases. For example it may be possible to also provide an AI- or FP64-only design with much more performance instead of a genereal purpose/all-in-one design.