Respectfully disagree here. Most AI workloads are very parallel vector workloads that benefit from a very large number of execution units: Which usually means you can aim for the power curve sweetspot, and add compute units to the chip using that extra power budget.
Okay, so let's look at an example in the Nvidia H100 (
Source: https://www.anandtech.com/show/1878...-memory-server-card-for-large-language-models ):
- The PCIe version is limited to 350 W and achieves 756 fp16 tensor TFLOPS (2.16 per W).
- The SXM version is limited to 700 W and achieves 990 fp16 tensor TFLOPS (1.41 per W).
Clearly, the SXM version is pushed well beyond the efficiency sweet spot.
The cost of the silicon is not much of an issue - Most data enter costs are dominated by power/cooling over the lifetime of the product.
According to this, the average datacenter PUE ratio is 1.58.
According to this, datacenter electricity costs range from $0.047 to $0.15 per kWh.
Let's take the upper end of that range, which works out to $1.31 per Watt-year. So, even if the GPU is running at max power, 24/7, it's only costing $1454 per year, to power. Now, you've also got the overhead of the server and networking gear, but those stay relatively fixed, irrespective of how fast the GPU is running, so factoring in those wouldn't strengthen the case for slowing the GPU.
So, let's compare that to the cost of the hardware. A DGX H100, containing 8x H100 SXM cards had a launch price of $482k, which works out to $60k per GPU. Current street price of a H100 (PCIe) seems to be about $30k. The useful service life of this hardware is only about 3-4 years, before it becomes too obsolete.
So, even if we assume a 4-year service life and the lowest price of $30k per H100,
we're still talking about hardware that costs at least 5.16x as much to purchase as its lifetime energy costs (even at 100% duty cycle + including cooling), for that entire time!
Amiright,
@helper800 ?
Also agreed. GB6 is a completely irrelevant benchmark. This is an AI accelerator. Using a CPU benchmark is ignoring three quarters of the transistors on the silicon.
I'm pretty certain the CCDs in it are the same ones in Genoa or Genoa-X EPYC CPUs. So, it would be weird if there were a real discrepancy between models of equal core count, after accounting for their relative clock speeds.
One wild card is the HBM3. I don't know how the best-case latency on it compares with DDR5. However, certainly the latency-under-load should be better, due to its higher bandwidth. And it's latency-under-load should be what the MT benchmark is seeing, especially if I'm right about it being memory-heavy.