News Nvidia announces Blackwell Ultra B300 —1.5X faster than B200 with 288GB HBM3e and 15 PFLOPS dense FP4

Gotta have those ultra halo products to firmly seat oneself as "#1".

I just have to say no. More marketing, more "WOW!" numbers, but it's really a distortion of the compute market and synthesizing Moore's Law into a reality that works if you ignore cost and efficiency scaling. Just not impressed when -- as far as "AI" has come -- a quarter of the earth's population still doesn't have access to fresh drinking water, and I don't see that needle changing rapidly.

The technologists and industrialists have horded wealth and IP to unimaginable levels. Kind of strange post for me, but I stand my ground that the more successful you are individually and as a organization, the more you contribute to society. This just screams of excess to me...and is only getting worse. Look at the gaming dGPU situation today, with both AMD and nVidia claiming that they had decent supply and are selling in record numbers.

If you don't have deep pockets, you're out. You - are - out. That's our "bright" future.
 
  • Like
Reactions: George³
The article said:
By leveraging FP4 instructions, using B300 alongside its new Dynamo software library to help with serving reasoning models like DeepSeek, Nvidia says an NV72L rack can deliver 30X more inference performance than a similar Hopper configuration. That figure naturally derives from improvements to multiple areas of the product stack, so the faster NVLink, increased memory, added compute, and FP4 all factor into the equation.
It's not multiplicative. The only way you get to multiply together any improvements is like if you manage to improve the overlap between compute and communication, so that computation isn't blocking on it, as much. That multiple could then be applied to the improvements gained from going to fp4, and the product would be the final speedup. Algorithmic improvements should also factor in, which I suspect a good chunk of that improvement is from. And the 30x number is almost certainly an outlier, rather than typical. Blackwell just isn't that much faster than Hopper.

Improvements in things like HBM bandwidth should only be enough to (hopefully) keep pace with the compute improvements. Same with NVLink. I doubt it will be less blocked on HBM or NVLink than Hopper was.

It could also be something like the increased HBM capacity enabling local storage of weights that previously had to be fetch from an attached GPU or Grace node.
 
  • Like
Reactions: JarredWaltonGPU
It's not multiplicative. The only way you get to multiply together any improvements is like if you manage to improve the overlap between compute and communication, so that computation isn't blocking on it, as much. That multiple could then be applied to the improvements gained from going to fp4, and the product would be the final speedup. Algorithmic improvements should also factor in, which I suspect a good chunk of that improvement is from. And the 30x number is almost certainly an outlier, rather than typical. Blackwell just isn't that much faster than Hopper.

Improvements in things like HBM bandwidth should only be enough to (hopefully) keep pace with the compute improvements. Same with NVLink. I doubt it will be less blocked on HBM or NVLink than Hopper was.

It could also be something like the increased HBM capacity enabling local storage of weights that previously had to be fetch from an attached GPU or Grace node.
Yes, which is precisely what I'm saying. We haven't tested this. Nvidia isn't giving exact details. It's just saying it's 30X faster. Now, I personally think "increased memory" is a big part of the increased performance. It could also be the faster NVLink, but I don't know that it's as pertinent. And there's MIG and other stuff in play where perhaps the software is better able to distribute work across the GPUs. Someone else will have to test that; this is just what Nvidia has claimed.
 
  • Like
Reactions: bit_user
30x only if you can squint hard enough to make an orange look like an apple... in the same vain as the 5070 being faster than a 4090... different compute process, different precision, etc. It's the reason to see hopper in the comparison... H100 doesn't support the same data types as B200 and B300.

As much as I hate that from a technical stand point, it's perfect for marketing because their target market for upgrades is H100 not B200.
 
  • Like
Reactions: bit_user