News Nvidia's B100 and B200 processors could draw an astounding 1000 Watts per GPU — Dell spills the beans in earnings call

Status
Not open for further replies.

Fruban

Distinguished
Apr 19, 2015
11
1
18,515
Yikes. I hope they power all the new AI data centers with solar, wind, and battery... Unsustainable power draw.
 

bit_user

Titan
Ambassador
Simple, oil immersion cooling with a double heat exchanger, with the waste heat being used to perform work, much as some cryptominers and server operators already do.
Immersion cooling is expensive, but probably something we're going to see more of. From Dell's perspective, requiring immersion cooling in their mainstream products would be a fail, as it would greatly restrict their market.

Also, there are limited opportunities to harness waste heat, especially in the places where datacenters exist or where it makes sense to build them. It's great if you can do it, but it certainly won't be the norm.
 

JTWrenn

Distinguished
Aug 5, 2008
331
234
19,170
It's only astounding if you are thinking about this being in a desktop. For a high end server system as long as the perf per watt and size per watt are better than the last gen it's good. The question is are they doing that or are they just cranking up the clocks to push off AMD for a while as they prep their next big architecture jump. I hope that is not the case but sometimes performance is king no matter what....really just depends on the customer.

Let's hope it is a step in the right efficiency direction and not a ploy for fastest card.
 
  • Like
Reactions: PEnns

CmdrShepard

Prominent
BANNED
Dec 18, 2023
531
428
760
Start with TamperMonkey:

let nodes = document.querySelectorAll('h3[class="article-name"')

Iterate nodes and hide all that contain "AI" in textContent.
 

bit_user

Titan
Ambassador
It's only astounding if you are thinking about this being in a desktop.
Not really. If you assume the density (i.e. nodes per rack) of the new generation will be at least the same as the current one, then it would likely represent a net increase in power consumption by data centers. That's not exactly surprising, but also not great news.

For a high end server system as long as the perf per watt and size per watt are better than the last gen it's good.
More perf/W is certainly nice, but if the speed of deployment is limited primarily by the supply chain, then power dissipation by AI will increase at a faster rate than currently (assuming they eventually reach similar production capacity of these chips as of Hopper models, currently).

However, if Nvidia were somehow stuck on Hopper, for a few more years, market dynamics should achieve efficient allocation and incentivize efficient use of those resources.

The question is are they doing that or are they just cranking up the clocks to push off AMD for a while as they prep their next big architecture jump.
They're quite simply trying to deliver the maximum perf/$. That means running the highest clocks possible, in order to extract the most performance from a given silicon area. This is the world we live in, especially with AI being so area-intensive and fab costs increasing on an exponential curve.
 
  • Like
Reactions: castl3bravo

e_fok

Distinguished
Nov 9, 2016
4
2
18,515
It's also possible that using dual chiplets allows for double the number of hbm stacks. It would make sense to have truly massive amounts of fast memory. 12 stacks of 36gb would be a good place to start.
 

watzupken

Reputable
Mar 16, 2020
1,181
663
6,070
I am expecting power consumption to go up with Blackwell. The jump from 4nm to 3nm is actually not a significant based on what we see with Apple's SOC. So to squeeze more performance out of Blackwell only means increased power consumption. Ada was an exception mostly because of the jump from cheap Samsung 10nm class node to a cutting edge TSMC 5nm class node.
 

bit_user

Titan
Ambassador
Ada was an exception mostly because of the jump from cheap Samsung 10nm class node to a cutting edge TSMC 5nm class node.
Only the client-oriented ADA GPUs used Samsung "8 nm". The A100 actually uses TSMC N7.

And I agree the leap from that Samsung node to the optimized TSMC N5 variant (called "4N") is responsible for a lot of the gains between the RTX 3000 and RTX 4000 generations (at least, if you compare equivalent models and don't just blindly follow their product tier numbering).
 

jp7189

Distinguished
Feb 21, 2012
532
303
19,260
Simple, oil immersion cooling with a double heat exchanger, with the waste heat being used to perform work, much as some cryptominers and server operators already do.

https://www.tomshardware.com/news/japanese-data-center-starts-eel-farming-side-hustle
Not really very simple to setup immersion cooling. The server racks have to be custom made and water proof. It would be exceptionally difficult to add this to a standard datacenter.

Also not simple if oil is the fluid.. its way too messy to work with and hazardous due to flammability.
 
  • Like
Reactions: bit_user

bit_user

Titan
Ambassador
Not really very simple to setup immersion cooling. The server racks have to be custom made and water proof. It would be exceptionally difficult to add this to a standard datacenter.

Also not simple if oil is the fluid.. its way too messy to work with and hazardous due to flammability.
I think single-phase immersion cooling is much less attractive than phase-change cooling, but the latter is defunct due to health concerns/litigation, for at least the time being.

 
Yikes. I hope they power all the new AI data centers with solar, wind, and battery... Unsustainable power draw.
Have you seen all the mountains of non-recyclable fiberglass wind turbine blades getting buried in landfills? The same goes for solar cells. Today's wind and solar energy is not as clean as many like to think. And batteries are just a way to store energy, not produce it, and if anything, a portion of the energy is wasted when stored in a battery. And again, they tend to only last so long before they need to be replaced.

As for the power draw of these GPUs, they are most likely going to be significantly more efficient than current models. If a single 1000 watt GPU can do the work of two 700 watt GPUs, for example, then that's a big increase to energy efficiency. Less total GPUs are required to do a given amount of work. When more GPU cores are put on a single card, that tends to improve efficiency, so while the power draw of a single card might go up, the power required to perform a given workload will actually go down.
 
  • Like
Reactions: jbo5112

bit_user

Titan
Ambassador
Have you seen all the mountains of non-recyclable fiberglass wind turbine blades getting buried in landfills? The same goes for solar cells. Today's wind and solar energy is not as clean as many like to think. And batteries are just a way to store energy, not produce it, and if anything, a portion of the energy is wasted when stored in a battery. And again, they tend to only last so long before they need to be replaced.
Perfect is the enemy of good. Every type of power plant has an effective lifespan, after which it must scrapped or overhauled, producing quite a lot of waste in the process. Fossil fuel extraction also produces a continuous waste stream. As long as a new technology is a net win over existing tech, we mustn't let its downsides stand in the way of progress. Over time, better materials, manufacturing, and recycling techniques can smooth the rough edges.

If a single 1000 watt GPU can do the work of two 700 watt GPUs, for example, then that's a big increase to energy efficiency. Less total GPUs are required to do a given amount of work.
I think we all know the industry is at a point where it's consuming all the AI compute power it can get. There's not currently a fixed amount of compute that's "enough". So, if they produce these 1 kW GPUs at the same rate they're currently producing 0.7 kW GPUs, we can anticipate the total power consumption to increase at a faster rate than before.
 

JTWrenn

Distinguished
Aug 5, 2008
331
234
19,170
They're quite simply trying to deliver the maximum perf/$. That means running the highest clocks possible, in order to extract the most performance from a given silicon area. This is the world we live in, especially with AI being so area-intensive and fab costs increasing on an exponential curve.
Perf per dollar at this level depends on running costs as well. If it costs twice as much to run, and you get 50% more perf it might not be a step in the right direction. Perf per $ is perf /upfrontcost + costpertokenfulfilled. That could shift it out of contention and is why both really matter. The supply constraints definetly play into it, but I think that back end issue is going to keep coming up.

Other than market forces though moving to much more efficient AI is necessary in our ever heating, constantly discussing energy production world. If AI ends up looking like bitcoin it's going to get a lot of pushback. They are massively different but AI uses the power of Venezuala comparisons are going to start I am sure.
 
  • Like
Reactions: jbo5112

bit_user

Titan
Ambassador
Perf per dollar at this level depends on running costs as well. If it costs twice as much to run, and you get 50% more perf it might not be a step in the right direction.
The point about operating costs probably makes more sense if we're talking about server CPUs. For GPUs, hardware pricing is so astronomical and becomes obsolete so quickly that I think operating costs are a lesser concern.

Furthermore, I think nobody is forcing the operators to run the hardware at the maximum-rated speed. If they find it more profitable to run it at a reduced clock speed, I presume they have that option.

Other than market forces though moving to much more efficient AI is necessary in our ever heating, constantly discussing energy production world.
Efficiency isn't an end, in itself. Increasing efficiency just decreases costs, which I expect to be answered by increasing demand.

If AI ends up looking like bitcoin it's going to get a lot of pushback. They are massively different but AI uses the power of Venezuala comparisons are going to start I am sure.
Perhaps AI will ultimately become power-limited, especially if production capacity ramps up anything like how Sam Altman wants. In that case, we could definitely see an adjustment towards greater power-efficiency, but that won't mean total power consumption will actually be any less. Rather, it would just be a tactic to extract the maximum amount of performance from a datacenter's fixed power budget.
 
I think we all know the industry is at a point where it's consuming all the AI compute power it can get. There's not currently a fixed amount of compute that's "enough". So, if they produce these 1 kW GPUs at the same rate they're currently producing 0.7 kW GPUs, we can anticipate the total power consumption to increase at a faster rate than before.
More compute power will be dedicated to AI whether it's made more efficient or not. These new GPUs will likely be notably more efficient than the old ones though, if the rumors of them offering around double the AI compute performance are true, allowing that additional processing power to be added without increasing energy usage as much as it would have otherwise. So if anything, these 1000 watt GPUs should be a win for efficiency compared to their 700 watt predecessors. Comparing the wattage of individual GPUs is a bit pointless when they will be used in server farms, where the newer cards will allow them to achieve a given level of compute performance utilizing fewer cards.

Perfect is the enemy of good. Every type of power plant has an effective lifespan, after which it must scrapped or overhauled, producing quite a lot of waste in the process. Fossil fuel extraction also produces a continuous waste stream. As long as a new technology is a net win over existing tech, we mustn't let its downsides stand in the way of progress. Over time, better materials, manufacturing, and recycling techniques can smooth the rough edges.
Yeah, but some people like to assume that wind and solar are completely clean, and don't seem to realize that the hardware still has a limited lifespan and needs to be replaced eventually, and that the manufacture of this equipment can potentially have an impact on the environment. Rainforests have been logged and replaced with balsa plantations to support the high demand for balsa wood in turbine blades, and a significant portion of the world's production of fiberglass and carbon fiber goes toward them, all to be combined together into composite materials that have generally been considered impractical to recycle. The blades are then often replaced after not much more than a decade of use, at which point they typically just get buried in landfills. There have been improvements happening with that, and of course overall they will likely be better for the environment than fossil fuel plants, but the current supply of wind and solar energy is still very low compared to other energy sources, and the immediate costs much higher, so it's probably a bit impractical to expect server farms to operate on those energy sources at this point.
 
Status
Not open for further replies.