News Datacenter GPU service life can be surprisingly short — only one to three years is expected according to unnamed Google architect

The article said:
There is a way to prolong the lifespan of a GPU, according to the speaker: reduce their utilization rate.
Precisely what is this "utilization rate"? Is it referring to the duty cycle or is it perhaps the product of the duty cycle and the clock speed? I assume probably the latter. If so, I think just telling someone "use it less" isn't a good answer. I think a modest clock speed reduction might be.

IMO, what would be useful to see is a curve of clock speed vs. failure rate for: core clocks, memory clocks, and NVLink clocks. My guess is that Nvidia is driving these clocks as high as it thinks they can realistically go, but I'll bet it's well past the point where longevity suffers.

I also can't help but wonder about temperatures and cooling.

BTW, this is potentially of some interest to us mere mortals, as it turns out these "GPUs" can be purchased for very little money, once they're a few generations old. Checkout prices on ebay for Nvidia P100's. If you just wanted something with a lot of fp64 horsepower and HBM, you can practically get them for a song.
 
Precisely what is this "utilization rate"? Is it referring to the duty cycle or is it perhaps the product of the duty cycle and the clock speed? I assume probably the latter. If so, I think just telling someone "use it less" isn't a good answer. I think a modest clock speed reduction might be.

IMO, what would be useful to see is a curve of clock speed vs. failure rate for: core clocks, memory clocks, and NVLink clocks. My guess is that Nvidia is driving these clocks as high as it thinks they can realistically go, but I'll bet it's well past the point where longevity suffers.

I also can't help but wonder about temperatures and cooling.

BTW, this is potentially of some interest to us mere mortals, as it turns out these "GPUs" can be purchased for very little money, once they're a few generations old. Checkout prices on ebay for Nvidia P100's. If you just wanted something with a lot of fp64 horsepower and HBM, you can practically get them for a song.
Clocks have zero to do with longevity. Voltage and heat kill chips. You could argue that more voltage is required to hit the signal high threshold faster in order to hit a higher clock rate, but that's looking at it backwards.
 
Precisely what is this "utilization rate"? Is it referring to the duty cycle or is it perhaps the product of the duty cycle and the clock speed? I assume probably the latter. If so, I think just telling someone "use it less" isn't a good answer. I think a modest clock speed reduction might be.

IMO, what would be useful to see is a curve of clock speed vs. failure rate for: core clocks, memory clocks, and NVLink clocks. My guess is that Nvidia is driving these clocks as high as it thinks they can realistically go, but I'll bet it's well past the point where longevity suffers.

I also can't help but wonder about temperatures and cooling.

BTW, this is potentially of some interest to us mere mortals, as it turns out these "GPUs" can be purchased for very little money, once they're a few generations old. Checkout prices on ebay for Nvidia P100's. If you just wanted something with a lot of fp64 horsepower and HBM, you can practically get them for a song.
Is it any different than the etherium mining gpu's? Buying them used was always a gamble.
 
  • Like
Reactions: jp7189
Back when Intel was stuck at 14+++ part of the issue was insisting on cobalt doping because the pure copper planned for smaller features wasn't hitting their longevity targets whereas TSMC had no such qualms. I have always wondered what that would mean in the real world. Has the usable life of chips been getter shorter with each subsequent shrink?
 
Clocks have zero to do with longevity. Voltage and heat kill chips. You could argue that more voltage is required to hit the signal high threshold faster in order to hit a higher clock rate, but that's looking at it backwards.
I'd argue it's not looking at it backwards, because the voltage/frequency curve has already been established by Nvidia for operating these products in a safe and error-free manner. You can scale back the clockspeeds without invalidating that, but as soon as you start monkeying with things like undervolting, you're now operating it outside those safety margins. Coloring outside the lines might be okay for gamers, but not datacenter operators.

So, with clockspeed being effectively the only variable they can directly control, that's exactly the terms in which they'd need to look at it.
 
Back when Intel was stuck at 14+++ part of the issue was insisting on cobalt doping because the pure copper planned for smaller features wasn't hitting their longevity targets whereas TSMC had no such qualms. I have always wondered what that would mean in the real world. Has the usable life of chips been getter shorter with each subsequent shrink?

The voltage has been slowly but surely dropping as the process node shrinks. This should counteract some of the potential degradation issues. Though we can see with Intel what happens when you go out of bounds. Tighter and tighter restrictions on overclocking and boost is about the only way this is going to go. Much like Nvidia really restricting power and voltage limits.
 
  • Like
Reactions: jp7189
Hmm, cost of doing business. Lots of components have lifetimes in server ops that are much shorter than us DIYers. I remember replacing 60&80mm 5k+ rpm fans in servers. Don't put a finger near those when testing. Also, don't forget hearing protection.
 
  • Like
Reactions: bit_user
I'd argue it's not looking at it backwards, because the voltage/frequency curve has already been established by Nvidia for operating these products in a safe and error-free manner. You can scale back the clockspeeds without invalidating that, but as soon as you start monkeying with things like undervolting, you're now operating it outside those safety margins. Coloring outside the lines might be okay for gamers, but not datacenter operators.

So, with clockspeed being effectively the only variable they can directly control, that's exactly the terms in which they'd need to look at it.
No, clockspeeds aren't the primary knob for DC GPUs. Power and therefore thermals are the primary knob which has an effect on the voltage and dictates clockspeeds. No one is setting clock speeds (in a datacenter) and hoping the other parameters fall in line. They set the other parameters and accept the clockspeed as a result.
 
No, clockspeeds aren't the primary knob for DC GPUs. Power and therefore thermals are the primary knob which has an effect on the voltage and dictates clockspeeds.
Now you're the one who has it backwards, because power is even more indirectly coupled from either clockspeed or voltage. If you set a power limit of half, but the workload has low shader utilization, the clocks (and therefore voltage) might still get ramped up, because the power management controller sees that it has enough headroom to clock up the non-idle units without going over the power limit. That will still cause accelerated wear on those blocks.

If the issue you want to control wearout, then the best solution definitely involves lowering the peak frequencies. I'll bet it would only take a modest clipping of peak clocks. To achieve the same effect by limiting power, you might have to reduce it much more. You might also choose to limit power, based on how much thermals are thought to be a factor.

No one is setting clock speeds (in a datacenter) and hoping the other parameters fall in line.
What part of anything I said involves "hope"? I said measure how they correlate, in order to set limits that favor a longer service life.

They set the other parameters and accept the clockspeed as a result.
I'm not talking about current practice, which I'd expect is mostly centered around balancing against short-term operational costs (i.e. things like energy costs and cooling capacity). I'm talking about what you would do, if you really wanted to increase the service life of these components.

Hey, it's a heck of a lot better than reducing duty cycle! That essentially means letting your hardware idle, where it's wasting both space and still some power!
 
Now you're the one who has it backwards, because power is even more indirectly coupled from either clockspeed or voltage. If you set a power limit of half, but the workload has low shader utilization, the clocks (and therefore voltage) might still get ramped up, because the power management controller sees that it has enough headroom to clock up the non-idle units without going over the power limit. That will still cause accelerated wear on those blocks.

If the issue you want to control wearout, then the best solution definitely involves lowering the peak frequencies. I'll bet it would only take a modest clipping of peak clocks. To achieve the same effect by limiting power, you might have to reduce it much more. You might also choose to limit power, based on how much thermals are thought to be a factor.


What part of anything I said involves "hope"? I said measure how they correlate, in order to set limits that favor a longer service life.


I'm not talking about current practice, which I'd expect is mostly centered around balancing against short-term operational costs (i.e. things like energy costs and cooling capacity). I'm talking about what you would do, if you really wanted to increase the service life of these components.

Hey, it's a heck of a lot better than reducing duty cycle! That essentially means letting your hardware idle, where it's wasting both space and still some power!
Are you intentionally misunderstanding me just to argue? Clockspeed is the rate at which the line is read. Sampling that data faster or slower has nothing at all to do with longevity. You argue that setting the clock rate changes the voltage because you move down on the V/F curve, but that is backwards. DC integrators have power and thermal targets for the overalls systems and specific cards which affects voltage and that in turn sets clockspeed. Then you argue that your aren't talking about what they do but rather what they should do (change clockspeed), but I'll go back to my very first point and say there is no doubt that voltage and power are the primary tuning points for longevity. How fast the line is read (clockspeed) can't possibly have any affect at all on chip degradation.
 
Precisely what is this "utilization rate"? Is it referring to the duty cycle or is it perhaps the product of the duty cycle and the clock speed? I assume probably the latter. If so, I think just telling someone "use it less" isn't a good answer. I think a modest clock speed reduction might be.

IMO, what would be useful to see is a curve of clock speed vs. failure rate for: core clocks, memory clocks, and NVLink clocks. My guess is that Nvidia is driving these clocks as high as it thinks they can realistically go, but I'll bet it's well past the point where longevity suffers.

I also can't help but wonder about temperatures and cooling.

BTW, this is potentially of some interest to us mere mortals, as it turns out these "GPUs" can be purchased for very little money, once they're a few generations old. Checkout prices on ebay for Nvidia P100's. If you just wanted something with a lot of fp64 horsepower and HBM, you can practically get them for a song.
it's better to underclock and run constantly than to run on a short duty cycle, even the miners got that
 
  • Like
Reactions: bit_user