News Nvidia Hits Record $13.5 Billion Revenue on Explosive AI GPU Sales

TerryLaze · Aug 27, 2023

bit_user said:
Except that you know as well as I do that if datacenters were only looking to buy machines on the basis of consolidating the same amount of processing power they already have, they'd probably be down to a single rack, by now. The problem is that compute demands go up at least as fast as density increases, which is why they build more & bigger datacenters and keep looking for better ways to cool them.

So what's your point?
Compute demands go up anyway and using fewer total systems is still better than using more total systems.

bit_user said:
I don't know exactly what you mean, by that.

I mean that you only show power going up and that alone doesn't tell you anything, if power increases at the same rate as performance then it's not changing the efficiency scaling.

bit_user said:
That basically means you have to make smaller and smaller chips, as you move to each new node, if you don't want power consumption to go up. We actually saw this, back during Intel's "quad core era", which is the last time they achieved consistent TDP reductions from one generation to the next. As soon as they started adding more cores, the efficiency improvements were overwhelmed and TDPs started to go up.

i7-7700k 91W TDP divided by 4 cores = 22.7W per core
i9-13900k 253W TDP divided by 24 cores = 10.5W per core

You can't take 'TDP for the full CPU going up' as a number in a vacuum for efficiency scaling because just power doesn't give you any number on performance, and efficiency is power needed for performance or power divided by performance, it's not power divided by nothing.

TDP per core has reduced to half from what it was during "quad era" .
If you don't want to use the total of 253W then get a smaller CPU, or run the 13900k at 90W (derbauer video) ,both of these options still exist, nobody, including physics, is forcing you to do anything here.

bit_user · Aug 27, 2023

TerryLaze said:
So what's your point?
Compute demands go up anyway and using fewer total systems is still better than using more total systems.

I guess the main thing that could be done better is building wider cores designed to clock lower, like Apple's. That's more expensive, up front, but then cheaper to operate. Apple can more easily absorb the cost of additional silicon than pure chip-makers, because it only sells complete devices. Plus, their chips don't even have very many CPU cores. Too bad they don't make proper server CPUs (yet).

As a plan B, E-core CPUs aren't too bad. The main downside is their limited applicability to certain workloads, but it's something.

TerryLaze said:
I mean that you only show power going up and that alone doesn't tell you anything, if power increases at the same rate as performance then it's not changing the efficiency scaling.

i7-7700k 91W TDP divided by 4 cores = 22.7W per core
i9-13900k 253W TDP divided by 24 cores = 10.5W per core

That's really leaning into the hybrid architecture to make your point, but you know I generally agree with it.
; )

It's worth pointing out that the same thing I said about chip area not holding constant across node-shrinks applies to core size, if you're going to normalize power dissipation by core count. The way your math works is essentially by relying on the E-cores to lower the average core size.

TerryLaze said:
You can't take 'TDP for the full CPU going up' as a number in a vacuum for efficiency scaling because just power doesn't give you any number on performance, and efficiency is power needed for performance or power divided by performance, it's not power divided by nothing.

Yes, the graph lacks efficiency data. As I was digging into that data, it did seem like a lost opportunity that it lacks all-core performance data. It's hard to have an argument about efficiency without sufficient data, so I'll drop the point. Anyway, I was just mentioning an area of general concern than something more definite.

Bringing this back on-topic, I'm eager to see some real-world data on Nvidia's Grace. I think it shouldn't be too long, now.

TerryLaze · Aug 27, 2023

bit_user said:
I guess the main thing that could be done better is building wider cores designed to clock lower, like Apple's. That's more expensive, up front, but then cheaper to operate. Apple can more easily absorb the cost of additional silicon than pure chip-makers, because it only sells complete devices. Plus, their chips don't even have very many CPU cores. Too bad they don't make proper server CPUs (yet).

A wide core will just produce more transistors that will just be sitting there all activated using power with nothing to do.
If you think that e-cores are limited, wide cores are even worse.
It would need software that is just as wide as the core and that type of software runs very efficiently on narrower cores.

bit_user said:
That's really leaning into the hybrid architecture to make your point, but you know I generally agree with it.
; )

It's worth pointing out that the same thing I said about chip area not holding constant across node-shrinks applies to core size, if you're going to normalize power dissipation by core count. The way your math works is essentially by relying on the E-cores to lower the average core size.

All of these are full core only CPUs, all of these reduce the power per core because the total power basically stays the same.
And every newer one is faster than every previous one.

i7-7700k 91W TDP divided by 4 cores = 22.7W per core
i7-8700k 95W TDP divided by 6 cores = 15.8W per core
i7-9700k 95W TDP divided by 8 cores = 11.8W per core

bit_user · Aug 27, 2023

TerryLaze said:
A wide core will just produce more transistors that will just be sitting there all activated using power with nothing to do.

Clock gating is frequently used to avoid that.

TerryLaze said:
If you think that e-cores are limited, wide cores are even worse.

Apple's are the most energy-efficient cores, bar none. The M1 cores were the widest, at its launch. I'm not sure if that's still true, but the other point about them is they're not designed to clock as high. That means they can have a longer critical path, and that enables higher IPC, for a variety of reasons.

Designing to a lower clockspeed target is a triple-win for efficiency, because you not only get higher IPC, but also avoid energy-wasting high clockspeeds, and cache misses take fewer clock cycles. Reducing the cycle-impact of cache misses makes them easier to hide with techniques like speculative execution.

TerryLaze said:
It would need software that is just as wide as the core and that type of software runs very efficiently on narrower cores.

So far, we haven't seen many examples of software that doesn't benefit from higher-IPC microarchitectures.

TerryLaze said:
All of these are full core only CPUs, all of these reduce the power per core because the total power basically stays the same.
And every newer one is faster than every previous one.

i7-7700k 91W TDP divided by 4 cores = 22.7W per core
i7-8700k 95W TDP divided by 6 cores = 15.8W per core
i7-9700k 95W TDP divided by 8 cores = 11.8W per core

Ah, but look at the base clocks:

i7-7700K: 4.2 GHz
i7-8700K: 3.7 GHz
i7-9700K: 3.6 GHz
i7-10700K: 3.5 GHz

In addition to the successive 14 nm process node refinements, they also had to reduce all-core clockspeeds. That means you don't get linear scaling, as you increase core counts.

TerryLaze · Aug 28, 2023

bit_user said:
Clock gating is frequently used to avoid that.

Then what's the point?!
Why have a very wide core if you have to shut down most of it all of the time?!

bit_user said:
Apple's are the most energy-efficient cores, bar none. The M1 cores were the widest, at its launch. I'm not sure if that's still true, but the other point about them is they're not designed to clock as high. That means they can have a longer critical path, and that enables higher IPC, for a variety of reasons.

Designing to a lower clockspeed target is a triple-win for efficiency, because you not only get higher IPC, but also avoid energy-wasting high clockspeeds, and cache misses take fewer clock cycles. Reducing the cycle-impact of cache misses makes them easier to hide with techniques like speculative execution.

You don't have to design your core for low clocks and force your customers to only use low clocks, as long as your cores CAN go to low clocks your customers can CHOOSE THEMSELVES what they need at any occasion.
They want to do something very energy efficiently?
They can run it at low clocks.
They want something to finish fast?
They can run it at high clocks.

bit_user said:
So far, we haven't seen many examples of software that doesn't benefit from higher-IPC microarchitectures.

We haven't seen many examples period.
All the benchmarks that we do see are three things, rendering 3d/video and compression.
And then there is spec were nobody* even knows or has any understanding on what it does or what it measures, but just as above it is made for servers.

*I'm sure there are some people that do, but none being your normal PC user.

bit_user said:
Ah, but look at the base clocks:

i7-7700K: 4.2 GHz

i7-8700K: 3.7 GHz

i7-9700K: 3.6 GHz

i7-10700K: 3.5 GHz

In addition to the successive 14 nm process node refinements, they also had to reduce all-core clockspeeds. That means you don't get linear scaling, as you increase core counts.

Ah but look at the first part of your post...
Designing to a lower clockspeed target is a triple-win for efficiency, because you not only get higher IPC, but also avoid energy-wasting high clockspeeds, and cache misses take fewer clock cycles. Reducing the cycle-impact of cache misses makes them easier to hide with techniques like speculative execution.

Also I just said that every newer one is faster than every previous one.
I did not suggest any linear scaling going on.

bit_user · Aug 28, 2023

TerryLaze said:
Then what's the point?!
Why have a very wide core if you have to shut down most of it all of the time?!

The point is that you get the performance benefit of the extra width with minimal energy drain when it's not used. If you don't know what clock-gating is, you should look it up and then you might understand.

Clock gating - Wikipedia

en.wikipedia.org

TerryLaze said:
You don't have to design your core for low clocks and force your customers to only use low clocks, as long as your cores CAN go to low clocks your customers can CHOOSE THEMSELVES what they need at any occasion.

That's actually not how it works.

Designing for higher clockspeeds involves tradeoffs and most of those reduce IPC and efficiency in general. That's why mobile-oriented cores don't clock very high. They opted for lower clockspeed targets, because they prioritize energy-efficiency.

Another famous example was Pentium 4 vs. Core 2. Even though Core 2 didn't clock nearly as high (IIRC, launching at like 2.4 GHz), it was about 50% faster than much higher-clocked P4's (3.6 GHz or thereabouts) and used less power! That held true, even when comparing ones fabbed on the same node (some Pentium D's did get made on 65 nm, where the Core 2 launched).

TerryLaze said:
We haven't seen many examples period.

That doesn't show what you think it does. You've previously claimed that Gracemont is area-optimized as a primary goal. Optimizing for area is at odds with optimizing for energy-efficiency. In fact, building a high-clocking core is essentially an area-optimization.

TerryLaze said:
Ah but look at the first part of your post...

"Designing to a lower clockspeed target is a triple-win for efficiency, because you not only get higher IPC, but also avoid energy-wasting high clockspeeds, and cache misses take fewer clock cycles. Reducing the cycle-impact of cache misses makes them easier to hide with techniques like speculative execution."

There's a fundamental difference between designing a core for low-clocks vs. simply running a high-clocking core at lower speeds. Yes, reducing clockspeeds increases efficiency, but without the benefits you could've gained by actually designing it to run at those speeds (i.e. fewer & more sophisticated pipeline stages; lower latencies for things like cache lookups).

TerryLaze said:
Also I just said that every newer one is faster than every previous one.
I did not suggest any linear scaling going on.

You were talking about increasing performance per CPU, in order to need fewer CPUs. The best way to do that is by adding cores. However, if you're clocking them a lot lower, then you're hurting your scaling story. What's worse is that the area-optimizations (i.e. PPA) made in order for the cores to clock high have gone to waste. You'd have been better off just designing cores for that specific clockspeed. Then, even with the same area, you could achieve better IPC!

TerryLaze · Aug 28, 2023

bit_user said:
The point is that you get the performance benefit of the extra width with minimal energy drain when it's not used. If you don't know what clock-gating is, you should look it up and then you might understand.

Clock gating - Wikipedia

en.wikipedia.org

That's actually not how it works.

Designing for higher clockspeeds involves tradeoffs and most of those reduce IPC and efficiency in general. That's why mobile-oriented cores don't clock very high. They opted for lower clockspeed targets, because they prioritize energy-efficiency.

Another famous example was Pentium 4 vs. Core 2. Even though Core 2 didn't clock nearly as high (IIRC, launching at like 2.4 GHz), it was about 50% faster than much higher-clocked P4's (3.6 GHz or thereabouts) and used less power! That held true, even when comparing ones fabbed on the same node (some Pentium D's did get made on 65 nm, where the Core 2 launched).

That doesn't show what you think it does. You've previously claimed that Gracemont is area-optimized as a primary goal. Optimizing for area is at odds with optimizing for energy-efficiency. In fact, building a high-clocking core is essentially an area-optimization.

There's a fundamental difference between designing a core for low-clocks vs. simply running a high-clocking core at lower speeds. Yes, reducing clockspeeds increases efficiency, but without the benefits you could've gained by actually designing it to run at those speeds (i.e. fewer & more sophisticated pipeline stages; lower latencies for things like cache lookups).

So basically your issue is that consumer oriented products are consumer oriented and not server oriented....
You want desktop CPUs and GPUs that are super wide and super slow that are only good for distributed computing and will suck big time at everything else.

bit_user said:
You were talking about increasing performance per CPU, in order to need fewer CPUs. The best way to do that is by adding cores. However, if you're clocking them a lot lower, then you're hurting your scaling story. What's worse is that the area-optimizations (i.e. PPA) made in order for the cores to clock high have gone to waste. You'd have been better off just designing cores for that specific clockspeed. Then, even with the same area, you could achieve better IPC!

You still have increased performance per CPU.

And if we take these two as an example, then you could run the i7-9700k at 180W and replace two i7-7700k CPUs keeping the same performance, you would probably even need a bit less because of architecture/node advances.

i7-7700k 91W TDP divided by 4 cores = 22.7W per core
i7-9700k 95W TDP divided by 8 cores = 11.8W per core

Nobody cares about area, except for you and the FABs...

bit_user · Aug 28, 2023

TerryLaze said:
So basically your issue is that consumer oriented products are consumer oriented and not server oriented....

No, I'm talking about server CPUs that are built primarily by extending consumer CPU cores. If the server CPUs had their own cores, then you could design them to lower clockspeed target, which would give you a bigger timing budget.

TerryLaze said:
You still have increased performance per CPU.

The point was never about performance, in isolation. It was about energy-efficiency. If you wanted the most energy-efficient CPU cores @ the 1.9 GHz Base frequency of something like a Xeon Platinum 8490H, you're not going to get there by using cores designed to make timing @ 5.5 GHz, when they're placed in a desktop CPU!

The Xeon 8490H has a max turbo clock speed of only 3.5 GHz. So, that means they could've had up to 57% longer critical paths, if we consider essentially the same core in a i9-12900KS reaches 5.5 GHz. Longer critical paths let you reduce the number of pipeline stages and various latencies. You can also potentially have more sophisticated algorithms in your branch predictor, prefetcher, scheduler, etc.

It's not just Intel, of course. AMD obviously does it, too.

InvalidError · Aug 28, 2023

bit_user said:
The point is that you get the performance benefit of the extra width with minimal energy drain when it's not used. If you don't know what clock-gating is, you should look it up and then you might understand.

One thing not to forget about clock gating is the knock-on effect that not propagating data through clock-frozen latches means combinational logic downstream from those is also frozen (static inputs) and won't be consuming dynamic power either.

Search

News Nvidia Hits Record $13.5 Billion Revenue on Explosive AI GPU Sales

TerryLaze

Titan

bit_user

Titan

TerryLaze

Titan

bit_user

Titan

TerryLaze

Titan

bit_user

Titan

Clock gating - Wikipedia

TerryLaze

Titan

Clock gating - Wikipedia

bit_user

Titan

InvalidError

Titan

TRENDING THREADS

Latest posts

Moderators online

Share this page