News Nvidia's Jesnen Huang expects GAA-based technologies to bring a 20% performance uplift

Admin · Thursday at 7:06 AM

Nvidia's Jensen Huang mulls about next-generation process technologies, suggests that the typical improvement from any process node change worth shifting to is only 20%.

Nvidia's Jesnen Huang expects GAA-based technologies to bring a 20% performance uplift : Read more

usertests · Thursday at 7:28 AM

We need "3D" process nodes. But GAAFETs and backside power delivery will be a nice snack, especially for chips using <15 Watts instead of 500.

Jame5 · Thursday at 9:03 AM

500w is so last generation. We are onto 1kW and 2kW designs in the future.

Neilbob · Thursday at 1:12 PM

- If it results in limited performance, prices will rise and Nvidia will increase profit.
- If it results in plenty of performance, prices will rise and Nvidia will increase profit.
- If the chips are knitted by Grandma using metal scouring pads, prices will rise and Nvidia will increase profit.

bit_user · Thursday at 2:24 PM

The article said:
TSMC itself expects its first GAA-based process technology — N2 — to increase performance by 10% to 15% compared to N3E, the company's second generation 3nm-class process technology that precedes N3P. Given these metrics, it looks like Nvidia's Huang is a bit more optimistic about N2 than TSMC is.

Anton, are you for real?

I have a ton of respect for Anton, but every once in a while, he says something that it seems like he should really know better than to believe. All of TSMC's estimates are based on a mountain of assumptions and organized around a theoretical (or actual?) reference CPU core. This dictates the mix of different cells and doesn't permit designers to adapt the microarchitecture to the process node, but instead looks at only the impact of directly porting the core between nodes. I've previously seen articles mention specifically which CPU core ARM uses for these estimates, but I'm having trouble digging it up.

Essentially, when they quote a performance number, it's just looking at how much higher you could clock the same design at the same power, when doing a direct port. But, the thing is that people rarely do direct ports from one node to the next, especially if they're in different families.

So, the first point of divergence between TSMC & Nvidia's numbers is that GPUs are different in their mix of cells than CPUs. Secondly, if you use the new node's additional density and timing budget to increase IPC, then you can definitely beat their performance estimates. As I said, the way they estimate performance gains is essentially just by looking at how much you could increase the clock speed. Yet, in most cases, you can do better with a mix of IPC and clockspeed improvements. This is especially true of something like a GPU or NPU, where a feature like "tensor cores" really fall into the category of an IPC increase.

bit_user · Thursday at 2:25 PM

Neilbob said:
- If it results in limited performance, prices will rise and Nvidia will increase profit.
- If it results in plenty of performance, prices will rise and Nvidia will increase profit.
- If the chips are knitted by Grandma using metal scouring pads, prices will rise and Nvidia will increase profit.

That's too simplistic. Nvidia actually does need to compete with alternatives on perf/$. The more they can increase performance, the higher a price premium they can justify. If the increase is small, it gives others a chance to close the gap and will put more pricing pressure on Nvidia.

Rob1C · Thursday at 2:59 PM

TSMC isn't quite as optimistic, clocks (alone) could reach 20% higher at 2nm; that is all.

It's not until the next node, which isn't far off, that we get that much of a gain:

https://www.tsmc.com/english/dedicatedFoundry/technology/logic/l_A16

"A16 is best suited for HPC products with complex signal routes and dense power delivery network, as they can benefit the most from backside power delivery. Compared with N2P (mine: "better than 2nm"), A16 offers 8%~10% speed improvement at the same Vdd, 15%~20% power reduction at the same speed, and 1.07~1.10X chip density.".

Still, glad they are pushing forward; or downward, if you prefer.

Giroro · Thursday at 3:07 PM

I wouldn't buy a new GPU for a 20% uplift in performance - especially not when Nvidia would increase prices and TDP by more than 20% to get there. If in 5 years I'll have to make the hard choice between 1440p144 high or 1440p120 ultra, I think I'll manage to find a way to survive. Come back when you can double performance at the same price/power, or when there's some revolutionary new rendering technology that a game I want to play absolutely requires.

But Nvidia doesn't care about me or fps, or games, or in general anymore. It's all about how many Tegraflops they can FP4ize per jiggawatt.
So if the new Nvidia can Amazon suggestive-sell me some ALLCAPS junk in a slightly faster in-browser pop ad, or to help Google employees with direct access to uncensored Gemini see their fictional interpretations of celebrity body parts faster ... I guess I just don't care. If it takes me 36 seconds instead of 30 to figure out if stable diffusion randomly decided to insert a stripper into the image I was upscaling, again, it's not a big deal. Or it's at least not a problem anyone should pay $2000+ to fix.

JarredWaltonGPU · Thursday at 4:01 PM

bit_user said:
Anton, are you for real?

I have a ton of respect for Anton, but every once in a while, he says something that it seems like he should really know better than to believe. All of TSMC's estimates are based on a mountain of assumptions and organized around a theoretical (or actual?) reference CPU core. This dictates the mix of different cells and doesn't permit designers to adapt the microarchitecture to the process node, but instead looks at only the impact of directly porting the core between nodes. I've previously seen articles mention specifically which CPU core ARM uses for these estimates, but I'm having trouble digging it up.

Essentially, when they quote a performance number, it's just looking at how much higher you could clock the same design at the same power, when doing a direct port. But, the thing is that people rarely do direct ports from one node to the next, especially if they're in different families.

So, the first point of divergence between TSMC & Nvidia's numbers is that GPUs are different in their mix of cells than CPUs. Secondly, if you use the new node's additional density and timing budget to increase IPC, then you can definitely beat their performance estimates. As I said, the way they estimate performance gains is essentially just by looking at how much you could increase the clock speed. Yet, in most cases, you can do better with a mix of IPC and clockspeed improvements. This is especially true of something like a GPU or NPU, where a feature like "tensor cores" really fall into the category of an IPC increase.

Anton wasn't at the Q&A, and of course he understands there's a LOT of factors going into the discussion. In fact, I would say Jensen Huang wasn't even speaking about N2 and GAA when he said "20%." You'd have to have been there, and I don't have a recording, but the "20%" figure was more of a quip than an actual estimate.

IIRC, some analyst from Korea asked about the possibility of Nvidia using Samsung's upcoming GAA node, and how much of a gain it might bring. Jensen responded very quickly and said something to the effect of: "20%. That's about all we can expect from a process node transition these days, with the end of Moore's Law. So if you ask what we'll get from a newer, more advanced node, I'd say 20%." I'm paraphrasing here, but it was something along those lines.

He then went into a lengthier discussion to reiterate again how Nvidia is now an AI infrastructure company, performance per watt is the limiting factor for data centers, and how the plan is to have even higher power, more dense racks in the future.

Another caveat: I may be mixing together different responses to different questions, because there's a lot of overlap in what was said. Anyway, I'll tweak that sentence because I most definitely don't feel like Jensen was implying TSMC N2 GAA would be 20% higher while TSMC is only projecting 10~15 percent. And Jensen was more focused on performance per watt rather than just performance.

Neilbob · Thursday at 4:37 PM

Tsk! You should know by now that almost everything I say is dripping with snark and sarcasm. Snarkasm, one might say.

Edit: belatedly realised I forgot to quote bit_user, so now I look like a nutter making random out of context remarks. Whoops.

Pierce2623 · Friday at 9:27 PM

bit_user said:
Anton, are you for real?

I have a ton of respect for Anton, but every once in a while, he says something that it seems like he should really know better than to believe. All of TSMC's estimates are based on a mountain of assumptions and organized around a theoretical (or actual?) reference CPU core. This dictates the mix of different cells and doesn't permit designers to adapt the microarchitecture to the process node, but instead looks at only the impact of directly porting the core between nodes. I've previously seen articles mention specifically which CPU core ARM uses for these estimates, but I'm having trouble digging it up.

Essentially, when they quote a performance number, it's just looking at how much higher you could clock the same design at the same power, when doing a direct port. But, the thing is that people rarely do direct ports from one node to the next, especially if they're in different families.

So, the first point of divergence between TSMC & Nvidia's numbers is that GPUs are different in their mix of cells than CPUs. Secondly, if you use the new node's additional density and timing budget to increase IPC, then you can definitely beat their performance estimates. As I said, the way they estimate performance gains is essentially just by looking at how much you could increase the clock speed. Yet, in most cases, you can do better with a mix of IPC and clockspeed improvements. This is especially true of something like a GPU or NPU, where a feature like "tensor cores" really fall into the category of an IPC increase.

Bud, chill. Half of those gains you’re talking about are what will be referred to as “architectural gains”. You know I agree with you 9 times out of 10 but you just seem like you’re making a big deal out of node gains vs architectural gains for no reason, here. I mean IPC gains literally have nothing to do with the node. The only way nodes affects IPC in any way is through secondary benefits from increased density or efficiency.

JRStern · Friday at 2:24 AM

I thought they were just messing up GaAs.
Really.
SMH

George³ · Friday at 2:33 AM

Gain from the transition from "3"nm to "1.6"nm 40%, but deterioration of results by 20%+, because the architecture is designed more for training of LLM and other HPC tasks than for games. final result no more than 20% pure gain of FPS. If we do not count the next MFG with 8x more frames, the original+ 7 fake.

bit_user · Saturday at 3:46 AM

Giroro said:
I wouldn't buy a new GPU for a 20% uplift in performance

I think they're talking about the performance boost you get for the same number of CUDA cores. However, if you made a die the same size, then you should get almost proportionately more CUDA cores/SMs/GPCs.

Since TSMC is claiming 15% greater density than N3E, the actual speedup would be something more like 38%, at only 15% more power. Again, that's comparing to a N3E, not the N4P-class GPUs we have today.

Giroro said:
- especially not when Nvidia would increase prices and TDP by more than 20% to get there.

The figure quoted by TSMC is the amount of performance increase at the same power. Presumably, that's what Jensen was also referring to.

bit_user · Saturday at 3:52 AM

Pierce2623 said:
I mean IPC gains literally have nothing to do with the node.

IPC has a lot to do with the node, because you have a certain clock speed target and denser nodes can fit more gates in the critical path. Also, on a worse node, more gates -> more power, which is another thing constraining your complexity.

I clearly recall reading an interview with someone from AMD (Mike Clark?) about Zen 5, where he mentioned how they were designing Zen 5 for both N4P and some N3 node (probably referring to the Zen 5C CCD). He said there were opportunities they saw were possible with N3 that they couldn't pursue, due to being held back by having to design the microarchitecture for N4P. Sorry, I looked for the source but have yet to find it.

I think another example we can look at is what happened with Rocket Lake. Intel basically backported the Ice Lake core from a rough 10 nm node to a very mature 14 nm node. During the process, I'm sure they had to split some pipeline stages across more clock cyles, just to meet timing at a competitive clock speed. Anyway, the resulting Rocket Lake ran tremendously hotter than its predecessors, which shows just how much the microarchitecture was designed around the performance & efficiency characteristics of their 10 nm process.

Pierce2623 · Saturday at 10:14 AM

bit_user said:
IPC has a lot to do with the node, because you have a certain clock speed target and denser nodes can fit more gates in the critical path. Also, on a worse node, more gates -> more power, which is another thing constraining your complexity.

I clearly recall reading an interview with someone from AMD (Mike Clark?) about Zen 5, where he mentioned how they were designing Zen 5 for both N4P and some N3 node (probably referring to the Zen 5C CCD). He said there were opportunities they saw were possible with N3 that they couldn't pursue, due to being held back by having to design the microarchitecture for N4P. Sorry, I looked for the source but have yet to find it.

I think another example we can look at is what happened with Rocket Lake. Intel basically backported the Ice Lake core from a rough 10 nm node to a very mature 14 nm node. During the process, I'm sure they had to split some pipeline stages across more clock cyles, just to meet timing at a competitive clock speed. Anyway, the resulting Rocket Lake ran tremendously hotter than its predecessors, which shows just how much the microarchitecture was designed around the performance & efficiency characteristics of their 10 nm process.

You’re selectively quoting there. I specifically said “other than secondary effects like density and efficiency”. Of course greater density and efficiency gives them more leeway to do new things. As far as Rocket Lake, it only runs hot when given nuclear level TDPs to compete with TSMC N7. People have a tendency to forget that even on 14nm the 11700k managed to go blow for blow with the 5800x.

bit_user · Saturday at 4:40 PM

Pierce2623 said:
You’re selectively quoting there. I specifically said “other than secondary effects like density and efficiency”. Of course greater density and efficiency gives them more leeway to do new things.

But they're not secondary. They've never been secondary. In the history of CPUs, the primary way that CPUs got faster, on a newer node, was through microarchitecture.

Pierce2623 said:
As far as Rocket Lake, it only runs hot when given nuclear level TDPs to compete with TSMC N7. People have a tendency to forget that even on 14nm the 11700k managed to go blow for blow with the 5800x.

Not quite. The i9 managed some wins, while also being more expensive. Here's roughly how the i7 compared, on both single- & multi- threaded benchmarks:

Source: https://www.anandtech.com/show/16535/intel-core-i7-11700k-review-blasting-off-with-rocket-lake/7

With the rate-1, Zen 3 is still 5.7% and 1.2% faster than Cypress Cove. I'll grant you that's close, and you're right that juicing the clock speed was significant, because the i7-11700K would turbo up to 5.0 GHz, while the 5800X was limited to 4.7 GHz. So, clock-for-clock, Zen 3 was still faster.

It's with Rate-N that we really see the efficiency gap take its toll, since Rocket Lake is clearly being power-limited in some cases.

None of this changes my main/original point that TSMC's numbers do not incorporate any microarchitecture improvements and do not account for any increase in core counts. They're strictly telling you how much better the new node is, if you take the exact same design and port it to the new node. I believe that's the primary reason for the divergence between TSMC's and Jensen's numbers.

That said, the additional context provided by Jarred makes it sound like Jensen's comments weren't actually meant to contradict or align with TSMC's, but more of a generalization.

ManDaddio · Saturday at 4:50 PM

Giroro said:
I wouldn't buy a new GPU for a 20% uplift in performance - especially not when Nvidia would increase prices and TDP by more than 20% to get there. If in 5 years I'll have to make the hard choice between 1440p144 high or 1440p120 ultra, I think I'll manage to find a way to survive. Come back when you can double performance at the same price/power, or when there's some revolutionary new rendering technology that a game I want to play absolutely requires.

But Nvidia doesn't care about me or fps, or games, or in general anymore. It's all about how many Tegraflops they can FP4ize per jiggawatt.
So if the new Nvidia can Amazon suggestive-sell me some ALLCAPS junk in a slightly faster in-browser pop ad, or to help Google employees with direct access to uncensored Gemini see their fictional interpretations of celebrity body parts faster ... I guess I just don't care. If it takes me 36 seconds instead of 30 to figure out if stable diffusion randomly decided to insert a stripper into the image I was upscaling, again, it's not a big deal. Or it's at least not a problem anyone should pay $2000+ to fix.

Nvidia did not increase prices in the 5000 series except for the 5090. So I'm not sure why you're trying to spread misinformation. The 5000 series is faster than the previous generation.

bit_user · Saturday at 6:32 PM

ManDaddio said:
Nvidia did not increase prices in the 5000 series except for the 5090. So I'm not sure why you're trying to spread misinformation. The 5000 series is faster than the previous generation.

Yes, but I think we need to take a step back and see where prices settle. The MSRP is one thing, but if the street prices settle at a higher point than with the RTX 4000 series, I'm not sure we can really say it's not more expensive.

To me, it would only make sense for prices to be higher, as everything about the RTX 5000 should make them more expensive. Using more power means they need bigger VRMs and cooling solutions. Using GDDR7 should also make them more expensive. Lastly, the GPU dies are bigger.

So, the only way they wouldn't be more expensive is if the wafer costs of the GPUs dropped by so much that Nvidia is selling the chips to its partners for less money than the corresponding RTX 4000 dies, and by enough to offset the added costs of GDDR7 and bigger cooling + power overheads. That just seems incredibly unlikely, especially since this is Nvidia we're talking about.

George³ · Sunday at 4:18 AM

ManDaddio said:
Nvidia did not increase prices in the 5000 series except for the 5090. So I'm not sure why you're trying to spread misinformation. The 5000 series is faster than the previous generation.

Please calculate for me average transistor performance for rtx 5000 series and for rtx 4000 series. I feel that despite moar Gigahertzs performance per transistor is lower in 5000 series.

bit_user · Sunday at 4:51 AM

George³ said:
Please calculate for me average transistor performance for rtx 5000 series and for rtx 4000 series. I feel that despite moar Gigahertzs performance per transistor is lower in 5000 series.

Okay, so the RTX 4090 has 76.3 BTr, while the RTX 5090 has 92.2 BTr, which is 20.8% more.

I suspect you're right, but it depends on what workload you mean. For anything that utilizes the fixed-function units they enhanced, the RTX 5090 probably performs more than 20.8% faster. That includes AI models that can effectively utilize fp4.

In general compute or raster graphics, performance increases should be sub-linear, especially considering the RTX 5090's clock speed reductions. Also, it has 33% more memory channels and I presume GDDR7 memory controllers occupy more die area than GDDR6X. Yet, I think the RTX 4090 wasn't really bandwidth constrained in gaming, so you don't get a good "return" on the extra transistors devoted to it.

In terms of cost efficiency, the RTX 5090's list price puts it at 46.12 MTr/$, whereas the RTX 4090 was 47.72 MTr/$. Then again, the RTX 5090 has 33% more memory and it's a newer type of memory that would be more expensive per GB, if anything. So, if you didn't expect wafer pricing on the GPUs to change much, then it'd be fair to have the 5090 cost a little more per GPU transistor.

BTW, I see your statements addressed "series" and I'm well aware that I latched onto the "90" cards, which aren't necessarily characteristic of the whole range. That's just what I had numbers for, readily at hand.

George³ · Sunday at 5:11 AM

bit_user said:
BTW, I see your statements addressed "series" and I'm well aware that I latched onto the "90" cards,

Yes my question is for series but place in xx90 models that you mean has additional specifications differences. I think that situation with models lower than flagships also must be considered separately for each class.

News Nvidia's Jesnen Huang expects GAA-based technologies to bring a 20% performance uplift

Administrator

Distinguished

Upstanding

Distinguished

Titan

Titan

Distinguished

Splendid

Senior GPU Editor

Distinguished

Commendable

Distinguished

Respectable

Titan

Titan

Commendable

Titan

Honorable

Titan

Respectable

Titan

Respectable

Share this page