News Leak Suggests 'RTX 4090' Could Have 75% More Cores Than RTX 3090

The RTX 3090 had over twice as many shaders as the RTX 2080 Ti, but certainly didn't perform twice as fast.
Nvidia changed the definition of a cuda core going from Turing to Ampere, so while it could have twice the performance in certain compute work loads, in games that required a combination of int and fp calculations, there weren't really twice as many execution units. Nvidia themselves stated that Ampere was a maximum of 1.7 times faster than Turing in rasterized graphics. Also memory bandwidth didn't come close to doubling between Turing and Ampere.
 
Nvidia changed the definition of a cuda core going from Turing to Ampere, so while it could have twice the performance in certain compute work loads, in games that required a combination of int and fp calculations, there weren't really twice as many execution units. Nvidia themselves stated that Ampere was a maximum of 1.7 times faster than Turing in rasterized graphics. Also memory bandwidth didn't come close to doubling between Turing and Ampere.
Then let's look at some values. Or just one because I'm feeling lazy.

Looking at page 13 (or 19 in the PDF) of the whitepaper on Turing, there's a graph of what games had a mix of INT and FP instructions. I'm just going to pick out Far Cry 5 from this. So if we take this graph, let's just assume for every 1 FP instruction, there were 0.4 INT instructions. And if we laid this on one of Turing's SMs, this means that if there are 64 FP instructions, there's only 25 INT instructions, or a utilization rate of ~70%. For Ampere, since there are some CUDA cores that are split between FP and INT, we can balance which one does what for better utilization. And doing some math, for the best utilization in Far Cry 5's example, 90 FP instructions plus 36 INT instructions, using 126 out of 128 of the CUDA cores vs 89.

So right off the bat, without adding any more CUDA cores, Ampere should have about a 1.4x lead over Turing in this example. So how much does Ampere get in practice? 1.28x based on TechPowerUp's 4K benchmark.

And sure Ampere doesn't have 2x the memory bandwidth, but it has almost double L1 cache, which should soak up the deficiency.

Either way, my commentary is pointing out the odd conclusion the article seems to hint that NVIDIA only needs to add 75% more shaders for double the performance.
 
Then let's look at some values. Or just one because I'm feeling lazy.

Looking at page 13 (or 19 in the PDF) of the whitepaper on Turing, there's a graph of what games had a mix of INT and FP instructions. I'm just going to pick out Far Cry 5 from this. So if we take this graph, let's just assume for every 1 FP instruction, there were 0.4 INT instructions. And if we laid this on one of Turing's SMs, this means that if there are 64 FP instructions, there's only 25 INT instructions, or a utilization rate of ~70%. For Ampere, since there are some CUDA cores that are split between FP and INT, we can balance which one does what for better utilization. And doing some math, for the best utilization in Far Cry 5's example, 90 FP instructions plus 36 INT instructions, using 126 out of 128 of the CUDA cores vs 89.

So right off the bat, without adding any more CUDA cores, Ampere should have about a 1.4x lead over Turing in this example. So how much does Ampere get in practice? 1.28x based on TechPowerUp's 4K benchmark.

And sure Ampere doesn't have 2x the memory bandwidth, but it has almost double L1 cache, which should soak up the deficiency.

Either way, my commentary is pointing out the odd conclusion the article seems to hint that NVIDIA only needs to add 75% more shaders for double the performance.
Below is a comparison between 1/4 of a Turing SM and 1/4 of an Ampere SM

Turng_Ampere_SM.jpg


As you can see, the number of FP32 CUDA cores doubled in Ampere, this is how Nvidia claims twice as many CUDA cores. However, you can also see not everything doubled. Whereas every CUDA core in Turing could concurrently perform 16x INT32 and 16x FP32 calculations, half of Ampere cores can either do 16x INT32 or 16FP32 instructions while the other half can only do 16x FP32 instructions. For purely fp32 workloads, you could see up to a theoretical doubling of performance, but for purely int32 workloads you wouldn't see any performance improvement at all. Because work loads are usually more floating point oriented than integer based (Nvidia claims 36 integer operations for every 100 floating point), Nvidia chose this layout that favors floating point performance. In the real world of mixed loads in games, you're going to see somewhere in between those 0 and 100% figures with Nvidia claiming a max of 70% improvement over Turing.
 
Last edited:
The RTX 3090 had over twice as many shaders as the RTX 2080 Ti, but certainly didn't perform twice as fast.

I agree. Even if you could develop a driver that could dish out the draw calls fast enough, there's a limit to the scheduler on the gpu being able to feed the cores. While more cores support more complex objects better, most objects aren't that complex. When you have 100 trees in the background you need them to be simple. So scenes are composed of many low to moderate complexity objects. This is also part of variable rate shading improvements. As a result most draw calls under utilize the GPUs full potential.

This is part of the genius of unreal new engine. They reduce complexity of a scene so much in software by eliminating unseen details and reducing them. The math behind it is genius. Mesh reduction has never been a simple cs problem. They reduced it to a simple O(n*n) problem. Where n resents the number of layers of reduction.
 
Last edited:
Time will tell if it indeed will bring 2x the performance as the leakers have claimed. Historically, I’ve never seen a next gen GPU deliver 2x the performance. The massive step up in recent years is really from Maxwell to Pascal. Turing did not really bring about the massive jump in performance. While Ampere delivered higher performance at the expense of much higher power requirement.
 
Time will tell if it indeed will bring 2x the performance as the leakers have claimed. Historically, I’ve never seen a next gen GPU deliver 2x the performance. The massive step up in recent years is really from Maxwell to Pascal. Turing did not really bring about the massive jump in performance. While Ampere delivered higher performance at the expense of much higher power requirement.

When they quote that number they usually mean raw compute #'s. (GFlops) But it's meaningless unless you keep the engines fed. Polaris had the same issue.
 
Below is a comparison between 1/4 of a Turing SM and 1/4 of an Ampere SM

Turng_Ampere_SM.jpg


As you can see, the number of FP32 CUDA cores doubled in Ampere, this is how Nvidia claims twice as many CUDA cores. However, you can also see not everything doubled. Whereas every CUDA core in Turing could concurrently perform 16x INT32 and 16x FP32 calculations, half of Ampere cores can either do 16x INT32 or 16FP32 instructions while the other half can only do 16x FP32 instructions. For purely fp32 workloads, you could see up to a theoretical doubling of performance, but for purely int32 workloads you wouldn't see any performance improvement at all. Because work loads are usually more floating point oriented than integer based (Nvidia claims 36 integer operations for every 100 floating point), Nvidia chose this layout that favors floating point performance. In the real world of mixed loads in games, you're going to see somewhere in between those 0 and 100% figures with Nvidia claiming a max of 70% improvement over Turing.
And then I realized I glossed over the detail when I was looking at Ampere's whitepaper.

But whenever you're done, can we get to what I'm pointing out originally? Claiming we'll see a 2x performance increase with just a 75% whatever increase doesn't seem likely.
 
And then I realized I glossed over the detail when I was looking at Ampere's whitepaper.

But whenever you're done, can we get to what I'm pointing out originally? Claiming we'll see a 2x performance increase with just a 75% whatever increase doesn't seem likely.

clock increase + IPC increase? But again, that's raw compute numbers. Reality is a different matter. RX480 Polaris blew away 1060 Pascal in raw compute. But it was really neck to neck when it came to games.
 
But whenever you're done, can we get to what I'm pointing out originally? Claiming we'll see a 2x performance increase with just a 75% whatever increase doesn't seem likely.
They didn't claim or predict that would happen. All they said is it could be possible, which is an accurate statement. Other rumors have claimed Ada would be twice as fast as Ampere. I highly doubt it will be, but that's probably why this article chose to focus on a 2x performance improvement and how it could be achieved.

Ampere did get pretty close to double the performance of Turing in certain in game ray tracing benchmarks. Low 90's% increase. I think double the raytracing performance is a possibility and maybe that's what the leaks would referencing.
 
"Not had any problems with Ray Tracing on my 3080. "

Let me rephrase--not running with aggressive upscaling at 4k on games like Dying Light 2 or CP2077.

Because even the most factory overclocked 3090 struggles with that. I have one.

To me, required upscaling=not really being able to handle Raytracing, no matter how good DLSS actually is in Quality mode.