News Xbox Series X: 12 Teraflops of GPU Performance Confirmed, More Details Revealed

bwana

Distinguished
Apr 5, 2006
27
2
18,530
Thank you. So it looks like the Xbox is very close to the titan in single precision. But why is the 2080ti so low in tensor performance if its CUDA count is on par w the others?
 

bit_user

Polypheme
Ambassador
why is the 2080ti so low in tensor performance if its CUDA count is on par w the others?
Because Nvidia intentionally nerfed it. The same GPU delivers 2x the fp16-multiply/fp32-accumulate performance in the Titan RTX and equivalent Quadro RTX model. They just didn't want people buying gaming cards for AI training workloads, which is the main purpose of that feature. So, they cut the throughput of that particular instruction in half.

However, if you compare the fp16-multiply/fp16-accumulate performance (not shown in that table), they're on par. That's for inference, which is used to accelerate things like global illumination ray tracing. So, they kept it at full performance.

There's a wealth of information, buried in this page: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units
 
Last edited:
  • Like
Reactions: Giroro

Giroro

Splendid
Thank you. So it looks like the Xbox is very close to the titan in single precision. But why is the 2080ti so low in tensor performance if its CUDA count is on par w the others?

Nvidia likes to reduce certain features (especially Double Precision) in their gaming cards, so the customers who need it like professionals and data centers need to upgrade to the far more expensive Quadro line. Tensor performance isn't really that important to gaming right now compared to how important they are to Nvidia's big AI customers.
 

bit_user

Polypheme
Ambassador
Nvidia likes to reduce certain features (especially Double Precision) in their gaming cards, so the customers who need it like professionals and data centers need to upgrade to the far more expensive Quadro line.
Fun fact: they haven't done that since Kepler. Since then, all of the consumer GPUs really don't have the hardware on die for more fp64 performance. In the case of Titan V, their only data center GPU to reach consumers since then, they kept fp64 performance at full speed.

The Titan RTX is not properly built on a datacenter GPU - it's just an uncrippled version of the RTX 2080 Ti, which is a consumer GPU without more than token fp64.

BTW, AMD's Radeon VII is built on a datacenter GPU, and AMD crippled its fp64 to 1/4th of the native capability. Even after that, it's still the fastest fp64 you can get below the $3000 Titan V.
 
Last edited:
  • Like
Reactions: alextheblue

hannibal

Distinguished
Interesting to see how big part of that 12 teraflops computational power is from raytrasing hardware...
It is possible that this has less rasterisation power than 5700 has but still have more computational power!
 

Giroro

Splendid
Fun fact: they haven't done that since Kepler. Since then, all of the consumer GPUs really don't have the hardware on die for more fp64 performance. In the case of Titan V, their only data center GPU to reach consumers since then, they kept fp64 performance at full speed.

The Titan RTX is not properly built on a datacenter GPU - it's just an uncrippled version of the RTX 2080 Ti, which is a consumer GPU without more than token fp64.

BTW, AMD's Radeon VII is built on a datacenter GPU, and AMD crippled its fp64 to 1/4th of the native capability. Even after that, it's still the fastest fp64 you can get below the $3000 Titan V.

If it really is the case that the Tensor and RTX cores weren't left over from their datacenter GPUs... then I very much can't explain why they wasted so much of the TU102 die space and power consumption on them.
 

bit_user

Polypheme
Ambassador
If it really is the case that the Tensor and RTX cores weren't left over from their datacenter GPUs... then I very much can't explain why they wasted so much of the TU102 die space and power consumption on them.
See for yourself, there's no Tesla card with a TU102:

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Tesla

Quadro RTX? Yes, it's on the 6000 and 8000 cards. While you can put them in servers, but they're mainly workstation-oriented cards.

If servers were a big market for the TU102, there should be a Tesla model - like the Tesla P40, which featured the GP102 (of GTX 1080 Ti fame).
 
  • Like
Reactions: alextheblue

Giroro

Splendid
See for yourself, there's no Tesla card with a TU102:

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Tesla

Quadro RTX? Yes, it's on the 6000 and 8000 cards. While you can put them in servers, but they're mainly workstation-oriented cards.

If servers were a big market for the TU102, there should be a Tesla model - like the Tesla P40, which featured the GP102 (of GTX 1080 Ti fame).

Interesting; I guess I was off base but I'm no expert in server or even workstation GPUs.
To me, NVidia's handling of RTX features has felt like they have been scraping to find ways to sell datacenter features to gamers, not the other way around.
Or maybe render farms and AI don't need DP? That's not in my realm of experience.
 
  • Like
Reactions: bit_user

alextheblue

Distinguished
If it really is the case that the Tensor and RTX cores weren't left over from their datacenter GPUs... then I very much can't explain why they wasted so much of the TU102 die space and power consumption on them.
Like Bit said, they're a repurposed workstation design. They decided they could push those features into the gaming space, especially with the aid of Developer Bucks™. In their estimation this was a better move than putting together a new chip for the high-end gaming market. I think they were right, even if I feel RT is not incredibly useful below the 2080 (due to the performance hit).
BTW, AMD's Radeon VII is built on a datacenter GPU, and AMD crippled its fp64 to 1/4th of the native capability. Even after that, it's still the fastest fp64 you can get below the $3000 Titan V.
Actually it's half of the native capability. VII FP64 is 1/4 of the FP32 rate, which is in turn 1/2 the native 1/2 FP64 rate of the Vega 20 silicon.
 
  • Like
Reactions: prtskg and bit_user

bit_user

Polypheme
Ambassador
Eh the vega 56 has 10.5 TFLOPs and the 64 had a shade under 13 (the 2070 comes in at 7.5). Not really a measure of a GPUs power in gaming.
True, but within a product line it is. Also, note that the 2070 Super peaks at 9 TFLOPS.

Anyway, it's probably reasonable to compare a 12 TFLOPS RDNA2 GPU against the 9.8 TFLOPS RDNA RX 5700 XT. That assumes that RDNA2 is at least as efficient as RDNA, and that they roughly scale up memory bandwidth, to match. Two ways they could add bandwidth are by going to a 384-bit bus, like the XBox One X, or adding some in package memory, like the original XBox One.
 

bit_user

Polypheme
Ambassador
To me, NVidia's handling of RTX features has felt like they have been scraping to find ways to sell datacenter features to gamers, not the other way around.
Oh, I totally agree. They were definitely stretching to find justifications to put Tensor Cores in gaming GPUs.

Or maybe render farms and AI don't need DP? That's not in my realm of experience.
I think the distinction we're tripping over is that the datacenter market has fragmented. For inferencing, AI can use 8-bit, 4-bit, and people (not very successfully, AFAIK) are even trying to use 1-bit. I've never heard of deep learning using fp64, for either inferencing or training. For most inferencing scenarios, 32-bit, and even potentially 16-bit is overkill, although training is a different story.

Meanwhile, traditional HPC continues to need fp64, while also starting to take advantage of AI.
 
  • Like
Reactions: alextheblue
True, but within a product line it is. Also, note that the 2070 Super peaks at 9 TFLOPS.

Anyway, it's probably reasonable to compare a 12 TFLOPS RDNA2 GPU against the 9.8 TFLOPS RDNA RX 5700 XT. That assumes that RDNA2 is at least as efficient as RDNA, and that they roughly scale up memory bandwidth, to match. Two ways they could add bandwidth are by going to a 384-bit bus, like the XBox One X, or adding some in package memory, like the original XBox One.
Is a semi custom chip though so even though the arc is the same it’s not the same product line and won’t have the same layout.
 

bit_user

Polypheme
Ambassador
Is a semi custom chip though so even though the arc is the same it’s not the same product line and won’t have the same layout.
Uh, probably the compute units and even higher-level blocks are the same as those destined AMD's mainstream GPU line. Note it's a semi-custom chip, not full-custom. Of course, being a different generation will mean differences, at that level, between it and the first-gen RDNA products.

Where you'll see differences vs. RDNA2 dGPUs is in how they're connected to the memory subsystem(s).

Anyway, I stand by my earlier claim that performance should scale relative to first-gen RDNA, if not better (i.e. due to things like variable-rate shading).
 
Uh, probably the compute units and even higher-level blocks are the same as those destined AMD's mainstream GPU line. Note it's a semi-custom chip, not full-custom. Of course, being a different generation will mean differences, at that level, between it and the first-gen RDNA products.

Where you'll see differences vs. RDNA2 dGPUs is in how they're connected to the memory subsystem(s).

Anyway, I stand by my earlier claim that performance should scale relative to first-gen RDNA, if not better (i.e. due to things like variable-rate shading).
Well performance should be better than a like for like part anyway due to console optimisation.