News Nvidia Unveils Its Next-Generation 7nm Ampere A100 GPU for Data Centers, and It's Absolutely Massive

@JarredWaltonGPU , thanks for the coverage, as always.

...but,
For workloads that use TF32, the A100 can provide 312 TFLOPS of compute power in a single chip. That's up to 20X faster than the V100's 15.7 TFLOPS of FP32 performance, but that's not an entirely fair comparison since TF32 isn't quite the same as FP32.
Right, a better comparison would be with the V100's fp16 tensor TFLOPS. Comparing it with general-purpose FP32 TFLOPS is apples vs. oranges, whereas TF32 vs fp16 tensor TFLOPS is maybe apples vs. pears.

Nvidia says elsewhere that the third generation Tensor cores support FP64, which is where the 2.5X increase in performance comes from, but details are a scarce. Assuming the FP64 figure allows similar scaling as previous Nvidia GPUs, the A100 will have twice the FP32 throughput and four times the FP16 performance
Um... don't assume that! Tensor cores are very specialized and support only a few, specific operations. You want to look at their spec for general-purpose fp64 TFLOPS.

It seems likely the A100 will follow a similar path and leave RT cores for other Ampere GPUs.
Agreed. Ray tracing is a different market segment. Even talking about photorealistic rendering for movies and such, it would make sense for them to take a page out of the Turing playbook and use the gaming GPUs for that.

the A100 can be has multi-GPU instancing that allows it to be partitioned into seven separate instances.
rats. You almost said "can has". I'd have enjoyed that.
; )
 
@JarredWaltonGPU , thanks for the coverage, as always.

...but,

Right, a better comparison would be with the V100's fp16 tensor TFLOPS. Comparing it with general-purpose FP32 TFLOPS is apples vs. oranges, whereas TF32 vs fp16 tensor TFLOPS is maybe apples vs. pears.


Um... don't assume that! Tensor cores are very specialized and support only a few, specific operations. You want to look at their spec for general-purpose fp64 TFLOPS.


Agreed. Ray tracing is a different market segment. Even talking about photorealistic rendering for movies and such, it would make sense for them to take a page out of the Turing playbook and use the gaming GPUs for that.


rats. You almost said "can has". I'd have enjoyed that.
; )
Nvidia has posted additional details, which I'm currently parsing and updating the article. There's a LOT to digest this morning!
 
  • Like
Reactions: bit_user
@JarredWaltonGPU
Isn't ECC a native feature on HBM2? If so then why would you need those extra stacks if not for redundancy?
I'm not positive on this, but ECC is an optional feature for HBM2 I think. I've found reports saying Samsung's chips have it... but is it always on? That's unclear. I have to think the sixth chip is simply for redundancy, which is still pretty nuts. Like, yields are so bad that we're going to stuff on six HBM2 stacks and hope the interfaces and everything else for at least five of them work properly? I asked for clarification on the six stacks -- TWICE -- but my question somehow got missed.
 
  • Like
Reactions: bit_user
I'm not positive on this, but ECC is an optional feature for HBM2 I think. I've found reports saying Samsung's chips have it... but is it always on? That's unclear. I have to think the sixth chip is simply for redundancy, which is still pretty nuts. Like, yields are so bad that we're going to stuff on six HBM2 stacks and hope the interfaces and everything else for at least five of them work properly? I asked for clarification on the six stacks -- TWICE -- but my question somehow got missed.
"And, of course, as a native feature of HBM2, it offers ECC memory protection."
https://www.anandtech.com/show/15777/amd-reveals-radeon-pro-vii (bottom of the 6th paragraph)
I just remembered where I had read the ECC reference to HMB2.

Crazy idea but instead of redundancy could they use a stack as a large L3/4 cache of some sorts?
 
"And, of course, as a native feature of HBM2, it offers ECC memory protection."
https://www.anandtech.com/show/15777/amd-reveals-radeon-pro-vii (bottom of the 6th paragraph)
I just remembered where I had read the ECC reference to HMB2.

Crazy idea but instead of redundancy could they use a stack as a large L3/4 cache of some sorts?
So I just got an answer from an Nvidia post-keynote roundtable. It didn't really answer anything, of course, but I was told, "A100 supports up to six HBM2 stacks as you see in our images, but we are currently only shipping products with five HBM2 stacks. There is no stack for ECC." Which didn't actually say if it's a dummy chip, or disabled, or if the product has ECC, or anything else. But I've pulled the ECC reference in the article and clarified on the stacks part.
 
@JarredWaltonGPU
Isn't ECC a native feature on HBM2? If so then why would you need those extra stacks if not for redundancy?
Look at how the Titan V had one of its 4 stacks disabled. That suggests to me that Nvidia had some serious yield problems with its HBM2, on the V100's - either the RAM or the connectivity to it. So, maybe the 6th stack is just there in case one of them is bad.

After yields improve, maybe Nvidia will even start selling a variant with the 6th stack enabled, for extra bandwidth and capacity.
 
  • Like
Reactions: JarredWaltonGPU
So I just got an answer from an Nvidia post-keynote roundtable. It didn't really answer anything, of course, but I was told, "A100 supports up to six HBM2 stacks as you see in our images, but we are currently only shipping products with five HBM2 stacks. There is no stack for ECC." Which didn't actually say if it's a dummy chip, or disabled, or if the product has ECC, or anything else. But I've pulled the ECC reference in the article and clarified on the stacks part.
The current V100 in the DGX-2 supports ECC. What is not known is whether the HBM is inline (unlikely) or native/sideband ECC - where a small section of memory is reserved for ECC. I don't expect the A100 systems to be significantly different.
 
Look at how the Titan V had one of its 4 stacks disabled. That suggests to me that Nvidia had some serious yield problems with its HBM2, on the V100's - either the RAM or the connectivity to it. So, maybe the 6th stack is just there in case one of them is bad.

After yields improve, maybe Nvidia will even start selling a variant with the 6th stack enabled, for extra bandwidth and capacity.
Massive Die, new (to them) process node.. so yeah the fully enabled GPU is not going to be for a while. I would imagine the 8 stacks will be used for upto 48 or 64GB in a refresh later.
 
I'm not positive on this, but ECC is an optional feature for HBM2 I think. I've found reports saying Samsung's chips have it... but is it always on? That's unclear. I have to think the sixth chip is simply for redundancy, which is still pretty nuts. Like, yields are so bad that we're going to stuff on six HBM2 stacks and hope the interfaces and everything else for at least five of them work properly? I asked for clarification on the six stacks -- TWICE -- but my question somehow got missed.
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Page 22 talks about HBM ECC
 
  • Like
Reactions: bit_user
NVIDIA: "here is your gpu for supercomputers and whatever, just take it and go"
Consumers: "but what about us daddy??"
NVIDIA: "dont worry kid, I'm not done with you, never will, because u are so naive and will probably buy anything I throw at you"
Also NVIDIA: "Commander Jensen, execute order 66 and eliminate competition, we need more money to remove AMD"
 
The Ampere A100 isn't going into the RTX 3080 Ti or any other consumer graphics cards. Instead, it's a powerhouse built for the next generation of exascale supercomputers and deep learning.

Nvidia Unveils Its Next-Generation 7nm Ampere A100 GPU for Data Centers, and It's Absolutely Massive : Read more

I swear whenever Jensen says "the more you buy, the more you save" , i picture scrooge mcduck or something jumping into his pile of gold. That said as someone who unfortunately bought in to the rtx line of cards immediately (thankfully i only purchased the rtx 2060). I won't be doing the same with ampere regardless of what the outcome looks like. AMD has restored my faith in them with how well ryzen has been and even the first batch of navi, i'll be waiting to see what they offer and then make my choice. I hope everyone else does the same.
 
  • Like
Reactions: bit_user
Look at how the Titan V had one of its 4 stacks disabled. That suggests to me that Nvidia had some serious yield problems with its HBM2, on the V100's - either the RAM or the connectivity to it. So, maybe the 6th stack is just there in case one of them is bad.

After yields improve, maybe Nvidia will even start selling a variant with the 6th stack enabled, for extra bandwidth and capacity.

Another article i read, the person who wrote the article said they asked nvidia and they kept giving a vague answer and saying " we are only selling 5 stack products at this time". So it's very likely they are going to at some point if for none of the reasons you mentioned then for just more money .
 
Massive Die, new (to them) process node.. so yeah the fully enabled GPU is not going to be for a while. I would imagine the 8 stacks will be used for up to 48 or 64GB in a refresh later.
I doubt there's more than 1 unused memory channel on that die, and I'd imagine they equipped it with as many channels as they thought it would ultimately need.

In fact, it can (eventually) get both more speed and capacity without adding channels:
 
One thing we have to note is that if GA102 gets 115SMs or 7360 CUDA Cores we can expect Quadro RTX8000/6000 and TITAN offerings to get full enabled chip. While RTX 3080 Ti may get 100SMs or 6400 CUDA Cores. While RTX 3080 may get 4480 CUDA Cores. This is all my guess work based on Core count jump from GV100 to GA100 which is 60% increase and I am referring to full Chip utilization and entire lineup being equally treated.
 
One thing we have to note is that if GA102 gets 115SMs or 7360 CUDA Cores we can expect Quadro RTX8000/6000 and TITAN offerings to get full enabled chip. While RTX 3080 Ti may get 100SMs or 6400 CUDA Cores. While RTX 3080 may get 4480 CUDA Cores. This is all my guess work based on Core count jump from GV100 to GA100 which is 60% increase and I am referring to full Chip utilization and entire lineup being equally treated.
It will be interesting to see what happens with GA102. It's not really possible to determine what GA102 will have just by looking at GA100. And looking back at Pascal can lead to some poor assumptions.

The GP100 was the data center / deep learning version, and it had up to 60 SMs total with 64 CUDA cores per SM (plus 32 FP64 CUDA cores). That's 3840 total FP32 CUDA cores. GP102 by comparison had up to 30 SMs with 128 FP32 CUDA cores per SM. It had the same total number of FP32 cores (3840), but got there in a very different fashion. So you might think GA102 will do the same thing relative to GA100 -- and it might -- but they're going to be very different beasts.

GA102 will of necessity have RTX cores -- there's no way Nvidia can walk back ray tracing support. In fact, the RT cores are supposedly enhanced and could be up to 4X faster than the Turing RT cores. Which means probably they're going to use a lot more die space. I'd assume the enhanced Ampere Tensor cores will also make an appearance, though that might not be the case. If they're the same ratio as in GA100, though, the GA102 SMs will ditch FP64 and add RT, which might not be much of a difference in size when all is said and done. And that's the problem.

There's no way Nvidia is going to make an 800mm square 'consumer' GPU on 7nm right now. It's just way too expensive. GA100 is going to have a price of roughly $18,750 per GA100 card based on the $199,000 asking price of the DGX A100. ($50K for all the server stuff, $150K for the eight GA100 parts). GP100 was a 610mm square part with a very high price, but GP102 was only 471mm square -- and still went into $1000+ graphics cards.

So, given the massive size of GA100, I suspect Nvidia is going to be aiming closer to 500mm square or less on GA102. And to get there, it will have to trim down the SM and core counts. Where GA100 has up to 128 SMs, I think 80 SMs in GA102 seems far more plausible, giving a maximum of 5120 cores. Clocks could be higher, though, and power will of necessity be less than 300W (and probably 250W for the RTX 3080 Ti). That's my bet, anyway.
 
  • Like
Reactions: panathas