Google's 'Cloud TPU' Does Both Training And Inference, Already 50% Faster Than Nvidia Tesla V100

Lucian Armasu · May 17, 2017

At Google I/O 2017, Google revealed its next-generation Tensor Processing Unit, called the Cloud TPU. The chip is able to perform both training and inference computation, unlike the first generation TPU, and it has much higher performance.

Google's 'Cloud TPU' Does Both Training And Inference, Already 50% Faster Than Nvidia Tesla V100 : Read more

redgarl · May 17, 2017

Nvidia is selling shovel ware. They are selling an image of breakthrough while they are far from being even key players outside the GPU market.

mavikt · May 17, 2017

But, can it play Crysis?

bit_user · May 17, 2017

redgarl :

Well, they're the leading GPU vendor, and machine learning is earning them quite a bit of revenue. But this was predictable - as long as they're still building general-purpose GPUs, they're going to be at a disadvantage relative to anyone building dedicate machine learning ASICs.

I wish we knew power dissipation, die sizes, and what fab node. I'll bet Google's new TPU has smaller die size and burns less power. From their perspective, it'd probably be too risky to build dies as huge as the GV100.

mavikt :

Well, no, it can't run Crysis. Perhaps it can play it, with a bit of practice.

hahmed330 · May 17, 2017

It's four 45 DL TFLOPs chips grouped together to produce aggregate 180 DL TFLOPs. Nvidia's Tesla V100 is a single chip producing 120 DL TFLOPs so no Tesla V100 is a more powerful chip then a single TPU2.

aldaia · May 18, 2017

hahmed330 :

Technically speaking a V100 has 9 chips 1 GPU and 8 HBM2 chips (and each HBM2 is actually 4 stacked chips).
Anyway, the number of chips in a package is totally irrelevant. Chip size alone is only good for braging rights. What is relevant is total size of package, power consumption of package, performance of package and total cost of package.
Right now we only know the performance, but i bet that the TPU2 4 chip solution is significantly cheaper than a single V100, consumes significantly less power, and by the looks of it uses less space.
Probably 4 TPU2 chips can be implemented in a single chip that will still be smaller than the nvidia one, but that will not improve performance and power consumption, while it will increase cost and require a more complex cooling solution.

hahmed330 · May 18, 2017

aldaia :

Actually, it is highly possible pretty much a guarantee that TPU2 has HBM2 since there are no DIMMs on the board and the first TPU had DDR3. They are not as power efficient you think they are since now they are doing deep learning as well.

"Chip size alone is only good for braging rights"

1.Rubbish a Tesla V100 is faster 2.6 times per chip then TPU2, therefore, spending more die size on ALUs is worth it. By the way, the next-gen of the digital world is the usage of Deep Learning, Analytics and Simulation simultaneously in a single application. The combined horsepower of a DGX-1V is capable of doing that, unlike TPU2, therefore, Nvidia is the furthest in that game.
2. You cannot scale efficiently on multiple chips for almost all of neural networks. Think of Crysis it would almost always run faster on a single 12 Teraflop chips then four 3 Teraflops chips using SLI... A lot of computation in deep learning done on a single neural network is quite sensitive to inter-node communication.
3. Just 4 TPU2s in a single node really Google??? Nvidia uses 2.6 times faster 8 V100's in a single node this will have dramatically bad effects on the scalability of "TPU2 Devices" since to match a single node of DGX-1V they would need 6 nodes of TPU2 nodes this defeats your main argument it won't be cost effective at all especially since they are gonna use HBM2.
4. DGX-1Vs are prebuilt server racks the client wouldn't need to worry about the cooling.
5. "DGX-1V" and "TPU2 Device" both are a single node, therefore, this is a fair comparison.

TPU2s are alright for small neural networks and for those which do not require a lot of Intra-Intercommunication between the processors. It is possible the TPU2 has better perf/power then Tesla V100, but TPU2 is a disappointment since its gonna take a long time to get TPU3 since ASICs take a lot of time and money to design.

Source for HBM2:

http://www.eetimes.com/document.asp?doc_id=1331753&page_number=2

renz496 · May 18, 2017

so did google intend to sell this hardware to other company? or just provide cloud service? but google TPU beating nvidia tesla V100 probably not that unexpected. because ultimately tesla V100 is more a general solution for compute needs (tesla V100 for example also function as FP64 accelerator).

aldaia · May 18, 2017

hahmed330 :

I agree on HBM. That is actually quite evident from the pictures and, if you read google's paper to be published at ISCA, it's also quite evident from their continuous comments about memory bandwidth being the main botleneck.

Respect single chip performance, I insist, it's nice to say "mine is bigger than yours" but its totally irrelevant for datacenters. Datacenter priorities are:

1 Performance/power
We don't know the real numbers but acording to some estimations a 4 chip package may use (in the worst case) the same power as a single tesla V100, so at least 50% better performance/power

2 Performance/$
You seem to asume that cost is proportional to number of chips, however building a big single chip is way way way more expensive that building 4 with same the total agregated area. We may never know the real cost for google (google is very secretive about that) but i have no doubt that 6 TPU nodes cost (much) less than $149,000. Note: I follow your assumtion that a single DGX-1V has the performance of 6 TPU nodes, but actually a DGX-1V is closer to 5 TPU nodes than to 6.

3 Performance/volume
One TPU pod contains 64 TPU nodes @ 180 TOPS and uses the space of 2 racks.
Fujitsu is building a supercomputer based on DGX-1 nodes. 12 nodes use also 2 racks of similar size.
12 DGX-1V contain 96 V100s @ 120 TOPS
So i'ts a tie here, both have the same performance/volume efficiency.

You also asume that the computations involved with neural networks don't scale well with nodes (using Crysis as example is a joke) and that TPU's are only good for small Neural Nets. Do you really think that google's Neural Nets are small?. Actually, the computations mostly involve dense linear algebra matrix operations (this is why GPUs are well suited to them). Linear algebra matrix operations scale amazingly well with number of nodes. Supercomputers with thousands, even with millions of nodes have been used for decades to run such applications. Neural networks are no exception, they can be run on traditional CPU based supercomputers and achieve nearly linear speed-up with number of nodes. The reason they are not run on thousands of CPU nodes is not performance scalability. It's actually because:
1 Performance/power
2 Performance/$
3 Performance/volume

All that being said, I agree that a GPUs are more flexible than TPU's in the same way that CPUs are more flexible than GPU's. Specialization is the only way of improving the aforementioned points 1, 2 and 3, but you pay a cost in flexibility. The best hardware (CPU, GPU or TPU) depends really on your particular needs, but if you are so big as google and run thousands or millions of instances 24/7 speciallization is the way to go.

redgarl · May 18, 2017

bit_user :

redgarl :

Well, they're the leading GPU vendor, and machine learning is earning them quite a bit of revenue. But this was predictable - as long as they're still building general-purpose GPUs, they're going to be at a disadvantage relative to anyone building dedicate machine learning ASICs.

I wish we knew power dissipation, die sizes, and what fab node. I'll bet Google's new TPU has smaller die size and burns less power. From their perspective, it'd probably be too risky to build dies as huge as the GV100.

mavikt :

Well, no, it can't run Crysis. Perhaps it can play it, with a bit of practice.

That they are selling more GPU is irrelevant. They have nothing more to offer than GPU even if they are talking about all this AI, self-driving cars, programming courses, etc... they are still only selling GPU.

renz496 · May 18, 2017

redgarl :

bit_user :

That they are selling more GPU is irrelevant. They have nothing more to offer than GPU even if they are talking about all this AI, self-driving cars, programming courses, etc... they are still only selling GPU.

and yet they are very successful at it despite "only" selling GPU.

hahmed330 · May 19, 2017

aldaia :

"You also asume that the computations involved with neural networks don't scale well with nodes"
"Linear algebra matrix operations scale amazingly well with number of nodes"

Neural networks come in huge amount of varieties and sizes, while a single machine learning model can be scaled on multiple nodes a singular giant neural network is not easily scalable to multiple nodes efficiently. Linear Algebra matrix operations do scale well as long there is enough intranodal and intranodal bandwidth that is a huge issue in supercomputers and running a single neural network on multiple nodes. Internodal communication is fast enough while intranodal is not fast enough nor its very power friendly.

Perf/Power
The worst case scenario is actually is unknown since the chip has a night and day difference then TPU1 doing training and inferencing on single chip drastically alters the scenario. Worst case scenario probably would be 125 watts due to the advent of HBM2 at 900 GB/s per TPU2; Best case scenario 75 watts per TPU2 with 768GB/s; Most probable scenario 100watts per TPU2 with 768Gb/s. When you actually have to consider the total TDP of node don't forget the TDP of the nodal interconnect, nodal intraconnect, storage e.t.c. This is actually one of the reasons why supercomputer have lower power efficiency than off the shelf PCs.

Scalability of nvidia's GPUs:
All of this talk above becomes purely academic since Nvidia could decide 3 months from today "Hey! we will sell a smaller version of our monstrous chip right now!" since scaling a large GPU to a smaller version of a GPU is just too easy meanwhile trying to scale a smaller custom ASIC to a larger custom ASIC is ridiculously hard. (You might as well design ASIC3) Also, the P100 had a smaller cousin known as P40... I will bet a mountain on the fact that there would be a V40 a few months from today would be faster then TPU2 and much much smaller then V100:

You remove the FP64s; Reduce the number of General Purpose FP32 by half; Use GDDR6 instead of HBM2; Have twice or four times as many tensor cores per SM; Half the number of texture processing units; half the number of ROPs e.t.c
Voila! V40

aldaia · May 20, 2017

hahmed330 :

aldaia :

"You also asume that the computations involved with neural networks don't scale well with nodes"
"Linear algebra matrix operations scale amazingly well with number of nodes"

Neural networks come in huge amount of varieties and sizes, while a single machine learning model can be scaled on multiple nodes a singular giant neural network is not easily scalable to multiple nodes efficiently. Linear Algebra matrix operations do scale well as long there is enough intranodal and intranodal bandwidth that is a huge issue in supercomputers and running a single neural network on multiple nodes. Internodal communication is fast enough while intranodal is not fast enough nor its very power friendly.

Perf/Power
The worst case scenario is actually is unknown since the chip has a night and day difference then TPU1 doing training and inferencing on single chip drastically alters the scenario. Worst case scenario probably would be 125 watts due to the advent of HBM2 at 900 GB/s per TPU2; Best case scenario 75 watts per TPU2 with 768GB/s; Most probable scenario 100watts per TPU2 with 768Gb/s. When you actually have to consider the total TDP of node don't forget the TDP of the nodal interconnect, nodal intraconnect, storage e.t.c. This is actually one of the reasons why supercomputer have lower power efficiency than off the shelf PCs.

Scalability of nvidia's GPUs:
All of this talk above becomes purely academic since Nvidia could decide 3 months from today "Hey! we will sell a smaller version of our monstrous chip right now!" since scaling a large GPU to a smaller version of a GPU is just too easy meanwhile trying to scale a smaller custom ASIC to a larger custom ASIC is ridiculously hard. (You might as well design ASIC3) Also, the P100 had a smaller cousin known as P40... I will bet a mountain on the fact that there would be a V40 a few months from today would be faster then TPU2 and much much smaller then V100:

You remove the FP64s; Reduce the number of General Purpose FP32 by half; Use GDDR6 instead of HBM2; Have twice or four times as many tensor cores per SM; Half the number of texture processing units; half the number of ROPs e.t.c
Voila! V40

"Neural networks come in huge amount of varieties and sizes"
Yes, but the underlying computations behind most Neural networks is matrix multiply. And the ones that cannot be mapped onto matrix multiplications are totally irrelevant for this discussion. Tensor cores can do only one thing: matrix multiply. The basic operation performed by V100 when using Tensor cores is 16x16x16 block matrix multiply. Matrix multiply has an interesting property; while input/output data is of the order of NxN the computations required are of the order of NxNxN. So, the amount of computations per data movement is really huge, and therefore scales very well with number of nodes even if inter-node bandwidth is relatively low.
So, either the neural network:
A) Requires mostly matrix multiplications -> then scales perfectly with smaller nodes
B) Requires something else -> then Tensor cores sit idle

Regarding power, a TPU pod (64 nodes) is equivalent in performance to two racks with 12 DGX-1V. To match the power a node will need to be rated at 600W. Just by looking at the picture its clear that a node uses much less than that, so ample of power for communications and disks and so on. By the way, the Fujitsu supercomputer also has disks and communication besides DGX-1 nodes.

"You remove the FP64s; Reduce the number of General Purpose FP32 by half; Use GDDR6 instead of HBM2; Have twice or four times as many tensor cores per SM; Half the number of texture processing units; half the number of ROPs e.t.c
Voila! V40"

Even better, remove all FP64, remove all FP32, remove all texture units, remove all rops, and remove anything that resembles a GPU, leaving just memory and tensor cores, keep HBM2 (google already determined that bandwidth is important). And Voila, you have a custom hardware as specialized as a TPU. No need to waste power and silicon on hardware you don't need ;-)

hahmed330 · May 20, 2017

aldaia :

hahmed330 :

"Neural networks come in huge amount of varieties and sizes"
Yes, but the underlying computations behind most Neural networks is matrix multiply. And the ones that cannot be mapped onto matrix multiplications are totally irrelevant for this discussion. Tensor cores can do only one thing: matrix multiply. The basic operation performed by V100 when using Tensor cores is 16x16x16 block matrix multiply. Matrix multiply has an interesting property; while input/output data is of the order of NxN the computations required are of the order of NxNxN. So, the amount of computations per data movement is really huge, and therefore scales very well with number of nodes even if inter-node bandwidth is relatively low.
So, either the neural network:
A) Requires mostly matrix multiplications -> then scales perfectly with smaller nodes
B) Requires something else -> then Tensor cores sit idle

Regarding power, a TPU pod (64 nodes) is equivalent in performance to two racks with 12 DGX-1V. To match the power a node will need to be rated at 600W. Just by looking at the picture its clear that a node uses much less than that, so ample of power for communications and disks and so on. By the way, the Fujitsu supercomputer also has disks and communication besides DGX-1 nodes.

"You remove the FP64s; Reduce the number of General Purpose FP32 by half; Use GDDR6 instead of HBM2; Have twice or four times as many tensor cores per SM; Half the number of texture processing units; half the number of ROPs e.t.c
Voila! V40"

Even better, remove all FP64, remove all FP32, remove all texture units, remove all rops, and remove anything that resembles a GPU, leaving just memory and tensor cores, keep HBM2 (google already determined that bandwidth is important). And Voila, you have a custom hardware as specialized as a TPU. No need to waste power and silicon on hardware you don't need ;-)

"Even better, remove all FP64, remove all FP32, remove all texture units, remove all rops, and remove anything that resembles a GPU, leaving just memory and tensor cores, keep HBM2"

Here's the thing you need General Purpose FP32, SFUs, LD/ST units, L0 cache, even H.265 encoder e.t.c. with the very large amount of tensor cores since without them these cores wouldn't be able to do as much as before therefore it reduces the use case scenarios for these chips such as:

Intelligent Video Analytics in AI based surveillance systems
Robotics requires heavy use vision computing
e.t.c

The thing is Nvidia has to sell these chips meanwhile google has to reduce their costs so they have to cover a range of markets with a few chips.
I think AMD and Nvidia are the only ones who can do simultaneously AI optimised simulation (e.g AI optimised ray tracing), deep learning of the whole process (having massive arrays of the many simulations and then AutoMLing them continuously though that's just one option) and then having a deep learning based visual analytics representing the whole model (because the clients need to easily understand the data visual representation is the best way)

" Just by looking at the picture its clear that a node uses much less"
Would you please expand on that??

aldaia · May 21, 2017

hahmed330 :

Not really, just a rough estimation based on personal experience in datacenters. Judging by the apparent size of the board (relative to the components plus some sites show a picture of what seems to be open compute sized chassis with 4 nodes per row). I would rate the board around 250-350W, and certainly no more than 450W.

hahmed330 · May 21, 2017

aldaia :

You are probably right about the power consumption I would say 300-350 watts for the 4 TPU2s combined which would correspond to your estimates. The TCO of the entire pod and the nodes would be where the matter would be truly resolved, but these TPU2s are not meant to be sold now only if hell freezes over maybe then we can get those pesky numbers

What kind of servers did you work on in data centres?? Garden variety x86 or some HPC based server with a GPU/CPU mix plotting world domination.

aldaia · May 21, 2017

hahmed330 :

aldaia :

You are probably right about the power consumption I would say 300-350 watts for the 4 TPU2s combined which would correspond to your estimates. The TCO of the entire pod and the nodes would be where the matter would be truly resolved, but these TPU2s are not meant to be sold now only if hell freezes over maybe then we can get those pesky numbers

What kind of servers did you work on in data centres?? Garden variety x86 or some HPC based server with a GPU/CPU mix plotting world domination.

HPC, our datacenter has a supercomputer in the top 500 since 2004

hahmed330 · May 22, 2017

[/quotemsg]

HPC, our datacenter has a supercomputer in the top 500 since 2004[/quotemsg]

I found some heck of a lot of information about TPU2 device hyper mesh:

Each TPU2 is consuming 200Watts!!!!

https://www.nextplatform.com/2017/05/22/hood-googles-tpu2-machine-learning-clusters/

i.e I don't think Google are deploying this TPU2 based infrastructure in its current topology for their cloud business its much more likely for research specifically for improving scalability.

aldaia · May 23, 2017

hahmed330 :

Hey thanks a lot for sharing that info. A lot of interesting stuff there.

I must note that 200W (actually they lean more towards 250W) per TPU2 chip is their estimation. Actual values only google knows and maybe they disclose in 1 or 2 years. Of course their estimation is much more reasoned than my "eyeball" estimation, but I think they are off by a factor of two (I admit that I base my opinion mostly on intuition, and google may prove me wrong anytime, lol)
However nextplatform makes several assumptions that are just speculation or even plain wrong:
1- Doubling performance means doubling energy.That is only true if internal organization stays the same (I'm OK with that) AND process technology also stays the same. First TPU was 28 nm we don't know yet if TPU2 stayed at 28 nm or moved to 20 -14 nm (I discard 10 nm since its too recent)
2- 16 bit floating point requires 2x the energy of 16 bit ints. That is plain wrong. The basic operation performed by the TPUs is MAC (multiply accumulate) that is dominated by the energy of multiplication. A floating point multiplication does not require 2X the energy of and integer mul, it's more in the range of 1 to 1.1 X
3- Google will never over-provision with the heat sink, so heat sink is directly related to power consumption. Actually TPU already set a precedent, while TPU consumed 40W, it's TDP was 75W (heat sink is sized for TDP). They almost over provisioned by a factor of 2x
4- They also speculate about 36 kilowatt per rack to enable 250 watt power delivery to each TPU2 socket. 36 KW is plausible, however that is very tight, it's usual to over provision power delivery by a significant margin.

Example: Our supercomputer is rated at 1MW power consumption. It is estimated that 20% is due to cooling (though I'm not sure if it is included in that 1MW) and another 20% is due to interconnection network and storage). That leaves 600KW (800 at most) for the 3K compute nodes. So each compute node consumes about 200-250W, yet every 2 compute nodes share a 900 W power supply, about 2x over provisioning.

hahmed330 · May 24, 2017

[/quotemsg]

Hey thanks a lot for sharing that info. A lot of interesting stuff there.

I must note that 200W (actually they lean more towards 250W) per TPU2 chip is their estimation. Actual values only google knows and maybe they disclose in 1 or 2 years. Of course their estimation is much more reasoned than my "eyeball" estimation, but I think they are off by a factor of two (I admit that I base my opinion mostly on intuition, and google may prove me wrong anytime, lol)
However nextplatform makes several assumptions that are just speculation or even plain wrong:
1- Doubling performance means doubling energy.That is only true if internal organization stays the same (I'm OK with that) AND process technology also stays the same. First TPU was 28 nm we don't know yet if TPU2 stayed at 28 nm or moved to 20 -14 nm (I discard 10 nm since its too recent)
2- 16 bit floating point requires 2x the energy of 16 bit ints. That is plain wrong. The basic operation performed by the TPUs is MAC (multiply accumulate) that is dominated by the energy of multiplication. A floating point multiplication does not require 2X the energy of and integer mul, it's more in the range of 1 to 1.1 X
3- Google will never over-provision with the heat sink, so heat sink is directly related to power consumption. Actually TPU already set a precedent, while TPU consumed 40W, it's TDP was 75W (heat sink is sized for TDP). They almost over provisioned by a factor of 2x
4- They also speculate about 36 kilowatt per rack to enable 250 watt power delivery to each TPU2 socket. 36 KW is plausible, however that is very tight, it's usual to over provision power delivery by a significant margin.

Example: Our supercomputer is rated at 1MW power consumption. It is estimated that 20% is due to cooling (though I'm not sure if it is included in that 1MW) and another 20% is due to interconnection network and storage). That leaves 600KW (800 at most) for the 3K compute nodes. So each compute node consumes about 200-250W, yet every 2 compute nodes share a 900 W power supply, about 2x over provisioning.[/quotemsg]

TPU2 probably is at 28nm which is probably due to time constraints at Google especially since designing a custom ASIC that can do both inferencing and training (which can be very complex) unlike the TPU1 is a highly time consuming difficult task. Working on a different manufacturing process altogether will increase amount of time by quite a bit even further.
“16 bit floating point requires 2x the energy of 16 bit int” this really is dependent on the architecture itself and the flexibility of it.
Even Over-provisioning itself can be a nightmare if you measure power by milliseconds basis instead of seconds.
Does your supercomputer use any exotic cooling methods e.g. mineral oil cooling??

TJohn · May 24, 2017

Good writing habits include defining acronyms the first time they are used.

TJohn · May 24, 2017

"The Tensor Processing Unit (TPU) is an 8-bit matrix multiply engine, driven with Complex Instruction Set Computing (CISC) instructions by the host processor across a Peripheral component Interconnect Express (PCIe) 3.0 bus. It is manufactured on a 28 nm process with a die size ? 331 mm2. The clock speed is 700 MHz and has a thermal design power of 28-40 W. It has 28 MiB of on chip memory, and 4 MiB of 32-bit accumulators taking the results of a 256x256 array of 8-bit multipliers. Instructions transfer data to or from the host, perform matrix multiplies or convolutions, and apply activation functions"

Google's 'Cloud TPU' Does Both Training And Inference, Already 50% Faster Than Nvidia Tesla V100

Contributing Writer

Distinguished

Distinguished

Polypheme

Reputable

Distinguished

Reputable

Champion

Distinguished

Distinguished

Champion

Reputable

Distinguished

Reputable

Distinguished

Reputable

Distinguished

Reputable

Distinguished

Reputable

Reputable

Reputable

Share this page