Chip Fights: Nvidia Takes Issue With Intel's Deep Learning Benchmarks

Status
Not open for further replies.

bit_user

Polypheme
Ambassador
Thanks for the article. I appreciate your coverage of Deep Learning, Lucian, and generally find your articles to be both well-written and accessible. A couple minor points, though...

First, please try to clarify which Xeon Phi product you mean. I wish Intel had called the new chip Xeon Theta, or something, but they didn't. Xeon Phi names a product line, which now has 2 generations. The old generation is code named Knight's Corner, and the new one is Knight's Landing. Intel is referring to the new one, while Nvidia is probably referring to the old one, as Knight's Landing isn't yet publicly available.

four-year-old Kepler-based Titan X
Titan X is neither Kepler-based nor four-years-old.

Intel's paper references a "Titan" supercomputer, containing 32 K20's, which are Kepler-based. The paper also mentions K80 GPUs (also Kepler-based). They don't appear to compare themselves to a Titan-series graphics card, at any point (not least of all, because it's a consumer product, and much cheaper than Nvidia's Tesla GPUs).

It added that a single one of its latest DGX-1 “supercomputer in a box” is slightly faster than 21 Xeon Phi
Which generation? It might be the same thing Intel did - comparing their latest against 4-year-old hardware. Given that the new Xeon Phi generation hasn't yet launched, this seems likely.

It’s likely that Xeon Phi is still quite behind GPU systems when it comes to deep learning, in both the performance and software support dimensions.
The new Xeon Phi did take one significant step backward, which is dropping fast fp16 support - something they even added to their Gen9 HD graphics GPU. Knight's Corner had it, but It doesn't appear to exist, in AVX-512 (what Knights Landing uses).

Anyway, one way to see through the smokescreen of each company's PR is to simply look at the specs. Both Knights Landing and the GP100 have 16 GB of HBM2-class memory (although Knights Landing has an additional 6-channel DDR-4 interface). The rated floating-point performance is 3/6 and 4.7/9.3 TFLOPS (double/single-precision), respectively. So, I'm expecting something on the order of a 2-3x advantage for Nvidia (partly due to their superior fp16). But, maybe Intel can close that gap, if they can harness their strong integer performance on the problem.

One thing is for sure: this beef isn't new, and it's not going away anytime soon. Intel has been comparing Xeon Phi to Tesla GPUs since Knights Corner launched, and Nvidia has been making counter-claims, thereafter.
 
G

Guest

Guest
I love it when the mud slinging starts. Means lower prices and higher performance. Ahhh, the beauty of capitalism. Why can't more people see it? Competition is good, folks.
 

ZolaIII

Distinguished
Sep 26, 2013
178
0
18,690
Both of tham are just bitching around to sell their bad, overpriced un effective products!

Atoms are even uncompetitive in mobile space, that is why Intel retreated from that market. ARM Cortex A72—73s are much better & A32 is most efficient A cortex ever, ARM even puts it against microcontroller last M series by efficiency. They are also available as POP IP ready to print designs. Still general purpose cores are not suitable for deap learning because their design. Only thing where Intel PHI's have architectural advantage compared to single on die compiutional pice of hardware up to date in mani tasks simultaneous execution because their is so much of them naturally. It is possible that Intel used high optimized math libs for the benchmarking but it still isn't a game changer. GPU's also aren't best architectural match for a deep learning because they are graphics processors in the first place. I would come along that they culd be used as comercial already included hardware but that simply isn't the case because commercial consumer ones come with purposely crippled computing performance. All do I would like to see how last ARM Mali generation will perform. DSP's are for a time being the most suitable for the task. New ones from Synopsis, Tensilica, CEVA (possibly others that I am not aware of) that are redesign for deep learning calculation purposes are; from much more efficient (5—8 times), cheaper to produce & available for costume implementations (IP, POP IP blocks), ready to use non property visual computing & deap learning Open Source software they are also reprogramable. ASICS simply can not fit the purpose because they are basically hard printed functions with no flexibility what so ever & even they are most cost/performance effective they still aren't suited for something like deep learning (because tomorrow you may need to perform other functions or altered algorithm's for same/other purpose). FPGA's especially new ones with couple of general purpose cores & DSP's (you can trow in & basic GPU if you need it) & huge programmable gate arias are best suitable for general all around purpose accelerators along with visual processing, deep learning, simultaneously multi processing. Problem with them is they aren't neither cheep (no much design if any offered as IP's) nor fast/easy to reprogram.
Just a grain of salt from me without going into details to make some sense of this hollow, empty discussion about CPU's/GPU's & deep learning.
 

bit_user

Polypheme
Ambassador
Basically, using big neural networks capable of higher-level abstractions. Not human-level consciousness or structured thought, but certainly human-level pattern recognition and classification.

It was facilitated mainly by GPUs and big data. The industry is already starting to move past GPUs, into specialized ASICs and neural processing engines, however. Big data is a significant part of it, given that it takes a huge amount of data to effectively train a big neural network.
 

bit_user

Polypheme
Ambassador
They certainly aren't bad, if you need the flexibility. Intel's is a general-purpose x86-64 chip that can run off-the-shelf OS and software (so, yeah, it can run Crysis).

In contrast, Nvidia's GPUs and Tesla accelerators (technically, their GP100 isn't a GPU, since it lacks any graphics-specific engines) must be programmed via CUDA, and are applicable to only a narrower set of tasks.

Your error is extrapolating from mobile and old-generation Atom cores to HPC and the new Atom cores. These have 4 threads and two 512-bit vector engines per core. With so many cores, SMT is important for hiding memory access latency, yet A72/A73 have just one thread per core. Regarding vector arithmetic, ARMv8 has only 128-bit vector instructions, and those have probably just 1 per core. Also, I don't know how many issue slots the Knights Landing cores have, but A72 is 3-way, while A73 is only 2-way.

I can't comment much on FPGAs, but they won't have 16 GB of HBM2 memory. Deep learning is very sensitive to memory bandwidth. And if you need full floating-point, I also question whether you can fit as many units as either the GP100 or Knights Landing pack.
 

Gzerble

Commendable
May 21, 2016
4
0
1,510
Deep learning is loosely based on neural networks; it's an iterative method of stochastic optimization. These have achieved some very good results in things computers are traditionally very bad at by finding a "cheap" solution that is close enough to accurate (object recognition being a clear example; the latest software of the Kinect is obviously based on this). This can be abstracted to plenty of single precision mathematical actions over a decent chunk of memory; a domain in which GPUs excel. Caffe, the software they are using, is the most popular deep learning software out there. It has been optimized to work very well with Cuda, which is why Nvidia cards are so very popular with the deep learning community.

Now, the Xeon Phi has a lot working for it; Intel are about to release a 72 core behemoth, with each core capable of four threads, and AVX512 operations which allow insane parallelization over what will now be extremely high performance memory. Caffe is obviously not optimized for this architecture yet, but it is obvious that Intel have put a lot of thought into this market, and it seems that the Xeon Phi may offer a higher throughput per watt.

As for people thinking that ARM chips are suited to these kinds of workloads: their memory controller is too weak, they don't have the right instruction sets for this kind of work, and you need an entire platform for this kind of work - and ARM's weakness is that each platform is custom built, rather than having a standardized offer which can promise long term support and service. ARM servers are not for these kinds of applications.
 

bit_user

Polypheme
Ambassador
I might be mistaken, but I think Nvidia's GP100 has 224 cores, each with one 512-bit vector engine. Unlike Knights Landing, I think these are in-order cores. Traditionally, GPU threads weren't as closely bound to cores as in CPUs, but I need to refresh myself on Maxwell/Pascal.

Well, there is an OpenCL fork of Caffe, but their whitepaper (see article, for link) says they used an internal, development version of Caffe. I'm going to speculate that it's not using the OpenCL backend.

I doubt that. It has the baggage of 6-channel DDR4, it lacks meaningful fp16 support, has lower peak FLOPS throughput, and its cores are more optimized for general-purpose software. I'm not a hater, though. Just being realistic.

Machine learning doesn't play well on the advantages Knights Landing has over a GPU, as it mostly boils down to vast numbers of dot-products. It's quite a good fit for GPUs, really. Even better than bitcoin mining, I'd say.

There are a few chips with comparable numbers of ARM cores, and scaled-up memory subsystems. But I'm not aware of any with HBM2 or HMC. Nor do they have extra-wide vector pipelines (I don't even know if ARM would let you do that, and still call it an ARM core).

I see Knights Landing as a response not only to the threat from GPUs, but also from the upcoming 50-100 core ARM chips. Already, we can say that it won't kill off the GPUs. However, it might grab a big chunk of the market targeted by the many-core ARMs.

Yeah, definitely amusing. What we need is some good, independent benchmarks. Especially for Capitalism to work as well as Andy Chow would like to believe it does. Not to knock capitalism, but you can't really point to this he-said/she-said flurry of PR-driven misinformation as any kind of exemplar.
 

Zerstorer1

Commendable
Aug 17, 2016
5
0
1,510
This is all classical computing which is fruitless for deep learning. You need quantum computing to build these neural nets. Why do you think Google, NASA, NSA, Apple are using D-Wave quantum computers that scale vastly for real Deep learning which is actually another name for artficial intelligence.
 

ZolaIII

Distinguished
Sep 26, 2013
178
0
18,690
"Your error is extrapolating from mobile and old-generation Atom cores to HPC and the new Atom cores. These have 4 threads and two 512-bit vector engines per core."
It's basically an old Atom with 4 trades per core & two AVX engines SIMD Extensions. So let's assume it's 2x the size of an old Atom while it can give up to 60% more performance with an usage of quad hyper threading (assumptions are actually based on similar technology from Imagination tech). "Regarding vector arithmetic, ARMv8 has only 128-bit vector instructions, and those have probably just 1 per core." actually their are two. Old Atom whose bigger than Cortex A57 & less power efficient wile it had similar performance if you don't count SIMDs. Now the A73 is approximately half the size of the A57 while it offers 30% more performance. Rest of the math is simple 4x130=520 (all do they don't scale so in single task simulation SMP use but safe to say is that they will stay in 400 margins) vs 160% & now even SIMD performance should be close enough. With cash cluster coherence, much wider abi, much better interconnect that ARM offers now you will find that & other performance aspects are much better now. A73s also have more L1 cache than Atoms. Their is also high bandwidth plug on them for connecting with external accelerators. So all in all they are much better. A Quad cluster A73 will cost less, consume less energy & produce much better general purpose performance wile being close with SIMD performance compared to single "new" Atom.
As I stated before neither of them actually are suitable for deap learning & traditional software stack doesn't play any role hire as we culd say how parameters are still not defined hire.
Still if you look at the things from the other direction that you can license ARM Cores & pair them with new generation of Mali's (that are now pretty much on pair with what Nv is offering from architectural standpoint wile still being better design for lower power consumption) & trow in a DSP (or anything else) you please (by already included link I mentioned) & that will all actually have cache coherency & fully HSA you see how this is actually a big win. Not to mention how you can architect a sistem to your needs (which is a major thing for big clients with specific needs like Google for instance).
Funnest thing is that even your home desktop Gpu use the DSP for its audio and video decoding/encoding engine (for instance AMD uses Tensilica ones). This actually tells you more than enough how GPU's aren't the best way to go regarding even simplest high parallel tasks. Intel IGP's use ASICS but that didn't actually show as future prof... HBM is not tied to any particular architecture regarding its implementation it's tied only to consortium membership & I don't think it's particularly hard to get one.

Ps by now you also probably heard Intel did come to similar conclusion & licensed ARM IP's so enough with this dumb discussion already.
 

bit_user

Polypheme
Ambassador
Need? Plenty of people are using GPUs, FPGAs, and ASICS. So, no, you don't need quantum computers, but do want. For training, anyway.
 

Zerstorer1

Commendable
Aug 17, 2016
5
0
1,510


https://youtu.be/PUlYV--lLAA
 

bit_user

Polypheme
Ambassador
AVX-512. Twice as wide as AVX, and 4x as wide as ARM's NEON. That's a 4x advantage, right there.

But we are counting SIMD. That's where Knights Landing, GP100, and most GPUs get the floating point performance that makes them so good at deep learning!

But these are the accelerators.

First, why would you put ARM cores in your GPU? Second, Mali isn't designed to scale this large.

I'm sure that's actually a case where they needed audio IP and the ability to do it in hard-realtime, and didn't want to engineer that into their GPU's software stack. It's not that the GPU has any lack of horespower. But, if you can afford to drop some special-purpose hardware, then it not only takes care of the realtime audio processing requirement, but is also a cheap way to add capacity to the GPU (i.e. by offloading those tasks from it).

Update: check out the new article on TrueAudio Next. This is a good use of the GPU's horsepower, for audio. More to the point, it seems like they sorted out the hard-realtime issues, using the CU reservation feature.

Of course, but show me a FPGA with 16 GB of embedded DRAM, and then we'll talk about deep learning with it.

The first words you quoted from me were:
Your error is extrapolating from mobile
So, I repeat them here. Apples and Oranges.
 

bit_user

Polypheme
Ambassador
Thanks, but I get that. Like I said, quantum computers will be great for training neural nets.

However, we'll still need fast conventional computers (like these) for most things we do today, including application of neural nets trained on quantum computers.
 

ZolaIII

Distinguished
Sep 26, 2013
178
0
18,690
"AVX-512. Twice as wide as AVX, and 4x as wide as ARM's NEON. That's a 4x advantage, right there." nope! 4x per core & same per quad core cluster ([128x2]x4). that is the same size as one Atom. But honestly NEON SIMD is not on pair with Intel SIMD's but I think gap is smaller now then it whose before.

"But we are counting SIMD. That's where Knights Landing, GP100, and most GPUs get the floating point performance that makes them so good at deep learning!"
Actuality it makes them sound horrible!
Honestly why on earth would you use them when you can pair a single ARM core with:
http://www.bdti.com/InsideDSP/2016/07/05/Synopsys
You will get about at least 40x performance per same power consumption compared to Atom. Still you won't need to have a large amount of general purpose ARM cores, just couple of them.

"First, why would you put ARM cores in your GPU? Second, Mali isn't designed to scale this large."
Because I need a basic GPU anyway.
Second reason would be that if I can choose I would prefer one that is also well suited & capable to also do additional offloading for vision & deep learning tasks wile remaining really small & power efficient. I don't need it to scale up really large. Actually MP 32 is more than enough.
http://www.anandtech.com/show/10375/arm-unveils-bifrost-and-mali-g71/4

"Of course, but show me a FPGA with 16 GB of embedded DRAM, and then we'll talk about deep learning with it."
Both Altera & Xilinh have them for some time now with many more consumers available (high end) products soon to come as more and more manufacturers start mass production of HBM2 chips.
http://www.anandtech.com/show/10527/sk-hynix-adds-hbm2-4-gb-memory-q3

Shore GPU's can process vision & audio but they are far less efficient in this than DSP's.

Your error is extrapolating from classic desktop!
We don't need high & might general purpose cores any more, just good enough (& highly efficient) will do. What we do need are most purpose (capable) high & mighty accelerators.
Just to let you know how current servers (that are based on same classic build philosophy) threaten to soon eat more electrical energy than its produced in the world so something must be done quickly. I just tried to provide path for resolving this problem that would be cheap, globaly available now & cost production acceptable (IP's with an accent on POP IP's).
That would be all. Anything else?
 

bit_user

Polypheme
Ambassador
I don't see how you can say that without gate counts, which intel hasn't released on these new cores.

How do you get 40x performance? The way they're counting GOPS, Knights Landing's cores would probably score about 600. So, even with their CNN engine, you wouldn't get 40x performance.

I think the actual differences between GPUs and DSPs is mostly in your head. GPUs are basically like massively multi-core DSPs.

I didn't think I was. Neither of these companies are tryingto build "just good enough". They're trying to provide the most compute power for scientific and numerical workloads (financial modelling, etc.). They're good at deep learning, but not as good as a purpose-built chip might be. Their strength is in their versatility, and the ease of development and debugging, relative to what developers are used to. Intel actually has an edge, there, since you can use standard development and debugging tools, and many existing codebases.

That's a big problem, and it's good you're thinking about it. You're right that the growth in energy usage must level off. I heard someone estimated that if the current trend in energy use by computers were to continue, the earth's oceans would boil from all the waste heat, within a couple hundred years. Of course, we wouldn't have that much energy to power them, but it just goes to show that the current trend is unsustainable.

Clearly, what we need is to put our data centers in orbit!
; )
 

ZolaIII

Distinguished
Sep 26, 2013
178
0
18,690
"I don't see how you can say that without gate counts, which intel hasn't released on these new cores."
Call it a rough estimate based on old Atom size compared to A57... Then again I did actually explain it all before so I won't be doing that again. How about that you actually reed what I wrote.

"How do you get 40x performance? The way they're counting GOPS, Knights Landing's cores would probably score about 600. So, even with their CNN engine, you wouldn't get 40x performance."
You didn't read again! It's 155 GOPS per core & 620 per quad cluster at 800MHz based on planar 28nm proces while being 8.2 mm2 (all inclusive) wile consuming about 2W. Try with 800 MACs/cycle nothing else even comes close. They also claim performance is up to 100x compared with previous EV5 gen but I don't think so.
You are pulling the data about Kings Landing from where exactly?

"I think the actual differences between GPUs and DSPs is mostly in your head. GPUs are basically like massively multi-core DSPs."
At least I have it in my head. Thanks for showing me that you actually don't know even architectural basics. I think you don't even know what DSP actually is nor why whare they developed in the first place because if you did you wouldn't be writing bullshits about how general purpose CPU can be better than them (in any of the tasks that we are arguing about hire) or how they are the same as GPU...

You are very very ignorant!
I won't coment about the rest you wrote as their is nothing really to coment.
 

bit_user

Polypheme
Ambassador
Zola, we've been down this road, before. For the last time, I will not dignify ad hominem attacks with a response. If you can't make your points with solid data and sound arguments, then we're done.

Furthermore, you'd do well to consider that reasonable people can disagree, and you might actually be wrong about certain things. If you're not willing to entertain those possibilities, then you might find more productive uses of your time, and reduce frustration all around.

Yes, I did read all of your posts, in full. I simply don't accept your power and performance claims, based solely on extrapolations of old generation designs, using old process technology. So, if you don't have better data, then let's move on.

Yes, I did read, but they are rather vague about exactly how they compute GOPS, for instance. Anyway, they are clearly talking about 8-bit MACs, of which Knights Landing cores can each do 128/cycle. However, these run at 3x the clock speed, resulting in the equivalent of 384.

Regarding power, Knights Landing can run 72 cores @ 1.5 GHz + 16 GB of HBM + 6-channel DDR4 memory controller + 40 lanes of PCIe 3.0 + the cache-coherent bus tying it all together, in only 245 W. So, while we don't know exactly how many watts each core uses, it's likely in the vicinity of 2 W.

And, BTW, the 8.2 mm^2 figure you cited excludes FPU. Also, their vision processor doesn't support floating-point, while AVX-512 does. So, it might be roughly twice as fast at integer MACs as a Knights Landing core, but the latter offers more flexibility (and vastly better floating-point performance, if you need it). And I did say there might be faster ASICs dedicated to the task.

You might recall that your position was that it's bad, ineffective, and overpriced. My position is simply that it's neither bad, nor ineffective. I never said anything about cost-effectiveness, nor that it was the fastest machine learning accelerator chip that could be built with current technology. So, you can argue that hypothetical, but I think I've shown the Knights Landing (a real product, which one can actually buy, and use with ease) is neither bad nor ineffective, relative to the standards you've raised.

Sorry, that was bad wording, on my part. So, let me rephrase that:

I don't understand the strong distinction you're drawing between DSPs and GPUs, and I think I do know a bit about each. Please enlighten us, if you wish.

That's entirely fair. You can drop any point, if you wish. It doesn't mean you concede, and nobody is keeping score, anyway.
 

ZolaIII

Distinguished
Sep 26, 2013
178
0
18,690
So hire we go.
"Yes, I did read all of your posts, in full. I simply don't accept your power and performance claims, based solely on extrapolations of old generation designs, using old process technology. So, if you don't have better data, then let's move on."
Then you have a problem with excepting things.
As a mater a fact I have better data & I am going to discredit you totally now as you asked for it. Will also show you don't have idea what are you talking about.

"Yes, I did read, but they are rather vague about exactly how they compute GOPS, for instance. Anyway, they are clearly talking about 8-bit MACs, of which Knights Landing cores can each do 128/cycle. However, these run at 3x the clock speed, resulting in the equivalent of 384."
Yes they are & not only SIMD ones but combined.
Hate to brake it up to you but AVX 512 bit SIMDs on those Phi's don't support 8 bit precision they only supports single (32 bit) & duble (64 bit) ones (how ever it does support unaligned access but no one will use it) & neither is suitable for deep learning that actually uses 4,8 & 12 bit ones. So it's actually 16 (32 bit ones) per one 512 bit SIMD & 32 per Atom compared to 64 per 512 bit SIMD on that DSP & 256 for quad cluster one. That comes to 8x per clock. You did mention the 3x clock speed of Atoms (even it's not a case it's only 1.9x)& tried to say how that is equal to 3x performance but how ever you aren't counting in architectural disadvantages CISC have compared to DSP's there simultaneous combined MAC's per cycle come in play & they will how ever be vastly faster in any case. Even that AVX implementation on PHI doesn't support 8 & 16 bit (bite & word) precision (how ever their is AVX512BW extension so far implement only on some upcoming Skylark Xeons) it how ever have a patch instructions extension (PF) & that certainly helps a lot it's still a far cry from one find in this CNN block of this DSP.
Source:
https://en.m.wikipedia.org/wiki/Advanced_Vector_Extensions
This actually explains how PHI culd come close to new Nv GPU offering (along with Nv dropping support for FF16 bit math processing). So you actually see how Atoms (Nor Nv GPU's) aren't exactly the best match for deep learning algorithms judging by their architecture.

"Regarding power, Knights Landing can run 72 cores @ 1.5 GHz + 16 GB of HBM + 6-channel DDR4 memory controller + 40 lanes of PCIe 3.0 + the cache-coherent bus tying it all together, in only 245 W. So, while we don't know exactly how many watts each core uses, it's likely in the vicinity of 2 W."
Speculating again aren't you? You really need to work on your math! It's more likely they use a bit over 3W per core as HMB is really lo on power consumption. Really your math is zero (3x clock sped & 2W per core)...

"And, BTW, the 8.2 mm^2 figure you cited excludes FPU" not really relevant in this case"

Now a calculations:
Let's say how this DSP quad core cluster will on the same advanced node (Intel's 14nm FF) use about 45% energy compared to one on 28nm LL planar process (assuming we won't change anything else) that is equal to 585 mW let's say 600mW to keep it fair. As the Atom consumes 3W on 1.9 Frequency & how that is more than 2x then it would consume working on 800MHz... For the sake of calculation let's say it would consume around 1200mW at 800MHz (& I am being generous hire) that is actually twice as much so we can drove conclusion how it have 2x gates so it's twice the size which is actually much worser than I firstly suggested. So 8x more DSP word's per clock x 2 per size equals 16x but that is only theoretical true output. Now we add the architectural difference & say how DSP will have at least 150% performance of SIMD/CISC Atom that makes it 24x. That is much less than 40x but their is no compiler nor math lib that actually can harvest all capabilities on PHI AVX SIMD's & how that will significantly cripple it's performance but on contrary the Synopsys compiler is well optimized & can utilize full potential of their DSP you will easily come to 40x index. If you use open source compilers that are about 2x slower it rises to 80x.

"You might recall that your position was that it's bad, ineffective, and overpriced" so isn't it? You can quote me on that as much as you like in the future.

"My position is simply that it's neither bad, nor ineffective" so your position is deadly wrong. Seams somehow you are stuck with it.

After reconsidering all of this Kings landing is an actual product that is highly over priced, inefficient & that you don't want to have. All do they did make significant improvements compared to previous generation.

"I don't understand the strong distinction you're drawing between DSPs and GPUs, and I think I do know a bit about each. Please enlighten us, if you wish".
I won't really go into any details hire!
All do both of them employee ALU SIMD units their are significantly different from architectural standpoint. Most colorful example I can think of to describe it would be how GPU's are as single way roads wile DSP's are two way highways with lots ins and outs along side. So basically you can drive only one car in the one direction on GPU while you can drive (couple) different cars in different directions simultaneously on DSP & they can jump in & out along the way much faster to be replaced with others. This is much more efficient if you need it if you don't GPU's culd win in basic highest parallel tasks but again their won't as their also contain other buildings blocks needed for graphic processing that specialized digital signal processors don't have. Besides their is only about 3% of all tasks that would benefit from such high level parallelism. Again FPGA's are like open field that you can program like you want (for any task) & helped with DSP blocks they can do MAC's much more efficient than ever before so all in all they are the best possible choice for architecting anything (making your highways & their numbers as you need them).

I hope this is enough for you if not try to find some good scientific works & reed them. All do I didn't find any of that kind with used modern & purpose adapted for deep learning DSP's. You certainly can find ones that will give you better overview in the architecture.

Best regards. This topic is closed considering me.
 

bit_user

Polypheme
Ambassador
You're correct about the first part. However, rather than use 32-bit floats, one would likely use AVX2, which allows 32 x 2 bytes to be processed in parallel, per core. I checked and Knights Landing does support AVX2.

So, you accuse me of not reading things, but the EV6x designs (the ones including the 800 MAC/cycle option) are all 500 Mhz. Go and see for yourself. So, 1.5 GHz / 500 Mhz = the 3x multiplier I used.

The GP100 definitely does support fp16. I only said that Knights Landing lost it, relative to Knights Corner. Sad loss, but I can see why they wanted to move to a standard instruction set extension. I hope we'll get it back, in the next generation.

3W is definitely too high. You're forgetting the other things I mentioned, including the DDR4 controller. Also, the cache-coherent bus connecting all the tiles and memory controllers probably uses a significant amount of power, itself.

You're pretty close to being reported.

And where did you get that number?

Intel already has libraries optimized for it, and GCC already supports the new instructions. Most developers tuning their code for this chip will use the compiler intrinsics. However, if you read Intel's whitepaper, which is linked from the news article, they cite a version of Caffe that they've already optimized for it.

Intel does provide a compiler, so it's not a given that people will use open source compilers. However I'm curious where you got this "2x" number.

Okay, it seems we have a disagreement about fundamental semantics, here. You've created a strawman, in the form of the fastest special-purpose accelerator of which you can conceive. And through some mysterious arithmetic of your own, shown that it's sufficiently faster to satisfy your position. Which is where we get to the bit about semantics (i.e. "bad and ineffective" being open to interpretion).

I'm okay with whatever you'd like to believe. It seems whatever I argue, you're going to put the goal posts out of reach. And there's no way either of us can prove anything about hardware that doesn't exist. So, I guess that's it.

We agree on something!

Thanks for the interesting analogy. I don't happen to agree, but that's a completely different debate.

Since you expressed interest in energy-efficient computation, I'll leave you with this link: http://www.nextplatform.com/2015/03/12/the-little-chip-that-could-disrupt-exascale-computing/

I'm more than a bit skeptical of their claims. I think they're underestimating the threat posed by GPUs (or GPU-like chips, such as Nvidia's GP100), but their core idea of simplifying the hardware to the bare bones & letting software deal with caching and memory indirection is rather compelling. It's like taking the VLIW philosophy to another level. I expect that idea to grow some legs.
 

ZolaIII

Distinguished
Sep 26, 2013
178
0
18,690
FF16 stands for Fast Float & no last Nv gen don't support it.
Only old first gen of AVX the 128 bit one had 16 bit support & upcoming gen should have it again along with 8 bit. So I actually don't know what you reed but you didn't read well which is actually hard to manage when provided with good source.

"Intel does provide a compiler, so it's not a given that people will use open source compilers. However I'm curious where you got this "2x" number."
Hire you go, knock your self out:
http://www.yeppp.info/benchmarks.html
There are open source project that are intended for Intel AVX but they are mainly still work in progress. Yeppp is mainly a work of one university professor & it whosent never seriously backed up so it supports only basic functions how ever it does shown the way how to do it for many (libs) that are being worked at. That benchmark list actually contains & all property Intel compilers up to date & no they aren't still fully capable of supporting the all capabilities of AVX extension find on last PXI's.
Get a grip on your self! Stating that mem controller (that is even not in use), PCI-E interface & 16GB of HBM will consume 100W is more than childish.

 

bit_user

Polypheme
Ambassador
I don't know what you have in mind, but here's what I'm talking about:
new half-precision instructions to deliver more than 21 teraflops of peak performance for deep learning
Source: http://nvidianews.nvidia.com/news/nvidia-launches-world-s-first-deep-learning-supercomputer

That's per-chip, BTW. I'm sure Intel wishes they'd put fp16 in AVX-512.

AVX and AVX2 are 256-bits. The SSE family of instructions are 128 bits. SSE1 introduced floating-point vector operations, while SSE2 extended the MMX family of integer vector instructions to 128 bits. The same thing happened with AVX - first they did floats, then AVX2 extended integer vector operations from 128 bits to 256 bit vectors.

Before you pointed it out, I wasn't aware that AVX-512 split up the instructions like that. In any case, Knigts Landing supports AVX2, so anyone wishing to do integer vector operations just has to do them at 256-bits instead of 512.

That's a math library - not a compiler. And the fact that it exists for x86 makes it a rather moot point. As an aside, they only specify accuracy on log/exp, and aren't best in either case. So, whether one can actually use it depends on their accuracy requirements - it's not necessarily a "free" optimization. I see no accuracy data on their polynomial evaluation, and I wouldn't count on any compiler's autovectorization, anyway.

Why do you think the memory controller isn't in use? That's wrong. You can't use Knights Landing without external memory, and almost no one would. In HPC, 16 GB of RAM is tiny.

Secondly, I didn't say 100 W, I only ever said in the vicinity of 2 W, which I think is closer than your estimate of 3 W.

Third, you're omitting the cache-coherent bus needed to interconnect the 36 tiles, DDR4 controllers, HBM, PCIe, etc. The aggregate traffic it must carry is quite substantial, meaning it's got to use a decent chunk of power.

What are you even talking about?

Now, I just want to take a step back and ask: are you trying to convince me, or yourself? I've learned a couple things, in this exchange, but nothing that's changed my opinion. I dare say you're in roughly the same boat? Just thought I'd ask.
 

ZolaIII

Distinguished
Sep 26, 2013
178
0
18,690
FF 16 Google it. Its even discussed hire before & by other people.
"That's a math library - not a compiler. And the fact that it exists for x86 makes it a rather moot point"
Sports read a little better it also supports ARM NEON SIMD's but that is not a point. Point is that you do have comparation of different compilers on that benchmarks list includes very expensive Intel property ones & both open source ones & all of that in variety of matches for the hardware with variety of math libs that go with them. Remember you asked me how do I know performance (with open source compilers) would be half of the one possible to achieve with Intel property (top) ones.

"Secondly, I didn't say 100 W, I only ever said in the vicinity of 2 W, which I think is closer than your estimate of 3 W." well 72x2=144; 246-144=102
So you actually did say 100W. On the contrary 72x3=216 which leaves 30W for everything else which is pretty much (more then) enough & if it's not then they have serious design flow with it.

Main thing for processing a deep learning algorithms is 8bit & 4bit precision. Doing them in 16 bit is bad & un efficient doing them in 32 bit is doubled that. If you can't understand this much I can not help you with that.

Well I don't know about you but this actually did make me to reexamine my opinion about Synopsis DSP as I presented hire. I whose a little bit wrong about its efficiency in processing the deap learning algorithms guided mostly by MAC's output... Now I see how Tensilica's last gen is actually better (as it supports 4bit processing). But actually that is all that changed in my opinion.

By the way this is parallella:
http://arstechnica.com/information-technology/2013/07/creating-a-99-parallel-computing-machine-is-just-as-hard-as-it-sounds/
& it whose a total bull so they (company) went down. I don't have problem with opinions you have problems with basic math.
 
Status
Not open for further replies.