Intel Xeon Phi Knights Landing Now Shipping; Omni Path Update, Too

  • Thread starter Thread starter Guest
  • Start date Start date
Status
Not open for further replies.
first processor with ... AVX-512 support.
BTW, its predecessor also had 512-bit vector instructions, but of a slightly different flavor. I think AVX-512 might be inferior, due to the lack of fp-16 support. This will certainly put it at a disadvantage vs. the GP100, for certain machine learning applications.

I want to see Intel put a 480+ core (20+ slice) GPU in a Socket-P, paired with an 8-core CPU and HBM. That would be probably be more competitive with GPUs.
 
I saw that. Already fixed. Check the last-edited timestamp.

Anyway, what's with this?
Each quad-threaded Silvermont core features two AVX-512 VPU (Vector Processing Units) per core, for a total of 288 VPU.
If it's two per core, wouldn't it be 144 VPUs? And I think this would equate to 2304 "cores", as AMD and Nvidia would count them.

IMO, Intel missed the boat, on machine learning. As you point out, Google is already beyond FPGAs. I predict they're already working on a hard-wired accelerator block, not unlike the QuickSink video accelerator block in their desktop & mobile GPUs.
 
If it's two per core, wouldn't it be 144 VPUs? And I think this would equate to 2304 "cores", as AMD and Nvidia would count them.

IMO, Intel missed the boat, on machine learning. As you point out, Google is already beyond FPGAs. I predict they're already working on a hard-wired accelerator block, not unlike the QuickSink video accelerator block in their desktop & mobile GPUs.

Good catch, I got the number wrong :) Google is certainly nebulous, but it surely has something newer than its TPU in the works, or already deployed, as otherwise I doubt it would have shown the TPU.
 
I'm talking about Intel, though.

It seems AVX-512 is missing half-precision floats, though their Gen9 HD Graphics and Nvidia's Pascal (and Maxwell, for that matter) have it. So, that puts Knights Landing at a disadvantage vs. GPUs. Then, consider that Google & others are even beyond FPGAs, for certain things, which it doesn't even have. So, if I were Intel, I'd be scrambling to build a dedicated machine learning engine into my next generation of chips (see Qualcomm's Zeroth engine).

And, actually, the HD Graphics GPUs are pretty nice. So, I'd love to see them scale up & do more with that architecture.
 


I think that would be a good idea, it definitely should hardcode something a la QuickSync.
In some odd way Intel seems to be setting itself up to compete against itself on the machine learning side. It has the KNL out, but it also purchased Altera to bring the FPGA on-die with Xeons, and judging from the early silicon they are teasing, it is well underway. I see that as a competitor of sorts to the KNL...but perhaps they are just tailoring for specific audiences.
I have no idea how Zeroth managed to fly under my radar, some interesting reading for tonight!
 


It is interesting, but in all of the documentation Intel carefully denotes that it is a "heavily modified Atom Silvermont core", which means it definitely has some heavy duty alterations. Hard to tell what mutant is really under the hood in that aspect.
 
I had the same questions about its predecessor, which was allegedly based on the P54C Pentium. But, after they extended it to support x86-64, added 4:1 hyperthreading, and added the 512-bit vector engine, how much Pentium was really left? I mean, was it more like the spiritual foundation? Or is there actually a fair amount of unmodified Pentium in there, if even just at the HDL level?

And that's another thing to ponder - if they take some 20-year-old HDL, how much do they have to change, just to get the code to work with their current tools? I think the situation should've been much better, with Silvermont.

Anyway, I hope Goldmont still surfaces in some form, or another. Even though they cancelled the phone/tablet SoCs, it could still appear in laptop or IoT-oriented SoCs.
 
I remember reading about those, back in the day.

But does Intel say how the dual-core tiles are connected? I know some recent multi-core server & communications processors use a mesh topology, but I wonder what this uses.

To be honest, I never liked the asymmetry of meshes. I don't like the idea that some nodes might be much farther from a memory controller than others. That said, they do scale pretty well.

Lately, Intel seems to favor ring buses, and I can see the benefits for cache coherency. When I put out a transaction on a bus, all the nodes bus can see it and potentially respond. As long as you've got just one ring, it's all pretty simple. But, their Xeon processors have (I think) as many as 2 rings, and this would need like 4 or 8. So, what then?
 
Also at ISC 2016 was unveiled the new Sunway TaihuLight supercomputer that ranked as the fastest supercomputer in the world (unfortunately Tom's has not yet reported the news). With a LINPACK benchmark rating of 93 petaflops, this is nearly three times faster than the previous holder of the record, the Tianhe-2, which ran at 34 petaflops. Each ShinWei SW26010 processor rivals with Knights landing in all aspects. If 72 cores in KN look like a lot of cores, each ShinWei SW26010 processor has 260 cores. The Sunway TaihuLight uses a total of 40,960 processors for a total of 10,649,600 cores across the entire system.
 
I'm not surprised, as Toms isn't really a HPC-oriented site.

Well, it depends a lot on how you define a "core", and exactly what the capabilities of those cores are. GPUs have far more "cores" than either, but they're far less capable. Knights Landing is notable, because all of its cores are fully general and can run a general-purpose OS and be programmed using the same tools and libraries used to develop desktop OS and apps.

By comparison, the only thing I can find about the SW26010 suggests that 256 of those cores are like the Cell processor's vector engines - that only 4 cores are general-purpose. This gives me the sense that it's much more GPU-like. Perhaps it's even akin to a quad-core CPU paired with a 256-core GPU? Above, I was even suggesting Intel should do something like that, with their HD graphics GPU.

I also wonder whether SW26010 has anything akin to omni-path or HBM2, yet you say "in all aspects"? What type & how wide is its external memory interface? How many PCIe lanes has it got?

Anyway, that is a lot of Petaflops. Good for apps that can use 'em!
 

Agreed.
However, both Knights landing, and Tesla P100 (unveiled at SC 2016 too) are also HPC oriented, yet they are covered in the news.
Not to mention that toms german site has (even if very short) an article about the new supercomputer (http://www.tomshardware.de/supercomputer-systeme-pc,news-256035.html).



Actually is more akin to a quad core CPU paired with an old Xeon Phi accelerator. Each of the general purpose cores is connected to 64 acelerator cores in a similar way that a Xeon Phi still required a "conventional" xeon CPU. The only diference is that both the general purpose and the acceletator cores are all integrated in same chip. But the each of the 256 cores is a "simplified" general purpose CPU core in the same way that each xeon phi core is a simplified core compared with a regular xeon core.

As far as I know each chip produces 3 TFlops double-precision, very close to Xeon Phi Knights Landing, and are extremely power efficient. The supercomputer is by far much more efficient than supercomputers based on both GPUs and Xeon Phi. Wich is an amazing feat considering that the chips have been produced with 28 nm technology.

 
I'm guessing the Toms editors just search the newswire for Intel & Nvidia, which they cover quite a bit.

That doesn't comport with what's written on Wikipedia. They describe a much more Cell-like architecture, where each of the vector cores works out of scratch-pad memory. This made the Cell notoriously hard to program. Really, much more like a GPU.

Furthermore, they state that each MPE+CPE group has its own memory space, which is another big difference from Knights Landing.

I'd be curious to know how much power it dissipates. AMD's 28 nm FirePro S9170 achieved 2.6 DP TFLOPS in 275 W. Nvidia's 28 nm K40 achieved 1.7 DP TFLOPS in 235 W.

I'm not saying the SW26010 is bad, just that it's much more comparable to a GPU (or 4 beefy APUs sharing a die), than Knight's Landing. You'd need to program it using something like OpenCL or OpenMP. It's not a small distinction.

And we don't know what kind of interconnectivity it has. In supercomputing, the problem has long been one of providing usable performance, more than just aggregating a large amount of FLOPS.
 


Regarding power efficiency, one thing is the power required by an isolated GPU or a Xeon Phi, and another thing is the power required by the complete system. The Sunway has a power efficiency of 6 Gflops/W, significantly better than the most efficient supercomputer based on Xeon Phi (2,4 Gfops/W). Have in mind also that supercomputer efficiency is much better for smaller supercomputers, xeon phi achieves 2,4 Gflop /W on a small supercomputer (#257 of the top 500), #2 which also uses xeon phi achieves only 1.95 Gflops/W.
The most efficient supercomputer based on K40 achieves 3,7 Gflops/W, but its listed in #305 position. A bigger supercomputer based on K40 appears on position #19 with only 1,5 Gflops/W and have in mind that such supercomputer achieves only a fraction of the performance of Sunway (3,57 Pflops). Sunway is, for its performance, an extremely power efficient supercomputer, it requires less power than #2 supercomputer (as I said based on xeon phi) while providing 3 times the HPL performance. Incidentally it also requires a smaller footprint.


Actually if we compare the top 3 supercomputers, sunway has more HPL performance (74.16% of peak performance) than the #3 ORNL Titan, based on GPUs (65.19%) and #2 Tianhe-2 based on Xeon Phi (55.83%). The reported 93 Pflop/s is actually usable performance, peak performance is 125.4 Pflop/s

However, HPL only measures performance for dense linear algebra applications (which covers a broad range of applications), regarding sparse linear algebra Sunway is much less efficient, actually providing less effective performance than #2 supercomputer. GPUs and xeon phi (although marginally better) are also very inefficient with sparse linear algebra, here the efficiency crown goes to the traditional vector supercomputers (NEC is still producing some of those).
 
I know it's not quite a fair comparison, but the FirePro S9170 I mentioned achieves 9.5 GFLOPS/W. If you add 100 W for a host processor and NIC, it'd still be about 7.

I think people using Xeon Phi accelerators aren't running such workloads. If they were, then an architecture built around more conventional GPUs would be more sensible.

Perhaps APUs are the way to go, since they could remove the overhead of powering two memories, the PCIe bus, etc. AMD has announced a server/workstation-oriented APU, built around their Zen core and their new Polaris graphics. It'll be interesting to see if it goes anywhere in HPC.
 
That would be great for virtualization, run 50 desktops on one server, watch this bad boy keep up instead of having to put several GPU's in
 


The estimation may be reasonable for a small server, but its certainly off for a big supercomputer. For instance the Xeon Phi co-processors in the Tianhe-2 (only supercomputer with a size that can be remotely compared to Sunway) have a power efficiency of 7.5 Gflops/W, however the full system achieves only an effective efficiency of 1.95 Gflops/W on HPL. However Tianhe-2 has only 16,000 nodes, had they build a version with the size of Sunway (40,960 nodes) the efficiency would be significantly lower.

I would like to reiterate that the bigger the supercomputer the lower the efficiency (even using the same components). All supercomputers at the top of the Green 500 chart are relatively small supercomputers (except for Sunway). Right now the most efficient supercomputer in the world achieves 6,7 Gflops/W but its HPL performance is only 1 PFlops. Sunway achieves 6 Gflops/W (almost as power efficient) but has 93 times the sustained HPL performance.

As I said, the fact that they can get such an efficiency out of this monster using homegrown processors built on the 28 nm node is really amazing.



 
Why? It seems to me that the only things which scale more than linear are networking.

There's also a question of what the power figures include, and whether everyone is measuring these the same. Like storage? Air conditioning? Power distribution? Lighting? If it includes air conditioning, then you could lower your power usage simply by locating the computer somewhere naturally cool. If it includes storage, then a computer with low storage requirements and SSDs would also score better.

It's always tough to compare different supercomputers. What really matters is getting the right one for your needs. If Sunway was built more as a prestige project than to meet certain application requirements and budget constraints, it would be easier to break records. It's impressive progress, nonetheless.
 
Status
Not open for further replies.