IBM, Stone Ridge Technology, Nvidia Break Supercomputing Record

Guest · Apr 25, 2017

What happens when you take 60 IBM POWER processors and 120 Nvidia GPUs up against a massive 716,800 processor supercomputer? You smash a record.

IBM, Stone Ridge Technology, Nvidia Break Supercomputing Record : Read more

bit_user · Apr 25, 2017

Such massive simulations are rarely used in the field, but it does highlight the performance advantages of using GPUs instead of CPUs for this class of simulation.

Interestingly, Blue Waters still has about 10x - 20x the raw compute performance of the new system. It's also pretty old, employing Kepler-era GK110 GPUs.

https://bluewaters.ncsa.illinois.edu/hardware-summary

So, this really seems like a story about higher-bandwidth, lower-latency connectivity, rather than simply raw compute power. Anyway, progress is good.

it would be interesting to see how a Knight's Landing-equipped (KNL) cluster would stack up.

A Xeon Phi x200 is probably about half as fast as a GP100, for tasks well-suited to the latter. It's got about half the bandwidth and half the compute. Where it shines is if your task isn't so GPU-friendly, or if it's some legacy code that you can't afford (or don't have access to the source code) to rewrite using CUDA.

Regarding connectivity, the x200 Xeon Phi's are at a significant disadvantage. They've only got (optional) one 100 Gbps Omnipath link, whereas the GP100's each have 3x NVLink ports.

Matt1685 · Apr 25, 2017

Exxon and Stone Ridge Technology says they simulated a billion cell model, not 100 billion cell model.

As far as Intel's claims about KNL, take them with a grain of salt. Firstly, they were probably making a comparison against Kepler processors, which is two generations from Pascal, not one. The P100 probably has 4 times the energy efficiency of the processor they used for the comparison. Secondarily any comparison of performance depends highly on the application. Thirdly, KNL uses vector processors which I think might be more difficult in practice to use at high efficiency than NVIDIA's scalar ALUs. Fourthly, KNL is restricted to a 1 processor per node architecture which makes it harder to scale a lot of applications when compared to Minsky's 2 CPU + 4 GPU per node topology. There would be much less high speed memory per node as well as much lower aggregate memory bandwidth per node. BIT_USER has already pointed out some of these things.

In reply to BIT_USER, the way it sounds, Exxon only used the XE CPU nodes for the simulation and not the XK GPU nodes. Each XE node consists of 2 AMD 6276 processors that came out in November 2011. NCSA counts each 6276 CPU as having 8 cores, but each CPU can run 16 threads concurrently and Exxon seems to be counting the threads to get their 716,800 processors. The XE nodes have 7.1 PF of peak compute according to NCSA. The P100s in the Stone Ridge simulation have about 1.2 PF. So, the increased performance has a lot to do with the increased memory bandwidth available, and perhaps with the topology (fewer, stronger nodes) and the more modern InfiniBand EDR interconnect.

Also, as far as using KNL when you can't afford to rewrite using CUDA, NVIDIA GPUs are generally used in traditional HPC using OpenACC or OpenMP. They don't need to use CUDA. They still seem to perform well used thus with respect to Xeon Phis, both in terms of performance and the time it takes to optimize code, as HPC customers are still using GPUs. Stone Ridge, however, surely has optimized CUDA code.

Blue Waters was built in 2012, so the supercomputer is 4-5 years old. The Stone Ridge Technology simulation is still very impressive, but the age of Blue Waters should be kept in mind when considering the power efficiency, cost, and performance comparisons.

bit_user · Apr 26, 2017

Matt1685 :

I was with you until this. Pascal's "Streaming Multiprocessor" cores each have 4x 16-lane (fp32) SIMD units. All GPUs rely on SIMD engines to get the big compute throughput numbers they deliver. KNL features only two such pipelines per core.

Aside from getting good SIMD throughput, KNL (AKA Xeon Phi x200) cores are no more difficult to program than any other sort of x86 core. That said, its "cluster on a chip" mode is an acknowledgement of the challenges posed by offering cache coherency at such a scale.

Matt1685 :

Right. That's what I was getting at.

Matt1685 :

This still assumes even recompilation is an option. For most HPC code, this is a reasonable assumption. But, the KNL Xeon Phi isn't strictly for HPC.

Matt1685 :

Right. I was also alluding to that.

Progress is good, but it's important to look beyond the top line numbers.

Anyway, the next place where HMC/HBM2 should have a sizeable impact is in mobile SoCs. We should anticipate a boost in both raw performance and power efficiency. And when it eventually makes its way into the desktop CPUs from Intel and AMD, high-speed in-package memory will remove a bottleneck that's been holding back iGPUs, for a while.

bit_user · Apr 26, 2017

When Intel decides it really wants to get serious about competing head-on with GPUs, it'll deliver a scaled up HD Graphics GPU. I see Xeon Phi and their FPGA strategy as playing at the margins - not really going for the jugular.

x86 is fine for higher-power, lower-core count CPUs. But, when power efficiency really matters, x86's big front end and memory consistency model really start to take a toll.

Matt1685 · Apr 26, 2017

I'm on shaky ground here, but I think something you said is misleading. NVIDIA uses an SIMT (single instruction, multiple thread) architecture. KNL uses AVX-512 vector units (which they have two of per core). I believe that SIMT allows flexibilty that KNL's vector units don't allow. That flexibility apparently makes it easier to extract data-level parallelism from applications. Of course there are downsides to it, as well. Here is an explanation better than I could give: http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html.

I think "KNL cores are no more difficult to program than another other sort of x86 core" is again a bit misleading. Strictly speaking, that's true. However, KNL cores are more difficult to program /well/ for. Code must be carefully optimized to properly extract parallelism in such a way that makes efficient use of the processor. In terms of achieving high performance, I haven't seen it claimed, except by Intel, that Xeon Phi processors or accelerators are easier to program for than GPUs.

I am confused by what you meant by "But, the KNL Xeon Phi isn't strictly for HPC." I only brought up HPC to provide a concrete example of GPUs being commonly used to run legacy code that users either can't afford to or simply don't want to rewrite using CUDA. It seems to me that the main advantages of Xeon Phi at the moment is that there are lots of programmers more comfortable with x86 programming and that Xeon Phi is an alternative to GPUs. Certain institutions don't want to be locked into any particular vendor or type of processor (any more than they already are with x86 I guess). They have lots of x86 legacy code and try to maintain their code in a more-or-less portable form as far as acceleration goes. So it's not difficult to use Xeon Phi to help ensure alternatives to GPUs are available. I haven't heard of many people being excited by Xeon Phi, though.

BTW, the last paragraph in my original message was not directed at you. After I sent the message I realized it might look like it was, sorry.

Matt1685 · Apr 26, 2017

Intel can't develop a GPU now. They've spent the last 5 years telling people they don't need GPUs. Even if Intel did now decide to risk pissing off their customers by moving full force to GPUs, I think they'd be at a big disadvantage. First it would take them years to develop a good GPU. Then after they had one, so much work would have to be done to optimize important code to work on Intel's new GPUs. They probably could have done it back in 2009, or whenever they started their Larrabee project. But instead they chose to try to leverage their x86 IP in the graphics world. As a result, graphics processors have been leveraged into their compute world.

Besides, NVIDIA is a large, rich, ambitious, nimble, and focused company with expertise in both GPU hardware and software. Intel would not find it easy to directly compete with NVIDIA in GPUs. Intel couldn't just outspend NVIDIA and beat them technologically. The best they could hope for is to match them, and that wouldn't be good enough considering NVIDIA's lead with CUDA. As far as using their clout, Intel have already been in trouble with regulatory bodies for actions they took against AMD. I think they'd be on a short leash if they tried similar tricks to get customers to buy their GPUs over NVIDIA's.

bit_user · Apr 26, 2017

Matt1685 :

I don't know if Pascal still uses SIMT, but the distinction between that and SIMD is pretty minor. Reading that makes it sound like it could be implemented entirely as a software abstraction, so long as you've got scatter/gather memory access.

Matt1685 :

You're drinking too much of the Nvidia kool aide. KNL has 4-way SMT, in addition to SIMD. So, it's no big deal if one or two threads stall out. And you can always use OpenMP or OpenCL.

Matt1685 :

It's not an issue of programmer comfort, so much as it is that 99.999% of software in existence is written for single-threaded or SMP-style architectures. You could recompile that on an ARM, but porting most of it to a GPU involves a lot more than just a recompile.

KNL supports Windows 10 and enterprise Linux distros. You can literally run whatever you want on it. Webservers, databases, software-defined networking... anything.

bit_user · Apr 26, 2017

Matt1685 :

They'll go wherever they think the money is.

Matt1685 :

done. and done. It's called their HD Graphics architecture, and they've been working on it at least that long. All they'd have to do is scale it up and slap some HMC2 on the side.

Matt1685 :

They did a pretty good job at destroying the market for discrete, sub-$100 GPUs. Once they start shipping CPUs with in-package memory, I predict a similar fate could befall the $150 or even $200 segment. In revenue terms, that's the bread and butter of the consumer graphics market.

Matt1685 :

Regulatory bodies have been toothless since about 2000, when the MS anti-trust case got dropped. I somehow doubt Trump is about to revive them, and the EU has bigger things to worry about.

bit_user · Apr 26, 2017

BTW, I would agree that there's something holding Intel back from offering a non-x86 solution. I don't think it's their marketing department, however. I think they've learned the lesson too well not to stray from x86. In the past, they've only gotten burned when they've gone in that direction.

But x86 has overheads that GPUs don't (and ARMs have is lesser degrees). You could even take their recent acquisitions of Movidius, Altera, and possibly even Mobileye & Nervana as acknowledgements of this fact.

Matt1685 · Apr 26, 2017

Pascal definitely still uses SIMT. So does AMD's GCN, even though they call their units vector units. (Their previous architecture, TerraScale, I think, was something they called Very Long Instruction Word, which worked a bit differently, but I'm not sure exactly how.)

The difference that SIMT makes is not just a minor difference, to my understanding, as that blog post I sent you (and other things I have read, including this research paper from AMD https://people.engr.ncsu.edu/hzhou/ipdps14.pdf) suggests.

"You're drinking too much of the Nvidia kool aide. KNL has 4-way SMT, in addition to SIMD. So, it's no big deal if one or two threads stall out. And you can always use OpenMP or OpenCL."

I'm not drinking Kool Aid. My information doesn't come from NVIDIA. You apparently didn't read the blog post I sent you. Having an either or option of SMT or SIMD does not equal the data-parallelism extraction possibilities of SIMT. Without looking into it, I'm guessing that KNL offers SMT through the Silvermont cores, not the AVX-512 cores, not that it matters. Running 70 (cores) x 4 (threads per core)= 280 (threads) isn't going to give you much parallelization. And you can always use OpenMP or OpenCL? What are you on about? That makes no sense. How do you expect to extract data-parallelism without them?

"It's not an issue of programmer comfort, so much as it is that 99.999% of software in existence is written for single-threaded or SMP-style architectures. You could recompile that on an ARM, but porting most of it to a GPU involves a lot more than just a recompile."

Sure but that has NOTHING to do with Xeon Phi. No one in his right mind is going to purchase and install a Xeon Phi system in order to run legacy code without modification on its vastly sub-par scalar execution units. They'd just buy regular Xeons and get much better performance. My point quite obviously talked about the reasons to use Xeon Phis, not the reasons to use Xeons.

"KNL supports Windows 10 and enterprise Linux distros. You can literally run whatever you want on it. Webservers, databases, software-defined networking... anything."

Yes it can run whatever you want as slowly as you want. Most users, even those that could take advantage of parallelization, are sitting on the sidelines using CPUs because they are among that 99.999% who have software written for single threaded architectures or SMP architectures. The point there is that they aren't jumping to Xeon Phi despite the magical properties of your that you have mentioned. That's because there isn't an advantage to it. In order to take advantage of the Xeon Phi they have to do the same sort of work they would need to do to take advantage of GPUs. This isn't coming from NVIDIA, this is coming from the market, such as Cray.

Matt1685 · Apr 26, 2017

"They'll go wherever they think the money is."

Exactly. They won't spend billions of dollars preparing themselves for what they know would be a losing battle.

"done. and done. It's called their HD Graphics architecture, and they've been working on it at least that long. All they'd have to do is scale it up and slap some HMC2 on the side."

No. Their HD graphics architecture is slow and energy inefficient, even in graphics-only modes. The purpose of that GPU is to provide onboard graphics and tick all the API feature boxes, and that's all it accomplishes. There's a big gulf between that architecture and NVIDIA's.

"They did a pretty good job at destroying the market for discrete, sub-$100 GPUs. Once they start shipping CPUs with in-package memory, I predict a similar fate could befall the $150 or even $200 segment. In revenue terms, that's the bread and butter of the consumer graphics market."

Do you remember when they were predicting the demise of NVIDIA after AMD bought ATI. Everything was going to go integrated graphics. No one games with integrated graphics except with casual or graphically undemanding games. The market for discrete GPUs is expanding, not shrinking. You're drinking the Intel kool aid.

"Regulatory bodies have been toothless since about 2000, when the MS anti-trust case got dropped. I somehow doubt Trump is about to revive them, and the EU has bigger things to worry about."

Not true. Judgment was handed out against Intel with respect to AMD, and Qualcomm has just been dealt losses.

bit_user · Apr 26, 2017

Matt1685 :

VLIW is like MIMD that's statically-scheduled.

Matt1685 :

I said I did look at it, and thought they were making a big deal out of something that's essentially a software construct layered atop SIMD hardware. Also, we're not talking about SMT or SIMD, we're talking about SMT + SIMD. I think both Nvidia and Intel are doing this, but I'm not 100% sure about Nvidia.

Matt1685 :

the AVX-512 engines aren't cores. AVX-512 is an instruction set extension for performing 512-bit vector arithmetic, operating on a correspondingly widened register file. It's very much SMT-compatible.

Matt1685 :

Well, Nvidia's P100 only has 60 cores. I'm not sure exactly how many concurrent threads it supports, but they're not as different as you make them sound.

Matt1685 :

Unlike GPUs, KNL natively supports a classical SMP programming model. You can just boot windows on the sucker and run a bunch of VMs, single-threaded programs, and/or multi-threaded programs. But, if you care about efficient utilization of its resources & have a compatible workload, you have the option of OpenMP or OpenCL (which is patterned after CUDA). That's all I was saying.

"It's not an issue of programmer comfort, so much as it is that 99.999% of software in existence is written for single-threaded or SMP-style architectures. You could recompile that on an ARM, but porting most of it to a GPU involves a lot more than just a recompile."

Matt1685 :

That's your opinion. It's got up to 72 of them, and a 384 GB/sec path to 16 GB of HMC2. I'm sure it makes sense for more non-HPC workloads than you think.

Matt1685 :

Have you got some sales data you'd like to share? If not, then I will disregard this as pure speculation.

bit_user · Apr 26, 2017

Matt1685 :

Not competing is the surest way to lose. Everybody wants a piece of Intel's datacenter market. They can't remain so dominant in the datacenter CPU market, so they've got to diversify.

Matt1685 :

You're basing this on what, exactly?

Matt1685 :

The sub-$100 market was a massive chunk of revenue, for these guys. It took a while for them to recover from it, but big data, deep learning, and VR helped bring them back from the brink. Deep learning will soon be dominated by ASICs. And the CPU-integrated market is currently limited by memory bottlenecks that are soon to be alleviated. Nvidia is trying to play in the embedded space, but again they face stiff competition from ASICs. So, either they go all-in with some specialized hardware of their own, or they're probably going to be hurting in a couple years' time.

This exchange isn't going anywhere positive.

I see your green-tinted glasses, but I have no dog in this fight - recall that I started out by saying the P100 was about 2x as fast as KNL, and that x86 could never win at GPU-friendly workloads. If you want to believe that everything Nvidia is the greatest thing since the electric light bulb, I'm happy to leave you to it.

bit_user · Apr 26, 2017

BTW, I've been thinking my description of VLIW was really sub-standard.

It's where you've got a wide super-scalar CPU, like Skylake or maybe even wider. However, instead of burning power and die area by dynamically scheduling instructions, everything is baked into the instruction stream generated by the compiler. All the hardware can really do is stall when you get a cache miss or the write buffer fills up.

BTW, many VLIW implementations also have SIMD engines. I've programmed on two of them, myself. VLIW is popular for DSPs, where binary backwards compatibility of programs is typically a non-issue, and where cost and power-efficiency are at a premium.

RoyTyrell · Apr 26, 2017

I think I'm going to have to go with Matt on this one. Nvidia itself didn't realize it had an HPC strategy until 10 years ago. (see: Tesla G80). Even then they had a heads down focus on graphical end users.

You simply cannot replicate that level of complex silicon overnight. Nvidia made a good bet. Intel didn't.

Intel is playing the role of snubbed french aristocrat -- peering down it's nose at the peasant eating it's lunch.

RoyTyrell · Apr 26, 2017

I think I'm going to have to go with Matt on this one. Nvidia itself didn't realize it had an HPC strategy until 10 years ago. (see: Tesla G80). Even then they had a heads down focus on graphical end users.

You simply cannot replicate that level of complex silicon overnight. Nvidia made a good bet. Intel didn't.

Intel is playing the role of snubbed french aristocrat -- peering down it's nose at the peasant eating it's lunch.

RoyTyrell · Apr 26, 2017

BTW... BIT_USER...

An ASIC chip is... "Application Specific"... meaning whatever narrow computation it is trying to solve - that's *all* it does. It is not just narrow but single purpose to the nth degree. And really, that is just semantics. You cannot create an HPC that adds 2 + 2 (and only 2 + 2) all day every day. What is that accomplishing? You are solving the same puzzle over and over again. Hence "Application Specific".

Optimized. Indeed.

bit_user · Apr 27, 2017

RoyTyrell :

Yeah, sort of. GPUs were originally considered ASICs. Just because something is regarded as an ASIC or uses that design/manufacturing process doesn't mean it has no programmability or programmable cores.

Anyway, what I mean by ASICs are things like Google's machine learning chip and Movidius' embedded computer vision chip. Google said their chip has some degree of programmability, but we don't know exactly what that means. In the case of Movidius, it's actually got some number of DSPs, arrayed around a fast, central memory. It's architected in a way that makes it much more power-efficient for deep learning tasks than GPUs, such as Nvidia's Tegra TX-1.

A lot of ASIC designers favor some degree of programmability, as a way of dealing with shifting requirements, markets, and even hardware bugs. Certainly, there are tasks at the very edge of performance, where only hard-wired logic can meet the technical or market requirements.

I don't know if you saw this, but it's worth a look:

http://www.tomshardware.com/news/google-tpu-comparison-haswell-k80,34069.html

You might also try searching for the term ASIC, on eetimes.com. Here's a good example: even if Ethernet chipsets have hard-wired routing circuitry, they contain programmable cores for control and management. And Google wants them to support a standard API.

http://www.eetimes.com/document.asp?doc_id=1331556

Fiqar_ · Apr 28, 2017

was going to say the same thing as BIT_USER great answer mate

Matt1685 · Apr 28, 2017

"I said I did look at it, and thought they were making a big deal out of something that's essentially a software construct layered atop SIMD hardware. Also, we're not talking about SMT or SIMD, we're talking about SMT + SIMD. I think both Nvidia and Intel are doing this, but I'm not 100% sure about Nvidia."

Go look for an alternate source if you want to. I've seen it suggested other places as well that it's easier to extract parallelism with SIMT than SIMD (I already sent you a second source).

"the AVX-512 engines aren't cores. AVX-512 is an instruction set extension for performing 512-bit vector arithmetic, operating on a correspondingly widened register file. It's very much SMT-compatible."

Well, maybe "core" is the wrong word. Let's call them "units". AVX-512 is an instruction set, and KNL contains vector units capable of carrying out the instructions. I myself don't see how running SMT on top of SIMD units changes much, but perhaps you can explain it.

"Well, Nvidia's P100 only has 60 cores. I'm not sure exactly how many concurrent threads it supports, but they're not as different as you make them sound."

The difference is each of the P100's 60 "cores" consists of 64 scalar ALUs and not 2 vector units. I only make them sound different because people who program the things make them sound different.

"Unlike GPUs, KNL natively supports a classical SMP programming model. You can just boot windows on the sucker and run a bunch of VMs, single-threaded programs, and/or multi-threaded programs. But, if you care about efficient utilization of its resources & have a compatible workload, you have the option of OpenMP or OpenCL (which is patterned after CUDA). That's all I was saying."

Yes, this is the third time we are going through the same hoop. You can do it but there's no good reason to do so, so why talk about it? Automatic parallelization doesn't exist and you don't buy a parallel processor to not take advantage of parallelization in your code. Our discussion was about SIMT and SIMD, so obviously we are concerned with parallelization in our context.

"That's your opinion. It's got up to 72 of them, and a 384 GB/sec path to 16 GB of HMC2."

Sure, it's my opinion. It also seems to be Intel's opinion if you look at their marketing and technical presentations. It also seems to be Cray's experience.

"Have you got some sales data you'd like to share? If not, then I will disregard this as pure speculation."

You claiming "I'm sure it makes sense for more non-HPC workloads than you think." is pure speculation. I've been actively searching at sporadic times over the past several months to try to find out the uptake of Xeon Phi. What I can say is that it is my understanding that hyperscalers seem to have very little uptake whatsoever, neither for their own personal use or for public clouds. I do also know that Cray has noted slow adopted of KNL except for 4 major HPC projects. The Top500 list will show you many other non-Cray large HPC projects that use them. Unfortunately, HPE doesn't talk about it much, and Dell is private. I have not yet been able to come across evidence of a significant and growing Xeon Phi community, unlike that of GPU programming communities. I also saw a quote from an oil and gas industry source that made it sound like they were less than thrilled with Knights Landing. So, no, obviously I don't know for sure how successful Knights Landing has been, but it's not just pure speculation. Is it your standard practice to assume every product is successful despite having no evidence that it is and various, but incomplete, evidence that it is not outside of a narrow realm?

Matt1685 · Apr 28, 2017

"Not competing is the surest way to lose. Everybody wants a piece of Intel's datacenter market. They can't remain so dominant in the datacenter CPU market, so they've got to diversify."

I didn't say they wouldn't compete. They'd just pick their strategy more wisely.

I said: Their HD graphics architecture is slow and energy inefficient, even in graphics-only modes.

You replied: "You're basing this on what, exactly?"

I'm basing it on hardware review sites such as AnandTech, PCPerspective, Guru3D, etc. They measure the performance and the power draw.

"The sub-$100 market was a massive chunk of revenue, for these guys."

NVIDIA has voluntarily moved away from the low-end OEM market, even that which is above integrated graphics level. NVIDIA wants gross margins like Intel has. Intel's are about 60%. NVIDIA's are about 50% I believe. NVIDIA's revenues and margins are both much higher now than they were before integrated graphics. Of course NVIDIA would love to have that revenue, but integrated graphics are not a threat. You're speculating that it is, but you're wrong. Your opinion would have been mainstream 10 years ago, but you'd be hard pressed to find an analyst who agrees with you these days.

"It took a while for them to recover from it, but big data, deep learning, and VR helped bring them back from the brink."

No, 75% of NVIDIA's revenue is gaming revenue. Data center and automotive are grwoth areas for the future. VR has been a drop in the bucket thus far. It's also a future growth area.

"Deep learning will soon be dominated by ASICs."

Maybe. That's yet to be determined. Certainly in the inference portion, that's probably true. Training-wise, it would take a long time. There's a whole lot of software out there optimized for GPUs as far as deep learning training goes. Not sure why this has become an anti-NVIDIA rant as far as you are concerned. Our discussion wasn't about NVIDIA. In fact, I think it's strange that you first say Intel is going to develop a GPU and beat NVIDIA and now you say that GPUs are doomed. You've lost the plot in your argumentative zeal.

"And the CPU-integrated market is currently limited by memory bottlenecks that are soon to be alleviated."

The CPU market is limited be the death of Denard Scaling. Faster memory isn't going to reverse the need for parallelization to continue performance improvements.

"Nvidia is trying to play in the embedded space, but again they face stiff competition from ASICs. So, either they go all-in with some specialized hardware of their own, or they're probably going to be hurting in a couple years' time."

Yeah, NVIDIA can build ASICs, too. Look into the Eyeriss Project.

"I see your green-tinted glasses, but I have no dog in this fight - recall that I started out by saying the P100 was about 2x as fast as KNL, and that x86 could never win at GPU-friendly workloads. If you want to believe that everything Nvidia is the greatest thing since the electric light bulb, I'm happy to leave you to it."

You may have started with no dog in the fight, but you obviously took one along the way. I don't think NVIDIA is the greatest. I have analyzed the situation and determined that NVIDIA GPUs have a stronger position than Intel's Xeon Phi. I've also seen that parallelization is the future. So Intel, though they have data center dominance, has an inferior parallelization method at the moment, and so an inferior path to the future. I think you agree with this, but somewhere along the way you got locked into arguing against it because you didn't want to consider that something you said about NVIDIA and SIMD might not be accurate.

Anyway, take care. Nice arguing with you.

Ross_30 · Apr 29, 2017

If IBM has achieved this at about 1/100th the cost of Exxons version, i don't think it'll be financially viable for oil and gas companies to go with Exxons outdated version. Its simply not cost effective!

bit_user · May 9, 2017

Matt1685 :

It's easier because their tools and runtime is doing more work for you, the programmer. SIMT is just a fancy name for something programmers have been doing on SIMD hardware since its inception.

Matt1685 :

Nvidia is doing exactly the same thing. They just use different terminology. In x86 lingo, you have CPU threads which might be carrying out work assembled from data-parallel work items that the compiler or programmer mapped to a set of slots in vector registers & instructions. When one thread stalls on a memory access or the result of another computation that's in-flight, another thread can get scheduled to the AVX-512 unit it was using. That's SMT.

Nvidia is doing the same thing, but they call this a warp. When one warp stalls, the instruction scheduler tries to find another warp to execute. An entire warp stalls or is executed, just like a classical CPU thread executing SIMD instructions. At the hardware level, this is classic SMT.

Matt1685 :

That's the definition of SIMD. They have SIMD engines, no matter what Nvidia's marketing department chooses to call them.

If you want to know the difference, I'll tell you: scale. Whereas KNL has only 4 threads per core, Pascal has up to 64 warps per SM. KNL cores have 2x 16-lane (fp32) vector units, whereas Pascal SM's have 2x 32-lane (fp32) vector units.

Another meaningful difference is probably the GigaThread engine. The task of scheduling multiple threads across different KNL cores falls to the OS, whereas Nvidia and AMD employ specialized hardware for this.

Finally, all x86 cores have their own MMU state. This provides memory protection, so that one process can't arbitrarily access the memory of another or the kernel. GPUs have one MMU for the entire chip.

Matt1685 :

Except we were talking about non-hyperscalers. We're already agreed that KNL can't beat GPUs at their own game. So, for KNL to make sense, there's got to be something about your workload that limits your ability to exploit a proper GPU.

bit_user · May 9, 2017

Matt1685 :

The reason I ask is that this is a bit tricky to measure. First, GPUs get more efficient with scale. If you scale down Nvidia or AMD GPUs, they'll also look much worse. More importantly, Intel's GPUs are limited to DDR3 and now DDR4, neither of which have the energy efficiency of GDDR5 or GDDR5x (as measured in GB/s per W).

Matt1685 :

Most software using deep learning is built atop one of the frameworks like Caffe, TensorFlow, etc. Using different hardware is easy, if the hardware vendor supports the framework you're using.

Matt1685 :

Anyone who might happen to read this can decide for themselves who lost the plot. The reason we got here is that I claimed that if Intel were serious about tackling GPUs, they'd just scale up their HD Graphics architecture. You seem to have some sort of anti-Intel bias, because you're refusing to accept that it's competent for its size & memory limitations.

In GPU-compute tests, Intel's GPUs hold their own against AMD's APUs. As far as I'm concerned, that evidence enough that the architecture has potential.

Matt1685 :

You missed an important word. I said "CPU-integrated", which was meant to imply graphics. I think we'll see AMD offering APUs with HBM2 and substantially more powerful integrated graphics, in the near future. Perhaps Intel will beat them to the punch.

Matt1685 :

No, I still have no dog in the fight of Intel vs. Nvidia. All my recent CPUs are Intel and my latest GPU is Nvidia. I like AMD's commitment to open standards, but I'm a pragmatist and pick the best product for my needs.

Matt1685 :

If I have an issue, it's with being accused of making misleading statements, simply because they don't line up with Nvidia's marketing spin, by someone who's clearly no expert in SIMD or GPU programming. If you see a discrepancy between what someone is saying and other information on the topic, it's fine to inquire about it. But please save the accusations for when you really know what the heck you're talking about.

Matt1685 :

If you say so.

IBM, Stone Ridge Technology, Nvidia Break Supercomputing Record

Guest

Guest

Titan

Reputable

Titan

Titan

Reputable

Reputable

Titan

Titan

Titan

Reputable

Reputable

Titan

Titan

Titan

Prominent

Prominent

Prominent

Titan

Prominent

Reputable

Reputable

Prominent

Titan

Titan

Share this page