Xeon Phi: Intel's Larrabee-Derived Card In TACC's Supercomputer

army_ant7 · Nov 13, 2012

cats_Paw :

With the different light sources, I remembered the Leo demo of AMD where they used DirectCompute. Aside from that, we may get the hardware to do those different effects that do require quite an amount of calculation, though to what extent would they be practical? Like would we notice every leaf on a tree that moves based on the angle and force of the wind pushing it? Sometimes, there are more efficient ways that sufficiently simulate real-world stuff.

Imagine calculating the collision of each molecule and atom generated in a game (actually, even generating molecules and atoms in a game to mimic real life would be quite a feat on its own :lol

. Even if we had the computational performance to do that, it may be pointless for the aesthetic appeal in games (since we wouldn't be able to see these particles normally anyway). I think they do things like that though for scientific simulations where they do matter.

Anyway, a game were made that would sufficiently simulate the real world, I doubt it would fully use what these supercomputers could offer. Though I can't really say...

technoholic :

Or Intel is just taking a shot at this market by making an appropriate implementation of what they have (x86). It does seem to have reasonable advantages. Though we'll see where this goes.

daggs · Nov 13, 2012

[citation][nom]ThatsMyNameDude[/nom]Holy shit. Someone tell me if this will work. Maybe, if we pair this thing up with enough xeons and enough quadros and teslas, we can connect it with a gaming system and we could use the xeons to render low load games like cod mw3 and tf2 and feed it to the gaming system.[/citation]
it works, in fact there is a demo showcasing sandy bridge + Phi using one os on SC12.

Guest · Nov 13, 2012

I'm not convinced, my experience with OpenMP was that it was rather inefficient.. The '#pragma plz parallel this loop thx' directives all gave an increase in performance, but still paled in comparison to some rather simple worker pthreads code, I'd even go as far to say 'your threading it wrong', albeit they do make it much easier to gain at least something from multithreading.

The raw pthreads code I've used works well, but always seems to have problems scaling past about 4 cores on one CPU, unless either the workload rarely require synchronization, or the code is processing multiple parallel workloads that are completely separate and require no synchronization. Having said that, I don't see how this would be better for HPC than something like a GPU(really truly parallel), or even a couple of 10 core Xeons(really truly programmable)...

IndignantSkeptic · Nov 13, 2012

[citation][nom]multithreadedman[/nom]I'm not convinced, my experience with OpenMP was that it was rather inefficient.. The '#pragma plz parallel this loop thx' directives all gave an increase in performance, but still paled in comparison to some rather simple worker pthreads code, I'd even go as far to say 'your threading it wrong', albeit they do make it much easier to gain at least something from multithreading.The raw pthreads code I've used works well, but always seems to have problems scaling past about 4 cores on one CPU, unless either the workload rarely require synchronization, or the code is processing multiple parallel workloads that are completely separate and require no synchronization. Having said that, I don't see how this would be better for HPC than something like a GPU(really truly parallel), or even a couple of 10 core Xeons(really truly programmable)...[/citation]

Maybe you didn't notice much speed-up because there aren't many cores on current CPUs yet and also OpenACC maybe hasn't been fully incorporated into OpenMP yet which would allow the GPUs to be activated?

lchrisk · Nov 13, 2012

Very interesting. I see this is a big upgrade from the last generation I suppose.

Guest · Nov 13, 2012

@IndignantSceptic, OpenACC is not a part of OpenMP. It is the plan eventually to roll the OpenACC functionality into OpenMP, but that is not the case yet.

On the general question of the benefit of Phi being x86 based, think of the toolchain. Intel (and others) have many years invested in writing tools (compilers, debuggers, profilers, libraries etc) that care very much about the instruction set. Those tools are much easier to bring to the Phi platforms as it is also x86 based. Of course it is not completely trivial given the different nature of the vector units etc, but it is certainly much easier than writing a new suite of tools for a completely new architecture.

devBunny · Nov 13, 2012

[citation][nom]PudgyChicken[/nom]It'd be pretty neat to use a supercomputer like this to play a game like Metro 2033 at 4K, fully ray-traced. I'm having nerdgasms just thinking about it.[/citation]

Lol. Where's the imagination? You've got a Top 10 supercomputer - http://www.top500.org/list/2012/11/ - and you want to tun a game on a 4K monitor??

If I had this baby for game playing then I'd have it feeding an IMAX cinema in 360-surround mode with an array of Kinect-type inputs following my actions. I'd be positioned on a treadmill capable of rolling in any direction so as to keep me centred in the arena and there'd be hydraulics to provide vertical input/feedback.

(Don't ask me how it works, this is a science fiction fantasy, the very, very, very early days of holodeck-esque facilities. ;o)

devBunny · Nov 13, 2012

Oh, and with every Stampede/Xeon Phi/IMAX game, er, "console" you get a free gym membership. You're going to need real-life fitness to play these games! Come get some. ;O)

jaybus · Nov 13, 2012

I think Phi could turn out to have more usefulness. Where GPGPU vector solutions probably have an edge in scientific and statistical analysis floating point number crunching, Phi is much more general purpose. I can imagine a 60 core / 4 threads per core card able to handle 240 simultaneous web server requests, for example, so web server farms could reduce the number of servers needed per rack.

PreferLinux · Nov 13, 2012

[citation][nom]IndignantSkeptic[/nom]@PreferLinux, I'm shocked programmers still program in low level. Also how can OpenMP not work on GPUs when a section of OpenMP is OpenACC?!![/citation]
Yes, they do – how else do you intelligently optimise something (a compiler can't do everything, and definitely not as well as a person). Besides, what is high-level? C and C++ aren't really (are in some ways, but not in others), but are very common especially for performance-sensitive code. Anyhow, CUDA requires its own language, and I don't think nVidia would let you use anything else... As for OpenMP on GPUs, I based that on the following from Wikipedia:

Pros and cons
...
Cons
...
Can't be used on GPU

[citation][nom]blazorthon[/nom]I doubt that you could do a whole lot with that much performance even if we made games that could somehow utilize that many cores. You might be able to do something like one hell of a physics processing game with very high FPS that way, but otherwise, IDK how much you could actually do with it.[/citation]
Run your software renderer with it, maybe? By the way, it does still have Larabee's graphics-specific parts, just turned off – at least according to (I think it was) SemiAccurate.

[citation][nom]technoholic[/nom]Excuse me, but aren't GPUs more efficient/optimised for paralleled work loads already? Is GPUs' nature (more coars + higher memory bandwidth of graphics memory + their abily to act alongside a CPU) not more suitable for parallel computing? Also GPUs are relatively cheap and widely available even for a home user. Even a home user can do accelerated work using their GPU. IMO x86 has got its age. Intel doesn't want to adopt new tech[/citation]
Not really – they have disgusting performance per thread, and very limited operations compared to this kind of thing. Hit a serial portion of code (and everything has them) on a GPU, and your performance drops through the floor. Not quite the same with this, and besides it'd be a lot easier to prevent that happening. This has over 1 TFLOP peak DP performance, so I hardly think performance is a problem – and the power usage is about the same too. This has the graphics memory bandwidth, and will "act alongside a CPU" just fine. So what about the availability? Far more people could program for this than for GPGPU.

SemiAccurate mightn't be the most reliable source of information, but they do know about tech. Check out their articles on the Phi for some ideas...

PreferLinux · Nov 13, 2012

[citation][nom]jaybus[/nom]I think Phi could turn out to have more usefulness. Where GPGPU vector solutions probably have an edge in scientific and statistical analysis floating point number crunching, Phi is much more general purpose. I can imagine a 60 core / 4 threads per core card able to handle 240 simultaneous web server requests, for example, so web server farms could reduce the number of servers needed per rack.[/citation]
GPGPU doesn't have much of an edge at all, even considering theoretical performance. Add in that they are far more difficult to program for, and I think you'll find it completely disappears.

tipoo · Nov 13, 2012

IBM discontinued the Cell family AFAIK, so they have nothing comparable to the Phi. What I've always been curious about is how the Power7 processors compare to the Xeons or Itaniums. Anyone know? That would be an interesting TH feature, if very niche.

damian86 · Nov 14, 2012

The only thing that came to my head after seeing that picture: a network cable is disconnected. Fuuuuuu.

army_ant7 · Nov 14, 2012

devBunny :

I believe it's already been done (not with IMAX screens though). You were missing one thing though, what happens when you get shot? :lol: The answer is here, along with the implementation.
http://www.tomsguide.com/us/FPS-Simulator-BF3-Battlefield-Paintballs,news-12925.html

If you thought this up by yourself, then I applaud you sir (or ma'am) for your creativity (though the opposite if you stole the idea and made it seem like you claimed it as your own).

Oh yeah, back to how this somewhat related to the article... I don't think they needed a supercomputer to make this work.

cuba_pete · Nov 14, 2012

[citation][nom]memadmax[/nom]Ok, the hardware looks decent.But, there's one problem: Intel is relying on programmers being able to optimize to a high degree for this setup...And that is the problem, programmers these days are lazy...It used to be you would spend a whole day optimizing loops using counters and timers to squeeze every last drop out of that loop...Now, programmers are like kids with legos, they just merge crap together that was written by someone else and as long as it runs decent, they call it good...[/citation]
...and then become contractors for the government...

dozerman · Nov 14, 2012

[citation][nom]ThatsMyNameDude[/nom]Holy shit. Someone tell me if this will work. Maybe, if we pair this thing up with enough xeons and enough quadros and teslas, we can connect it with a gaming system and we could use the xeons to render low load games like cod mw3 and tf2 and feed it to the gaming system.[/citation]

I think he means creating an OpenGL or DirectX implimentation that runs on X86 so that graphics could be processed on a Phi instead of a traditional GPU, thus fufilling his dreams of an Intel GPU.

dozerman · Nov 14, 2012

[citation][nom]jaybus[/nom]I think Phi could turn out to have more usefulness. Where GPGPU vector solutions probably have an edge in scientific and statistical analysis floating point number crunching, Phi is much more general purpose. I can imagine a 60 core / 4 threads per core card able to handle 240 simultaneous web server requests, for example, so web server farms could reduce the number of servers needed per rack.[/citation]

What stops a tesla or firestream from doing this, too?

quangluu96 · Nov 15, 2012

Can it goes SLI 😀 ?

burritobob · Nov 15, 2012

All that I can say is Holy flying cores Batman...

palladin9479 · Nov 15, 2012

Guys the reason Intel is using x86 is to maintain code compatibility with current software.

These are only partially x86 CPUs, or more specifically their really slow x86 CPU's with a huge a$$ SIMD vector processor glued onto it. There is a special instruction that takes full advantage of that vector processor to achieve high rates of computation.

There is a few reasons to do this. First is how code / data is treated by the CPU. In the case of OpenCL / OpenMP and CUDA you have two sets of code. One is the x86 binary loaded that sets up the session and pass's the payload, which is the second set of code to the dedicated vector processor for processing. This is administratively intensive and involves a lot of rewriting of code into packages to be sent off then interpreting the results. The second reason is memory address-ability, dedicated vector processors have a different memory address space and often different dedicated memory. The data to be worked on must be moved from your systems main memory into the vector CPU's dedicated memory space and then have the address tables passed on so that it remains coherent.

By treating the huge a$$ vector processor as a tightly integrated co-processor you can execute instructions on it in the middle of your general code without first having to send off a batch to a remote vector CPU. You can easily mix and match general processing code with vector code, in one moment do a bunch of logical operations / reads then next execute a large amount of vector instructions on the results of those previous operations. All without having to move the data around or copy it to special memory. Parallel code is extremely sensitive to memory bandwidth, there is a ton of simultaneous reads, copy's and writes that have to happen. Hence GDDR5 is being used as "system" memory for the card.

cuba_pete · Nov 15, 2012

[citation][nom]palladin9479[/nom]...These are only partially x86 CPUs, or more specifically their really slow x86 CPU's with a huge a$$ SIMD vector processor glued onto it.[/citation]

Sounds simple...why didn't you or any other internet engineer do this? Or is it really not that simple?

blazorthon · Nov 15, 2012

dozerman :

Probably just software AFAIK. Intel already has software compatibility for such work, Nvidia doesn't, nor does AMD. They have the platforms to make such software, but it's not there yet. Besides, I have to wonder who'd do it better.

blazorthon · Nov 15, 2012

cuba_pete :

Do what? Make a graphics card such as this?

cuba_pete · Nov 15, 2012

blazorthon :

Yes...Palladin made it sound like this was an easy thing to do...however as the axiom goes...there should go Palladin.

palladin9479 · Nov 16, 2012

blazorthon :

It's someone's sock account. Probably from the AMD speculation thread, hence the attack on me.

What I said was true, their slow x86 CPUs with gigantic SIMD vector arrays attacked to them. This means they ~can~ process generic x86 instructions, so they maintain code compatibility with current software. They have special instructions that enable them to execute huge amounts of vector math on their SIMD engines, think SS4 / AVX multiplied by 10. Its not as fast as a dedicated external vector coprocessor (NVidia / AMD) but it can act as it's own OS and is much easier to program for. Very smart on Intel's part. Developers like to program on whats easiest / cheapest, users will use what the programmers develop on.

AMD's been working on something very much like this, they call it "Fusion" and is aimed at the consumer market whereas Intel's is aimed at the HPC word.

Xeon Phi: Intel's Larrabee-Derived Card In TACC's Supercomputer

Distinguished

Distinguished

Guest

Guest

Distinguished

Honorable

Guest

Guest

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Honorable

Honorable

Distinguished

Honorable

Splendid

Distinguished

Glorious

Glorious

Distinguished

Splendid

Share this page