Xeon Phi: Intel's Larrabee-Derived Card In TACC's Supercomputer

cuba_pete · Nov 16, 2012

Sock account? Nah...and not an attack if you re-read the thread.

Unless one is a hardware developer/mfr then one cannot just "stick" these things together...that was my point.

If you are a hardware developer/mfr, then it sounds like you could do better. I suggest that...then we can spec it against Intel and whatever AMD comes up with in the future.

palladin9479 · Nov 16, 2012

cuba_pete :

This makes no sense at all....

I was using laymens terms to describe a complicated implementation of a hybrid architecture CPU. We could go into very long discussions involving ALU, AGU, MMU and various internal bus topology and how they relate to SIMD arrays. Instead we just say it's a slow (as in low clock speed / single core performance) CPU that use's the x86 ISA that has a large SIMD array. If you knew anything about vector CPU's you'd know that modern graphics cards are just large expensive vector processors that utilize SIMD arrays to processor tons of simultaneous transactions.

Now kindly remove yon foot from yon mouth and lurk more.

army_ant7 · Nov 16, 2012

palladin9479 :

What you're saying does seem to agree with what the article said about how the Xeon Phi only has a very small portion of it that is actually x86. If you would kindly entertain me, I have a few questions.

Are you saying that the x86 portion of it is really just there for compatibility's sake? As serves as some sort of abstraction, in a way, for normal x86 instructions to be translated into instruction for the vector processor?

Would you agree to what jaybus said about it being able to handle more general processing than GPGPU's can?

jaybus :

You see, I have heard that what a GPU can process is limited to what a CPU can, that's why you don't see OS's running plainly on GPU. Is this true?
If so, putting coding difficulty and differences aside, is the Phi pretty much as limited to what its hardware is capable of, compared to GPGPU's?

Just checking, do they still cal it "Fusion" or "HSA" now (or are they two different things or is the former just the consumer marketing term for the latter?

Thanks for your prior posts, I think I understood it more; and based on how thorough your explanation are, I don't think you're trolling or trying to mislead us here--you seem pretty believable.

palladin9479 · Nov 16, 2012

Are you saying that the x86 portion of it is really just there for compatibility's sake? As serves as some sort of abstraction, in a way, for normal x86 instructions to be translated into instruction for the vector processor?

First we gotta explain that not all processor architectures are equal in all respects. Vector processors are designed to handle large amounts of simultaneous intense math. Calculating the molecular density of a neutron star, or doing particle simulations for plasma physics. What their really bad at is regular logic operations and memory manipulation. Things like data load / stores, compares and instruction management. Those last things are what modern general purpose CPU's do using their ALU / AGU / MMUs, aka "Integer Operations". Now the CPU inside your desktop has been optimized to process large quantities of those last operations, those operations tend to be very serial in nature and don't work very well in parallel operations. To try to speed up the process of crunching through them processor designers will create branch predictors, data / instruction caches and other mechanisms to try to process several serial instructions at once.

Now this CPU "the Phi" is designed to have many small x86 "cores" with each "core" being 1Ghz and very simple. Each of these "cores" has a limited ALU/AGU/MMU/ Cacheing system but a very large SIMD array attached to it. This results in the processor having very low integer operation capability, but a ridiculously high vector processing capability. And as it's x86 you can install / load it as it's own OS with memory management, device drivers, the entire works. With dedicated add-on co-processors (GPGPU / OpenCL / CUDA) you have to use software libraries on the host OS to separate the native integer / operational code from the specialized Vector batch code use for that co-processor. It's this overhead that Intel is avoiding completely, though at the cost of more silicone devoted to non-vector capabilities.

Ex:
You have one server running <insert favorite OS here> that has three Phi's plugged in. Each of those Phi's is loaded with it's own Linux OS and using the Host OS for I/O, network connectivity and permanent storage (HDD). They even get a virtual frame buffer (video output) on the Host OS. Now you configure each of these OS's with their own IP address and cluster them together using Linux's existing technology / software. You can now run already existing software on this due to the x86 compatibility. To run your production software you would want to recompile it with a compiler that supported the specialized Phi vector instructions (GCC / Intel Compiler). Those programs would execute their integer / memory operations like normal but their SIMD vector instructions would run on the insane arrays the Phi has.

It's a supercomputer in a can.

army_ant7 · Nov 16, 2012

@palladin9479
Thanks for the reply (I also noticed a wrong word in that quote of my post. I meant "That" not "As. :lol: But anyway...). Interesting stuff and I'm learning something new here, so really, thanks! 😀

What I only know about vector processors is that they do process multiple operations in parallel (looked it up on Wikipedia a while back out of curiosity). I see that you only mentioned integer and vector operations. I'm wondering, do vector operations only handle floating-point numbers? I have heard that (GP)GPU's are good for floating-point operations. I'm not really sure if that's all they can really process.

With the Xeon Phi cores, do you mean that they don't process integer operations as fast or that they're really limited as to what they can do (meaning they really can't do somethings)?

I noticed mention of x87. I read on Wikipedia that it's related to floating-point operations. If it isn't too much trouble, could you maybe explain how x87 instruction fits in this? I remember reading about it here: CPU PhysX: The x87 Story. I think I see why x87 is used (because of large (in terms of bits) floating-point data). I also read these to help me understand things a bit: Differences between x87 FPU and SSE2 and Wikipedia articles on SSE4 and AVX (though just enough to somewhat get a feel of what they are). I'm not sure how they relate to each other, like if AVX is better than SSE4 (which I think is just an update to SSE4 after SSE3) or if they work together. This kinda relates back to what I said about if vector operations (which I'm not sure as well if they're synonymous with SIMD) only handle floating-point data.

Also, just a fun (but maybe not practical thought). Would it be possible to make a Xeon Phi in a way that it's a standalone device (doesn't need a host computer) and install an OS on it?

Sorry about all these questions. Feel free to say something like "just research on these" or "go to school if you wanna learn about them." It's understandable and I wouldn't think any less of you.

palladin9479 · Nov 16, 2012

Vector processing is using a single instruction and cycle to do multiple computations at once.

Say

ADD value a to values b c d e

In a general purpose CPU that would be a ton of instructions. (pseudo code)

load a r01 # load value of a into register 0
store 0 r02 # clear register 2 for use and to set it to a known state
store 2 r08 # clear register 8 for final output and set it to a known state

start loop:
load b r02
add a b r08 (add value a and b and output to register 8)
store <memory address> r08 (copy contents of value 8 to predetermined memory location)
return loop:

You would have to do that loop for each iteration of the adds. A vector instruction would all the value of a to all those other values, you would only have to copy the contents out of the register stack to a memory location. The instruction doesn't have to be add, it could be sub, mul, div or any combination of them (FMA is a mul followed by an add). The idea is you do all of them simultaneously in a single cycle without many memory operations, rather then 10+ with many memory moves that have latencies.

The downside to vector processors is that by design their terrible at single integer instructions. "Integer Operations" are things like logical compares (<>==!<!>) and memory operations. In HLL this would be the equivalent of comparing variables to each other and acting on the results (true / false) by moving to a different section of code or copying data from one location to another (memory moves).

What we find is that 90% of code inside programs is integer code, comparing values and moving things around. That is why modern x86 Intel / AMD CPUs are so popular, they do amazing at integer / memory operations. It's that 10% of the code that is used heavily in the HPC world that this seeks to accelerate. Doing financial modeling, scientific calculations and other intense mathematic operations. Those operations tend to be the easier to express in vector language.

There is no more "FPU" these days. ALU / AGU / MMU's all operate on integer values that have no decimal place. They do binary math on the values there which makes decimals pretty much impossible to express. You can do math on decimal numbers by separating the number into two integer sections and doing the operation on each separately then putting them back together. To do this purely in HW requires that the HW units be made to an exact integer value size (8/16/32/64 bit). To handle decimals that have an unknown (at compile time) decimal size you need a special HW unit that can dynamically adjust itself.

That was all in the 80's and early 90's. Nowadays we use SIMD vector units to do floating point work. Break the decimal unit into multiple pieces and it's easily to express in vector language. So the "FPU" is now emulated through the SIMD units.

AVX / SSE / FMA are all vector instructions. Specifically they define a standard language to express your math in so that the vector processor knows what to do with it.

army_ant7 · Nov 16, 2012

@palladin9479
Thank God I have some background in programming and computer concepts, or else I might be asking a whole lot of other questions or doing a lot more research/studying. I get it (at least I think so)!

I found the explanation on how floating-point data is dealt with by the CPU, interesting. I guess you really can't express a decimal number plainly in binary unlike integers, and there has to be a certain system to it. (I think my brother pointed this out to me a while back, but I didn't particularly note it in my mind.) I think I see now why floats eat a lot more memory compared to at least short ints.

I see now how SIMD is related to floats. Why is the FPU now emulated with SIMD instructions? Don't they work in the same way FPU's did back then (except maybe that they can handle these broken down floats in a less memory operation-intensive manner)?

Please tell me if I got this right. Vector processing doesn't actually do the arithmetic operations concurrently (unless you have multiple ALU's maybe), but just saves on having to do redundant memory operations (like moving data around). Is that right?

Mind if I ask what your educational background is and what you do for a living? 😀 I applaud you for your knowledge on these topics. I also heard from blazorthon how he learned of the PSCheck technique to attempt to optimize the Bulldozer CPU, from you. Really, thanks! 😀

palladin9479 · Nov 16, 2012

I see now how SIMD is related to floats. Why is the FPU now emulated with SIMD instructions? Don't they work in the same way FPU's did back then (except maybe that they can handle these broken down floats in a less memory operation-intensive manner)?

Please tell me if I got this right. Vector processing doesn't actually do the arithmetic operations concurrently (unless you have multiple ALU's maybe), but just saves on having to do redundant memory operations (like moving data around). Is that right?

Mind if I ask what your educational background is and what you do for a living? 😀 I applaud you for your knowledge on these topics. I also heard from blazorthon how he learned of the PSCheck technique to attempt to optimize the Bulldozer CPU, from you. Really, thanks! 😀

When I say "FPU is emulated" it's in reference to x87 FPU instructions being handled by the SIMD array inside modern x86 CPUs. A modern SIMD array processor is a FPU technically, it does floating point math, it just also does much more then that. It's basically faster.

Vector processing does do all those instructions simultaneously, typically in ~three CPU cycles (one for load, one for execution, one for store). All modern GPU's are vector processors, well their more but their primary engines are vector processors.

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CC4QFjAA&url=http%3A%2F%2Fwww.geforce.com%2FActive%2Fen_US%2Fen_US%2Fpdf%2FGeForce-GTX-680-Whitepaper-FINAL.pdf&ei=NyqmULCRCOSZiQeVqICYCA&usg=AFQjCNERFCaoYg6m45Pit0LdJuqBjM2YxQ&sig2=bPz98WJdSrdNw5zwLxA0mQ

That should show you how many "cores" a Geforce 680 has. Each of those "cores" is capable of a single operation and multiple can bond together to perform larger operations. Whenever your rendering a scene your doing tons of math for each pixel on the screen, most of this math is done simultaneously for each pixel and many pixels at once.

That is why GPGPU / OpenCL and other "GPU accelerated" tasks are able to get such huge benefits. If they can break their work down into many parallel operations, then they can feed those all to the vector processor to be done many times faster then any General Purpose CPU could possibly do. That above referenced 680 is running at approx 1Ghz yet smokes the most powerful Intel CPU on the market in vector math. The down side is two fold, one is the insane memory bandwidth vector CPUs need to keep all their units fed with data, it's many times more then what a typical general purpose CPU needs, and second is that their slow speed and non-speculative nature means serial logic operations are very slow on them.

As for what I do, I'm currently a Systems Engineer. I'm contracted out to do systems analysis work, basically I build on paper the systems that get put into data centers (occasionally I go out and install them myself). My education is highly home grown coupled with getting sent to various class's. I'm not an EE not software developer but I need to understand their language in order to do my job. When your planning to spend a million plus on a system, picking the wrong set of equipment is costly.

army_ant7 · Nov 16, 2012

I see! 😀 This all so really interesting to me! I mean, I'm not much of an HPC kind of guy, but the inner (low-level) workings of computer hardware do fascinate me a lot. 😛 And I have you to thank for the things I'm learning now and the enjoyment associated with it. Thanks!

So you do need multiple ALU's (I'm not sure if I'm correctly referring to the part that does what I'm saying.)/the cores their found in, to be able to handle to fulfill all those arithmetic and logical operations in parallel with vector processing? I mean, if I had a single-core CPU, would SIMD instructions be processed in parallel? (I'm referring to the execution phase at least.) (Now I feel like I'm not totally getting something with this as I remembered you mentioning how memory operations are also "integer operations" as well, but I'm also thinking that the MMU and other components might be responsible for those. So please bear with me. Sorry. Hehe...)

I bet you just did this to simplify the explanation, and I highly appreciate that

, but I do remember learning that graphics calculations also involve vector calculations (the geometric-graphics variety, funny how the terms are similar with the topics at hand 😛) which involve matrix calculations which are highly parallelize-able I think. (I was thinking it was because you can do arithmetic operations to the individual values in the matrices in parallel, based on what I learned in my Algebra lessons (Actually, I only realized this when it was mentioned in that article here about AMD's history posted a month or a few ago.), but now that you've opened my mind to how floats are dealt with, I'm thinking there's more to how graphics are processed on a GPU.) Anyway, with what I've just mentioned, I'm not implying that you didn't know all this already.

With that occupation, no wonder you're so knowledgeable about this stuff! Well, aside from the fact that, like you seem to have said, you've studied this stuff as an enthusiast as well. Though, I'm not sure if you're telling me that you're being contracted by companies without a degree. That's quite amazing if so! When you said you took some classes, the "some" makes it seem so. Hehe... Thanks again for the replieds I'm really enjoying what you're teaching me, just like how I feel when I read (some) TH articles. 😀

utomo · Nov 18, 2012

Why processor manufacturer did not consider bigger processor size.
example by making double size, we can get double power easily

army_ant7 · Nov 18, 2012

utomo :

Because I don't think it can be done as easily as you claim. For instance, the software that a person uses would have to be able to take advantage of whatever that "bigger" processor has to offer. I'm thinking you mean more cores or something.

Aside from that, there are the temperature constraints, as was mentioned in the article as to why the Xeons had a lower clockrate than the Sandy Bridge-E Core i7's. There's probably even more to it than these two reasons I've given.

Guest · Nov 19, 2012

there is no need to rewrite code to use GPUs, its actually less effort than coding for Xeon Phi, since you dont have to worry about vectorization. Also it seems the K20's are easier to get performance close to the peak, I saw the results of the DGEMM and Stream Benchmarks, and the Phi only gets about 50% of its memory bandwidth and 80% of its flops, and thats with code optimized by intel engineers.

cuba_pete · Nov 19, 2012

palladin9479 :

I understand fully what you were stating.

I have no foot in my mouth. I am not lurking, just another passing peasant...such as yourself.

My point (since you need it spelled out) is that you overstated your position on one's ability to manufacture such a device.

To put it bluntly: if it's so easy, why don't you do it? Answer: because you cannot, apparently.

blazorthon · Nov 19, 2012

cuba_pete :

That he/she can't do it is irrelevant because large companies can (they usually have a lot more money than the average person) and really should look into it. That is his/her point that you seem to be ignoring.

cuba_pete · Nov 19, 2012

blazorthon :

I am not "trying" to ignore anything. My point in this is that we hear people poo-pooing products when a company delivers them...play them down because "all they are is...." and such. Intel and AMD fans are good at this.

I would equate this to someone who states that all you really have to do to get a car to make 60 mpg is blah blah blah...and my response to them would be, well, go ahead if you think it's so easy. Intel would be the big car maker in this case which can and does.

It started out as a simple comment. We don't need to get all bent out of shape about it.

blazorthon · Nov 19, 2012

cuba_pete :

So, you're saying that we as consumers can't educate ourselves and others of the situations just because we should be happy with how any products are? If so, then I have to disagree. Of course products are getting talked about, that's what people do. We don't just hide our knowledge about it when they are the subject of discussion (such as this thread) or brought into it for some other reason or reasons such as being part of an example or analogy to prove a point.

It doesn't take anyone being a fan of any company to do this, it just takes intelligent conversation. There's nothing wrong with pointing out the mistakes of the companies and there most certainly is even less right about not pointing them out in relevant places such as this thread.

palladin9479 · Nov 20, 2012

The hilarious part is that cuba is so ignorant of the topic at hand that they think I'm "poo pooing" and / or putting down the Phi, when in actuality I'm praising it as a very innovative idea.

Eventually he / she will wonder what that funny asphalt like taste is.

penguintech · Nov 20, 2012

Everytime I see something like this, it just makes me drool and realize that even though in my world I just upgraded, I am really far behind in reality. I just got a quad xeon set up for home server and my own personal use, and it looks like a wimp compared to something like this.....

cuba_pete · Nov 20, 2012

palladin9479 :

Hahahaha...yeah, I thought it was a funny taste...

[ignore]palladin[/ignore]

blazorthon · Nov 21, 2012

[citation][nom]parallel_progrmr[/nom]there is no need to rewrite code to use GPUs, its actually less effort than coding for Xeon Phi, since you dont have to worry about vectorization. Also it seems the K20's are easier to get performance close to the peak, I saw the results of the DGEMM and Stream Benchmarks, and the Phi only gets about 50% of its memory bandwidth and 80% of its flops, and thats with code optimized by intel engineers.[/citation]

Code has nothing to do with how much memory bandwidth it has and you don't seem to know what you're talking about otherwise, let alone what the rest of us were talking about. Read palladin9479's posts here and you'll probably understand things better.

Guest · Nov 21, 2012

K-zon

Within that of what has said been posted and say placed within interest of post for rather in regards of any other post and say place of, mainly for comment(s) and article/s of for, would seem of on be an interest within to have a say of such within here of on.

Not in short of where there is to say of not to say for or on, but of such about to finds within that of say many maybe almost on that of just a few if not many of that to say and still just the one.

On place of for what to say find within regards of is left to where any can be found if not just regard on and say with to is no better for without finding better on to no place of. Which along about to would not say of such with.

Almost though as mention with would try to place for say though that interest to would be a means of more within a place of less. And even on is probably said no for much of to in means about but still of to be short less with no more would be to say no more at all for all in all about of.

Which can be fine and such at a time but of within to say of is with that of plae to be without even more for for what ore is of on would say within that on to would be no less about.

To say is loss to all of about within for for about all of to say might still be less though.

Rather there be a difference on any towards the interest of regards to find about is rather the regards about are that of almost to say any.

Some of such to find might be that of what is not within regards of not for about on any interest of difference to towards on.

For all work placed say before to that of any is to almost say that of all is short for what there is to find for any at all. And any left is not long at all to almost say the most for what there could be to say if any at all to say.

Along what could still be within interest on is rather interest is still of to for for where say an interest is.

Of simply put is placed within question to interest on for say use of and on might be left to where use to is still at to say beforehand to rather say, what is still in no use when in interest of use to say find of on?? (Interest of short or less might be within say question of but might be a question still for.)

Just on that of to with of about still almost alone let alone additional still of such to say mention on or just any additional to though probably maybe still at a time and even then still of is still on that of what is not to that of a place to say of question more for to say about. But still, might be said wrong of such on still.. (To long.)

As on within HPC to its' interest of is probably within some interest of to say find at times within on for use to rather within say large or "larger". Small given only to its place for what might be its' regards of then its' interest about perhaps.

Xeon Phi: Intel's Larrabee-Derived Card In TACC's Supercomputer

Distinguished

Splendid

Distinguished

Splendid

Distinguished

Splendid

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Guest

Guest

Distinguished

Glorious

Distinguished

Glorious

Splendid

Distinguished

Distinguished

Glorious

Guest

Guest

Share this page