Intel to Share Next-Gen Poulson CPU Details

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

ta152h

Distinguished
Apr 1, 2009
1,207
2
19,285
[citation][nom]palladin9479[/nom]Here is another problem with VLIW and Itanium architecture. The binaries encoding of instructions are static and the HW is incapable of executing them out of order. The original Itanium had six execution units, binaries compiled for that CPU can execute up to six instructions in one pass. But if the user later upgrades their CPU to one with eight or twelve instruction units the binary could still only execute on six and the other units would be permanently stalled. You would have to go back and recompile ~everything~ to support the 12 instruction unit model. And if in the future they introduced a 16 or 24 unit model, then you'd have to do all that recompiling all over again.Now lets reverse it, lets say MS goes out and compiles W2K8 for the *newer* Itanium with 12 instruction units. All the application makers go out and do this too, your entire software base goes out and does this. Guess what happens should you try to run those binaries on the older 6 instruction unit hardware? They will not execute properly if at all. They are statically encoded to send up to 12 instructions to a 6 instruction system. It will work just fine until the code tries to send a 7th simultaneous instruction and suddenly you will get an exception which will cause a nonmaskable interupt (NMI), most likely this will cause the system to crash. This would require the compiler to compile binary code for multiple instances of the CPU and then have the binary check and determine if it should execute 6, 12, 16 or 24 instruction code.This is all because VLIW based architecture is perfect for when the software is being written directly to a very specific known architecture, stuff used in DSP's or GPU's. Its absolutely a bad idea in a general purpose CPU which can be upgraded or switched out and comes in multiple flavors and may be expected to execute any random amount of code at any random time.[/citation]

Sadly, you don't know what you're talking about. The hardware uses the compiler to optimize instruction packing, but it does not execute the instructions. You'd just be sending four packets instead of two, and they would execute fine on older hardware, although possibly with a small performance penalty since it's not optimized for it.

The same is true for x86. Remember the Pentium 4 and how code had to be optimized for it to execute well?
 

mikem_90

Distinguished
Jun 24, 2010
449
0
18,780
[citation][nom]mayankleoboy1[/nom]i do want one, but i really have no use for it[/citation]

These are CPUs for systems like IBM did back in the 70's and 80's where their global market for number of sales (systems, not CPUs, they'll sell a few thousand CPUs in each system install) was maybe Five... six.

You don't sell a kidney to get one, you mortgage a small European country.

I too am wondering if this will be as epic a failure as the previous Itaniums, or if Intel has truely learned their lesson and redid most of the x86 microcode.

The cooling for these systems was typically a "Brick" module, liquid cooled, using some materials I forget (Barium?) which if it leaked, was a big Hazmat incident.

A buddy of mine from SGI used to play with these. He was a MIPS fan mostly.
 
Sadly, you don't know what you're talking about. The hardware uses the compiler to optimize instruction packing, but it does not execute the instructions. You'd just be sending four packets instead of two, and they would execute fine on older hardware, although possibly with a small performance penalty since it's not optimized for it.

The same is true for x86. Remember the Pentium 4 and how code had to be optimized for it to execute well?

Not even close. The HW doesn't use a compiler to do a damn thing, the compile operation happens long before the code gets near the HW. VLIW architecture has no instruction ordering / branch prediction units and instead the compiler is supposed to do all that and render the binary. The idea is that the compiler has infinitely longer time and resources to determine the proper branching of instructions then a HW branch prediction unit would have and thus should be more accurate. The compiler would analyze the code and determine the code path and which instructions can be executed in parallel then encode everything into binary format. The maximum amount of simultaneous instructions would be based on the target architecture, if the target could process 6 instructions then the compiler would encode a maximum of 6 simultaneous instructions, if the target could process 12 instructions then the compiler would encode a maximum of 12 simultaneous instructions. This is what ILP is all about, encoding multiple instructions into a single command then sending it to the CPU to be processed at once.

On a typical superscaler CPU there is instruction ordering and branch prediction hardware that analyzes incoming code and determines which instructions can be processed simultaneously and which branching will happen. This HW doesn't exist inside a VLIW processor.

In theory the VLIW should be faster, and it is if the software is encoded explicitly for VLIW and doesn't change nor have much random branching. The breakdown happens the moment your architecture changes as older binaries are not able to take advantage of it and in real world scenarios where the context of the operation can not be predicted at compile time. A compiler has no way to know exactly what external inputs will provide the program and thus can not possible predict which way those dependent branches will go. A HW branch prediction unit does have knowledge of the context of the operation and can make somewhat accurate predictions if the dependent instruction references something in memory.

Ex,

Something like this would be easy for a VLIW (pseudo ASM)
STORE AX 0 (set AX register to 0)
ADD 1, 2 (Add the value of 1 and 2 and store to AX)
ADD AX, 4 (Add the value of 4 to the AX register)
CMP AX, 7 (is AX 7)
JE AX_EQUAL_TRUE (jump to label AX_EQUAL_TRUE if the value is true)
JMP AX_NOT_EQUAL (else jump to AX_NOT_EQUAL)

:AX_EQUAL_TRUE
more code

:AX_NOT_EQUAL
more code

A compiler could easily see what AX would be and make the binary so that AX_EQUAL_TRUE is executed at the same time that AX is being added to with the knowledge that it would jump there anyway. But if we modify the code such that,

STORE AX 0 (set AX register to 0)
ADD 1, C800:007F (Add the value of 1 and whatever is in C800:007F and store to AX)
ADD AX, 4 (Add the value of 4 to the AX register)
CMP AX, 7 (is AX 7)
JE AX_EQUAL_TRUE (jump to label AX_EQUAL_TRUE if the value is true)
JMP AX_NOT_EQUAL (else jump to AX_NOT_EQUAL)

:AX_EQUAL_TRUE
more code

:AX_NOT_EQUAL
more code

In this example a memory address is being referenced for the value to be added to AX, if a previous predictable operation wrote a value to this memory address then its fine, if not and the value of this address is not known at compile time, then the compiler would have no idea whether to jump or not in the branch. The compiler would have to make a WAG (wild a$$ guess) and hope it was right. If during execution the compiler was found to be wrong, then the CPU will stall as the instruction pipeline is flushed and we make the branch and reload from there. Hopefully the information is in L1 cache so the stall won't be long, if its not then we must go to L2 cache and things start to look much worse for our little program.

The issue hinges on whether the compiler can make the proper decision without knowing the context the program will be operating under. If so then great, else a HW branch prediction engine would be better. The industry heavyweights (SUN / IBM) have already shown that VLIW will not work in a real production environment, there is no magic secret sauce inside the compiler that can make the impossible possible. And yes if a VLIW binary is encoded with 7 simultaneous instructions and tries to execute that on a 6 process system, the CPU will trigger an exception and a NMI, its an illegal instruction. VLIW CPU's can not reorder instructions, they must execute what their given, and if they can't execute the instruction then they must discard it and let the program know.
 
Status
Not open for further replies.