News Intel Itanium gets a new lease on life, sort of — GCC 15 "un-deprecates" Linux compiler support

The article said:
Itanium relied on what's known as the IA-64 architecture, a Very Long Instruction Word (VLIW) architecture that took advantage of a software compiler to calculate instructions that were to be executed in parallel in advance of when those instructions needed to be executed.
It's a relative of VLIW, but they're not the same. Intel coined the term EPIC (Explicitly-Parallel Instruction Computer) for it, specifically in contrast VLIW, RISC, or CISC. The difference is that the compiler only computes the data dependencies (hence the "Explicitly Parallel" part), whereas VLIW is also scheduled at compile time.

It's not a trivial difference, since VLIW binaries are usually tied to one specific generation of a CPU and must be recompiled when you want to run them on a newer one. This is one of the factors that limited VLIW to embedded computing applications, although it also doesn't do well on very branchy code or with difficult-to-predict data dependencies.

EPIC binaries are forward-compatible with newer CPUs, which can also dispatch their instructions across a greater number of execution units, in order to provide higher IPC without recompilation.

The article said:
x86 CPUs use a hardware-based scheduler to schedule instructions to the cores (or execution units)
I don't know why you kept "cores" in there, but it's wrong. Just "execution units".


BTW, the guy who stepped up to maintain GCC's IA-64 support is the same one targeted in this swatting incident:
 
Last edited:
What I find intersting in hindsight is that the most primitive instrucation set (and register) architecture turned out the easiest to push forwards for decades while all those which tried to exploit hardware advances for additional performance suffered from those exploits no longer useful or working a generation later.

E.g. very early microprocessors would take several clock cycles to do anything, but once certain operations could be done in a single cycle, wasting such a single cycle became an issue engineers wanted to exploit. So when they designed the very first SPARC CPUs, they knew that a control transfer (branch or call) would require some extra time where the ALUs would idle and thus if the code had anything useful to do independent of the control transfer, it could exploit that single instruction slot to "run in parallel" after the branch instruction, otherwise it would insert a NOP.

Now, this is all just out of my head 40 years later so details could be wrong, but imagine designing your CPU and your compilers to exploit one cycle of potential parallelism...!

These days physical registers outnumber architectural registers dozens to one and hardware schedules quite a few parallel instructions on every clock, only possible with runtime information and quite a bit of speculative execution on top.

Another architecture perhaps more fatally tied to its compilers was the early MIPS 2000/3000 CPUs, which relied on the compiler to manage proper instruction pipelining to save on transistors better spent elsewhere.

But getting it wrong resulted in a locked-up CPU (and a bus fault?), which quickly negated the transistor saving benefits and perhaps that lost the race for MIPS even if it got quickly fixed a generation later.

Again this is all from fading memory and very anecdotal today.

I'd love to see a lecture where someone in the know would extoll how some of these early architectures could have technically fared, had they been given the kind of resources x86 got. Ivan Goddard might be able to do that, but he's a little too focussed on the Belt architecture currently, which isn't moving forward as fas as I can tell.

Note the technical, I'm well aware that the fate of most architectures was really decided by business/politics/market etc.

In the case of the Itanium, I'm pretty sure (in my gut), that the overhead of EPIC would have turned into nothing but deadly ballast, while MIPS might do as well as ARM (and actually still survives in China).

With SPARC I'm less sure, it's primitive enough, but the register windows seem far to close to metal.

I see no reason why technically VAX and the IBM 360 couldn't do just as well as X86, both seem at similar levels of abstraction and simplicity.
 
  • Like
Reactions: bit_user
when they designed the very first SPARC CPUs, they knew that a control transfer (branch or call) would require some extra time where the ALUs would idle and thus if the code had anything useful to do independent of the control transfer, it could exploit that single instruction slot to "run in parallel" after the branch instruction, otherwise it would insert a NOP.
While committed slots are no longer something you find in modern ISAs, I think SPARC also pioneered register windows. I believe Loongson adopted those.

Didn't ARM also have branch committed slots, at one point?

MIPS might do as well as ARM (and actually still survives in China).
Imagination ended up with MIPS, but they couldn't make a go of it and switched to designing RISC-V cores.

With SPARC I'm less sure, it's primitive enough, but the register windows seem far to close to metal.
Well, SPARC went royalty-free several years before RISC-V gained critical mass. That wasn't enough to save it, nor does it seem to be enough to save OpenPOWER. When companies finally do something right, it's usually much too late.
 
I'd love to see a lecture where someone in the know would extoll how some of these early architectures could have technically fared, had they been given the kind of resources x86 got. Ivan Goddard might be able to do that, but he's a little too focussed on the Belt architecture currently, which isn't moving forward as fas as I can tell.
One interesting thing with The Mill's Belt is that it does not actually exist. It is mostly an abstraction used for instruction encoding.
Each result is stored in each execution unit's output latches, and when there is a new instruction then the old output gets lost. The length of the belt is the issue-width of the machine.

Itanium was supposedly based on work by Bob Rau, who died prematurely.
Ivan Goddard (who is Mill Computing's main compiler guy and public face) has likened The Mill to being close to what Intel might have come up with had Bob Rau's vision been kept intact and they had iterated a few times before release.

There are still a few similarities with Itanium and The MIll:
- Both have a separate hidden "safe stack" for temporaries and return addresses
- The Mill's spiller vs. the Register Stack Engine (that Itanium was supposed to have but never got AFAIK), works in the background shuffling the hidden stack between core and memory during spare bus cycles.
- The belt encoding vs. rotating registers in loops
- The Mill is also a type of VLIW, but flipped to the temporal axis, which supposedly is easier to pipeline.
- explicitly scheduled by the compiler, like EPIC.

The latest I read on their forum is that they are supposedly running compiled code in simulation, with simulated timing,
and thus have proven (to themselves) that the concept works.
I see no reason why technically VAX and the IBM 360 couldn't do just as well as X86, both seem at similar levels of abstraction and simplicity.
Some people would call the z/architecture an incarnation of the IBM 360. The latest CPU: z17 "Telum II" was released this year.
While committed slots are no longer something you find in modern ISAs, I think SPARC also pioneered register windows. I believe Loongson adopted those.
LoongArch does not have delay slots, and didn't adopt register windows. It omitted more quirks of MIPS, making it 64-bit (ready) from the start, but the MIPS legacy still shows in some off its behaviour and specification.
Imagination ended up with MIPS, but they couldn't make a go of it and switched to designing RISC-V cores.
Imagination sold MIPS. Then it was resold, and the new owner was restructured and adopted the name MIPS.
But yes, that company ditched the MIPS architecture to design RISC-V cores.
 
One interesting thing with The Mill's Belt is that it does not actually exist. It is mostly an abstraction used for instruction encoding.
Each result is stored in each execution unit's output latches, and when there is a new instruction then the old output gets lost. The length of the belt is the issue-width of the machine.

Itanium was supposedly based on work by Bob Rau, who died prematurely.
Ivan Goddard (who is Mill Computing's main compiler guy and public face) has likened The Mill to being close to what Intel might have come up with had Bob Rau's vision been kept intact and they had iterated a few times before release.
I've watched all of Ivan's lectures end-to-end and they were absolutely fascinating, while also a pleasure to watch because Ivan speaks one of my favorite languages so beautifully.

And at first glance the Belt is so incredibly elegant and convincing: clearly a much better way to do general purpose compute than the ordinary ways of doing compute.

So why does nobody care?

I've tried to ask around the HPC crowd I tend to mingle with occasionally and even tried to get a word on it from Calista Redmond when I met her in Grenoble a few years ago: they clearly never head of him or the Belt but RISC-V people are all sold on general purpose CPUs being a dead horse: it's all in the special purpose accelerators and extensions.

The Belt is all about doing general purpose compute with far fewer transistors. But unfortunately there is no longer any scarcity of transistors, unlike in the DSP days where he grew his mindset. In fact CPU designers gain so many transistors with every shrink, they really struggle to turn them into any kind of value.

Of course you could use those transistors to multiply the cores. But getting value out of 255 general purpose cores on a single chip isn't that easy. Even the hyperscalers can't think of any great way to have them cooperate on anything. Instead they have them serving an ever greater number of separate customers.

Some people would call the z/architecture an incarnation of the IBM 360. The latest CPU: z17 "Telum II" was released this year.
Oh yes, z/Arch is living proof of how 360 lives on and is the most equivalent in terms of evolutionary potential to the x86.

Except of course, that Gene Amdahl & Co actually designed 360 forward looking and to be modular in terms of instructions that might be microcode on smaller machines and hardware on bigger ones, while x86 (8086) started as a hack, on a hack (8080) on a hack (8008) on a hack (4004) and arguably never designed.

But I guess the forward looking parts of 360 were at such a primitive level, it never had a negative impact on the evolution of the ISA. I've never looked in detail e.g. on instruction encodings and formats, pretty sure leftovers from BCD or similar would no longer be useful, whole addressing variants useless.

So yes, if it wasn't for IBM's lawyers, x86 might have died long ago and PC might be running z.
LoongArch does not have delay slots, and didn't adopt register windows. It omitted more quirks of MIPS, making it 64-bit (ready) from the start, but the MIPS legacy still shows in some off its behaviour and specification.

Imagination sold MIPS. Then it was resold, and the new owner was restructured and adopted the name MIPS.
But yes, that company ditched the MIPS architecture to design RISC-V cores.
I never dipped deeply into LoongArch, but I can appreciate the attraction from the Chinese perspective.
I believe MIPS was very popular in textbooks, so whole generations of CPU design engineers might have been raised on it. And then there are things like compilers and OS ready to use and play with.
Finally: no lawyers chasing after you as long as you don't try to sell to Western markets.

RISC-V has all those advantages as well now and then some. So with it bing the most widely used in teaching and being so accessible, what remains of MIPS probably has little room to grow, unless lawyers and politicians change the rules.
 
  • Like
Reactions: Findecanor
The Belt is all about doing general purpose compute with far fewer transistors. But unfortunately there is no longer any scarcity of transistors, unlike in the DSP days where he grew his mindset. In fact CPU designers gain so many transistors with every shrink, they really struggle to turn them into any kind of value.
I wouldn't reduce the value of The Mill to just number of transistors and issue-width. There are unorthodox solutions on multiple fronts to be able to be faster than conventional processors on general-purpose code.

Personally I don't really care about that at all. I think that general-purpose processors have been fast enough for a decade.
What I like the most about The Mill are its features for OS and user-space programs, especially when it comes to security.
It is not vulnerable to most speculation-based attacks, including Spectre and Meltdown. Return-oriented programming is not possible with a hidden safe stack. Inter-procedure call can be done in one clock cycle, and you can pass a region of memory (at byte-granularity) with access rights rescinded afterwards.
When you enter a procedure the stack frame reads zero without having to clear physical RAM, and when you leave the old frame gets scrubbed.
I have a hobby project to develop a virtual machine with ahead-of-time compiler to emulate some of those properties on conventional processors: ARM, x86 and RISC-V. (At some performance cost of course.)
I did originally want to support The Mill as well (of course) but because its future seems uncertain and the team is holding important details secret, I don't see how I can.
 
Last edited:
I wouldn't reduce the value of The Mill to just number of transistors and issue-width. There are unorthodox solutions on multiple fronts to be able to be faster than conventional processors on general-purpose code.

Personally I don't really care about that at all. I think that general-purpose processors have been fast enough for a decade.
What I like the most about The Mill are its features for OS and user-space programs, especially when it comes to security.
It is not vulnerable to most speculation-based attacks, including Spectre and Meltdown. Return-oriented programming is not possible with a hidden safe stack. Inter-procedure call can be done in one clock cycle, and you can pass a region of memory (at byte-granularity) with access rights rescinded afterwards.
When you enter a procedure the stack frame reads zero without having to clear physical RAM, and when you leave the old frame gets scrubbed.
I have a hobby project to develop a virtual machine with ahead-of-time compiler to emulate some of those properties on conventional processors: ARM, x86 and RISC-V. (At some performance cost of course.)
I did originally want to support The Mill as well (of course) but because its future seems uncertain and the team is holding important details secret, I don't see how I can.
I was most attracted by how elegantly it release resource pressures by stepping outside the conventional. And when that implied it was also less susceptible or even immune to control flow hijacking, that was icing on the cake. Much of my work is in mission critical systems, so I keep a lookout for that sort of thing.

Shadow stack support and tagged memory extensions for the more conventional architectures now have finally found there way there and there is other even tougher hardware approaches coming on RISC-V, but it's become strangely quiet on that front to be quiet honest.

I had been trying to wrap my head around how strange evolution works in IT and how things like Unixoids could still survive when they are clearly 1960 designs and no match to unknown code from unknown actors running on your most personal computers. Yet here we are: Linux thrives and Plan-9 never got anywhere. Capability based addressing is still extremely niche even if it's been around just as long and could avoid so many problems.

But being off the trodden path only ever pays off, when that main path suffers very severe issues and evidently we haven't gotten there yet.

And I wouldn't like that happening either 🙂