News Tachyum, long on wild performance claims and short on actual silicon, delays the Prodigy Universal processor to the second half of 2024 — meme chip...

Status
Not open for further replies.
I never did believe their hype. This chip/company fits a mold of many before it, which make lofty promises but either fail to deliver or are so late that the industry has already moved beyond their solution, by the time they finally do.

Not that I don't feel some sympathy, but the amount of hubris and overpromising in their messaging doesn't exactly make it easy.

Also, the whole episode of using Cadence as a scapegoat didn't exactly improve my feelings towards them. While I can believe some of their complaints, I highly doubt their chip wouldn't have been held up for other reasons, even if Cadence had executed perfectly.
 
Tachym describes its first-gen Prodigy as a universal monolithic processor integrating up to 192 proprietary 64-bit VLIW-style cores
No, they revised their ISA some time ago. They ditched the original plan to use VLIW, which I found mildly disappointing.

"Tachyum still calls their architecture ‘Prodigy’. But they’ve overhauled it in response to customer feedback. VLIW bundles are gone in favor of a more conventional ISA. Hardware scheduling is much more capable, improving performance per clock. And the cache hierarchy has seen major modifications. 2022 Prodigy’s changes are sweeping enough that much of the analysis done on 2018’s Prodigy no longer applies."

Source: https://chipsandcheese.com/2022/08/26/tachyums-revised-prodigy-architecture/

Prodigy now has two 1024-bit matrix units per core.
Like I said above, about how hard it is to leap-frog the industry, even ARM now has matrix-multiply extensions!


I'm not aware of any ARM CPUs actually implementing it, but probably by the time Prodigy comes to market...
 
Last edited:
Yeah these delays have been disappointing. I was eager to see their promised reached though as bit_user mentions it's hard to leap frog the industry. I think the days where leap frogging in the industry as 'common' have left the building as far as I can tell as the frequency for this happening has significantly diminished over my lifetime.
 
Last edited:
  • Like
Reactions: thisisaname
it's hard to leap frog the industry. I think the days where leap frogging in the industry is 'common' have left the building as far as I can tell as the frequency for this happening has significantly diminished over my lifetime.
The last examples I'm aware of were all Japanese.

However, PEZY - the most interesting, with an exotic RF-coupled alternative to HBM - got hit with some massive corruption case that seems to have derailed their business, at least for quite a while.

As for Preferred Networks, I'm not sure if their processors are for sale outside Japan.

Lastly, Fujitsu's A64FX did finally reach global markets, but I think probably not at a compelling price point, by the time it did. Probably of limited interest to anyone not doing work on auto-vectorization or ARM SVE.

The way things are going, the next instance will be likely be Chinese.
 
Last edited:
  • Like
Reactions: atomicWAR
No, they revised their ISA some time ago. They ditched the original plan to use VLIW, which I found mildly disappointing.
"Tachyum still calls their architecture ‘Prodigy’. But they’ve overhauled it in response to customer feedback. VLIW bundles are gone in favor of a more conventional ISA. Hardware scheduling is much more capable, improving performance per clock. And the cache hierarchy has seen major modifications. 2022 Prodigy’s changes are sweeping enough that much of the analysis done on 2018’s Prodigy no longer applies."​


Like I said above, about how hard it is to leap-frog the industry, even ARM now has matrix-multiply extensions!

I'm not aware of any ARM CPUs actually implementing it, but probably by the time Prodigy comes to market...
So basically an entirely different chip then the one announced in 2018. Not holding my breath on this one.
 
Tachyum has posted a bunch of command-line demos of supposedly running operating systems and programs on Prodigy on FPGA. So they do have something, but it takes more than some Verilog to make a chip.

Like I said above, about how hard it is to leap-frog the industry, even ARM now has matrix-multiply extensions!
So does Intel, and there have been discussions about RISC-V getting some too.

Matrix multiplication of small datatypes is used for AI / "Deep learning". These don't require that much precision though.
ARM and RISC-V have extensions for fp16 and bf16 (32-bit IEEE fp with 16 fractional bits chopped off). There are also ops using integers. Tachyum Prodigy is supposed to support 4-bit integers even, for stupidly great throughput at stupidly low precision.

BTW. A matrix multiplication consists of many dot products, so if you see a low-precision dot product instruction in an ISA then you could bet that it was also put there for AI.
 
So does Intel, and there have been discussions about RISC-V getting some too.
Intel's AMX is not quite as general as ARM's SME.

BTW. A matrix multiplication consists of many dot products, so if you see a low-precision dot product instruction in an ISA then you could bet that it was also put there for AI.
Dot products on individual rows & columns was the old way of doing it. However, the downsides are that you have to perform redundant fetches of the same data to compute a matrix product and you're limited in parallelism to how many independent vector dot products the CPU can do at once. The point of a matrix or tensor product engine is to enable greater parallelism, while reducing redundant data movement.
 
No, they revised their ISA some time ago. They ditched the original plan to use VLIW, which I found mildly disappointing.
Do you mean that (1) the original VLIW was mildly disappointing or (2) the fact that they ditched VLIW was disappointing?

In my opinion, changing the original Tachyum design (which already had sufficiently high performance-per-watt to be respected on the market) without having any actual Tachyum V1 VLIW chips manufactured in sufficient quantities, was a mistake. They should have concentrated on first delivering an actual product to the market in sufficient quantities, and only then abandon the V1 VLIW architecture.
 
Last edited:
Do you mean that (1) the original VLIW was mildly disappointing or (2) the fact that they ditched VLIW was disappointing?
I was hoping for some kind of fresh take on VLIW, or an effective use of JIT + runtime, profile-driven optimization.

In my opinion, changing the original Tachyum design (which already had sufficiently high performance-per-watt to be respected on the market) without having any actual Tachyum V1 VLIW chips manufactured in sufficient quantities, was a mistake. They should have concentrated on first delivering an actual product to the market in sufficient quantities, and only then abandon the V1 VLIW architecture.
Yes, there's something to be said for actually delivering "good enough". I think Ventanna got this part right. They supposedly completed their first generation, even though I think it's not quite a product anyone is willing to deploy. However, completing it seems to have given them enough credibility to line up customers for their v2 product.
 
I was hoping for some kind of fresh take on VLIW, or an effective use of JIT + runtime, profile-driven optimization.
In my opinion, JIT+runtime profile-driven optimization can be less effective in a VLIW CPU compared to OoO in the sense that the decisions of a JIT (which is mostly implemented in software, with some hardware assistance/support) are too coarse grained (i.e: it takes a very long time for a JIT engine to react to a real-time change) taking at least N*100 CPU cycles to examine+rewrite code using a JIT engine, while an OoO (out of order) execution implemented mostly directly in hardware can respond to a real-time change in just a few CPU cycles. The difference in response time between a JIT and an OoO is 100-fold (or more). Secondly, JIT cannot (by definition) have access to a large part of the telemetry accessible to an OoO engine - a large part of realtime telemetry that matters for optimization is valid for just a few CPU cycles.

I am not against JIT, and not against VLIW in-order CPUs without OoO - I am just saying that there is a fundamental dichotomy between where JIT compilation is useful and where OoO execution is useful: JIT compilation is incapable of fully substituting/replacing OoO execution. JIT and OoO can coexist (obvious example: Java VM running on an i386 CPU) and their individual speedups are additive, but there is a very large performance gap between a VLIW in-order CPU with an optional JIT (i.e: Tachyum V1 CPU from HotChips 2018 PDF running a Java application) and an OoO CPU with an optional JIT (i.e: current Tachyum "V1+" CPU architecture from year 2022 running a Java application).
 
In my opinion, JIT+runtime profile-driven optimization can be less effective in a VLIW CPU compared to OoO in the sense that the decisions of a JIT (which is mostly implemented in software, with some hardware assistance/support) are too coarse grained (i.e: it takes a very long time for a JIT engine to react to a real-time change) taking at least N*100 CPU cycles to examine+rewrite code using a JIT engine, while an OoO (out of order) execution implemented mostly directly in hardware can respond to a real-time change in just a few CPU cycles.
They solve (mostly) different problems, IMO.

Secondly, JIT cannot (by definition) have access to a large part of the telemetry accessible to an OoO engine - a large part of realtime telemetry that matters for optimization is valid for just a few CPU cycles.
For the most part, you don't need it. Merely having branch probabilities is enough for PGO to deliver solid benefits. That can drive inlining, hoisting, auto-vectorization, and a range of other optimizations.

I am not against JIT, and not against VLIW in-order CPUs without OoO - I am just saying that there is a fundamental dichotomy between where JIT compilation is useful and where OoO execution is useful:
I realize I walked into this by talking about "a variation on VLIW", but what I was hoping for was something more like EPIC, where you have a compiler-generated dependency graph that's runtime-scheduled. That lets you do OoO, without having to search for data-dependencies. I think if Intel's IA64 would've stuck around long enough, it eventually would've gotten there.

Another example to consider, involving JIT and PGO, is GPU shaders. Currently, they do both. Last I heard, GPUs are still in-order, but that could obviously change.

JIT and OoO can coexist (obvious example: Java VM running on an i386 CPU) and their individual speedups are additive,
DEC Alpha allegedly had JIT translation of x86 -> native ISA + PGO. I'm just not sure if the PGO part would happen dynamically, or only the next time you run the program.
 
For the most part, you don't need it. Merely having branch probabilities is enough for PGO to deliver solid benefits. That can drive inlining, hoisting, auto-vectorization, and a range of other optimizations.
Note 1: I think there is some misunderstanding here. The issue isn't whether JIT or PGO need or don't need realtime telemetry to optimize code - the issue is that it is mathematically/physically impossible for JIT and PGO to access such information given the time pacing with which JIT and PGO operate. It has no meaning to attribute the term "need" to something that is by definition inaccessible.

Note 2: PGO delivers benefits almost irrespective of whether the CPU is VLIW, OoO, IA64 or some other architecture X. The term 'almost irrespective' implies that PGO cannot be used to tell/discern whether VLIW, OoO, IA64 or X is the right choice in terms of final application performance.
 
Yes, but the incentives to use PGO are greater, for VLIW. That's certainly where I first encountered the technique, long before GCC supported it or LLVM even existed.
Thinking about it, I am not sure about whether incentives to use PGO are greater for VLIW, compared to incentives to use PGO on a non-VLIW architecture.

If VLIW means pure VLIW, for example the encoding of all instructions is {INT1 or NOP, INT2 or NOP, FLOAT1 or NOP, FLOAT2 or NOP, BRANCH or NOP}, where INT1 and INT2 have to be proved to be independent by the compiler (likewise: FLOAT1 and FLOAT2): how can PGO be more helpful compared to its helpfulness on ARM/RISCV/x86/etc?

The aspect of VLIW that I certainly like is the fact that the INT1 and INT2 opcodes have to be proved to be independent by the compiler. But VLIW is not a requirement for this - it would be possible to add compiler generated instruction dependency bits to, for example x86, if x86 evolution wasn't adhering to backward-compatibility constraints.

I only agree that that incentives to use PGO are greater for VLIW in the sense that they are required for VLIW to catch up in terms of performance to the competing non-VLIW instruction set architectures such as ARM/RISCV/x86.
 
  • Like
Reactions: bit_user
The aspect of VLIW that I certainly like is the fact that the INT1 and INT2 opcodes have to be proved to be independent by the compiler. But VLIW is not a requirement for this - it would be possible to add compiler generated instruction dependency bits to, for example x86, if x86 evolution wasn't adhering to backward-compatibility constraints.
That's what IA64 did. Instructions are bundled into triplets and each triplet has explicit dependency tags. This enables the CPU to do runtime scheduling (out-of-order, even). They couldn't use classical VLIW, since the instruction stream embeds so many details of the microarchitecture - not just issue ports, but also latencies, register bank conflicts, etc. So, you could hardly touch those aspects of a CPU, without breaking backward compatibility.

IA64 could still be used by computer engineering researchers on FPGAs and simulators, but it's been out of use for long enough that you'd have to get a snapshot of a Linux distro for it, from like 10 years ago.
 
That's what IA64 did. Instructions are bundled into triplets and each triplet has explicit dependency tags. This enables the CPU to do runtime scheduling (out-of-order, even). They couldn't use classical VLIW, since the instruction stream embeds so many details of the microarchitecture - not just issue ports, but also latencies, register bank conflicts, etc. So, you could hardly touch those aspects of a CPU, without breaking backward compatibility.
IA64 could still be used by computer engineering researchers on FPGAs and simulators, but it's been out of use for long enough that you'd have to get a snapshot of a Linux distro for it, from like 10 years ago.
I strongly disagree. Stops in IA64, for example section "9.3 Instruction Dependencies" in http://refspecs.linux-foundation.org/IA64-softdevman-vol1.pdf, are actually a very bad (pointless) idea, because of at least the following reasons:

- IA64 stops are overly simplistic to be useful. In real-world code performance, only a more sophisticated instruction dependency tracking mechanism makes sense.

- An OoO CPU can easily autodetect IA64 stops in a sequential ARM/RISCV/x86/etc instruction stream, which makes IA64 stops (=bundles) completely pointless from performance viewpoint

When I wrote previously in this discussion that instruction dependency bits can (in theory) be added to x86, I did not mean IA64. IA64 is a counter example of a good design.
 
  • Like
Reactions: bit_user
I strongly disagree. Stops in IA64, for example section "9.3 Instruction Dependencies" in http://refspecs.linux-foundation.org/IA64-softdevman-vol1.pdf, are actually a very bad (pointless) idea,
Okay, thanks for sharing that. My knowledge of IA64 is comparatively superficial. It seemed to me like it had some good ideas, but I never delved into such detail.

It seems to me like optimal ISA design is a delicate balance between minimizing runtime computation and maximizing instruction density. I like the idea of explicit dependency information, but it's somewhat redundant with what's already contained in the instruction stream and therefore reduces density, for the sake of something that might not be that expensive to compute on-the-fly, for all I know.
 
Last edited:
Okay, thanks for sharing that. My knowledge of IA64 is comparatively superficial. It seemed to me like it had some good ideas, but I never delved into such detail.

It seems to me like optimal ISA design is a delicate balance between minimizing runtime computation and maximizing instruction density. I like the idea of explicit dependency information, but it's somewhat redundant with what's already contained in the instruction stream and therefore reduces density, for the sake of something that might not be that expensive to compute on-the-fly, for all I know.
Just a clarification: IA64 is OK if its only competition is something like Pentium/PentiumMMX (released in years 1993/1996), because the IA64 ISA alone is the reason why IA64 could in theory outperform PentiumMMX by something like 25% at the same clock frequency. However, IA64 is incapable of outperforming more advanced x86 OoO CPU designs like AMD K6 and Intel Pentium3 which are able to do at run-time what IA64 does at compile-time.
 
Status
Not open for further replies.