News Startup claims it can boost any processor's performance by 100X — Flow Computing introduces its 'CPU 2.0' architecture

Admin · Jun 12, 2024

Flow Computing, a spinout from Finland’s acclaimed VTT Technical Research Center, says its Parallel Processing Unit (PPU) can enable '100X Improved performance for any CPU architecture.'

Startup claims it can boost any processor's performance by 100X — Flow Computing introduces its 'CPU 2.0' architecture : Read more

peachpuff · Jun 12, 2024

Intel: buy'em out boys!

rluker5 · Jun 12, 2024

Sounds suspiciously like an NPU.

Notton · Jun 12, 2024

I hope you guys get one to review before they fold or get bought out and never see the light of day again.

Findecanor · Jun 12, 2024

I have yet a few papers to read, but from my understanding so far this I think this approach is basically about taking SIMT technology that you'd typically find in a GPU and adapting it to run general-purpose code that you'd otherwise only run on a CPU.

An NPU on the other hand, can only do one thing: matrix multiplication, and often does it at low precision.

Deleted member 2731765 · Jun 12, 2024

Tom's article seem to be short on details. I got some insight.

Flow claims it can 100x any CPU's power with its companion chip and some elbow grease | TechCrunch

A Finnish startup called Flow Computing is making one of the wildest claims ever heard in silicon engineering: by adding its proprietary companion chip,

techcrunch.com

CPUs have gotten very fast, but even with nanosecond-level responsiveness, there’s a tremendous amount of waste in how instructions are carried out simply because of the basic limitation that one task needs to finish before the next one starts.

What Flow claims to have done is remove this limitation, turning the CPU from a one-lane street into a multi-lane highway.

The CPU is still limited to doing one task at a time, but Flow’s PPU, as they call it, essentially performs nanosecond-scale traffic management on-die to move tasks into and out of the processor faster than has previously been possible.

Think of the CPU as a chef working in a kitchen.

The chef can only work so fast, but what if that person had a superhuman assistant swapping knives and tools in and out of the chef’s hands, clearing the prepared food and putting in new ingredients, removing all tasks that aren’t actual chef stuff?

The chef still only has two hands, but now the chef can work 10 times as fast. It’s not a perfect analogy, but it gives you an idea of what’s happening here, at least according to Flow’s internal tests and demos with the industry (and they are talking with everyone).

The PPU doesn’t increase the clock frequency or push the system in other ways that would lead to extra heat or power; in other words, the chef is not being asked to chop twice as fast. It just more efficiently uses the CPU cycles that are already taking place.

Flow’s big achievement, in other words, isn’t high-speed traffic management, but rather doing it without having to modify any code on any CPU or architecture that it has tested. It sounds kind of unhinged to say that arbitrary code can be executed twice as fast on any chip with no modification beyond integrating the PPU with the die.

Therein lies the primary challenge to Flow’s success as a business: Unlike a software product, Flow’s tech needs to be included at the chip-design level, meaning it doesn’t work retroactively, and the first chip with a PPU would necessarily be quite a ways down the road.

Chart showing improvements in an FPGA PPU-enhanced chip versus unmodified Intel chips. Increasing the number of PPU cores continually improves performance.

TerryLaze · Jun 12, 2024

peachpuff said:
Intel: buy'em out boys!

Intel already failed at trying to do this some 10 years ago, this is just intel's xeon phi / knights landing but without intel's backing.

Choiman1559 · Jun 12, 2024

There are already products on the market that are very specific to OOoE technology, such as the POWER architecture, which produces similar results to this chip. (Although not x86 compatible and not for consumer use!)

Alvar "Miles" Udell · Jun 12, 2024

It sounds good in theory, but it does sound suspiciously like a the NPU that Qualcomm, AMD, and Intel have themselves so it's impact may be limited to other companies, like Samsung if they don't license Qualcomm's.

Murissokah · Jun 12, 2024

Metal Messiah. said:
What Flow claims to have done is remove this limitation, turning the CPU from a one-lane street into a multi-lane highway.

Uberthreding? Seems like the kind of thing that is very powerful for some applications, but useless for othres.

bit_user · Jun 12, 2024

My guess is that it's basically like a tightly-integrated GPU. Like, if the compute cores of a GPU were integrated almost as tightly into a CPU as their FPU.

Metal Messiah. said:
CPUs have gotten very fast, but even with nanosecond-level responsiveness, there’s a tremendous amount of waste in how instructions are carried out simply because of the basic limitation that one task needs to finish before the next one starts.

There are a couple things this could mean. They could be talking about the way that CPU cores can operate on only 1 or 2 threads at a time (although I think POWER has done up to 8-way SMT), but the main thing I think they're probably talking about is how CPUs (especially x86) have tight memory synchronization requirements. GPUs famously have very weak memory models. This can limit concurrency, in CPUs.

Metal Messiah. said:
Chart showing improvements in an FPGA PPU-enhanced chip versus unmodified Intel chips. Increasing the number of PPU cores continually improves performance.

This is super sketchy, if they don't even tell us which models of CPUs they're talking about or show us what code they ran, compiler + options, etc. I find it pretty funny they used a core i7 that was 10x as fast as a Xeon W. Did they compare the newest Raptor Lake i7 against the slowest and worst Skylake Xeon W??

I'm guessing the code they used on the x86 CPUs was probably brain dead and got compiled to use all scalar operations.

bit_user · Jun 12, 2024

peachpuff said:
Intel: buy'em out boys!

Indeed. Like they did with Soft Machines:

https://www.tomshardware.com/news/soft-machines-virtual-cores-visc,31127.html

TerryLaze said:
Intel already failed at trying to do this some 10 years ago, this is just intel's xeon phi / knights landing

LOL, no. Xeon Phi was very simply just a bunch of Atom-derived CPU cores with SMT and big AVX-512 units bolted on.

There was really nothing more to it than that. It was x86 in every sense, which they used as a selling-point. It also had all the limitations of x86, which is why it failed.

bit_user · Jun 12, 2024

Findecanor said:
An NPU on the other hand, can only do one thing: matrix multiplication, and often does it at low precision.

That's not true. What they do best is matrix multiplies, but that's not all they do!

Intel's NPU: https://chipsandcheese.com/2024/04/22/intel-meteor-lakes-npu/
Qualcomm's NPU: https://chipsandcheese.com/2023/10/04/qualcomms-hexagon-dsp-and-now-npu/
AMD's XDNA (formerly Xilinx Versal AI cores): https://chipsandcheese.com/2023/09/16/hot-chips-2023-amds-phoenix-soc/ (search for XDNA)

ekio · Jun 12, 2024

This claim looks too good to be true…
That would mean this company is worth 1 trillion overnight.

Also, if this allows x86 to continue, that is a shame.

TerryLaze · Jun 12, 2024

bit_user said:
LOL, no. Xeon Phi was very simply just a bunch of Atom-derived CPU cores with SMT and big AVX-512 units bolted on.

There was really nothing more to it than that. It was x86 in every sense, which they used as a selling-point. It also had all the limitations of x86, which is why it failed.

And this will be the exact same thing only with arm cores, it's still the same thing, a bunch of cores on an add in card.
Literally the only difference they show from xeon phi is that they will have shared cache/memory which means it will only work on special mobos and special cpus made for those mobos....so even worse than phi.

bit_user · Jun 12, 2024

TerryLaze said:
And this will be the exact same thing only with arm cores, it's still the same thing, a bunch of cores on an add in card.

It sure doesn't sound that way to me. Even the graphic you included in your post rules out the possibility of it being physically outside of the CPU, because there's a direct arrow between the other CPU cores and the PPU.

blargh4 · Jun 12, 2024

Startup claims extraordinary things about non-existing product, news at 11

Wake me up when it breaks Amdahl's law

oofdragon · Jun 12, 2024

If it can really make processors rum 100x faster it will either be sold for the tech never to be used or everyone in the k ow will suicide with 100 bullets in their back

Alvar "Miles" Udell · Jun 12, 2024

This is completely believable, and it has to do with the fact that Mark Tyson didn't bother to include any explanation, whereas The Virge, in their article yesterday, did, by including Flow Computing's FAQ, in which states:

23. How much die space does adding a PPU require to achieve 100X performance over standard architectures? It depends on the system configuration. In case the number of processor cores is high, it is expected that several CPU cores could be substituted by the PPU. Then PPU uses the leftover die space without the need to add any extra silicon area. Our initial silicon area estimation model is based on legacy silicon technology parameters and public scaling factors. For the 64-core PPU that achieves 38X - 107X speedup in laboratory tests, the initial silicon area estimate is 21.7 mm^2 area in 3 nm silicon process. The silicon area estimate for a 256-core PPU achieving 148X - 421X speedup is 103.8 mm^2, respectively.

A 64 core PPU that achieves 38-107X speedup (0.59-1.67x per core), and a 256 core PPU that achieves 148-421x speedup (0.57-1.65x per core). We know an x86 CPU is horribly inefficient compared to an ARM design, but programs need to be specifically coded for them else they can perform worse than on an x86 CPU, and it sounds like they're basically saying that the PPU will be increasing performance rather brute force like with existing code yet much more efficiently with specifically coded programs.

This would have been great a few years ago, but with Qualcomm already having an NPU on the market with AMD and Intel soon to follow, what incentive do they have to ditch all their efforts and license Flow Computing's design? Even if it's half as performant as their claims are, they're all no doubt working on higher performance second generation products now even before Flow Computing even has a prototype fabbed, which they don't intend to do anyway. If anyone would be interested one would have to think it's Apple or China.

truerock · Jun 12, 2024

I have reviewed the literature at https://flow-computing.com/

This is my interpretation of what I think this is.

When I did my CPU circuit diagram as an undergrad computer science student, I did it with pencil on paper. It was a 2-dimensional thing.

What I think Flow Computing is doing is adding a third dimension to a CPU design. So, those of you who know how a CPU works can probably imagine how each clock-tick on the instruction register would not only perform one step - but, would be able to do multiple things per clock tick depending on the instruction being executed.
But, not only that - multiple instruction registers would simultaneously be doing the same thing, and all of the 3rd dimension steps would possibly combine with each other to output results more quickly. It would be like a neural network of instruction registers.

Anyway - any of you can go read the same documentation I read. I think I may have provided a simplification of what this is about.

TerryLaze · Jun 13, 2024

bit_user said:
It sure doesn't sound that way to me. Even the graphic you included in your post rules out the possibility of it being physically outside of the CPU, because there's a direct arrow between the other CPU cores and the PPU.

Does this look like something that will be inside a CPU?!
Because to me it looks like a GPU type of a deal.

TerryLaze said:
Intel already failed at trying to do this some 10 years ago, this is just intel's xeon phi / knights landing but without intel's backing.

TJ Hooker · Jun 13, 2024

TerryLaze said:
Does this look like something that will be inside a CPU?!
Because to me it looks like a GPU type of a deal.

No, they quite clearly envision it as being on-die (or at least on-package) with the CPU.

"The Parallel Processing Unit (PPU) is an IP block that integrates tightly with the CPU on the same silicon."

Flow Computing is the enabler of next generation SuperCPUs

Flow's Parallel Processing Unit (PPU) gives 100X CPU performance for demanding applications, such as locally-hosted AI and general-purpose parallel computing.

flow-computing.com

TJ Hooker · Jun 13, 2024

Metal Messiah. said:
CPUs have gotten very fast, but even with nanosecond-level responsiveness, there’s a tremendous amount of waste in how instructions are carried out simply because of the basic limitation that one task needs to finish before the next one starts.

What Flow claims to have done is remove this limitation, turning the CPU from a one-lane street into a multi-lane highway.

The CPU is still limited to doing one task at a time, but Flow’s PPU, as they call it, essentially performs nanosecond-scale traffic management on-die to move tasks into and out of the processor faster than has previously been possible.

[...]

Flow’s big achievement, in other words, isn’t high-speed traffic management, but rather doing it without having to modify any code on any CPU or architecture that it has tested. It sounds kind of unhinged to say that arbitrary code can be executed twice as fast on any chip with no modification beyond integrating the PPU with the die.

This analysis by techcrunch is iffy. CPUs have been taking advantage of instruction level parallelism for 20+ years, with multiple execution units in each core, pipelining, out of order execution, branch prediction/speculative execution, etc. Describing a modern CPU (core) as being strictly serial isn't correct.

If Flow Computing has found a way to extract ILP better/faster/with less power/die space/whatever, great. But acting like they invented the idea is a bit rich.

Edit: And the existing techniques I mention above for extracting ILP don't require code to be specifically written to support them. So this quote from the Flow CEO, later in the techcruch article, seems suspect as well: "You can already do parallelization, but it breaks legacy code, and then it’s useless.”

hotaru251 · Jun 13, 2024

this is the key point....it needs optimization for anywhere near that 100x.

While I support any advancement that benefits end users optimization is generally an afterthought to many people who make stuff.
"as long as it works" is the motto.

bit_user · Jun 13, 2024

hotaru251 said:
this is the key point....it needs optimization for anywhere near that 100x.

What's funny to me is that probably everyone in this thread is aware & accepts that GPUs are at least an order of magnitude faster than CPUs, for computational tasks they're good at. I think that's pretty non-controversial. I think we're also aware that GPUs need programs to be written specifically to harness their specialized, parallel nature (not to mention their differing ISA).

So, I don't find the idea they're spinning completely implausible. I think what's at the core of their idea is a tighter level of integration than iGPUs normally have, which is basically a memory-level interface (communication can happen via the cache hierarchy, but I think it still happens primarily via memory-mapped reads & writes).

If you can launch a small sequence of operations on some data and get the results back within dozens of nanoseconds, rather than microseconds, that could meaningfully impact how iGPUs can factor into computation. However, if that's really the main thing they're doing differently, it seems like all of the biggest players in the industry would be well-equipped to implement it, if they wanted. They wouldn't need to take this entire PPU idea.

I guess we'll have to wait and see. Maybe there are a few more clever things they're doing, but the key question is whether those are just icing on the cake or really fundamental to the value proposition.

News Startup claims it can boost any processor's performance by 100X — Flow Computing introduces its 'CPU 2.0' architecture

Administrator

Reputable

Distinguished

Estimable

Distinguished

Deleted member 2731765

Guest

Titan

Dignified

Distinguished

Titan

Titan

Titan

Reputable

Titan

Titan

Commendable

Distinguished

Dignified

Distinguished

Titan

Titan

Titan

Splendid

Titan

Share this page