News Startup claims it can boost any processor's performance by 100X — Flow Computing introduces its 'CPU 2.0' architecture

Findecanor

Distinguished
Apr 7, 2015
309
215
19,060
I have yet a few papers to read, but from my understanding so far this I think this approach is basically about taking SIMT technology that you'd typically find in a GPU and adapting it to run general-purpose code that you'd otherwise only run on a CPU.

An NPU on the other hand, can only do one thing: matrix multiplication, and often does it at low precision.
 
  • Like
Reactions: bit_user
D

Deleted member 2731765

Guest
Tom's article seem to be short on details. I got some insight.


CPUs have gotten very fast, but even with nanosecond-level responsiveness, there’s a tremendous amount of waste in how instructions are carried out simply because of the basic limitation that one task needs to finish before the next one starts.

What Flow claims to have done is remove this limitation, turning the CPU from a one-lane street into a multi-lane highway.

The CPU is still limited to doing one task at a time, but Flow’s PPU, as they call it, essentially performs nanosecond-scale traffic management on-die to move tasks into and out of the processor faster than has previously been possible.

Think of the CPU as a chef working in a kitchen.

The chef can only work so fast, but what if that person had a superhuman assistant swapping knives and tools in and out of the chef’s hands, clearing the prepared food and putting in new ingredients, removing all tasks that aren’t actual chef stuff?

The chef still only has two hands, but now the chef can work 10 times as fast. It’s not a perfect analogy, but it gives you an idea of what’s happening here, at least according to Flow’s internal tests and demos with the industry (and they are talking with everyone).

The PPU doesn’t increase the clock frequency or push the system in other ways that would lead to extra heat or power; in other words, the chef is not being asked to chop twice as fast. It just more efficiently uses the CPU cycles that are already taking place.

Flow’s big achievement, in other words, isn’t high-speed traffic management, but rather doing it without having to modify any code on any CPU or architecture that it has tested. It sounds kind of unhinged to say that arbitrary code can be executed twice as fast on any chip with no modification beyond integrating the PPU with the die.

Therein lies the primary challenge to Flow’s success as a business: Unlike a software product, Flow’s tech needs to be included at the chip-design level, meaning it doesn’t work retroactively, and the first chip with a PPU would necessarily be quite a ways down the road.


Chart showing improvements in an FPGA PPU-enhanced chip versus unmodified Intel chips. Increasing the number of PPU cores continually improves performance.

flow-chart.png
 
Last edited by a moderator:
Jun 12, 2024
1
0
10
There are already products on the market that are very specific to OOoE technology, such as the POWER architecture, which produces similar results to this chip. (Although not x86 compatible and not for consumer use!)
 
It sounds good in theory, but it does sound suspiciously like a the NPU that Qualcomm, AMD, and Intel have themselves so it's impact may be limited to other companies, like Samsung if they don't license Qualcomm's.
 

bit_user

Titan
Ambassador
My guess is that it's basically like a tightly-integrated GPU. Like, if the compute cores of a GPU were integrated almost as tightly into a CPU as their FPU.

CPUs have gotten very fast, but even with nanosecond-level responsiveness, there’s a tremendous amount of waste in how instructions are carried out simply because of the basic limitation that one task needs to finish before the next one starts.
There are a couple things this could mean. They could be talking about the way that CPU cores can operate on only 1 or 2 threads at a time (although I think POWER has done up to 8-way SMT), but the main thing I think they're probably talking about is how CPUs (especially x86) have tight memory synchronization requirements. GPUs famously have very weak memory models. This can limit concurrency, in CPUs.

Chart showing improvements in an FPGA PPU-enhanced chip versus unmodified Intel chips. Increasing the number of PPU cores continually improves performance.

flow-chart.png
This is super sketchy, if they don't even tell us which models of CPUs they're talking about or show us what code they ran, compiler + options, etc. I find it pretty funny they used a core i7 that was 10x as fast as a Xeon W. Did they compare the newest Raptor Lake i7 against the slowest and worst Skylake Xeon W??

I'm guessing the code they used on the x86 CPUs was probably brain dead and got compiled to use all scalar operations.
 

bit_user

Titan
Ambassador
Intel: buy'em out boys!
Indeed. Like they did with Soft Machines:

Intel already failed at trying to do this some 10 years ago, this is just intel's xeon phi / knights landing
LOL, no. Xeon Phi was very simply just a bunch of Atom-derived CPU cores with SMT and big AVX-512 units bolted on.

There was really nothing more to it than that. It was x86 in every sense, which they used as a selling-point. It also had all the limitations of x86, which is why it failed.
 
Last edited:

bit_user

Titan
Ambassador
An NPU on the other hand, can only do one thing: matrix multiplication, and often does it at low precision.
That's not true. What they do best is matrix multiplies, but that's not all they do!
 
LOL, no. Xeon Phi was very simply just a bunch of Atom-derived CPU cores with SMT and big AVX-512 units bolted on.

There was really nothing more to it than that. It was x86 in every sense, which they used as a selling-point. It also had all the limitations of x86, which is why it failed.
And this will be the exact same thing only with arm cores, it's still the same thing, a bunch of cores on an add in card.
Literally the only difference they show from xeon phi is that they will have shared cache/memory which means it will only work on special mobos and special cpus made for those mobos....so even worse than phi.
94h4onmwrnBQ4PJn79gNnU-1200-80.jpg.webp
 

bit_user

Titan
Ambassador
And this will be the exact same thing only with arm cores, it's still the same thing, a bunch of cores on an add in card.
It sure doesn't sound that way to me. Even the graphic you included in your post rules out the possibility of it being physically outside of the CPU, because there's a direct arrow between the other CPU cores and the PPU.

94h4onmwrnBQ4PJn79gNnU-1200-80.jpg.webp

 

oofdragon

Honorable
Oct 14, 2017
312
284
11,060
If it can really make processors rum 100x faster it will either be sold for the tech never to be used or everyone in the k ow will suicide with 100 bullets in their back
 
This is completely believable, and it has to do with the fact that Mark Tyson didn't bother to include any explanation, whereas The Virge, in their article yesterday, did, by including Flow Computing's FAQ, in which states:

23. How much die space does adding a PPU require to achieve 100X performance over standard architectures? It depends on the system configuration. In case the number of processor cores is high, it is expected that several CPU cores could be substituted by the PPU. Then PPU uses the leftover die space without the need to add any extra silicon area. Our initial silicon area estimation model is based on legacy silicon technology parameters and public scaling factors. For the 64-core PPU that achieves 38X - 107X speedup in laboratory tests, the initial silicon area estimate is 21.7 mm^2 area in 3 nm silicon process. The silicon area estimate for a 256-core PPU achieving 148X - 421X speedup is 103.8 mm^2, respectively.

A 64 core PPU that achieves 38-107X speedup (0.59-1.67x per core), and a 256 core PPU that achieves 148-421x speedup (0.57-1.65x per core). We know an x86 CPU is horribly inefficient compared to an ARM design, but programs need to be specifically coded for them else they can perform worse than on an x86 CPU, and it sounds like they're basically saying that the PPU will be increasing performance rather brute force like with existing code yet much more efficiently with specifically coded programs.

This would have been great a few years ago, but with Qualcomm already having an NPU on the market with AMD and Intel soon to follow, what incentive do they have to ditch all their efforts and license Flow Computing's design? Even if it's half as performant as their claims are, they're all no doubt working on higher performance second generation products now even before Flow Computing even has a prototype fabbed, which they don't intend to do anyway. If anyone would be interested one would have to think it's Apple or China.
 

truerock

Distinguished
Jul 28, 2006
317
44
18,820
I have reviewed the literature at https://flow-computing.com/

This is my interpretation of what I think this is.

When I did my CPU circuit diagram as an undergrad computer science student, I did it with pencil on paper. It was a 2-dimensional thing.

What I think Flow Computing is doing is adding a third dimension to a CPU design. So, those of you who know how a CPU works can probably imagine how each clock-tick on the instruction register would not only perform one step - but, would be able to do multiple things per clock tick depending on the instruction being executed.
But, not only that - multiple instruction registers would simultaneously be doing the same thing, and all of the 3rd dimension steps would possibly combine with each other to output results more quickly. It would be like a neural network of instruction registers.

Anyway - any of you can go read the same documentation I read. I think I may have provided a simplification of what this is about.
 
It sure doesn't sound that way to me. Even the graphic you included in your post rules out the possibility of it being physically outside of the CPU, because there's a direct arrow between the other CPU cores and the PPU.
94h4onmwrnBQ4PJn79gNnU-1200-80.jpg.webp
Does this look like something that will be inside a CPU?!
Because to me it looks like a GPU type of a deal.
Intel already failed at trying to do this some 10 years ago, this is just intel's xeon phi / knights landing but without intel's backing.

xxYC787ry8tyo7mbBbMgRV-1200-80.jpg
 

TJ Hooker

Titan
Ambassador
Does this look like something that will be inside a CPU?!
Because to me it looks like a GPU type of a deal.
No, they quite clearly envision it as being on-die (or at least on-package) with the CPU.

"The Parallel Processing Unit (PPU) is an IP block that integrates tightly with the CPU on the same silicon."
 

TJ Hooker

Titan
Ambassador
CPUs have gotten very fast, but even with nanosecond-level responsiveness, there’s a tremendous amount of waste in how instructions are carried out simply because of the basic limitation that one task needs to finish before the next one starts.

What Flow claims to have done is remove this limitation, turning the CPU from a one-lane street into a multi-lane highway.

The CPU is still limited to doing one task at a time, but Flow’s PPU, as they call it, essentially performs nanosecond-scale traffic management on-die to move tasks into and out of the processor faster than has previously been possible.

[...]

Flow’s big achievement, in other words, isn’t high-speed traffic management, but rather doing it without having to modify any code on any CPU or architecture that it has tested. It sounds kind of unhinged to say that arbitrary code can be executed twice as fast on any chip with no modification beyond integrating the PPU with the die.
This analysis by techcrunch is iffy. CPUs have been taking advantage of instruction level parallelism for 20+ years, with multiple execution units in each core, pipelining, out of order execution, branch prediction/speculative execution, etc. Describing a modern CPU (core) as being strictly serial isn't correct.

If Flow Computing has found a way to extract ILP better/faster/with less power/die space/whatever, great. But acting like they invented the idea is a bit rich.

Edit: And the existing techniques I mention above for extracting ILP don't require code to be specifically written to support them. So this quote from the Flow CEO, later in the techcruch article, seems suspect as well: "You can already do parallelization, but it breaks legacy code, and then it’s useless.”
 
Last edited:

bit_user

Titan
Ambassador
this is the key point....it needs optimization for anywhere near that 100x.
What's funny to me is that probably everyone in this thread is aware & accepts that GPUs are at least an order of magnitude faster than CPUs, for computational tasks they're good at. I think that's pretty non-controversial. I think we're also aware that GPUs need programs to be written specifically to harness their specialized, parallel nature (not to mention their differing ISA).

So, I don't find the idea they're spinning completely implausible. I think what's at the core of their idea is a tighter level of integration than iGPUs normally have, which is basically a memory-level interface (communication can happen via the cache hierarchy, but I think it still happens primarily via memory-mapped reads & writes).

If you can launch a small sequence of operations on some data and get the results back within dozens of nanoseconds, rather than microseconds, that could meaningfully impact how iGPUs can factor into computation. However, if that's really the main thing they're doing differently, it seems like all of the biggest players in the industry would be well-equipped to implement it, if they wanted. They wouldn't need to take this entire PPU idea.

I guess we'll have to wait and see. Maybe there are a few more clever things they're doing, but the key question is whether those are just icing on the cake or really fundamental to the value proposition.
 
  • Like
Reactions: hotaru251