News Startup claims it can boost any processor's performance by 100X — Flow Computing introduces its 'CPU 2.0' architecture

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
I have reviewed the literature at https://flow-computing.com/

This is my interpretation of what I think this is.

When I did my CPU circuit diagram as an undergrad computer science student, I did it with pencil on paper. It was a 2-dimensional thing.

What I think Flow Computing is doing is adding a third dimension to a CPU design. So, those of you who know how a CPU works can probably imagine how each clock-tick on the instruction register would not only perform one step - but, would be able to do multiple things per clock tick depending on the instruction being executed.
But, not only that - multiple instruction registers would simultaneously be doing the same thing, and all of the 3rd dimension steps would possibly combine with each other to output results more quickly. It would be like a neural network of instruction registers.

Anyway - any of you can go read the same documentation I read. I think I may have provided a simplification of what this is about.

I think you may be out of the loop, you have basically explained modern CPUs. It has been a long time since CPU cores were only able to do one thing per clock cycle, just look at AMDs recent keynote, they say on average that the new CPUs have a 16 % (35 % maximum increase) increase in IPC, Instructions Per Cycle. They don't publish the actual IPC figures but rough figures I have seen suggest that modern desktop CPUs (before this latest generation) may achieve around 50 IPC so they can already do multiple things per clock tick.

Then there are multiple other things they do to increase performance further and utilise the CPU resources better, like Out of Order Execution, where the CPU reorders instructions based on what resources they need, what other instructions it is trying to do and what data the instruction depends upon. Speculative execution and branch prediction where when it comes across a conditional statement the CPU will preemptively pick which condition is most likely and start executing that before the condition is even checked then when the condition is checked it either continues down the path if it was right or it scraps everything it did on its chosen path and starts again on the correct path. SIMD, Single Instruction Multiple Data, basically where one instruction is executed on a batch of data in parallel, sometimes called things like "vector processing". SMT, Simultaneous Multi Threading, where a CPU can execute multiple threads at once, desktop CPUs can generally execute two threads at once although this doesn't lead to a 2x performance boost since it works by allowing the second thread to use parts of the core that the first thread isn't using.

So things definitely aren't as simple as a CPU only being able to execute one instruction per clock tick and it hasn't been that way in general computing for a long time. So if the claimed advantage of this "PPU" is that they let you execute multiple instructions per clock cycle then that already exists and is already a part of a lot of CPUs as is parallel processing. A CPU you drew a diagram of at undergraduate level in university is very likely nothing like modern CPUs (especially x86 CPUs) and is likely even not much like CPUs of the time, depending on how long ago it was.
 
Sounds suspiciously like an NPU.
Seems like it. Seems like it could just be a maths accelerator or coprocessor. CPUs are general purpose and modern desktop CPUs are large and complex, adding a specialised bit of hardware to perform certain calculations faster will always get better performance and efficiency hence why people use GPUs and are now starting to use NPUs.

Even look at GPUs, I will use the 4060 Ti for this example. It can achieve 22.06 TFLOPS at 16 and 32 bit floating point but can achieve 384 TOPS using it's tensor cores. That is a big difference by using a specialised processor. Looking at CPUs, most are on the order of a few hundred GFLOPS so an NPU that can do 50 TOPS is a big improvement for certain tasks. So assuming a CPU can manage 500 GFLOPS and an NPU can manage 50 TOPS, that is already a 100 x increase and the NPU is a relatively small system with pretty low power draw.

Basically adding specialised coprocessors or accelerators will always get substantially better performance than a general purpose processor even though CPUs now have features built in to make them better at certain things, like SIMD because by being general purpose that inherently means it can't be optimised for a specific task unless it becomes very very large and inefficient.

I hope for their sake that this company isn't trying to claim they have invented parallel processing or that they have invented maths coprocessors.
 
  • Like
Reactions: rluker5
I think you may be out of the loop, you have basically explained modern CPUs. It has been a long time since CPU cores were only able to do one thing per clock cycle, just look at AMDs recent keynote, they say on average that the new CPUs have a 16 % (35 % maximum increase) increase in IPC, Instructions Per Cycle. They don't publish the actual IPC figures but rough figures I have seen suggest that modern desktop CPUs (before this latest generation) may achieve around 50 IPC so they can already do multiple things per clock tick.
IPC always heavily depends on what the instructions are...
Yeah, ZEN4 is about the same level as intel, other than the clocks, which is why intel has released the same CPU three times as different gens and is releasing a new one now that ZEN 5 will release.
These are per core so a full 16core CPU will be far above 50 IPC, again depending on what you run, some things will have much higher or lower IPC than others.
Zen 4 IPCZen 3 IPCZen 2 IPCGolden Cove IPCSunny Cove IPC
Dependent MOV r,r5.715.724.545.624.76
Independent MOV r,r5.735.74.555.684.77
Zero integer register using XOR r,r5.735.723.63
ALU pipes still used
5.734.77
Zero integer register using MOV r, 03.77
ALU pipes still used
3.813.64
ALU pipes still used
5.643.81
ALU pipes still used
Zero integer register by subtracting it from itself5.715.73.64
ALU pipes still used
5.734.77
Dependent MOV xmm, xmm5.735.004.01
Limited by allocation rate to FP/vector side?
Not tested4.06
Elimination sometimes fails?
Independent MOV xmm,xmm5.715.713.84Not tested4.77
Zero vector register using xorps4.004.004.00Not tested4.76
Zero vector register using subps5.735.713.51Not tested4.77
Dependent Add Immediate1.001.001.005.611.00
 
These are per core so a full 16core CPU will be far above 50 IPC, again depending on what you run, some things will have much higher or lower IPC than others.
I've normally seen IPC compared on a per-core and per-workload basis. It's weird to see it claimed on the basis of a summation of a CPUs parts and on such a theoretical basis.

Typically, performance on some defined workload is used as a proxy for "instructions", which makes sense because CPUs can execute instructions of different classes at different rates (as the table you quoted shows though the figures for floating point and vector instructions vary even more widely). There's no single IPC figure that would fully characterize a given CPU/core on any workload.

If you look at how Intel, AMD, and ARM tend to characterize their CPUs, on an IPC basis, it's across a diversity of workloads and they just pick the median (midpoint) as the one to use for their claims. This still makes the figure susceptible to meddling, since they could bias their benchmark selection toward the ones more partial to the new CPU.

zyGGgc9qzjhdVvLMKqCFZS.jpg


Source: https://www.tomshardware.com/pc-com...-7-and-5-processors-with-a-16-ipc-improvement
 
Last edited:
  • Like
Reactions: TJ Hooker
I think you may be out of the loop, you have basically explained modern CPUs. It has been a long time since CPU cores were only able to do one thing per clock cycle, just look at AMDs recent keynote, they say on average that the new CPUs have a 16 % (35 % maximum increase) increase in IPC, Instructions Per Cycle. They don't publish the actual IPC figures but rough figures I have seen suggest that modern desktop CPUs (before this latest generation) may achieve around 50 IPC so they can already do multiple things per clock tick.

Then there are multiple other things they do to increase performance further and utilise the CPU resources better, like Out of Order Execution, where the CPU reorders instructions based on what resources they need, what other instructions it is trying to do and what data the instruction depends upon. Speculative execution and branch prediction where when it comes across a conditional statement the CPU will preemptively pick which condition is most likely and start executing that before the condition is even checked then when the condition is checked it either continues down the path if it was right or it scraps everything it did on its chosen path and starts again on the correct path. SIMD, Single Instruction Multiple Data, basically where one instruction is executed on a batch of data in parallel, sometimes called things like "vector processing". SMT, Simultaneous Multi Threading, where a CPU can execute multiple threads at once, desktop CPUs can generally execute two threads at once although this doesn't lead to a 2x performance boost since it works by allowing the second thread to use parts of the core that the first thread isn't using.

So things definitely aren't as simple as a CPU only being able to execute one instruction per clock tick and it hasn't been that way in general computing for a long time. So if the claimed advantage of this "PPU" is that they let you execute multiple instructions per clock cycle then that already exists and is already a part of a lot of CPUs as is parallel processing. A CPU you drew a diagram of at undergraduate level in university is very likely nothing like modern CPUs (especially x86 CPUs) and is likely even not much like CPUs of the time, depending on how long ago it was.
Yes, I understand your point.

My analogy for what you are bringing up is having multiple instruction registers performing tasks in parallel.

My interpretation of what I read at https://flow-computing.com/ is one instruction register that does multiple things at each tick of the CPU clock (clock tick... NOT each CPU clock cycle).

I mean.... it is difficult to convey what Flow Computing has patented... but, I think my analogy is not horribly off base.

So, if you were creating a compiler for this, there are certain things that are done in a sequence of machine code level instructions that don't necessarily need to be done sequentially. For example, a common thing would be to move memory location A to register A and memory location B to register B and then add regA to RegB, put the result in RegC and then move RegC to memory location C. A very simple implementation of Flow Computing would be for the first 2 memory moves be performed simultaneously by a single instruction CPU-clock-tick in one (single) instruction register.
 
Last edited:
So, if you were creating a compiler for this, there are certain things that are done in a sequence of machine code level instructions that don't necessarily need to be done sequentially. For example, a common thing would be to move memory location A to register A and memory location B to register B and then add regA to RegB, put the result in RegC and then move RegC to memory location C. A very simple implementation of Flow Computing would be for the first 2 memory moves be performed simultaneously by a single instruction CPU-clock-tick in one (single) instruction register.
Doing this at compile-time is what VLIW CPUs do (or rather what optimizing compilers for them try to do). For out-of-order CPUs, the classical formulation is probably Tomasulo's Algorithm:

It's probably worth having a look at this page, too:
 
Yes, you bring up some important ideas.

I don't think Flow Computing is about registry management or instruction que management. Current CPUs already have that.

I think - and I obviously could be wrong - Flow Management is about multiple things occurring at a single tick of a CPU clock inside the instruction register... which subsequently initiates other instruction registers within a 3-dimensional network of instruction registers.

Robert Tomasulo didn't have the idea in the 1960s that there would be a CPU with a matrix of thousands of instruction registers. Something like that wouldn't be possible or conceivable for decades.

 
Last edited:
I think - and I obviously could be wrong - Flow Management is about multiple things occurring at a single tick of a CPU clock inside the instruction register...
You keep talking about an instruction register containing multiple operations. That's basically the definition of VLIW.

... which subsequently initiates other instruction registers within a 3-dimensional network of instruction registers.
I don't see anywhere on their website that they talk about a 3-dimensional anything.

Robert Tomasulo didn't have the idea in the 1960s that ...
To be clear, the only reason I mentioned Tomasulo is that you started musing about how to do instruction scheduling. I just wanted to point out the classical formulation of that, which happens to be the foundation of modern OoO processors.

...there would be a CPU with a matrix of thousands of instruction registers.
Where are you reading that? Quote the part where it talks about that, please.
 
I apologize. English is not my native language, but I really wanted to express my opinion. I will be more conservative in my judgments. In my opinion, this article and all the PR just have nothing to do with computer science. And their explanations lead away from the answers to some distant country. I have 2 main points for this:

  1. Many algorithms are context-dependent. That is, at step n we need the calculations of step n-1. This thought alone kills the idea of parallel computations in some independent co-processor. No recompilation will help you make x100. Let’s take the first simple example that comes to mind. Calculation of factorial. f(x) = f(x-1)*x; Now try to make the compiler optimize this code so that it parallelizes well. It’s very easy to break it into intervals with your mind and calculate each in its “branch”, but making the machine do this automatically is not an easy task. If I add an if statement here, then it will not be possible to come up with optimization either. So why is this thing needed? how can it make a x2-x100 in general calculations? - it’s just obviously a lie.
  2. Let’s assume that we don’t have problem number 1. We have a CPU that can do magic and turn any algorithm into a parallelizable one. What do they themselves say about this? “Latency of memory references is hidden by executing other threads while accessing the memory. No coherency problems since no caches are placed in front of the network. Scalability is provided via a high-bandwidth network-on-chip.” What? How will the absence of L1, L2 cache help the program run faster? Are you seriously going to wait for data to come from DDR and this is your explanation why everything should run fast and synchronously? Are you aware of what a cache line is, why they were made, why each core has its own cache with instructions? They even made hyper-threading so that while your instruction pipeline was waiting for data from main memory, you could occupy registers with other work and switch it to another pipeline? And you’re telling me that you have a lot of other work and you will be busy with it. What do I care about other work, run the program faster!
In general, I see no hint of novelty or breakthrough solution here. If this is an NPU, then their explanation is simply incorrect at best, at worst, they are trying to hype and want to use industry optimism with the aim of simply deceiving investors.
 
So essentially a Co-Processor, are they certain most if not all programs can properly make use of this when most of them still struggle with multi-threading after all these years?
 
  • Like
Reactions: Arheus
Many algorithms are context-dependent. That is, at step n we need the calculations of step n-1.
Yes, I like to use computation of Fibonacci sequence as a way to benchmark out-of-order CPU cores on a low-IPC task. It's one of the least energy-intensive things they can do, because the algorithm allows for very little parallelism. Consequently, it can be a good way to find out what the all-core clock limits are, without first running into energy or thermal limits.

How will the absence of L1, L2 cache help the program run faster?
It doesn't. What they said is that they use parallelism to hide latency, not that the latency doesn't exist. However, this requires lots of threads which requires lots of concurrency in your software. Most modern CPU cores stop at just 2 threads, which isn't enough for much latency-hiding. Scaling up to more threads could be a problem, as many common tasks (e.g. gaming & web browsing) tend not to even fully utilize all of the threads and cores that conventional CPUs have.

Are you aware of what a cache line is, why they were made, why each core has its own cache with instructions?
Yes, which is why it uses L3. However, cache burns quite a lot of power. If your working set tends to be too big to fit in L1 or L2, then having one is just a waste of die space and energy. I'm just speculating, but I think that's probably the reason they don't use L1 or L2 - that most of their data is either streamed to/from memory with little reuse, or streamed to/from the conventional CPU cores.

What do I care about other work, run the program faster!
This is a classic speed vs. throughput tradeoff. If we look at GPUs, they don't run any one thread very fast, but their power comes from being able to run a very large number of threads concurrently.

at worst, they are trying to hype and want to use industry optimism with the aim of simply deceiving investors.
Since I don't know exactly what their "secret sauce" is, I don't claim to know whether this is smoke & mirrors, but I think they wouldn't be wasting so much time & resources on something pointless and wouldn't have gotten enough investment capital to take their idea this far, if it really didn't have some potential. Maybe we'll learn more, or perhaps they will just fade away.

It does annoy me that we get these huge claims but no real details, leaving us to try and fill in the gaps on our own. I'd almost rather not hear anything if I'm not going to get enough details even about how they collected their supposed benchmark data to know whether or not their claims are credible.
 
Last edited:
  • Like
Reactions: Arheus