D
Deleted member 2731765
Guest
This analysis by techcrunch is iffy.
Exactly my thoughts, when I first read the article.
The scheduled forum maintenance has now been completed. If you spot any issues, please report them here in this thread. Thank you!
This analysis by techcrunch is iffy.
I have reviewed the literature at https://flow-computing.com/
This is my interpretation of what I think this is.
When I did my CPU circuit diagram as an undergrad computer science student, I did it with pencil on paper. It was a 2-dimensional thing.
What I think Flow Computing is doing is adding a third dimension to a CPU design. So, those of you who know how a CPU works can probably imagine how each clock-tick on the instruction register would not only perform one step - but, would be able to do multiple things per clock tick depending on the instruction being executed.
But, not only that - multiple instruction registers would simultaneously be doing the same thing, and all of the 3rd dimension steps would possibly combine with each other to output results more quickly. It would be like a neural network of instruction registers.
Anyway - any of you can go read the same documentation I read. I think I may have provided a simplification of what this is about.
Seems like it. Seems like it could just be a maths accelerator or coprocessor. CPUs are general purpose and modern desktop CPUs are large and complex, adding a specialised bit of hardware to perform certain calculations faster will always get better performance and efficiency hence why people use GPUs and are now starting to use NPUs.Sounds suspiciously like an NPU.
IPC always heavily depends on what the instructions are...I think you may be out of the loop, you have basically explained modern CPUs. It has been a long time since CPU cores were only able to do one thing per clock cycle, just look at AMDs recent keynote, they say on average that the new CPUs have a 16 % (35 % maximum increase) increase in IPC, Instructions Per Cycle. They don't publish the actual IPC figures but rough figures I have seen suggest that modern desktop CPUs (before this latest generation) may achieve around 50 IPC so they can already do multiple things per clock tick.
Zen 4 IPC | Zen 3 IPC | Zen 2 IPC | Golden Cove IPC | Sunny Cove IPC | |
Dependent MOV r,r | 5.71 | 5.72 | 4.54 | 5.62 | 4.76 |
Independent MOV r,r | 5.73 | 5.7 | 4.55 | 5.68 | 4.77 |
Zero integer register using XOR r,r | 5.73 | 5.72 | 3.63 ALU pipes still used | 5.73 | 4.77 |
Zero integer register using MOV r, 0 | 3.77 ALU pipes still used | 3.81 | 3.64 ALU pipes still used | 5.64 | 3.81 ALU pipes still used |
Zero integer register by subtracting it from itself | 5.71 | 5.7 | 3.64 ALU pipes still used | 5.73 | 4.77 |
Dependent MOV xmm, xmm | 5.73 | 5.00 | 4.01 Limited by allocation rate to FP/vector side? | Not tested | 4.06 Elimination sometimes fails? |
Independent MOV xmm,xmm | 5.71 | 5.71 | 3.84 | Not tested | 4.77 |
Zero vector register using xorps | 4.00 | 4.00 | 4.00 | Not tested | 4.76 |
Zero vector register using subps | 5.73 | 5.71 | 3.51 | Not tested | 4.77 |
Dependent Add Immediate | 1.00 | 1.00 | 1.00 | 5.61 | 1.00 |
I've normally seen IPC compared on a per-core and per-workload basis. It's weird to see it claimed on the basis of a summation of a CPUs parts and on such a theoretical basis.These are per core so a full 16core CPU will be far above 50 IPC, again depending on what you run, some things will have much higher or lower IPC than others.
Yes, I understand your point.I think you may be out of the loop, you have basically explained modern CPUs. It has been a long time since CPU cores were only able to do one thing per clock cycle, just look at AMDs recent keynote, they say on average that the new CPUs have a 16 % (35 % maximum increase) increase in IPC, Instructions Per Cycle. They don't publish the actual IPC figures but rough figures I have seen suggest that modern desktop CPUs (before this latest generation) may achieve around 50 IPC so they can already do multiple things per clock tick.
Then there are multiple other things they do to increase performance further and utilise the CPU resources better, like Out of Order Execution, where the CPU reorders instructions based on what resources they need, what other instructions it is trying to do and what data the instruction depends upon. Speculative execution and branch prediction where when it comes across a conditional statement the CPU will preemptively pick which condition is most likely and start executing that before the condition is even checked then when the condition is checked it either continues down the path if it was right or it scraps everything it did on its chosen path and starts again on the correct path. SIMD, Single Instruction Multiple Data, basically where one instruction is executed on a batch of data in parallel, sometimes called things like "vector processing". SMT, Simultaneous Multi Threading, where a CPU can execute multiple threads at once, desktop CPUs can generally execute two threads at once although this doesn't lead to a 2x performance boost since it works by allowing the second thread to use parts of the core that the first thread isn't using.
So things definitely aren't as simple as a CPU only being able to execute one instruction per clock tick and it hasn't been that way in general computing for a long time. So if the claimed advantage of this "PPU" is that they let you execute multiple instructions per clock cycle then that already exists and is already a part of a lot of CPUs as is parallel processing. A CPU you drew a diagram of at undergraduate level in university is very likely nothing like modern CPUs (especially x86 CPUs) and is likely even not much like CPUs of the time, depending on how long ago it was.
Doing this at compile-time is what VLIW CPUs do (or rather what optimizing compilers for them try to do). For out-of-order CPUs, the classical formulation is probably Tomasulo's Algorithm:So, if you were creating a compiler for this, there are certain things that are done in a sequence of machine code level instructions that don't necessarily need to be done sequentially. For example, a common thing would be to move memory location A to register A and memory location B to register B and then add regA to RegB, put the result in RegC and then move RegC to memory location C. A very simple implementation of Flow Computing would be for the first 2 memory moves be performed simultaneously by a single instruction CPU-clock-tick in one (single) instruction register.
You keep talking about an instruction register containing multiple operations. That's basically the definition of VLIW.I think - and I obviously could be wrong - Flow Management is about multiple things occurring at a single tick of a CPU clock inside the instruction register...
I don't see anywhere on their website that they talk about a 3-dimensional anything.... which subsequently initiates other instruction registers within a 3-dimensional network of instruction registers.
To be clear, the only reason I mentioned Tomasulo is that you started musing about how to do instruction scheduling. I just wanted to point out the classical formulation of that, which happens to be the foundation of modern OoO processors.Robert Tomasulo didn't have the idea in the 1960s that ...
Where are you reading that? Quote the part where it talks about that, please....there would be a CPU with a matrix of thousands of instruction registers.
Yes, I like to use computation of Fibonacci sequence as a way to benchmark out-of-order CPU cores on a low-IPC task. It's one of the least energy-intensive things they can do, because the algorithm allows for very little parallelism. Consequently, it can be a good way to find out what the all-core clock limits are, without first running into energy or thermal limits.Many algorithms are context-dependent. That is, at step n we need the calculations of step n-1.
It doesn't. What they said is that they use parallelism to hide latency, not that the latency doesn't exist. However, this requires lots of threads which requires lots of concurrency in your software. Most modern CPU cores stop at just 2 threads, which isn't enough for much latency-hiding. Scaling up to more threads could be a problem, as many common tasks (e.g. gaming & web browsing) tend not to even fully utilize all of the threads and cores that conventional CPUs have.How will the absence of L1, L2 cache help the program run faster?
Yes, which is why it uses L3. However, cache burns quite a lot of power. If your working set tends to be too big to fit in L1 or L2, then having one is just a waste of die space and energy. I'm just speculating, but I think that's probably the reason they don't use L1 or L2 - that most of their data is either streamed to/from memory with little reuse, or streamed to/from the conventional CPU cores.Are you aware of what a cache line is, why they were made, why each core has its own cache with instructions?
This is a classic speed vs. throughput tradeoff. If we look at GPUs, they don't run any one thread very fast, but their power comes from being able to run a very large number of threads concurrently.What do I care about other work, run the program faster!
Since I don't know exactly what their "secret sauce" is, I don't claim to know whether this is smoke & mirrors, but I think they wouldn't be wasting so much time & resources on something pointless and wouldn't have gotten enough investment capital to take their idea this far, if it really didn't have some potential. Maybe we'll learn more, or perhaps they will just fade away.at worst, they are trying to hype and want to use industry optimism with the aim of simply deceiving investors.