Soft Machines Startup Promises 'Virtual Cores' Twice As Fast As Physical Cores

everygamer · Oct 26, 2014

As a programmer I look at this and wonder how they are getting the single thread application to take the next step when the previous step has not completed. You would still need to queue and execute in order, as that is the point of a single threaded application. It processes data linearly, in order. They would likely need to use predictive methods to "guess" what the next step will be and pre-process that step or a number of possible steps and then have those results waiting to speed up the application. Microsoft is looking at this type of model for cloud assisted applications, though it remains to be seen how effective it will be in the long run.

In reality, if developers spent more time learning how to write multi-threaded applications we would see better performance increases.

Jaroslav Jandek · Oct 26, 2014

everygamer :

That is not how it is done in modern CPUs. While the instructions are initialized and the operation result retired in order, their execution isn't.
If you are familiar with the C# language (most programmers are), it is conceptually similar to the async/await functionality - while you await results, you let other tasks run asynchronously.

everygamer :

As a programmer, you should be aware that many tasks can't be efficiently parallelized - in many cases the overhead of parallelization (synchronization and management) is so large that running in one thread is both faster and more efficient.

alextheblue · Oct 26, 2014

If this tech was used on amd it could seriously help them get back into business with their multicore cpu's being quite good at multithreaded performance but sucking at single threaded.

Well AMD is one of several firms investing in them, and they said their technology is ISA-independent. So there's no reason to believe that it won't be integrated into an x86 chip or two in the future. However, it won't be any time soon, and I suspect it will hit dense servers and highly mobile devices first.

hst101rox · Oct 26, 2014

TechyInAZ, I was thinking the same thing. There had to be a way to split the work of one thread to multiple threads so then programming for parallel computing is unnecessary. If this can take off single threaded performance will go through the roof.

InvalidError · Oct 26, 2014

Jaroslav Jandek :

Except that a multi-threaded CPU has the extra flexibility to fill pipeline stalls due to dependencies in one thread's instruction stream with instructions from any other hardware thread. This is part of the reason why modern GPUs are designed to juggle thousands of threads - individual threads spend a fair amount of time waiting on memory accesses so having more threads to pick from gives more opportunities to keep execution units busy.

This is a bit like Sun's T-series CPUs.

Untruest · Oct 26, 2014

I am not sure what everyone's background is but I can see the potential for this and of course the numbers are already showing from the benchmarks

This was designed to utilized processors more efficiently. In essence by creating a virtual core from a quad core cpu let's say for example, it's like the thread see's 4 integer units, 4 floating point units, 4 fetch, Load/stores... which is great. However ultimately they will still go underutilized if there are data hazards / dependencies. Compilers would need an update so as to best utilize the resource. At least that takes the responsibility out of less skilled programmers or saves man hours if you're a software company.

Where I'd like to see this go is with the same idea using something similar to Intel's HyperThreading. Then you'll have multiple threads more efficiently using the 1 virtual core which is turn itself is like 4 cores. Those benchmarks would be interesting.

Jaroslav Jandek · Oct 27, 2014

InvalidError :

You did not get my point at all - I was responding to a specific question, not making a general claim.
What you describe is called simultaneous multithreading, which Intel implements as Hyper-threading and they have chosen only 2 logical cores for a reason. IIRC, there are implementations with even 8 logical cores, but those processors have much higher IPC.
In GPUs, SMT is implemented by streaming multiprocessors! Completely different architecture with tens of thousands of registers and fast HW context switching - that is why it can actually process so many instructions in parallel. This is not something you can do with 'a beefy single core cpu', which is what I was responding to.

mamailo · Oct 27, 2014

@theLiminator

You missed the point. Is not about parallelize the code flow but to execute differents parts of it in separated cores.

In your example one core can check if Y is in range; other can preemptative adds X to Y and another prints the last value of X.

Overall the speed increase would be noticeably. Faster than a single core doing all the stuff

xenol · Oct 27, 2014

The idea of this doesn't seem new and feels very reminiscent of EPIC and VLIW that HP/Intel and Transmeta tried to do in the early 2000s. And even before NVIDIA did Denver, Kepler was following the same design route.

I dunno, maybe they found a new solution that EPIC and VLIW failed at.

bit_user · Oct 28, 2014

This will not help highly-optimized code, such as most games.

If we're talking about compiled code, most of what they do, you could accomplish with a good parallelizing compiler. And there's no fundamental reason compiled code can't optimize itself, as it runs. This could yield even better results than would be possible with Soft Machines' tech.

I'm guessing some JIT compilers for Java, JavaScript, C#, etc. also do auto-parallelization and profile-based optimization.

Anyway, it'll be interesting to see who buys them. Someone who currently lacks a good, parallelizing JIT compiler, I'd guess.

bit_user · Oct 28, 2014

xenol :

Huh? Itanium was never about making x86 run faster - the x86 interpreter was just about providing a means to run legacy programs. The first few Itaniums also had only one core, so they definitely didn't try to do multicore optimizations, like this.

However, since the Pentium Pro, all Intel CPUs (with the possible exception of Atom and Xeon Phi) consist of a x86 front-end bolted onto a RISC core. So, you're not even correct in your statement about Itanium proving anything about instruction set translation.

And all the article says about SoftMachines implies that they translate programs written for the same architecture, making your point even more non sequitur.

Apologies if this sounds harsh, but I suggest you learn some facts, before you post.

palladin9479 · Oct 28, 2014

There are some very real problems with doing what they are talking about. Ultimately code is just 1's and 0's running on physical transistors, all binary math. You can not arbitrarily reorder that math with some serious consequences. The best way to describe what code looks like from the machines point of view is to imagine a long stream of 1's and 0's stretching to infinity. They are always linear, a single long stream (also known as single thread). Each instruction is in sequence with jumps pointing to another position in that stream. You can't arbitrarily rearrange the code in that stream without serious logic consequences. Current processors do prediction where the front end analyzes the code and makes guess on what the results of the operations should be, then splits up the code and executes multiple instructions at once. If one of those guess's turns out to be wrong, then all work is discarded and started over again with the correct results rather then the wrong guessed results. This happens to be one of the biggest differences between the Intel and AMD CPU design's. AMD's execution core is not "weaker" then Intels, binary math is simple to execute and we haven't had any major changes in over a decade. AMD's ability to "guess" the correct results and act on those predictions isn't nearly as good as Intels.

Anyhow, this concept has some severe limitations. You can't arbitrarily combine high level computational elements "cores" into a single virtual one and execute binary code. You could only execute a form of pseudo-code that was purpose designed to be optimized at run time vs compile time. What they are talking about is implementing a hardware version of software pseudo-code, essentially a chip that runs Java natively. Could have some uses in appliances, but most definitely not efficient to build an OS around.

Jaroslav Jandek · Oct 28, 2014

bit_user :

You are guessing wrong. None of those actually has automatic parallelization. Microsoft has been working on that for years - it is actually possible (esp. with a tracing JITter) but really hard to do right. There are parallel libraries that make parallelization extremely easy to do explicitly, though - e.g. tasks in C#, java.util.concurrent in Java...
Ironically, auto-parallelization was much easier in the past with fixed inputs and workloads. Today, virtually all OO workloads have dynamic inputs and workloads - that's why it is so hard...

FYI, C++ actually does have an auto-parallelization compiler - the Intel one is pretty good and the M$ one has optional auto-parallelization (/Qpar).

xenol · Oct 28, 2014

bit_user :

And yet you pick one aspect out of my entire post and claim that I should learn some facts.

The way I interpreted the approach this company is using is they're taking instructions from one process and shoving them down as many execution units as possible using a software scheduler. Itanium used a software scheduler (the program was scheduled at compile time) in order to feed a VLIW derived processor.

And sure, Itanium only has one core, but I never said that Itanium was trying to do this. I said it's very reminiscent of this because on the higher level it sounds like the same operation, just with different implementations.

I believe that's the real take away here. Someone's trying to do a software scheduler, which has been sort of a holy grail for general purpose processors for a while. MCST has done it with Elbrus, with the 1.3GHz Elbrus-8C apparently matching the performance of a Xeon-E5 @ 2.6GHz. NVIDIA has done it on the GPU side with Kepler and on the CPU side with Denver. If Soft Machines can pull this off on a larger scale with performance to boot, then it will be the closest thing Intel will have to competition since AMD's Athlon and I'm sure Intel will start changing their road maps real soon.

EDIT: because I feel like writing more...

Why is a software schedule valuable? It's efficient. In a real world example, I have a test I do at work on a unit. Some tests require different equipment which we have to check out and different configurations which require power cycling the unit (power cycles require a minimum of 3 minutes before it can be done again). The test as is requires us to dynamically reorder a few steps because it requires the equipment that needs to be checked out or the configuration needs to change (and two steps later it needs to go back). This is akin to out-of-order execution (though the order of the results aren't important)

If we were to sort the test ahead of time such that all of the tests that require the equipment or configuration were lumped appropriately, then we wouldn't use up valuable time thinking how we should do the test or waiting for equipment to be checked out or reconfiguring the unit.

justvisiting · Oct 28, 2014

This will not work, there is what you call, diminishing returns, for example, single cpu, uses register (0 clock cyle) to process information, even if the core shares L1 (i guess this is where they share data - where the boundary that separates your multicore cpu (alu+fpu) units) you introduce a delay, (2-3) cycle and thats just only ~32k of data.

So for example PhysicalProc1, can process it for 10 cycles, and PROCVirtual(has 2 Physical Processor, proc2 and proc3), technically the speed up is 100%, 5 each proc, but if you introduce the l1 latency(delay) of 2, so a total of 12 cycle to complete, so each would have a 6 cycle, a speed up of 80 percent, that is if 10 is splittable by 5 cycle, but what if, its only splitable by 1cycle, so 10 division total then multiply it by 2 overhead per division, = 20 cycle in total, split it in to 2 (proc2 and proc3), you got 10 cycle for each proc in PROCVirtual, no real gain there....

The overhead, hurt the performance....

I'm not even touch, Data that is larger than 32K(l1)cache size.

bit_user · Oct 28, 2014

xenol :

Again, this is apples and oranges. First, Itanium/EPIC is not a strict implementation of VLIW, which does depend exclusively on software scheduling. What Intel did with EPIC was to make the data dependencies of ops explicit, in order to simplify the logic needed to schedule execution in hardware, at runtime. There are two benefits to this approach, with the primary one being binary compatibility of different processor generations (a traditional weakness of true VLIW). The second benefit is that optimal scheduling decisions often can't be made until runtime, due to limited information available to the compiler (caused both by language-level ambiguities like pointer aliasing, but also runtime effects like cache misses). When one instruction's data dependency is blocked on a cacheline fetch, the EPIC coding makes it easy for the CPU to see which other ops must be held back vs. which can go ahead and execute.

But as it relates to this story, scheduling execution of ALU resources is very different from parallelizing across cores. As many have pointed out, in this thread, there are significant communication and synchronization overheads incurred by multicore communication that can take hundreds of cycles at best. Therefore, I suspect that Soft Machines is doing fairly coarse dependency analysis and detecting larger blocks of code and data that can be parallelized.

I don't mean to minimize the difficulty of doing this well. My hat goes off to them. I just think the benefit is fairly limited, as most performance intensive tasks will have this done by hand (which is often better, or at least minimizes any gains remaining for Soft Machines). And if/when JIT compilers for Java/C#/JavaScript become smart enough to do this, internally (which they're better able to do because they have a higher-level view of the program), then Soft Machines will be left with mere crumbs.

IMO, the big win for them is to get bought by someone big who wants to build their tech into a JIT compiler. Like MS, Google, etc.

BTW, I apologize for criticizing your post. There's a lot of misunderstanding of Itanium, which I think is a nice architecture that was defeated by Intel's own lawyers. Whenever I see what looks like uninformed Itanium bashing, it sets me off on the wrong foot. I think we're all just here to learn and share.

palladin9479 · Oct 29, 2014

The Itanium was a flawed design to being with, the magic compiler it needed never materialized because it's impossible to create. Intel can not predict the future, which is what is necessary for any VLIW orientated architecture to perform well as a general purpose processor. It would of made a great specialized co-processor for large chunks of vector math but there is nothing that Intel could of done to make it a good general purpose CPU. Anyhow this doesn't relate to Itanium in any way. This isn't even "software scheduling" as it's not referencing any current ISA, they created their own for this. It's basically a JIT compiler implemented in hardware that takes in the VISC instructions and compiles / schedules them on the physical resources. This is actually already done by both Intel and AMD though they both go through great strides to protect it as trade secrets. Neither AMD nor Intel produce native x86 CPUs and haven't for a very long time. Both of them design RISC CPU cores that have an external x86 instruction decoder / scheduler bolted on. It takes in the x86 instructions and converts them into RISC like load/store operations and then dispatches them to the internal computing resources for execution. When the execution is done the result is returned in compliance with the x86 ISA. This methodology enables both AMD and Intel to redesign the processors ISA between family's while still maintaining comparability with existing software.

So really, all these guys have done is developed a new instruction set that separates the front end decoder from the internal execution resources. A program compiled for VISC would be compatible on virtually any processor regardless of what ISA it was running internally. Of course they aren't marketing it as that, too technical and not enough buzzwords, so instead they go for the "cloud computing / virtualization" buzzword angle.

bit_user · Oct 29, 2014

palladin9479 :

huh? They don't need a magic compiler. The only thing a compiler for EPIC has to do beyond RISC or x86 is separately encode the data dependencies in the instruction stream. Optimizing compilers already do software scheduling, based on modeling of hardware resources and instruction latencies, and have for a very long time. In fact, it was arguably more important in the early days of pipelined and superscalar CPUs, and also with the small scheduling windows utilized by early out-of-order CPUs.

I don't see how you can argue that EPIC should be slower than x86, if Intel put the same resources behind custom layout and built it on the same process node as their flagship x86 CPUs. For that to ever happen, it had to reach critical mass, which it never did because no one wanted to be locked into buying CPUs only from Intel. Intel's lawyers made it impossible for a competitor to offer EPIC-compatible CPUs, by patenting every single aspect of the architecture and the technologies around it.

It is not a VLIW CPU. It performs on-the-fly instruction scheduling in hardware. It has branch prediction and speculative execution. VLIW CPUs have none of these.

The reason for encoding the op dependencies explicitly was to simplify the scheduling logic, which grows nonlinearly with the number of pipelines and the size of the scheduling window. Given a limited transistor budget, they chose to simplify the instruction stream so they could spend more of their die area on ALUs and still end up with a wide chip that could achieve good yield and hit cost and power targets.

The only point of agreement, here, is that a benefit from very wide superscalar CPUs is only gained when there's loads of instruction-level-parallelism, which is scarce in most code. However, anything that runs fast on a GPU has loads of ILP (and it's no coincidence that VLIW has been used in GPUs, over the years). Fortunately, high-ILP is not uncommon in performance-intensive workloads.

palladin9479 :

You're telling me? I already said that!

GreaseMonkey_62 · Oct 29, 2014

qlum :

AMD shouldn't waste time and buy into this.

palladin9479 · Oct 29, 2014

bit_user :

palladin9479 :

huh? They don't need a magic compiler. The only thing a compiler for EPIC has to do beyond RISC or x86 is separately encode the data dependencies in the instruction stream. Optimizing compilers already do software scheduling, based on modeling of hardware resources and instruction latencies, and have for a very long time. In fact, it was arguably more important in the early days of pipelined and superscalar CPUs, and also with the small scheduling windows utilized by early out-of-order CPUs.

I don't see how you can argue that EPIC should be slower than x86, if Intel put the same resources behind custom layout and built it on the same process node as their flagship x86 CPUs. For that to ever happen, it had to reach critical mass, which it never did because no one wanted to be locked into buying CPUs only from Intel. Intel's lawyers made it impossible for a competitor to offer EPIC-compatible CPUs, by patenting every single aspect of the architecture and the technologies around it.

It is not a VLIW CPU. It performs on-the-fly instruction scheduling in hardware. It has branch prediction and speculative execution. VLIW CPUs have none of these.

The reason for encoding the op dependencies explicitly was to simplify the scheduling logic, which grows nonlinearly with the number of pipelines and the size of the scheduling window. Given a limited transistor budget, they chose to simplify the instruction stream so they could spend more of their die area on ALUs and still end up with a wide chip that could achieve good yield and hit cost and power targets.

The only point of agreement, here, is that a benefit from very wide superscalar CPUs is only gained when there's loads of instruction-level-parallelism, which is scarce in most code. However, anything that runs fast on a GPU has loads of ILP (and it's no coincidence that VLIW has been used in GPUs, over the years). Fortunately, high-ILP is not uncommon in performance-intensive workloads.

palladin9479 :

You're telling me? I already said that!

*blink blink*

Ok sure you go with that buddy

WyomingKnott · Oct 30, 2014

"1.7-2.2 times faster". No. That many times "as fast".

I hate this advertising-based incorrect arithmespeak. Using reductio ad absurdum, one is 1 times more than one.

bit_user · Oct 31, 2014

palladin9479 :

If you're not going to make any case or present any facts, why'd you bother replying? Stooping to insults or mockery basically amounts to a concession, whereas one can infer nothing if you simply don't respond. I'm just saying this as advice - not trying to continue the debate.

BTW, anytime you find yourself claiming to be wiser than scores of the smartest and most experienced chip designers and architects, you should probably check yourself. Big companies fail all the time, but the reasons are rarely as simple or foreseeable as they may seem in hindsight and from a distance.

Soft Machines Startup Promises 'Virtual Cores' Twice As Fast As Physical Cores

Distinguished

Honorable

Distinguished

Reputable

Titan

Distinguished

Honorable

Distinguished

Distinguished

Titan

Titan

Splendid

Honorable

Distinguished

Reputable

Titan

Splendid

Titan

Distinguished

Splendid

Legenda in Aeternum

Titan

Share this page