AMD CPU speculation... and expert conjecture

szatkus · Oct 24, 2014

gamerk316 :

Seriously? It's an obvious fake.

jdwii · Oct 24, 2014

gamerk316 :

http://www.extremetech.com/extreme/192858-visc-cpu-virtual-core-design-emerges-could-this-be-the-conceptual-breakthrough-weve-been-waiting-for

the CPU interleaves the serial workload across two physical cores

Sounds like a horrible idea to me but hey it scored higher.

szatkus · Oct 24, 2014

8350rocks :

Looks like they are targeting IoT, which makes a lot sense with such low clock. Of course it also depends on power consumption, but with such low clock it could be good.

szatkus · Oct 24, 2014

jdwii :

Actually it sounds like a great idea. Similar to SpMT (maybe that is SpMT, hard to say without details). We can place a lot of cores on one chip, but performance of single thread progress very slowly past few years.

jdwii · Oct 24, 2014

^ Its just so new typically splitting one thread up into 2 would cause major latency issues many thought Amd was indeed doing that with bulldozer. I hope we see more info on this in the future

Cazalan · Oct 24, 2014

szatkus :

I think the low clocks are due to it being implemented in an FPGA.

Anyhow when start ups come out with stuff like this it's generally because they blew through their first rounds of funding and need another big chunk to stay alive.

SPEC 2006 scores are also misleading. SPEC_Rate which takes into account the memory and IO subsystems are typically far more important.

szatkus · Oct 25, 2014

Cazalan :

Looks like not FPGA: http://semiaccurate.com/assets/uploads/2014/10/Soft_Machines_prototype_board.jpg

con635 · Oct 25, 2014

From reports on other forums mantle turn time on civ5 be is a third of dx

Reepca · Oct 25, 2014

So I suppose I'm a bit confused about how this whole virtual core business works - from what I see, it looks like the processor automatically parallelizes workloads somehow... but that seems to me to be a software thing. I guess what confuses me is this: If the "virtual core" technology makes it possible to run a single thread on multiple cores, what exactly is the advantage to, say, brute-force searching for prime numbers on one virtual core vs doing the same across (number of CPU cores) threads? I guess it would be easier to program, but I'm not comprehending where the performance benefit comes from.

I imagine a number of you can explain it to me pretty well - enlighten me!

genz · Oct 25, 2014

Reepca :

Virtual cores I believe are not cores, hence the term 'Virtual', either we are talking electronically re-configuring FGPA style core flexibility (in which a core can change mode from being one core at 2x clock to 2 cores and back) which seems a little far fetched at that process node. It's probably a lot closer to HT on Intel chips, in which a core splits complex operations into atomically multithreaded ops in microcode or hardware.

The problem with software is that it is inherently slower than hardware due to layers of abstraction, this is where the Apple Super-CPU myth that szatkus used comes from. Android which all competing CPUs would be running as an OS, uses Dalvik on top of a form of Linux. Linux has probably the least optimized drivers and software APIs in just about all non-science applications, due to the open-source, non-profit status of them meaning nobody want to sit there for hours and scrub for optimizations, and Dalvik as a VM on top means that we are running our phone benchmarks on a virtual machine on the phone we are trying to evaluate. iOS is far more lightweight without that, and thus even on slower CPUs seems much faster. This is why a CPU could never be faster if it relies on software to self manage. Now your application has to talk to software before it talks to firmware before it talks to the hardware. Another layer has been added to the stack.

On a side note, about the Apple A chips vs competitors... comparing the two is like comparing early Win XP running your benchmark with early early Win XP running a Win 7 VM that is running your benchmark.

Reepca · Oct 25, 2014

I guess that makes sense, I just never really thought the price of what the OS does with running threads on different cores, scheduling and whatnot was high enough to half the end-result IPC. So the difference is basically multi-threading at the software level vs multi-threading at the hardware level?

szatkus · Oct 25, 2014

genz :

I don't think it works like FPGA. But in fact it looks similar to HT. I explain by example. Let's say that one of VISC "true" (not virtual) cores has 3 ALUs and we have 4 instructions which can be executed in parallel (all of these 4 instructions need ALUs and I'll assume that VISC magic works perfectly in this case). So that virtual core throw three instructions to three ALUs in one core and one remaining instruction to ALU in the second core. It's impossible in Bulldozer because there are only 2 ALUs in one cluster and there are no indirect interactions between them. I guess that VISC ISA is focused on parallelization (maybe something similar to Mill ISA). Because of that ARM flavor of VISC's performance will depend mainly on quality of translation ARM->VISC.

genz :

Most of Android benchmarks use Dalvik only to boot execution of native code (with NDK - Native Development Kit). It's quite close comparision to iOS native binaries as long as we take only numerical benchmarks.

genz · Oct 25, 2014

Reepca :

A VM pretends to be an entire system. In the case of Android, it's as if your system is emulating another OS inside it, as one would see if they were running WINE on Linux, or Crossover on Mac.

Looking at the white paper, the Soft Machines effectively appears to be a twin threaded core much like AMD Bulldozer etc etc. However rather than reallocating CPU time as demand changes like a single threaded core does, or having an units on the die that straddle cores, it on the fly shifts hardware resources from less demanding thread to more demanding thread, seemingly just like the Bulldozer, but with power efficiency. Personally I don't think its dual core at all. It's either a single core that is really fast at thread switches, or an asymmetric processor to truly meet it's claims.

What's more, this could not be done in software, as the software could never process quickly enough to keep up with the demand for it to change. Non-native is slow. Whenever you run something in software, you need multiple instructions to carry out calculations for that software, so you can't really use software to control something that needs input very often on the die like threading. You'd be looking at a bare minimum of 20 instructions latency, God forbid if the instructions aren't on cache. Even at 350mhz,the CPU's still reallocating resources 350 million times a second, and having 50 clocks of latency would wipe something like 90% of the average single thread performance gains off the board.

Modern OS's btw, are like pilots that use the autopilot switch. They may say to the CPU "Prioritize this", but that's all they say, the CPU itself would be actually doing all of the edits and controlling per clock.

genz · Oct 25, 2014

szatkus :

Ah! I missed that. Didn't realize it was hardware virtualizing ARM instructions. You're pretty much spot on about the code conversion then. Doubt it's at all possible to parallelize x86 though simply due to race issues, cache and branch prediction.

Unified Shader Arch style IS for the CPU would be welcome.

szatkus · Oct 25, 2014

Guys, some more details

http://www.softmachines.com/wp-content/uploads/2014/10/MPR-11303.pdf

I wonder why they're developing also x86 frontend...

Rafael Luik · Oct 25, 2014

colinp :

AnandTech also noted gains, you can see it more in the minimum FPS (good thing), just flip the page from the other guy who posted about it earlier. http://www.anandtech.com/show/8640/benchmarked-civilization-beyond-earth/3

jdwii · Oct 25, 2014

szatkus :

Wow what a great read and its amazing that Amd and Samsung put their money into it. Still i remain a skeptic this is huge its something we were all told couldn't work.

genz · Oct 26, 2014

Sounds like it could be Barcelona all over again.

etayorius · Oct 27, 2014

jdwii :

Expect this by 2020 or so...

gamerk316 · Oct 27, 2014

Ah! I missed that. Didn't realize it was hardware virtualizing ARM instructions. You're pretty much spot on about the code conversion then. Doubt it's at all possible to parallelize x86 though simply due to race issues, cache and branch prediction.

It's doable though, and I don't think the performance would be excessive if done in HW. At the end of the day, most CPU's at their lowest level are executing the same instructions, so you just need a really fast way to equate the different instructions for various architectures. I could see a 5-10% performance hit on conversion.

As for what they are doing, it basically looks like that rather then bind threads to cores, the CPU is internally breaking up the instruction stream, and making its own internal threads in such a way as to avoid pipeline stalls. If you can do this fast enough, I could see increases in single threaded workloads. I'm very much interested how the hundreds-of-threads cases looks though, because I could see this falling flat there.

ComradeVlad · Oct 27, 2014

Long time lurker but I finally decided to make an account to better keep up with the thread.

I'm not sure if this has been posted yet but it is a tablet with Mullins. Maybe we will see some more designs come out, because I'd love to replace my old tablet. Anyways, here is the link if anyone is interested. If it has been posted before please forgive me.

http://liliputing.com/2014/10/bungbungame-photon2-tablet-is-one-of-the-first-with-amd-a10-mullins-processor.html

-Fran- · Oct 27, 2014

ComradeVlad :

Well, it is a good deal, but not the best if you just want a tablet. At $550 its put in a price range where you get very good (simple) alternatives for around $450. Including tablets with Tegra K.

It will depend on what you really need to do with the tablet.

Still, great to see a design win for AMD.

Cheers!

gamerk316 · Oct 27, 2014

S/A analysis of VISC:

http://semiaccurate.com/2014/10/23/soft-machines-breaks-cover-visc-architecture/

Honestly, if we're ever to get away from X86, this is EXACTLY how you'd go about it. The main issue I see here is getting someone to put it in a machine, but I could see this taking off in a decade or two.

Reepca · Oct 27, 2014

gamerk316 :

But I guess the idea would be that you don't need anything beyond a single-threaded workload since it will be internally split across the hardware anyway. The only reason I could see for using multiple threads in programming on these systems is if you want asynchronous processing.

Next on my list of silly questions to ask: why couldn't this technology be implemented in GPUs, where (from what I understand) gains should be even larger since there are so many more cores to take advantage of?

gamerk316 · Oct 27, 2014

Reepca :

gamerk316 :

But I guess the idea would be that you don't need anything beyond a single-threaded workload since it will be internally split across the hardware anyway. The only reason I could see for using multiple threads in programming on these systems is if you want asynchronous processing.

Next on my list of silly questions to ask: why couldn't this technology be implemented in GPUs, where (from what I understand) gains should be even larger since there are so many more cores to take advantage of?

GPUs aren't designed to handle serial workloads. For those tasks, their performance STINKS compared to CPUs. So it's kinda pointless to design a GPU to improve single-threaded IPC, since you'll never send those types of workloads to a GPU in the first place.

AMD CPU speculation... and expert conjecture

Honorable

Splendid

Honorable

Honorable

Splendid

Distinguished

Honorable

Honorable

Honorable

Distinguished

Honorable

Honorable

Distinguished

Distinguished

Honorable

Reputable

Splendid

Distinguished

Honorable

Glorious

Reputable

Illustrious

Glorious

Honorable

Glorious

Share this page