AMD CPU speculation... and expert conjecture

Page 652 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

jdwii

Splendid


http://www.extremetech.com/extreme/192858-visc-cpu-virtual-core-design-emerges-could-this-be-the-conceptual-breakthrough-weve-been-waiting-for
the CPU interleaves the serial workload across two physical cores

Sounds like a horrible idea to me but hey it scored higher.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


Looks like they are targeting IoT, which makes a lot sense with such low clock. Of course it also depends on power consumption, but with such low clock it could be good.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


Actually it sounds like a great idea. Similar to SpMT (maybe that is SpMT, hard to say without details). We can place a lot of cores on one chip, but performance of single thread progress very slowly past few years.
 

jdwii

Splendid
^ Its just so new typically splitting one thread up into 2 would cause major latency issues many thought Amd was indeed doing that with bulldozer. I hope we see more info on this in the future
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


I think the low clocks are due to it being implemented in an FPGA.

Anyhow when start ups come out with stuff like this it's generally because they blew through their first rounds of funding and need another big chunk to stay alive.

SPEC 2006 scores are also misleading. SPEC_Rate which takes into account the memory and IO subsystems are typically far more important.
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


Looks like not FPGA: http://semiaccurate.com/assets/uploads/2014/10/Soft_Machines_prototype_board.jpg
 

Reepca

Honorable
Dec 5, 2012
156
0
10,680
So I suppose I'm a bit confused about how this whole virtual core business works - from what I see, it looks like the processor automatically parallelizes workloads somehow... but that seems to me to be a software thing. I guess what confuses me is this: If the "virtual core" technology makes it possible to run a single thread on multiple cores, what exactly is the advantage to, say, brute-force searching for prime numbers on one virtual core vs doing the same across (number of CPU cores) threads? I guess it would be easier to program, but I'm not comprehending where the performance benefit comes from.

I imagine a number of you can explain it to me pretty well - enlighten me!
 

genz

Distinguished



Virtual cores I believe are not cores, hence the term 'Virtual', either we are talking electronically re-configuring FGPA style core flexibility (in which a core can change mode from being one core at 2x clock to 2 cores and back) which seems a little far fetched at that process node. It's probably a lot closer to HT on Intel chips, in which a core splits complex operations into atomically multithreaded ops in microcode or hardware.

The problem with software is that it is inherently slower than hardware due to layers of abstraction, this is where the Apple Super-CPU myth that szatkus used comes from. Android which all competing CPUs would be running as an OS, uses Dalvik on top of a form of Linux. Linux has probably the least optimized drivers and software APIs in just about all non-science applications, due to the open-source, non-profit status of them meaning nobody want to sit there for hours and scrub for optimizations, and Dalvik as a VM on top means that we are running our phone benchmarks on a virtual machine on the phone we are trying to evaluate. iOS is far more lightweight without that, and thus even on slower CPUs seems much faster. This is why a CPU could never be faster if it relies on software to self manage. Now your application has to talk to software before it talks to firmware before it talks to the hardware. Another layer has been added to the stack.

On a side note, about the Apple A chips vs competitors... comparing the two is like comparing early Win XP running your benchmark with early early Win XP running a Win 7 VM that is running your benchmark.
 

Reepca

Honorable
Dec 5, 2012
156
0
10,680
I guess that makes sense, I just never really thought the price of what the OS does with running threads on different cores, scheduling and whatnot was high enough to half the end-result IPC. So the difference is basically multi-threading at the software level vs multi-threading at the hardware level?
 

szatkus

Honorable
Jul 9, 2013
382
0
10,780


I don't think it works like FPGA. But in fact it looks similar to HT. I explain by example. Let's say that one of VISC "true" (not virtual) cores has 3 ALUs and we have 4 instructions which can be executed in parallel (all of these 4 instructions need ALUs and I'll assume that VISC magic works perfectly in this case). So that virtual core throw three instructions to three ALUs in one core and one remaining instruction to ALU in the second core. It's impossible in Bulldozer because there are only 2 ALUs in one cluster and there are no indirect interactions between them. I guess that VISC ISA is focused on parallelization (maybe something similar to Mill ISA). Because of that ARM flavor of VISC's performance will depend mainly on quality of translation ARM->VISC.



Most of Android benchmarks use Dalvik only to boot execution of native code (with NDK - Native Development Kit). It's quite close comparision to iOS native binaries as long as we take only numerical benchmarks.
 

genz

Distinguished


A VM pretends to be an entire system. In the case of Android, it's as if your system is emulating another OS inside it, as one would see if they were running WINE on Linux, or Crossover on Mac.

Looking at the white paper, the Soft Machines effectively appears to be a twin threaded core much like AMD Bulldozer etc etc. However rather than reallocating CPU time as demand changes like a single threaded core does, or having an units on the die that straddle cores, it on the fly shifts hardware resources from less demanding thread to more demanding thread, seemingly just like the Bulldozer, but with power efficiency. Personally I don't think its dual core at all. It's either a single core that is really fast at thread switches, or an asymmetric processor to truly meet it's claims.

What's more, this could not be done in software, as the software could never process quickly enough to keep up with the demand for it to change. Non-native is slow. Whenever you run something in software, you need multiple instructions to carry out calculations for that software, so you can't really use software to control something that needs input very often on the die like threading. You'd be looking at a bare minimum of 20 instructions latency, God forbid if the instructions aren't on cache. Even at 350mhz,the CPU's still reallocating resources 350 million times a second, and having 50 clocks of latency would wipe something like 90% of the average single thread performance gains off the board.

Modern OS's btw, are like pilots that use the autopilot switch. They may say to the CPU "Prioritize this", but that's all they say, the CPU itself would be actually doing all of the edits and controlling per clock.
 

genz

Distinguished


Ah! I missed that. Didn't realize it was hardware virtualizing ARM instructions. You're pretty much spot on about the code conversion then. Doubt it's at all possible to parallelize x86 though simply due to race issues, cache and branch prediction.

Unified Shader Arch style IS for the CPU would be welcome.
 
Ah! I missed that. Didn't realize it was hardware virtualizing ARM instructions. You're pretty much spot on about the code conversion then. Doubt it's at all possible to parallelize x86 though simply due to race issues, cache and branch prediction.

It's doable though, and I don't think the performance would be excessive if done in HW. At the end of the day, most CPU's at their lowest level are executing the same instructions, so you just need a really fast way to equate the different instructions for various architectures. I could see a 5-10% performance hit on conversion.

As for what they are doing, it basically looks like that rather then bind threads to cores, the CPU is internally breaking up the instruction stream, and making its own internal threads in such a way as to avoid pipeline stalls. If you can do this fast enough, I could see increases in single threaded workloads. I'm very much interested how the hundreds-of-threads cases looks though, because I could see this falling flat there.
 

ComradeVlad

Reputable
Oct 27, 2014
1
0
4,510


Well, it is a good deal, but not the best if you just want a tablet. At $550 its put in a price range where you get very good (simple) alternatives for around $450. Including tablets with Tegra K.

It will depend on what you really need to do with the tablet.

Still, great to see a design win for AMD.

Cheers!
 
S/A analysis of VISC:

http://semiaccurate.com/2014/10/23/soft-machines-breaks-cover-visc-architecture/

Honestly, if we're ever to get away from X86, this is EXACTLY how you'd go about it. The main issue I see here is getting someone to put it in a machine, but I could see this taking off in a decade or two.
 

Reepca

Honorable
Dec 5, 2012
156
0
10,680


But I guess the idea would be that you don't need anything beyond a single-threaded workload since it will be internally split across the hardware anyway. The only reason I could see for using multiple threads in programming on these systems is if you want asynchronous processing.

Next on my list of silly questions to ask: why couldn't this technology be implemented in GPUs, where (from what I understand) gains should be even larger since there are so many more cores to take advantage of?
 


GPUs aren't designed to handle serial workloads. For those tasks, their performance STINKS compared to CPUs. So it's kinda pointless to design a GPU to improve single-threaded IPC, since you'll never send those types of workloads to a GPU in the first place.
 
Status
Not open for further replies.