AMD CPU speculation... and expert conjecture

Page 532 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


I read somewhere that AVX2 replaces about 19 legacy x86 instructions. ICC can even recode those old instructions automatically into AVX2.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


You are giving lessons, but of a different kind...

In above image AMD gives a total score of 28.9 for the Jaguar Opteron. There are four cores, thus a simple division 28.9/4 gives 7.2 ~7. This is the same score of 7 per core mentioned by AMD recently.

You are confusing the total throughput per core with the single thread score. Your pretension that AMD is lying about benchmarks is once again only in your imagination. :nan:

The A57 continue scoring 10 per core and jaguar 7 per core. The A57 continues being a 43% faster. Nothing has changed after your "lesson".
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


HTT is Intel implementation of SMT. CMT is something completely different. Details below.



I enjoy when you use your typical condescend tone asking me for putting you in your place.

CMP is not "multiple chips", as you pretend, but it means Chip Multi Processing. Pay attention to the first word: Chip. For instance, the Athlon 5350 is a CMP architecture with four cores. You confound chips with cores.

Yes, SMT is Simultaneous Multi Threading. It appears in above spoiler in the figure that I gave, therefore you are not giving any new information. Simultaneous Multi Threading doesn't mean what you pretend. You are trying a literal interpretation of the words, whereas you are avoiding the technical, precise, definition of the concept.

The proper definition of SMT already appear in graphical form in the figure that I gave above. But I will give the definition in textual form:

Simultaneous Multi-Threading (SMT) allows the concurrent execution of multiple instruction streams on the same processor core.

http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaai.saptuning%2Fsapturningsmt4.htm

There are different degrees of SMT. SMTn means n threads are executed per core. Intel only implements SMT2. IBM implements SMT2, SMT4, and SMT8. SMT8 means that each core executes 8 threads simultaneously.

One SMT2 core is running two threads simultaneously. Whereas CMT is about two clustered cores running two threads (see above figure). Evidently CMT doesn't satisfy the SMT definition because each core in a CMT module only executes one thread (again see the figure) and SMT requires more than one thread per core.

Aka SMT != CMT

I note how you ignored the quote from Anand mentioning that SMT and CMT are different. AMD also considers that CMT and SMT are different. There is a quote from AMD at the start of the link that I gave above (in the spoiler)

AMD also put the difference between CMT, SMT, and CMP in graphical form

3663732_9bc35365d1_l.png

Top left: CMP
Bottom left: SMT
Top right: CMT

The conclusion once again is

CMP !=SMT != CMT
 
Oh wow this discussion again.

Guys, stop talking about "IPC", it's a useless number in the context your using it in. Instructions Per Clock is just the number of instructions a particular design can execute every clock cycle, it's a useless metric when the instructions are not the same. Since ARM is pretty vanilla RISC and x86 is it's own hybrid ISA direct comparisons between them are impossible. Heck because different x86 instructions have different execution times comparisons between different x86 chips are pretty much useless. IPC only makes sense when discussing standard RISC design's and only when discussing different chips within that same ISA. This is because a standard RISC instructions takes exactly one clock cycle to execute and so comparisons using "IPC" are a good metric for determining the efficiency of the cache, OoO scheduler and branch predictor. In something like x86, with instructions have variable execution lengths, the concept of X per Y doesn't make sense since when the weight of X chances from one instruction to the next.

The best way to measure the differences is to create a predefined set of "work" that needs to be done, then measure how long each platform takes to complete that "work". The accuracy of this metric is then highly dependent on how realistic that unit of work is and the #1 reason why so many benchmarks are useless. This is also why I like SPEC so much, real unbiased industrial benchmarks that text the entire platform. Provided the reader knows what their reading.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Precisely HSA introduces a common ISA for both CPU and GPUs. Under HSA, the GPUs are first class processors as CPUs.



There are lots of ARM software ready. In fact there are some basic ARM desktops already on sale.



Yes history repeats. The same people who is now pretending that x86 will be never replaced, because ARM cannot scale up, is repeating the old arguments used by Alpha/MIPS/Power guys in the 80. They also pretended that x86 never could be used in workstations, servers, supercomputers because couldn't scale up.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790

Well, if you ask me I will say that I find about 95% of your posts completely useless, but about your question I can guess that we will see the FX logo the day 4.

No, it is not Steamroller FX CPU. It is a new APU.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


As mentioned before CPU also need drivers. Drivers connect specific hardware layer with general software layer.

Yes, under HSA GPU become first class as CPUs.

HSA, with its widespread industry support from companies like AMD, ARM, Imagination Technologies, Qualcomm, Texas Instruments and many others, not only elevates the GPU to a first-class citizen in the world of general purpose compute, it also brings a number of benefits for developers who want to harness its power.

For example, in the past the GPU needed the CPU to feed it jobs, adding an extra step in the process. Programmers had to explicitly dictate which parts of the code would be offloaded from the CPU to the GPU, which meant the programmer needed to have some knowledge of the underlying processor and which workloads are best suited to each. This takes time and advanced knowledge of complex processor architectures.

To alleviate this burden on developers, HSA deployed "heterogeneous queuing." This technology means that the GPU can initiate its own processes rather than relying on a CPU to tell it to do so.

Such technology is the first step toward a truly integrated APU, in which workloads will automatically execute on the processor that provides the best performance and energy efficiency characteristics.

http://www.techradar.com/news/computing-components/graphics-cards/how-to-enhance-the-power-of-the-gpu-1250934
 


Ok stop. You need to understand exactly what "drivers" are first and foremost. Drivers are just pieces of software that translate API calls into proprietary hardware functions. Essentially they manage and abstract the hardware in such a way that a piece of software can be generic in it's approach. You don't see "CPU drivers" because software on a platform is already compiled for the processors language and processors rarely change languages. In this sense any emulator would act as a sort of "driver". Also the Windows NT scheduler is also a kind of "driver" in that it manages which instructions get sent to which CPU.

There actually was an era of "driver-less" GPU's, it was back in the 90's with 3DFX and DOS based gaming. Each piece of software used the GLIDE language which is the native language of the 3DFX Voodoo series of cards. You can equate GLIDE to x86, ARM, PPC, SPARC and MIPS. Languages like OpenGL and Direct3D are more like Java in that they are abstracted and not natively run on the cards. The Geforce GTX 770 does not run Direct3D, it runs a proprietary language that nVidia designed just for that series of cards. The nVidia Geforce drivers will translate the DirectX API calls into that native language and manage the card's resources in this abstracted way. The Geforce GTX 580 doesn't run DirectX either, it's not even the same language as the GTX 770. Same with all the ATI / AMD cards. All these different chips use their own language which the manufacturer design's. The drivers that are then released translate a standard API (DirectX/OpenGL) into that native vender / model specific language. Now note that his incurs a performance penalty as each layer of abstraction always reduces the maximum theoretical performance of a system. This is were consoles have a large advantage over PC gaming, because their hardware is locked in at design time, the software venders don't need to worry about performance loss from "drivers" and can compile their program for that native language. Actually it's more of a pseudo language as there will be a vender provided library that contains all the functions precompiled for that hardware, industrial secretes and all.

So the order is software -> standardized API -> Driver -> vender / model language -> Hardware. DirectX provides that standardized API for Windows PC's, and OpenGL provides it for everyone else.

That should make it very clear why GPU's are "second class citizens" on modern PC's. Because the software designer can never be 100% sure which vender / model you went with for your graphics option, they use a standardized interface and allow the vender provided drivers to do the translation work. To do otherwise would have massive compatibility issues all around and incur obscene development costs. You'd have to have a specialized engine for every possible vender / model of graphics that it would run on, instead of an engine for each possible API.

Now for Mantle / HSA. Mantle is AMD's stab at the old GLIDE concept. Mantle is kind of a hybrid API, it's not 100% abstract nor is it 100% native. Most of it's functions would be mapped 1:1 to a native function in the GCN language, but some of it's functions would be purely abstracted. Because it's starts with the assumption that most of the grunt would doesn't require careful translation it immediately gets an increase in theoretical performance. But it's not a 100% native language, there is still function mapping that's taking place, this is done to allow model specific language to exist and be handled by a micro-driver inside the Mantle implementation. This is how nVidia and Intel could make Mantle compatible products, they merely need to create a micro-driver that would handle any non-native functions they had, which in their current products would be quite a bit. And this is also why it's impossible for AMD to create an "nVidia / Intel" compatible implementation of Mantle, AMD doesn't have access to the technical engineering trade-secrets that Intel and nVidia use for their own hardware. HSA is slightly different, it defines a standard way for all these components to inter-operate (glue) but you still need a driver because you CAN NOT be 100% sure which language the other components will use. HSA is just a method that allows multiple processors to natively share resources and address space, the language of those processors isn't defined in the language. You can have an x86 CPU, Power CPU and an ARM CPU all inside the same system working in the same address space / code stream. Or you could have a POWER8 CPU, GCN GPU and an ARM CPU all working together. The actual executable code still needs to be either in the native language of "whatever" is present, or it needs to be in a pseudo-language that will be interpreted by a driver at run time.
 



I have an x86 emulator on my UltraSparc IIIi system, it's called dosbox. A long time ago I posted a picture of me running Windows 3.11 inside a DosBox instance on my Sunblade 2500 silver. I only needed to recompile dosbox into sparcv9 and configure it to use the full-emulation engine vs the dynamic one and viola instant x86 emulator. QEMMU also emulates x86 inside whatever environment you want. The problem isn't software, it's the inherent massive performance hit you get from having to translate a language that was never meant for translation into another. In the case of those abstracted API's and pseudo-languages like JAVA the designers started with the assumption that something else would be translating them and so made it easy and fast. x86 was built long before we were doing that and it's simply not possible to do it fast due to how it address's memory and I/O.

It's far easier to emulation ARM on an x86 platform then it is to emulate x86 on an ARM platform.
 

colinp

Honorable
Jun 27, 2012
217
0
10,680


One of those projects has been abandoned, while the commercial one doesn't seem to have released a product yet, so there's no way of telling how well it really works. Two years on from that news, and not a single case study makes you wonder how well they are getting on.

They claim "up to 80%" performance, but don't define what they mean by that. If it means (up to) 80% of the speed of x86, with all instructions working perfectly, then that would be amazing of course. It would probably be fine for many apps where performance isn't an issue at all. But if you want to game as well as a 2500K, then you'll need an ARM chip at least 25% faster. Where can I buy one of those?

And then, if you have to pay for software to run x86 on ARM, why wouldn't you just buy an x86 chip instead and know that your software will run with 100% speed and reliability? Isn't that the flexibility that AMD is soon to be offering?
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I also have DosBox and use only to play some ancient games. I have a customized conf file that preload my keyboard map, mounts C in my folder of old games, move to C and make a DIR of the folder contents. Last time I used it I played up to the end my copy of Doom by nth time :)

There is a version of DosBox ported to ARM architecture.

Next video is about the classic x86 Duke Nuken running on DosBox on ARM board:

http://www.youtube.com/watch?v=MmjbBUUbocM

For those that don't know DosBox, not only it emulates 32bit x86 architectures and DOS OS but also emulates typical sound cards of the epoch such as the SoundBlaster.

Patched QEMU is used to run some windows applications on ARM hardware using WINE.

Chinese have recently filled a patent for WINE on ARM

http://www.theregister.co.uk/2014/04/01/chinese_patent_app_tries_to_own_wine_on_arm/



The guy claims in his blog: "The project is not dead, I just had no time to continue the development." But I think that it is Windows RT which is really dead now and that his project has no future.

Here a video of someone playing x86 game Fallout 2 on a WinRT tablet thanks to his work

http://www.youtube.com/watch?v=I-hVbcRVre4

The Russian team released a commercial product. They did partner with hardware makers and are already selling ARM servers that run x86 software (jdwii please don't say this to your boss).

By 80% efficiency they mean:

We provide up to 80% out of native ARM application performance. Running x86 applications on ARM delivers close to native performance with dramatic energy savings.

http://eltechs.com/product/

It is fair to mention that their claimed 20% penalty on performance is about the same penalty percentage that Bulldozer/Piledriver modules have by using a CMT architecture with shared front-end/decoder.

Of course running x86 native on x86 hardware will bring maximum performance than emulating. The real benefit of emulating is when you run most of the software in the native ISA (e.g. ARM) and then some few ancient x86 applications on the emulator.

I already said before that I expect x86 hardware to remain for running legacy applications. I already mentioned AMD's Feldman quote about they releasing the new x86 CPU Warsaw only for legacy customers running still x86 software.

In any case my post was aimed to break gamerk new myth that x86 cannot be emulated on ARM, that it "isn't happening".

You know very well that ARM-64 bit is not still released. And you know very well than AMD K12 will be not ready before 2016. Stop silly questions like "Where can I buy one of those?"
 


I'm a little confused as to you're argument here (or against mine for that matter)?!

I didn't and don't expect ARM to replace x86 in specific industrial / business circumstances like that. It wouldn't make sense to. What I do think makes sense (and what AMD is actually doing) is moving ARM into a space that already largely supports it (servers), and as I stated before the market doesn't need to be that big for it to prove a success for AMD.

Their entire strategy is about offering the correct part for the correct job. In some circumstances an ARM core might be a better fit than an x86 design. In others having an on board GPU would be a real benefit. Their strategy is about creating mix and match building blocks they can glue together to suit a customers wants. There is allot of talk about blending ARM and x86 and GPU together and somehow that will make a better process FOR A PC- and I agree with you I really don't see that being practical, however having the capability to say to a customer- you want ARM cores- sure here you go... graphics, yep we can do that- multi core x86- of course! AMD are uniquely placed in that regard and it might just pay off.
 
it's that time of the year, time for some on-topic amd cpu stuff! leave your pointless bickering and gather around the links! or not!

at's kabini testing pt2, has some nice numbers and comparos
http://www.anandtech.com/show/8067/amd-am1-kabini-part-2-athlon-53505150-and-sempron-38502650-tested
The Jaguar cores in Kabini are listed as 3.1mm2, and AMD is quoting that four of these cores will fit into a single Steamroller module. Unfortunately the dimensions of a Steamroller module are not known - a 32nm SOI Bulldozer module clocked in at 30.9 mm2 for example, but no equivalent number is available for 28nm Steamroller. However some quick math shows four Jaguar cores populates 12.4 mm2. This leaves the rest of the core for the L2 cache, IGP and a large amount of IO.
...
Given that one core is 3.1 mm2, extrapolating out gives the size of the die at 31.4x the size of a single core, or 97.3 mm2. The GPU area is approximately 5.2x the size of a core, giving ~16.1 mm2 for 128 GCN cores, compared to 12.4 mm2 for CPU cores. The Video Codec Engine and Unified Video Decoder are not part of these totals, located on other parts of the APU. The memory controller clocks in at ~9.4 mm2 and the display/IO portion runs at ~7.3 mm2.

an interesting entry on the benches is the $171 atom c2750, avoton chip with 20w tdp, 2.4 ghz base clockrate and 8 cores. i think the x264 2nd pass bench is a good indication why the console's 8 core cpus won't be able to match most of the desktop quads if they were running pc software. hyperthreading and multi-module amd cpus might run circles around those. you can compare all of the tested cpus by clicking the "all results" button on the page.

and the R3 igpu is faster than hd3000, LOL.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Consoles are limited to lower TPDs than desktops. Thus you cannot really pretend that the ~10W CPU on the PS4 can offer the same raw performance than a 40W or a 125W CPU!

The main difference is that consoles are not running a bloated and slow OS such as Windows and that hardware is fixed and developers can extract all its potential as Metro LL developer remarks:

You just cannot compare consoles to PC directly. Consoles could do at least 2x what a comparable PC can due to the fixed platform and low-level access to hardware.

Thus a ~100GFLOP/s console-CPU can perform like a 200GFLOP/s PC-CPU more or less.
 
PC Specialist Magma A10
http://hexus.net/tech/reviews/systems/70369-pc-specialist-magma-a10/
small size kaveri desktop pc. one caution: get a good 3rd party cpu cooler instead of the stock one.


in kabini's case, it might have to do with the lower clocked sempron skus, not the athlons. if the browser in question is mozilla, i can guess why the athlon would slow down (absolute, not relative to intel) - addons causing abrupt cpu loading, disk lockups and memory leaks.
pentiums and amd trinity/kaveri apus have dual channel imc, bigger and faster cache, Much higher clockrate. a Y series pentium or core i3 would be a good performance comparison (but not price comparison).
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


How many of these cards do you think they actually sell though? Not many. So why not mark them up for the chumps with money to burn. Apple marks up their phones more than 100%.

 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Nvidia is not Apple. Everyone in the industry is laughing at the commercial/marketing fiasco of the Z. The Z is from Zuck.
 
Status
Not open for further replies.