AMD CPU speculation... and expert conjecture

hcl123 · Jun 12, 2013

gamerk316 :

Yes but games is the least to worry about, at least they are much more sane than other pieces especially synthetics. And to complement, even here at TH, the test showed that a Trinity is yet better for games than a HSW 4+2(GT2 i think). Yet intel managed to make a good approximation, which is always good for end users, really extraordinary would be having a GPU with 80sp (HSW GT2) win for a GPU with 384sp.

Shows that AMD has a lot of work to do, Richland most probably is better like announced, doubt a 160sp (HSW GT3) will win for it (depends)... yet shows that intel has made quite a good development work with their GPUs and their drivers.

So games follow the logic, nothing to point there... and drivers are important alright, perhaps more than the DirectX compiler.

UPDATE:

Secondly, there never was a "cripple-AMD" function. Intel simply decided to optimize more aggressively its own chips. You could always specify via compiler switch what level to compile to.

Well a "cripple function" is an abuse of language... its not a "function" its the all compiler infrastructure, that with ICC spew 2 code paths with different characteristics (all compilers can/could do that, have 2 different code paths attending some condition, mostly hardware support for something, for the cases of some CPUs that can have such ISA support while others don't, and so the binary can run well in most all CPUs), depending on CPUID detection.

The problem with "different characteristics" of ICC, is that it forced those code paths based on CPUID, not based on some "instruction" support presented on some CPU or not(which would then be ok , since intel is not making all CPUs, and so cannot guaranty that all CPUs natively support all their "instructions sets", but force a change of set used independent of support or not, only because its not genuine intel is quite biased). At least the judge also thought so, that it was deliberately biased, and ruled for Intel to change their ways with ICC.

Yes the case already deserved a law suit.

8350rocks · Jun 12, 2013

AMD should promote GCC, honestly...though with Kudori working on all things software, open64 might become a much better compiler in the next few years.

EDIT: The compute section of intel's iGPU has improved DRAMATICALLY...outside of that, performance is still on par with FM1 socket Llano architecture in AMD (2 generations old).

noob2222 · Jun 12, 2013

gamerk316 :

not when another vendor is paying you. if thats the case, might as well buy a ford pickup and have them deliver a fiesta since they both have 2 seats. Its fords vehicle, they can do whatever they want right?

If Intel didn't want AMD to have SSE, they shouldn't have sold them the rights.

Instead they wanted money from AMD and just disable it through other means. Would be like buying a photoshop liscence that can edit text.

mlscrow · Jun 12, 2013

cowboy44mag :

We already know what the performance of FX Piledriver chips is like and we can assume the performance gains by increasing the clock to 5.0GHz, it will not be very impressive. A 4770K will still likely outperform it most of the time.

As for Steamroller FX - I seriously doubt we'll be seeing 5.5GHz-6GHz, at least not right away, if ever. You have to remember that Steamroller will be made with High Density Libraries which are going to increase heat and lower power thresholds, therefor requiring lower clock rates than previous designs. If we're lucky, we may see a 5GHz part, but I honestly think it's going to be clocked in the 4GHz range. Only time will tell.

de5_Roy · Jun 12, 2013

hcl123 :

i've never paid much attention into gpu specs. just trying to make some sense of these things using my newbie math:
amd's 7660d igpu has 384 radeon cores.
can intel igpu's execution units(each with 8 shader alus) be analogous to amd's simd units? gt2 with 20 eus have 160 shader alus, gt3 would have 320 shader alus. with less than half shader alus, gt2 or hd4k shouldn't be able to come close to 7660d. gt3 may come close, and if the bandwidth is fed, it can outperform 7660d... although i can't figure out how gt3e outperform 7660d by that big margin. makes me wonder how much memory bandwidth 7660d actually needs to flex it's muscles. i always assumed that intel's l3 cache makes up a very small portion of performance deficit compared to 7660d.
according to my newbie math, silvermont soc will need full hd4000 igpu to be able to compete with kabini (128 alus). however, hd4k seems to use around 5 watts for 16 eus (128 shader alus). kinda don't see how intel can fit hd4k on a tablet/smartphone soc. 4-8 eus might be possible.
my newbie math also tells me that kaveri might be able to house 8-12 compute units. or at least 6. hopefully.

8350rocks · Jun 12, 2013

mlscrow :

Well, Richland saw an increase in efficiency, and these FX 9XXX twkr's are supposed to be "Vishera 2.0" which means I think we will likely see some improvements there as well as some minor performance gains above and beyond just clock speed.

Additionally, the way the FX 9590 and 9370 will be binned, I wouldn't be surprised if these SKUs end up setting a drastically higher world record for OC. If the silicon is anything like that last batch of P2X4 twkr chips, then we might see near 10 GHz on LN2.

If they can get 6.0+ GHz out of this thing with a good water cooling setup...that would be quite impressive. FX 9590 @ 6.0 GHz > Hasfail 4770k @ 4.2 GHz all day long.

noob2222 · Jun 12, 2013

mlscrow :

the HDL should ony be a minor heat increase.

are yielding a 15 to 30 percent lower energy per operation http://www.tomshardware.com/news/Steamroller-High_Density_Libraries-hot-chips-cpu-gpu,17218.html

. so we have a 30% area shrink, 15% lower power worst case, or break even. we don't know yet how much power will be reduced.

In comparison, ivy bridge dropped power by 10-20% while shrinking 45%. of course everyone blamees it on "thermal paste". If AMD cpus heat up it will be simply because "AMD sucks".

I think the bigger issue with SR is going to be the 28nm bulk vs 32nm soi. AMD already has a better heat-layout than Intel, cores spread to corners instead of sitting next to each other. Could be interesting as AMD cpus have always been easier to keep cool even while drawing more power.

-Fran- · Jun 12, 2013

noob2222 :

I really doubt AMD paid Intel to have AMD support in their compiler. One thing is to license the tech and another completely different is to give support to that tech. In that aspect, AMD is at fault (half) for not pushing more aggressively another compiler at the time and for accepting the deal with Intel and not taking them to court. I'm sure Intel forced the hand on most devs to use their compiler, or worst case scenario, most of those devs were raging Intel fanbois; who knows. So, under that light, AMD had (still has) a VERY tough battle against Intel in the compiler arena. Even MS can't do a proper job with their bloody Visual Studio.

Also, it's not a correct car analogy 😛

Cheers!

hcl123 · Jun 12, 2013

de5_Roy :

No. The GT2 in HSW is 80 MADs, roughly equivalent to 80sp, because each EU of intel is not VLIW is like more 4x SMT for each EU. So in reality is 20 EU (20x4 = 80 )

The GT3 is is 40 EU and so 160 MADs ops, roughly equivalent to 160sp

http://translate.googleusercontent.com/translate_c?depth=1&hl=pt-BR&rurl=translate.google.com&sl=ja&tl=en&u=http://pc.watch.impress.co.jp/docs/column/kaigai/20130507_598286.html&usg=ALkJrhjWpQPgo4ct5Km-XL7qc1D2VbS42A

see images:

Global
http://pc.watch.impress.co.jp/img/pcw/docs/598/286/08.jpg

GT3 scheme which will be HD5000, GT2 is half
http://pc.watch.impress.co.jp/img/pcw/docs/598/286/15.jpg

Compare with Trinity
http://pc.watch.impress.co.jp/img/pcw/docs/598/286/23.jpg

No... hardly has intel catched AMD in this, yet they have made quite a very good job, that strange uarch with slices and half slices show good promise with more ... slices lol..

Compared with Kabini, i doubt that Intel has catched also. Kabini is GCN and is 128sp, but this is different from VLIW. Doubt intel with its 20 EU ( even if possible) or 80 MAD pipes, will match 128sp. But in here the results could be much more close (depends on a lot of things)

juanrga · Jun 12, 2013

gamerk316 :

juanrga :

By "most", you mean "a handful". For the cross-platform stuff, I've seen a LOT of GCC.

Secondly, there never was a "cripple-AMD" function. Intel simply decided to optimize more aggressively its own chips. You could always specify via compiler switch what level to compile to.

And as an aside, optimal code would show bias, due to not needing any optimization. Likewise, disabling the optimizer entirely would yield CPU-independent code (though it would run like crap). [As a general rule, when comparing like architectures, disabling the optimizer is probably the *best* way to get a fair performance benchmark]

Regarding games. There is two points to be considered. First, there are games optimized for intel, they are listed in Intel site and include Batman, Skryrim, Civ... Precisely games such as Skryrim, Civ are notorious because run badly on AMD chips.

Question: Anyone care to tell me what "optimized for Intel/AMD" actually means? Because NO ONE plops down any manual CPU opcodes, and its not like you are going to see low-level X86 stubs throughout the code to make Intel/AMD chips run better/worse. Its all 100% marketing, pure and simple. Granted, some games, by their design, are likely biased; you see this in GPU's too: games that need memory bandwidth favor AMD, games that rely on pure shader performance favor NVIDIA. No intential bias, just the outcome of the design chosen.

The only time I can recall ever "biasing" something was shortly after the first C2D's showed up. As the L2 was split oddly (cores 0 and 2 share the same L2, as does 1 and 3), there was a bias toward using cores within the same L2 context. But then again, I only did that for apps whose performance was dominated by cache misses, which was exactly ONE program in about a decade.

The irony is that current consoles NEED to be "threaded" better to run at all, which makes the whole argument silly, because ITS WRONG. The 360 has a tri-core CPU with 2-way SMT, allowing 6 threads to be executed at a time. The PS3 has 7 functional PPE's, allowing 7 threads to be executed at a time. Not bad considering at the time, PC's had JUST introduced dual-core CPU's.

Crysis 3 scales better because the devs moved processing off the GPU, which I again mention is not optimal programming, given how much faster GPU performance grows per generation. You've essentially simply created a CPU bottleneck which won't go away, rather then a short term GPU bottleneck. Simply put, while other, GPU bottlenecked games will run faster generation over generation, you are going to find Crysis 3 performance will NOT improve due to being heavily bottlenecked by the CPU, which is not showing significant performance gains generation over generation.

Seriously, i could peg 16 cores at 100% easy: just handle all the rendering in software. Sure, the program will get maybe 5 FPS, but the program threads well!

There was a Cripple_AMD function, which was hidden by Intel, then discovered by academics and reviewers, latter Intel negated its existence until the FTC settlement obligated Intel to reveal the truth, add a disclaimer to the ICC documentation, and pay money to the cheated users.

It is a fact that games performing badly in AMD chips, such as Skryrim and Civ, are listed as "optimized for intel".

The 360 is tri-core, the PS3 is single-core (two threads) plus 6 SPUs. That is why most PC games are using few cores. Next consoles will be both eight cores. Moreover, both use lower clocks. Therefore game developers are not going to use one or two cores and ignore the rest (as happens today with most PC games). The first demo of a PS4 game is already using six cores.

Cazalan · Jun 12, 2013

The compiler issue has been beaten like a dead horse.

If you want better benchmarks then stop talking about Intel/AMD and put your foot to the websites doing the reviews.

Promote GCC test suites like this one.
http://www.phoronix-test-suite.com/

8350rocks · Jun 12, 2013

I read a developer article somewhere where the developers for Thief were talking about using 7-8 cores frequently...I cannot find the article now...but I expect better optimization to come with familiarity with the console.

8350rocks · Jun 12, 2013

Cazalan :

Phoronix and PARSEC and SPEC are the better ones out there...though SPEC is expensive as hell...PARSEC is quite "reasonable" considering...

hcl123 · Jun 12, 2013

mlscrow :

Roughly by the clock alone Centurion its around 18% better performance, which already is more gain than from SnadyBridge to Failwell LOL, SNB nd HSW which maintain practically the same clocks.

OTOH, its not only clock, if DDR3 2400 is supported from origin it means its a different SKU, other things have been minimally tweaked...and even vishera 1 and Orochi 1(BD) already presented extraordinary scalability with clock , with performances gains above the clock increases (quite possible if your uarch was tweaked from origin for those high clocks, but only reach it by OC)... ~25% is my guess...

And that is the principal point of the decision of AMD i think, the clock "potential" was there but the processes of Glofo was not up for it inside the targeted TDPs. Well the process was tweaked to for Richland (which should be quite a good OC to) as is lead to suspect. Then the decision is not that crazy i think, even with the 220W TDP label ( average consumption already was quite below the rated TDP, even for intel, so the same here).

About SteamRoller, doubt it will be that high clocks with "bulk" processes. But there must/should/could be 2 points here, what is mobile, that is SR in APU format, will be tweaked for low power lower clocks with bulk processes. And as is lead to suspect IF 28 nm SHP of Glofo is PD-SOI, is possible SR in desktop APUs and FX, reach as high clocks as is previous brethren.

If this high performance high clock process(PD-SOI) doesn't materialize, then AMD is in trouble if their SR version is not quite above in performance, 30% more ops per cycle might not be enough to break the benchmarks possible bias. But then again this could prove to be a dynamic target always escaping, its software, can change and improve easily, and then carrying exactly the same name that none will be the wiser of this dynamic nature ( without md5 then is sooo easy!), meaning those extrapolations about the hardware, what in fact ppl are seeing in those charts is an improvement of the bench software, that can change in any direction (frankly is a supposition that spells fraud by all pores... but what can i do ? its honestly what it leads to suspect to me)

hcl123 · Jun 12, 2013

8350rocks :

I think Open64 is a gone experiment. What will be the compiler of AMD in the next years, is the HSA one, which is based on LLVM. And according to those results with PostgreSQL pgbench

http://www.phoronix.com/scan.php?page=article&item=llvm_clang33_3way&num=4
(PostgreSQL is an OSS database engine and Clang is the front-end for LLVM)

.. it starts to show an enormous potential... clearly it will not be fair for intel (ironic) if any bechnmark is made out of it(which i doubt they will allow to show or use in anycase in their heavily sponsored sites like AT).

8350rocks · Jun 12, 2013

Yeah, I would think Intel would try to bury that pretty quick...LOL!!!!

588% more transactions per second with FX 8350 over i7-3960x!!!

palladin9479 · Jun 12, 2013

Cazalan :

There's actually a very easy way to check to see if the ICC was used. Don't search for Compiler strings or other test information, it's trivial to change that. Instead search the binary for the specific machine language that indicates the Intel dispatcher (the part that cheats) is present.

http://www.softpedia.com/get/Programming/Patchers/Intel-Compiler-Patcher.shtml

That will search the executable (I'm working to see if I can get it to work with libraries) for the exact instructions used. What the dispatcher essentially does is a sequence of string checks with the result being a binary 1 (Genuine Intel found) or 0 (non-Intel found). If the result is 1 then the dispatcher will do a check of the CPUID flags and set the appropriate code path, if 0 then it will force the lowest code path compiled. The way around that is to change the function to always return 1 which is as simple as changing a few values in a hex editor. Then the dispatcher will always check the flags and chose the proper code path.

I've been checking various programs and I see the intel dispatcher code in many of them. When I get home I'll post more info on what I found. Suffice to say that "Maxon Cinebench 11.5" reports it's MSVC yet contains the ICC dispatcher instructions.

-Fran- · Jun 12, 2013

palladin9479 :

Cazalan :

There's actually a very easy way to check to see if the ICC was used. Don't search for Compiler strings or other test information, it's trivial to change that. Instead search the binary for the specific machine language that indicates the Intel dispatcher (the part that cheats) is present.

http://www.softpedia.com/get/Programming/Patchers/Intel-Compiler-Patcher.shtml

That will search the executable (I'm working to see if I can get it to work with libraries) for the exact instructions used. What the dispatcher essentially does is a sequence of string checks with the result being a binary 1 (Genuine Intel found) or 0 (non-Intel found). If the result is 1 then the dispatcher will do a check of the CPUID flags and set the appropriate code path, if 0 then it will force the lowest code path compiled. The way around that is to change the function to always return 1 which is as simple as changing a few values in a hex editor. Then the dispatcher will always check the flags and chose the proper code path.

I've been checking various programs and I see the intel dispatcher code in many of them. When I get home I'll post more info on what I found. Suffice to say that "Maxon Cinebench 11.5" reports it's MSVC yet contains the ICC dispatcher instructions.

I'll ask friends to bench games with that program. If we get improvements, we should spread this like fire in a dried jungle.

I've always wondered why no one created this program before and went viral. Uhm...

Cheers!

8350rocks · Jun 12, 2013

-Fran- :

palladin9479 :

I'll ask friends to bench games with that program. If we get improvements, we should spread this like fire in a dried jungle.

I've always wondered why no one created this program before and went viral. Uhm...

Cheers!

Agreed...imagine if it did go viral...curious to see someone run cinebench 11.5 with this now...just to see what the results would be.

palladin9479 · Jun 12, 2013

Found some more info.

http://www.swallowtail.org/naughty-intel.shtml

Guy released patch's to modify the ICC to not create code that checks for CPUID, also released another patch that modified the compiled binary to remove the trademark check entirely. When I get home I'll start screwing around, it would be really useful if someone who has direct access to the ICC can test binaries produced to see if the patch method works on newer code. This stuff's been on the internet for awhile, just the knowledge required to know WTF their talking about is uncommon.

His conclusion is funny as sh!t.

Conclusions

It is a shame that the Intel compiler, which use to be almost the no-brainer choice if your primary concern was fast code, is now being coerced into being a marketing tool. Crippling the output for non-Intel chips may mean that some published benchmarks may end up bogusly favouring Intel over AMD, but the cost is that if you want to release fast production code I can't recommend the (unpatched) compiler. There are an awful lot of AMD machines out there!

To Intel: there is a standard mechanism out there (invented by you!) for questioning a CPU as to its capabilities. You should be using that, not checking for the presence of your trademark. I don't expect Intel to support AMD-specific extensions, and I also don't expect Intel to have to test its compiler on AMD CPUs. However, if a CPU states that it can do SSE3 or whatever then I expect the code produced by the Intel compiler to use SSE3 instructions rather than to check first if the chip was made by Intel. It was not acceptable for Microsoft to go out and deliberately cripple Windows under DR-DOS, and likewise it isn't acceptable for you to cripple a product that you sell for not inconsequential sums of money so that it won't perform properly on competitors' hardware.

Here is a program for code compiled by the ICC for linux (whatever of it actually exists). It's a bit more complicated and I wouldn't trust it on anything newer. Changing a few opcodes via hex is fine, but trying to insert complex checks into already compiled code is dangerous.

https://github.com/jimenezrick/patch-AuthenticAMD

Finally please be sure to check libraries. What we find is that most of the heavy lifting is done by functions compiled into the libraries with the main executable acting as a front end to organize and dynamically load it all.

8350rocks · Jun 12, 2013

palladin9479 :

Found some more info.

http://www.swallowtail.org/naughty-intel.shtml

Guy released patch's to modify the ICC to not create code that checks for CPUID, also released another patch that modified the compiled binary to remove the trademark check entirely. When I get home I'll start screwing around, it would be really useful if someone who has direct access to the ICC can test binaries produced to see if the patch method works on newer code. This stuff's been on the internet for awhile, just the knowledge required to know WTF their talking about is uncommon.

His conclusion is funny as sh!t.

Here is a program for code compiled by the ICC for linux (whatever of it actually exists). It's a bit more complicated and I wouldn't trust it on anything newer. Changing a few opcodes via hex is fine, but trying to insert complex checks into already compiled code is dangerous.

https://github.com/jimenezrick/patch-AuthenticAMD

Conclusions

It is a shame that the Intel compiler, which use to be almost the no-brainer choice if your primary concern was fast code, is now being coerced into being a marketing tool. Crippling the output for non-Intel chips may mean that some published benchmarks may end up bogusly favouring Intel over AMD, but the cost is that if you want to release fast production code I can't recommend the (unpatched) compiler. There are an awful lot of AMD machines out there!

To Intel: there is a standard mechanism out there (invented by you!) for questioning a CPU as to its capabilities. You should be using that, not checking for the presence of your trademark. I don't expect Intel to support AMD-specific extensions, and I also don't expect Intel to have to test its compiler on AMD CPUs. However, if a CPU states that it can do SSE3 or whatever then I expect the code produced by the Intel compiler to use SSE3 instructions rather than to check first if the chip was made by Intel. It was not acceptable for Microsoft to go out and deliberately cripple Windows under DR-DOS, and likewise it isn't acceptable for you to cripple a product that you sell for not inconsequential sums of money so that it won't perform properly on competitors' hardware.

+1 to the quote in it's entirety...

Also, I do agree...trying to do something on the scale of changing code for a Linux version of ICC would best be undertaken by someone with more than a little coding experience...

hcl123 · Jun 12, 2013

8350rocks :

-Fran- :

Agreed...imagine if it did go viral...curious to see someone run cinebench 11.5 with this now...just to see what the results would be.

Don't get overexcited. What the ICC dispatcher does in CB11.5 is use AVX on intel and SSE2 on others (AMD)... IIRC AVX performance on "BD FlexFPU" is not a big improvement. The interesting part would be to use SSE4.*, then the improvement could easily reach 2 digits of percentage on FlexFPU... but of course for this you'll have to recompile lol

But then we can say it will not be fair for intel to compare AVX against SSE4... not oranges to oranges... (EDIT)

hcl123 · Jun 12, 2013

mlscrow :

That is not what previous presented links points... like
http://translate.googleusercontent.com/translate_c?act=url&depth=2&hl=en&ie=UTF8&prev=_t&rurl=translate.google.com&sl=auto&tl=en&u=http://pclab.pl/art52841.html&usg=ALkJrhjshE8ZvbPi0zETTaCaVtRvu2gzCw

The HOT Chips 2012, clearly said "maintain high frequency engine" , which means SR is made with the same FO4 and the same high frequency prone latch ups ...

Now it depends on fab process, if bulk processes, very much doubt... even with finfet, and even with these specifically tweaked for high frequency (which intel might present for HSW-E ), case of high performance that none "foundry" will follow because its an enormous investment on all process details( but intel can $ do it)

If is PD-SOI and follows previous 32nm tweaks, its quite possible Glofo 28nm SHP (PD-SOI) allows even greater clocks than now. ( but PD-SOI on 28nm is also pretty much the end of line, so where does the money came from ? ... perhaps the deal to get out of GF and pay ~700millions by AMD, have a hand in there, otherwise certainly is not a process that any foundry would invest.)

hcl123 · Jun 12, 2013

And even this 32nm shows it could have more lol

A test of the A10 6700 "Richland" 65W

~40% efficiency shown for a similar clocked 5800 on perf/power

result showing possibilities of above 20% gain for graphics stufss (much included here games, more if the mem controller with higher speed DRAM is used... i think...)

http://translate.googleusercontent.com/translate_c?act=url&depth=2&hl=en&ie=UTF8&prev=_t&rurl=translate.google.com&sl=auto&tl=en&u=http://pclab.pl/news52829.html&usg=ALkJrhiJ92M0vjknmh3p5jNrEP9LNQhX7A

UPDATE:
also wonder if a FM2+(edit lol) platform is used along with higher clock DRAM (which richland might support better) then the 31% (salt) better graphics/games of Richland can materialize

http://wccftech.com/asus-shows-a88xm-pro-fm2-socket-motherboard-kaveri-apus-2/

palladin9479 · Jun 12, 2013

hcl123 :

8350rocks :

Don't get overexcited. What the ICC dispatcher does in CB11.5 is use AVX on intel and SSE2 on others (AMD)... IIRC AVX performance on "BD FlexFPU" is not a big improvement. The interesting part would be to use SSE4.*, then the improvement could easily reach 2 digits of percentage on FlexFPU... but of course for this you'll have to recompile lol

But then we can say it will not be fair for intel to compare AVX against SSE4... not oranges to oranges... (EDIT)

Depends on the version of ICC used. Older versions were i386 vs SSE2/3. Newer versions depend on whether it's x86 or x64 and what libraries were present. Agner did a big break down on the "new" ICC and what code is supposed where, his conclusion was that even with the "fixed" ICC there was a ton of code discrimination going on. The best you can currently hope for is SSE2 on non-Intel CPU's and if they were using ICC 10 (2008) then it's i386 vs SSE4.2.

The differences are in small functions that tend to do lots of brute force lifting, so anywhere from 5~30% (mostly around 10%) improvements are possible. Value performance crowns have been given away for less.

AMD CPU speculation... and expert conjecture

Honorable

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Glorious

Honorable

Distinguished

Distinguished

Distinguished

Distinguished

Honorable

Honorable

Distinguished

Splendid

Glorious

Distinguished

Splendid

Distinguished

Honorable

Honorable

Honorable

Splendid

Share this page