AMD CPU speculation... and expert conjecture

de5_Roy · Oct 11, 2013

Cazalan :

? per core or per module? [strike]by core i mean the integer processing part.[/strike] edit: i don't even know what to call it... i meant to ask if SR will have 3 alus per thread executing cluster (can't word it better..).
edit: my first time hearing about 3 alus per core, too.

tracker45 · Oct 11, 2013

any am3+ steamrollers ?

Cazalan · Oct 11, 2013

de5_Roy :

You can call them cores. Someone actually updated the SR Wiki to say 3 ALU as well. Wishful thinking or is that why the rumors are saying Kaveri 2.0 like 1.0 was cancelled.

http://en.wikipedia.org/wiki/Steamroller_%28microarchitecture%29

Cazalan · Oct 11, 2013

tracker45 :

None expected for 2013/2014 according to the latest roadmaps. Possibly in 2015.

8350rocks · Oct 11, 2013

Cazalan :

As much as I would love to see it be 3 ALUs per core...I cannot find any hard documentation of such a drastic architectural change. Outside of this one slide juanrga has posted, everything else I have read released from AMD says nothing about 3 ALUs per core. If they were going to do that, I would think it would be something they would be advertising pretty heavily to the tech industry insiders.

Additionally, an additional ALU per core would take up quite a bit of die space. You're talking core sizes would be quite a bigger portion of the die than they already are.

If this revelation turns out to be fact, it could explain why the APUs are not looking at a 6 core offering as of yet; however, that would be quite a drastic departure from the current architecture as it stands. The only explanation I could see for such information so late in the game would be if Jim Keller himself stepped in and said, "hold on, you're not quite there..." and made this change to the architecture of his own accord.

I wouldn't at all be upset, were that to be the case; though, I can safely say I am skeptical at this point...we will see once the ES's are dissected as to what is really under the hood for sure.

tracker45 · Oct 11, 2013

What cpu's are they releasing this year ?

Cazalan · Oct 11, 2013

tracker45 :

Desktop Kaveri is supposed to start shipping in December.

There's this table that SA dug up from INF files as potential candidates.

The Q6660 Inside · Oct 11, 2013

anxiousinfusion :

blackkstar :

The issue about gamerk, is that he uses "ERMEHGAWD DUHULL COARS ARE ENUFF, DEVS LIAK NEVER EVOLVE" logic over and over again.

griptwister · Oct 12, 2013

I think someone is getting tired of having to code for multiple threads. :lol:

Maybe that's why we see him on one or 2 threads on the forums?

palladin9479 · Oct 12, 2013

I just want a decent upgrade to my PhenomII x4 980, the FX-8350 just does not cut it... but anything they release later (if not piledriver based) i will pick, the performance will probably be about 30% at minimum compared to my current PhenomII, so that should be enough... if it`s 6 or 8 cores, the better.

The 8350 is definitely an upgrade for the 970~980BE. I know because I had one and did indeed upgrade to a fx8350.

To the rest, going with 3 ALU's per core would alleviate the need for a true six core setup in the mainstream segment. Also you need to be careful with benchmarks of the BF nature. Similar to what we say between SP and MP modes, whats actually happening in the environment will have a very large impact on the core loading. If there isn't a ton of stuff exploding and all those environmental effects / collisions / ect.. happening then your not gonna get solid core loading.

Actually I think this is more about minimum FPS then anything else, it doesn't matter what FPS a particular setup can do when nothings going on. Only the FPS you can when stuff's exploding and your actually in heavy MP combat is important.

de5_Roy · Oct 12, 2013

Cazalan :

even for the sake of speculation, this seems encouraging. i was thinking along the lines of amd re-tweaking sr and kaveri to better fit laptops and possibly hybrid 2-in-1s and low power apus since bd derivatives are predominantly desktop architectures.

btw, if amd implements 3 alus per 'core', doesn't this mean 6 alus per module and 12 alus per dual module cpu?
does it mean amd implemented hyperthreading (or similar)?
how will the die area be impacted by this design decision (speculation)?
i am guessing switching to 28nm and smaller (than vliw4) gcn igpu opened up enough die area...?

juanrga · Oct 12, 2013

8350rocks :

LOL. Do you want a 64 bit ARM baseline for the _first_ 64 bit ARM chip? That is very easy to solve, because only one baseline is possible. Do you need some hint?

8350rocks :

Nope. I said «A9/A15 are not "x86 CPU baseline"».

The A9 and the A15 can be found in phones. The A9 and the A15 can be found in some _early_ ARM servers

8350rocks :

You don't need "equivalent architectural baseline" to compare the performance of ARM to x86, PowerPc to x86, MIPS to ARM, ARM to PowerPC...

If you don't want to discuss more about ARM, you can stop posting. If you stop posting more nonsense, I will stop correcting it.

8350rocks :

How many extrapolations do you believe are being made? 200? 58? 3.1416?

Your pretension that comparing A57 to jaguar, Piledriver, or Sandy Bridge is like comparing a 1960 supercomputer with punch cards to a modern x86-64 is either an intelligent joke or you are going crazy with _your_ pretension that x86 is modern and complete, whereas ARM is incomplete and old.

8350rocks :

You must recommend that when it is about x86. When you said that Steamroller will be 30% you didn't mentioned "grain of salt", neither you used any benchmark. You merely claimed the expected performance from "something that doesn't exist". The fact you use the "that doesn't exist" argument only against ARM, reflect again how you play by double standard rules.

8350rocks :

You mix things. Don't strange you are so confused. You cannot compare the _raw performance_ of a ARM phone chip to the _raw performance_ of a x86 DT chip and pretend, as _you_ do, that x86 is faster. If someone make this either he doesn't have absolutely any idea about the topic or is being dishonest.

Of course, that is different from using a phone chip as baseline to obtain the performance of a server chip, which can be made with independence of the arch.

Nobody is _extrapolating_ the performance of Xeon E5 from an Atom, neither _asumming_ Steamroller performance from Temash.

8350rocks :

Do your lack of answer and your refusal to cite the same link again implies that you did knew that you cannot evaluate the performance/efficiency of two CPU architectures from comparing a chip to a cluster?

Regarding CISC/RISC, I know enough to say you the following.

In the first place, any modern CISC processor is so complex than cannot be directly implemented in the silicon. In any AMD or Intel modern chip CISC is internally translated to a RISC-like uops, which are then executed on hardware.

_Your_ old pretension that a RISC architecture cannot do everything a CISC architecture in a natural way is based in _your_ confusion that CISC mean Complete and RISC is for Incomplete.

Since the CISC instruction are never executed in a modern CISC processor, but are first translated to RISC uops all your arguments vanish in the air. But I will bite.

First mistake. A RISC CPU doesn't spend more time procesing instructions. It merely process more instructions, since the RISC arch minimizes the execution time for each instruction, because each one is simpler, the time needed to execute the program is minimized _not_ maximized. This is RISC 101 stuff.

Second mistake. Further repetition of the first. Your any-ARM-version-of-Windows-will-be-slower and nonsense as that was solved above.

Third mistake. A ___genuine___ CISC CPU doesn't spend less time procesing instructions. It merely process less instructions, but since each instruction is more complex the execution time for each instruction is longer, and the time needed to execute the program is _not_ minimized. Again, this is for a genuine CISC CPU that executes CISC on silicon. Modern CISC CPUs are implemented as RISC on silicon and the code really executed by the hardware is RISC-like.

The more funny part of your anti-RISC rant is that your lovely CISC CPUs are really executing RISC code at the metal level. When your write rants against RISC you are really writting rants about how really work a FX-8350 at the most fundamental level (silicon level). I think it was Intel the first to notice the advantages of the RISC approach and the first to implement RISC uops on silicon; AMD did a bit latter if my memory doesn't fail.

Fourt mistake. To use _your_ misunderstanding of RISC for pretending that ARM cannot be faster than x86 is lovely, but that you are unaware of that some of the more powerful CPUs in the planet are RISC gives a degree of tragedy to your posts.

For instance, POWER 7 offers 33 GFLOP per core. Ivy Bridge i7-3770k offers 28 GFLOP per core. POWER 7 was replaced by POWER7+ and this by POWER8. IBM states that it is two to three times as fast as the POWER7. Stop from pretending that RISC cannot be as fast as x86.

Again, if you don't want to discuss more about ARM, you can stop posting. If you stop posting more nonsense, I will stop correcting it.

noob2222 :

Nope. I already explained to you how were obtained. Maybe you would stop from taking nigthmare-code lessons from Mr. if(0!=1) then{} else{} and pay some attention to the material on the posts that you reply...

juanrga · Oct 12, 2013

de5_Roy :

Evidently.

de5_Roy :

No.

de5_Roy :

Everything else the same, one core with 3 ALUs occupies more space than one core with 2 ALUs.

de5_Roy :

Yes, smaller node implies you can put more stuff on the same area, but 3 ALU per core could be also implemented in >28nm. Moreover, Steamroller introduces a streamlined FPU that reduces CPU die space.

juanrga · Oct 12, 2013

Cazalan :

Don't be fooled my friend. That is not for CPUs. I was discussing only CPUs.

That link is for Intel MIC project. The Xeon Phi coprocessor is somewhat based in x86. But it doesn't run the same x86 binaries that you can run in an i7-3770k for instance. For using the Phi you have to recompile the code at least and if you want to use all its potential you have to rewrite the code.

Take a look to "What Xeon Phy is NOT" slide

https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/6/68/XeonPhi.pdf.

Intel MIC project is aimed to competing with GPUs, _not_ with CPUs.

ARM will beat Intel CPUs on supercomputers and Nvidia GPUs are already beating Xeon Phi on HPC. The two first positions in the green500 list are for Nvidia K20 supercomputers. Xeon Phi is third but with a 31% less efficiency (even when has a node size advantage over Nvidia).

ARM+CUDA or ARM+GCN will be the interesting combos.

Cazalan :

I also find very interesting that work. They tested ancient A8 and A9 against ATOM and i7. To put things a bit on perspective, the A8 is made on 65nm process.

Their node scaling procedure is open to objections. The multiply the A8 by 0.8 but the i7 only by a 1.3. This is favouring the i7.

Their procedure to measure power is also is suspicious. They measure total power and then subtract board power. Problem the A8 and the A9 are SoCs, the i7 and the ATOM are not. Therefore they are really comparing the power of a CPU to the power of a SoC. This is again favouring Intel.

Why did they use gcc 4.4? Why did they run the same binaries, when one of main advantages of RISC is that allows for more aggressive compiler optimizations?

Why did they choose Intel but not AMD chips? If they had tested Bulldozer FX instead of an SB i7, they had found poor efficiency and even poor performance in several tests. Why did they not test a modern ARM chip?

The team did not look at newer ARM designs, including the 64-bit ARMv8 architecture, but they believe the results would look much the same — in terms of the relationship between power and performance.

Really? LOL the A15 introduces both performance and efficiency improvements over the A9, and the new ARMv8 architecture is a new ISA, not a mere extension as Intel64, The A57 core based in ARMv8 introduced huge advantages of the order 2x--3x over predecessors.

Huumm. This looks all suspicious. Then a simple search reveals that this technical paper is being cited in news sites that claim that Intel will compete with ARM on mobiles. Huuum more suspicion. Then we take a look to the conference website and we found Intel name among 'sponsors'

http://www.carch.ac.cn/~hpca19

Then we search the main author of the article and... surprise surprise he has an Intel folk as collaborator and several Intel students

http://research.cs.wisc.edu/vertical/wiki/index.php/Main/People

Yeah Intel is competing with ARM :lol:

Cazalan :

Well I have been saying for months that FX Steamroller is not needed... In my article about Kaveri I was _conservative_ and used only a 20% boost, when some details suggest could be up to a 40%.

The Q6660 Inside · Oct 12, 2013

juanrga :

Cazalan :

Don't be fooled my friend. That is not for CPUs. I was discussing only CPUs.

That link is for Intel MIC project. The Xeon Phi coprocessor is somewhat based in x86. But it doesn't run the same x86 binaries that you can run in an i7-3770k for instance. For using the Phi you have to recompile the code at least and if you want to use all its potential you have to rewrite the code.

Take a look to "What Xeon Phy is NOT" slide

https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/6/68/XeonPhi.pdf.

Intel MIC project is aimed to competing with GPUs, _not_ with CPUs.

ARM will beat Intel CPUs on supercomputers and Nvidia GPUs are already beating Xeon Phi on HPC. The two first positions in the green500 list are for Nvidia K20 supercomputers. Xeon Phi is third but with a 31% less efficiency (even when has a node size advantage over Nvidia).

ARM+CUDA or ARM+GCN will be the interesting combos.

Cazalan :

I also find very interesting that work. They tested ancient A8 and A9 against ATOM and i7. To put things a bit on perspective, the A8 is made on 65nm process.

Their node scaling procedure is open to objections. The multiply the A8 by 0.8 but the i7 only by a 1.3. This is favouring the i7.

Their procedure to measure power is also is suspicious. They measure total power and then subtract board power. Problem the A8 and the A9 are SoCs, the i7 and the ATOM are not. Therefore they are really comparing the power of a CPU to the power of a SoC. This is again favouring Intel.

Why did they use gcc 4.4? Why did they run the same binaries, when one of main advantages of RISC is that allows for more aggressive compiler optimizations?

Why did they choose Intel but not AMD chips? If they had tested Bulldozer FX instead of an SB i7, they had found poor efficiency and even poor performance in several tests. Why did they not test a modern ARM chip?

The team did not look at newer ARM designs, including the 64-bit ARMv8 architecture, but they believe the results would look much the same — in terms of the relationship between power and performance.

Really? LOL the A15 introduces both performance and efficiency improvements over the A9, and the new ARMv8 architecture is a new ISA, not a mere extension as Intel64, The A57 core based in ARMv8 introduced huge advantages of the order 2x--3x over predecessors.

Huumm. This looks all suspicious. Then a simple search reveals that this technical paper is being cited in news sites that claim that Intel will compete with ARM on mobiles. Huuum more suspicion. Then we take a look to the conference website and we found Intel name among 'sponsors'

http://www.carch.ac.cn/~hpca19

Then we search the main author of the article and... surprise surprise he has an Intel folk as collaborator and several Intel students

http://research.cs.wisc.edu/vertical/wiki/index.php/Main/People

Yeah Intel is competing with ARM :lol:

Cazalan :

Well I have been saying for months that FX Steamroller is not needed... In my article about Kaveri I was _conservative_ and used only a 20% boost, when some details suggest could be up to a 40%.

ARM is not going anywhere in the short term, period.

noob2222 · Oct 12, 2013

juanrga :

too bad you haven't figured out this code yet.

While (Juanrga = wrong) {printf("JUANRGA IS NEVER WRONG")};

The A57 core based in ARMv8 introduced huge advantages of the order 2x--3x over predecessors.

ya ... vs what the cortex a5?

The out-of-order design will deliver performance measured at 1250 SpecInt2000 at 1.7 GHz, or about 25-30 percent more umph than today’s 32-bit A15 cores. It can be clocked at up to 2.5 GHz in a 20 nm process, said Mike Filippo, principal design architect for the part.

...

The A57 will also deliver slightly greater power efficiency than the A15, but it could require as much as 30 percent more die area, Filippo said. The core was designed for equally effective use on 32- and 64-bit code, he added.

jdwii · Oct 12, 2013

Doesn't it take 128 arm processors in a cluster just to get a 4 core Xeon performance, isn't that Arm processor incapable of running x86 instructions? IF they found out how to emulate the x86 instructions wouldn't there be a HUGE performance penalty. So except for part of the server market i just can't see Arm coming anywhere close to Intel or even Amd on the performance level when its needed.

Not to mention I was impressed that Arm is not even on the consoles i actually thought they would of done that.

juanrga · Oct 12, 2013

The Q6660 Inside :

juanrga :

Cazalan :

Don't be fooled my friend. That is not for CPUs. I was discussing only CPUs.

That link is for Intel MIC project. The Xeon Phi coprocessor is somewhat based in x86. But it doesn't run the same x86 binaries that you can run in an i7-3770k for instance. For using the Phi you have to recompile the code at least and if you want to use all its potential you have to rewrite the code.

Take a look to "What Xeon Phy is NOT" slide

https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/6/68/XeonPhi.pdf.

Intel MIC project is aimed to competing with GPUs, _not_ with CPUs.

ARM will beat Intel CPUs on supercomputers and Nvidia GPUs are already beating Xeon Phi on HPC. The two first positions in the green500 list are for Nvidia K20 supercomputers. Xeon Phi is third but with a 31% less efficiency (even when has a node size advantage over Nvidia).

ARM+CUDA or ARM+GCN will be the interesting combos.

Cazalan :

I also find very interesting that work. They tested ancient A8 and A9 against ATOM and i7. To put things a bit on perspective, the A8 is made on 65nm process.

Their node scaling procedure is open to objections. The multiply the A8 by 0.8 but the i7 only by a 1.3. This is favouring the i7.

Their procedure to measure power is also is suspicious. They measure total power and then subtract board power. Problem the A8 and the A9 are SoCs, the i7 and the ATOM are not. Therefore they are really comparing the power of a CPU to the power of a SoC. This is again favouring Intel.

Why did they use gcc 4.4? Why did they run the same binaries, when one of main advantages of RISC is that allows for more aggressive compiler optimizations?

Why did they choose Intel but not AMD chips? If they had tested Bulldozer FX instead of an SB i7, they had found poor efficiency and even poor performance in several tests. Why did they not test a modern ARM chip?

The team did not look at newer ARM designs, including the 64-bit ARMv8 architecture, but they believe the results would look much the same — in terms of the relationship between power and performance.

Really? LOL the A15 introduces both performance and efficiency improvements over the A9, and the new ARMv8 architecture is a new ISA, not a mere extension as Intel64, The A57 core based in ARMv8 introduced huge advantages of the order 2x--3x over predecessors.

Huumm. This looks all suspicious. Then a simple search reveals that this technical paper is being cited in news sites that claim that Intel will compete with ARM on mobiles. Huuum more suspicion. Then we take a look to the conference website and we found Intel name among 'sponsors'

http://www.carch.ac.cn/~hpca19

Then we search the main author of the article and... surprise surprise he has an Intel folk as collaborator and several Intel students

http://research.cs.wisc.edu/vertical/wiki/index.php/Main/People

Yeah Intel is competing with ARM :lol:

Cazalan :

Well I have been saying for months that FX Steamroller is not needed... In my article about Kaveri I was _conservative_ and used only a 20% boost, when some details suggest could be up to a 40%.

ARM is not going anywhere in the short term, period.

That is the mistake made by Intel and AMD.

AMD is correcting it, adopting now ARM for servers, tablets... and proclaiming that ARM will win.

juanrga · Oct 12, 2013

noob2222 :

I already commented about your code... twice :lol:

noob2222 :

2x--3x when compared with something more modern

Try again

noob2222 · Oct 12, 2013

^^ marketing slides ... HAHAHAHAHAH!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

While (Juanrga = wrong) {printf("JUANRGA IS NEVER WRONG")};

Marketing didn't play any role in calculating the GFLOPs.

go ahead, deny it again.

juanrga · Oct 12, 2013

jdwii :

As mentioned before...

A single Seattle CPU competes in raw performance with a top Xeon CPU.

ARM can simulate x86 instructions.

Nobody will be using ARM to run x86 legacy code exclusively. That is why AMD is releasing Warsaw in servers, for those customer who will be slow on migrating to ARM. Rest of people will be using ARM software on ARM chips.

Nvidia has already ported CUDA to ARM for instance. CUDA will be not used in servers.

ARM is not in the consoles because it was not ready 3 years ago when the development of the new consoles started. Many people agrees that next gen consoles will be ARM.

juanrga · Oct 12, 2013

noob2222 :

It didn't play any role in calculating the GFLOPs

Try again :lol:

The Q6660 Inside · Oct 12, 2013

The FX 9370 seems pretty good now, you know, with an Asetek LCS and pre-ocd to 4.7Ghz and Far Cry 3, it can hold quite the grudge against a 4770K with a sh*t stock cooler...
http://www.newegg.com/Product/Product.aspx?Item=N82E16819113352

Meanwhile.., AMD is secretly killing the tin FX 83XX boxes :evil:

Cazalan · Oct 12, 2013

juanrga :

You're confusing an IEEE conference sponsor with the actual people who wrote that particular paper, which largely have nothing to do with the conference other than submitting their paper for review and presenting it. The conference doesn't pay people or provide funding to write papers.

There is a huge difference. A conference sponsor means they get their names on the SWAG (stuff we all get, t-shirts, coffee mugs, and crap like that) and signs and other hand outs that are given to the people who PAY to go to the conference.

There is an Intel person listed as a collaborator for the research group, but there is also a Broadcomm person. Broadcomm uses ARM cores in their products not x86 cores.

"Support for this research was provided by NSF grants CCF-0845751, CCF-0917238, and CNS-0917213,
and the Cisco Systems Distinguished Graduate Fellowship."

Cisco uses X86 and ARM products so your Intel bias assumption doesn't have much merit.

It's amazing how quick you are to cut down the paper but take marketing slides as gospel.

juanrga :

When it comes to conference papers they're often submitted a year in advance of the conference. And the work probably started a year or more before that, so you're looking at a 2+ year lag from start to presentation (Feb 2013).

So the question why they didn't test an A15 is because they simply weren't out yet. Apple was the first to ship a phone with one and that wasn't until September 2012, which is only 5 months before the paper was presented.

They're not doing speculation here. They have real world hardware they had to get in the lab, code to write, testing, getting results, taking measurements, getting feedback from advisers, peer reviews, etc.

They used what they had access to. The i7 Sandy is old as well. It came out in January 2011, which is around the time A9s were introduced to the market. I.e the first Tegra 2 phone in January 2011.

http://www.networkworld.com/news/2011/010511-nvidia-shows-off-lg-optimus.html

juanrga :

No one is making CPU only HPC from here on out. Even IBM systems are incorporating NVidia GPUs. BlueGene/Q is the only one that could be considered CPU only on the top500 list but it's floating point accelerator is essentially a GPU.

Your beloved project denver (Tegra 6) isn't getting it's massive performance gains from the ARM cores. The 100x performance increase is largely due to the Maxwell GPU cores that support CUDA, instead of their ancient GeForce ULP cores.

Recompiling for HPC environments is common place. They're all unique with various bandwidth, storage, cpu variations. Dig on the Phi all you want but it currently powers the #1 supercomputer in the world.

The same will have to be done for a supercomputer powered by Tegra 6. It will mostly be CUDA code with some ARM code for moving data around.

Cazalan · Oct 12, 2013

juanrga :

Or course CUDA will be used in NVidia servers. That's how you program/execute Kepler/Maxwell cores.

https://developer.nvidia.com/content/cuda-arm-platforms-now-available

AMD CPU speculation... and expert conjecture

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Honorable

Distinguished

Splendid

Splendid

Distinguished

Distinguished

Distinguished

Honorable

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Honorable

Distinguished

Distinguished

Share this page