AMD CPU speculation... and expert conjecture

8350rocks · Jun 24, 2013

Upendra09 :

AMD may very well be ahead after Steamroller...it's that dramatic an improvement.

hcl123 · Jun 24, 2013

I am quite happy with my Deneb but i don't see AMD surpassing intel anytime soon.......do any of you have any more hope for AMD than i?

(edit)

This is ridiculous

Surpassing in what ? ... in single-thread integer power ? ... who the hell cares, if not for kind of trolling bragging rights ?

Even intel surpassed intel, as seen by the OpenCL tests, even BD in MCM format surpassed SNB-E at multithreading in Linux, even a 10x less expensive chip surpasses an Exterme 6 cores CPU (see chart of Expresso transcode http://www.guru3d.com/articles_pages/amd_a10_6800k_review_apu,14.html )

If nothing radical happens, none of this chips and contenders in the pure CPU format will get 10, 15 or 20% above the other, the traditional x86 has reached a scaling dead end... which is utter ridiculous the language employed, if not for the entertainment factor, it looks like the inmates have taken over the asylum.

Can we have a serious discussion ?... because who ever wins at what, none of us as "end-users" will gain anything, a system is much more than a processor for all matters, power or performance, so please stop being "USED" by this vendors for making propaganda for free, even if you have a 2x more powerful box, it only means you payed triple or quadruple for it, if is worthed if is suitable and if you don't mind is up to you, but in the end you gain nothing, you payed for it.

Upendra09 · Jun 24, 2013

8350rocks :

that is what they said about bulldozer......can one of you give me a short run down of today's technology in terms of processors? its been a while since i followed the computer market so closely

-Fran- · Jun 24, 2013

Upendra09 :

Like you said, there's nothing that can tell us where the performance will be in reality. AMD has been REALLY tight lipped about SR, just like it was with BD and PD.

So, wait and see is the only realistic approach we can take.

@amdfangirl: Heh, have an interesting history to tell us? 😛

Cheers!

hcl123 · Jun 24, 2013

gamerk316 :

No its not "we can make everything parallel, if we try"... the problem is that it was not really tried.

Soft devs are right, some things don't scale and that is it. But that is why something like HTM (hasfail has it), automatic vectorization and speculative multithreading, might very well appear... for the things that does scale... and those are invariably the most important jobs a computer can have, the rest is nonsense. Else we wont have "more core chips" for the desktop/client, it just doesn't make sense, so the non scaling software will dictate the non-scaling hardware, and all of the sudden all this discussions lose interest. Lets hope the paradigma tries harder with better tools (perhaps like HSA), lets hope a dead end will not happen.

8350rocks · Jun 24, 2013

Upendra09 :

If you're not very versed in hardware, it's a bit difficult to explain, however, this is probably the best layman's breakdown with technical speak I have seen describing what AMD is doing and why it helps:

http://www.brightsideofnews.com/news/2013/3/6/analysis-amd-kaveri-apu-and-steamroller-core-architectural-enhancements-unveiled.aspx

juanrga · Jun 24, 2013

gamerk316 :

jdwii :

juanrga :

Again, for the several hundredth time: The majority of tasks do not scale to multiple threads. Some, like encoding, do, and FX tends to win those benchmarks (as I predicted back before BD even launched, remember?) But single-threaded benchmarks are still perfectly valid, because guess what? The tasks they are simulated also tend to be single threaded, because they DO NOT SCALE. And guess what? Some people like to know how CPU's do at those tasks.

The mindset of "we can make everything parallel, if we try", is stupid and wrong. Some stuff does NOT scale, period. Disregarding benchmarks on that basis alone is silly and shows bias on your part.

Well I wrote "most" and not "all". Moreover, I think that you missed my point about multitasking. It does not bother if some apps are only single or double threaded when you can run two or three of those at one time. As said above the usual archaic benchmarking methodology (run one benchmark, wait, run other, wait...) is not representative of how most people use their computers today.

juanrga · Jun 24, 2013

Some info about Steamroller. This is supposedly a leaked shot of Steamroller compared to Piledriver

Here some technical analysis of the die

http://semiaccurate.com/forums/showpost.php?p=184500&postcount=932

hcl123 · Jun 24, 2013

If you're not very versed in hardware, it's a bit difficult to explain, however, this is probably the best layman's breakdown with technical speak I have seen describing what AMD is doing and why it helps:

http://www.brightsideofnews.com/news/2013/3/6/analysis-amd-kaveri-apu-and-steamroller-core-architectural-enhancements-unveiled.aspx

That BSN article has been deprecated

Kaveri will not have 3 modules 6 cores, neither it will have GDDR5.

the first is easy to guess why... the traditional desktop windows software is not made with multithreading in mind... the second i think AMD will follow Wuii and XB one, and get ESRAM on board, then if things are balanced it can get much more effective bandwidth for the GPU side than 4 channels of GDDR4M (128bit low power) ( in gross mode the ESRAM provides 100GB/s, 128bit GDDR5M provides ~50GBs)

It could be 1T SRAM or 4T SRAM if bulk... it could be T-RAM SRAM if SOI... just wonder (edit)

Upadte:
http://www.brightsideofnews.com/Data/2013_3_6/Analysis-AMD-Kaveri-APU-and-Steamroller-x86-64-Architectural-Enhancements-Unveiled/AMD_hsa_evolution.jpg

Also Kaveri is the 3th phase and the one ready for HSA software because of hUMA, but the 4th phase is the same for software, only the GCN CUS might transition inside the CPU modules... those modules already have provisions for context and exception handling and preemption

hcl123 · Jun 24, 2013

juanrga :

Yes it looks legit, but i'm finding too much fruit for steamroller.. perhaps is Excavator with 4 threads per module, it suits better this http://diybbs.zol.com.cn/11/11_106489.html

The kind of wave cache could be adaptive L2, the more than double branch prediction CAM tables (edit) could be "thread control"... with as it seems 4 ALUS and 4 AGUS per cluster/core of the CMT, that modules could have easely 4 threads (2+ 2 SMT(->hyperthreading in intel lingo)). More so because it has clearly double front-end not only double decode.. that is double the L1(edit), double prefetch, double fetch, doubled size pick buffers and perhaps double local branch prediction, only the global prediction with L1 and L2 BTB (branch target buffers) remain single.

UPDATE:

It also has double FlexFPU in a traditional format!!...

What if it is a die that was scrapped ?... it will be the base for Excavator, since it seems an attempt at lower node than 28nm, only with excavator the 2 FlexFPUs will be replaced by a kind of GCN CU... the MMX kind pipes will be the Scaler pipes in GCN and the 4 FMACS will be the 4 SIMD of GCN

hcl123 · Jun 24, 2013

OTOH if its Steamroller expect a kavery with 8 threads(like hasfail) and a FX with 16 lol

A MCM server part could have 32 threads in a single socket(the same of 2 Xeon sockets), then DDR4 will be needed

It will be a tremendous push in all fronts... i would like to see it, but still find too much fruit lol

gamerk316 · Jun 24, 2013

The Q6660 Inside :

1: GCC

No one uses it; MSVC or bust.

2: 30% faster

Right now, 15% seems to be the number, based on my analysis a few pages back. I'm on the lookout for more recent benchmarks, but 15% seems reasonable.

GOM3RPLY3R · Jun 24, 2013

jdwii :

Completely agree, not everything is multithreaded. Even so, if the HT cores aren't up to par, Overclocking is always (usually) an option.

gamerk316 · Jun 24, 2013

No its not "we can make everything parallel, if we try"... the problem is that it was not really tried.

Yes it has; its been tried since the 70's, with the same exact results every single time: Most tasks DO NOT SCALE.

Soft devs are right, some things don't scale and that is it. But that is why something like HTM (hasfail has it), automatic vectorization and speculative multithreading, might very well appear... for the things that does scale... and those are invariably the most important jobs a computer can have, the rest is nonsense. Else we wont have "more core chips" for the desktop/client, it just doesn't make sense, so the non scaling software will dictate the non-scaling hardware, and all of the sudden all this discussions lose interest. Lets hope the paradigma tries harder with better tools (perhaps like HSA), lets hope a dead end will not happen.

Taking the case of speculative multithreading and HTM, both have MAJOR performance downsides in the worst case outcome. You have to do a performance analysis against your system to determine if you will gain an overall benefit using these techniques. For example, HTM, if its found that another thread has modified the data you have processed, the thread has to dump its results, put a traditional lock in place, and begin the computation a second time. Thats over 2x the performance loss of using standard software locks. If your fail rate is high enough, you can easily tank performance. In large databases, where you have dozens of CPU cores, and a lock can kill performance, HTM makes sense. For a consumer PC? Not so much.

As for automatic vectorization: Its a good performance enhancement, but not one thats going to appear on benchmarks. Algorithmic performance tends to drive application performance, and if the algorithm itself is serial, there isn't much you can do to improve performance.

gamerk316 · Jun 24, 2013

hcl123 :

Yes, its Windows fault, despite the fact is has one of the better thread schedulers out there (as of NT 6.1 at least). GDDR is expensive, and thus not attractive for the consumer market, and ESRAM as a local buffer is suboptimal and power hungry. [Note: The XB1 is roumered to be having SIGNIFICANT yield issues due to its ESRAM.]

mayankleoboy1 · Jun 24, 2013

gamerk316 :

Adding to this, all automatic vectorizers are super conservative, and do a pathetic pathetic job of vectorization. Unless you specifically use intrinsics, auto vectorization will 99% time ignore your code. And the only reason you can/will use intrinsics is when your data will benefit from it.
If your data will benefit from this, it already is inherently parallel.

noob2222 · Jun 24, 2013

hafijur :

50% faster ... rofl ... Intel can't even get 10% from one gen to the next ... and for the record Intel does have a 6 core cpu. It costs $1000+

Sure, Llano wasn't that efficient, have to go back 2 years to find that cpu ... lets look at something newer instead.

your fail is trying to pretend certain products don't exist and that nothing can beat the power draw of 22nm, especially something in 32nm ... oh wait, it just did.

^^ Edit ... WTF ...

Anyway intel fans will be happy with steamroller as intel are just releasing virtually identical cpu except better igpu and instructions and better idles since the 2011 sandy bridge cpu when people want more cores for cheap on desktop side.

BETTER IGP .... HAHAHAHAHAHAHA!!!!!!!!!!!!!!!!!!

Intel has better idle power? Hasbeen didn't even bring idle power down to trinity levels.

8350rocks · Jun 24, 2013

hafijur :

Why does it not surprise me at all that you present one benchmarks out of thousands that has such skewed power numbers...?

The average max power consumption I see out of the FX 8350 in reviews looks more like this:

Also, your benchmark is a bit misleading, as they are testing system power consumption...they cannot attribute that entirely to the CPU, and differences between motherboards and components can make a big difference in power consumption, as you yourself have point out on numerous occasions..."Part of the power consumption is the motherboard".

8350rocks · Jun 24, 2013

hafijur :

If you want to compare Server CPUs, pick on something designed for that:

griptwister · Jun 24, 2013

Well... some good news for Intel at least... maybe they need a recovery from Haswell? Lol!

http://www.tweaktown.com/news/31285/rumortt-intel-s-ivy-e-to-use-solder-not-paste-for-die-heat-spreader-thermal-transfer/index.html

In my personal experience, I have noticed no difference between a i7 and a FX 6300 for gaming. Maybe if I had a 120Hz monitor I'd see that difference, But in all honesty. I think this whole thing get's blown out of per-portion.

8350rocks · Jun 24, 2013

hafijur :

What, total power drawn over the entire course of the benchmark suite they used? It clearly took them somewhere around 1.5-2 hours to run the benchmarks...that's where the power drawn figures come from...those aren't for maximum power drawn at once.

Cazalan · Jun 24, 2013

griptwister :

I was wondering when they were going to do that. Of the couple Ivy de-lidding videos I watched the thermal paste was put on incredibly sparse.

Most probably exaggerate the temp difference but ~10c should be easily achieved. The K processors should be done that way too.

Ranth · Jun 24, 2013

griptwister :

And now we know how Intel "justify" their 1000$ price tag :sarcastic:

noob2222 · Jun 24, 2013

hafijur :

your the one pretending its 2011 and that Llano is the peak of AMD's technology, but sure, lets look at more power consumption of Intel's 6-core cpus

what!!! 30W more than AMD, idle and load?!!! how can this be ... intel owns all.

as for your bit-tech ... rofl

apples to apples, Hasbeen increased power over ivy bridge by 14W and only beats AMD by 69W at full load and a mere 3 watts idle ...

bit -techs numbers are always laughable compared to any other review. Here is why:CPU: 1.45V, CPU/NB: 1.2V, NB: 1.2V, NB HT: 1.25V, RAM: 1.65V Artificial inflation is easy to do.

aside from that, your the one trying to mix-up parts for comparisons.

I can see in 2015 amd will be releasing 220w tdp chips to compete with intels 35w tdp chips
The i7 980x is faster then the fx8350 still and that was an early 2010 cpu.
I reckon 2016 will be the year they get to sandy bridge levels and that will take a gigantic leap from where amd are now.
Intel cpus currently are trying to outperform amd chips taking 90-100w less on peak load.
intel are just releasing virtually identical cpu except better igpu

1. so what is Intel's 35W cpu? I3 3200?

well thats 100W vs 35w, who won that?

2. ... argue power consumption, then bring up the 980x. might want to look up power draw on nahalem

3. 2016 ...

so amd needs to speed up by 6 seconds

4. not that much, ~65w

5. ... LOL.

hcl123 · Jun 24, 2013

gamerk316 :

No Its the same thing of automatic vectorization, its only different in your mind... like the "force" lol

But yes you are right, some form of performance analyses have to be done, but remember that even BDver1 has hardware profiling mechanisms with a few instructions of ISA extansion, and the purpose of speculative multithreading is being "transparent" for the developers, they only have to insert the possible break points for profiling according to an analyses, which could be done kind of automatically by the compiler toolchains, and continue more or less designing the code for low thread count.

The speculative multithreading apparatus will decide to split the sequential code "ON-THE-FLY" according to the profiles... it could tremendously accelerate things, most logical it will be done by the data graphs analyses, and in there is where the HTM apparatus can enter to help, since it will be speculative threads as transactions by teh data flow grahs... or thread level data speculation...

OF course HTM code design from origin to be "transactional" (like DB engines) will be better, but it will also be extensivelly MT, it will be for HPC/server... a totally different paradigma and approach, not invalidating HTM can help in spMT.

In this case there is no penalty... neither a hardware bloat explosion, since like HSA the compiler is JIT, and there is a very advanced "runtime" to decide what to split what not and when... the code can even be single-thread.

For this a "dataflow" kind of uarch like advanced by a paper of The University of Washington can be the best approach, more so because GPU are already half way to a full blown dataflow paradigma, and spMT with HTM can also go for GPUs, for the compute jobs not kernel based, even for game physics.

So if there is an approach in model of CPU and GPU, a "dataflow" approach, having GPU ALUs and CPU ALUs very close tied together is not so complicated.

http://wavescalar.cs.washington.edu/wavescalar.pdf

That is why i suspected, the "excavator" exposé can be a 1th April joke, but the inventor knows what he is talking about http://diybbs.zol.com.cn/11/11_106489.html (it could be truth... but exposed only in china ? .. why not ?.. its an exploding market, and the west is pretty much intel turf on minds and hearts )

AMD CPU speculation... and expert conjecture

Distinguished

Honorable

Distinguished

Glorious

Honorable

Distinguished

Distinguished

Distinguished

Honorable

Honorable

Honorable

Glorious

Honorable

Glorious

Glorious

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Honorable

Distinguished

Honorable

Share this page