AMD CPU speculation... and expert conjecture

Page 236 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.


Despite the fact I've already shown that almost no applications in everyday use, benchmarking or otherwise, are compiled with ICC?

As for Linux:

http://www.phoronix.com/scan.php?page=article&item=amd_fx8350_visherabdver2&num=1

Intel wins the majority of tests, but tends to lose in parallel bound benchmarks, such as C-Ray. When Intel wins though, its generally by a very large margin (upwards of 30% in some cases). C-Ray was really the only "Bad" benchmark for Intel, but then again, Ray Tracing is going to naturally benefit from more cores.

-------------------------------------------------

At the end of the day, if the CPU is getting its work done, you will not see a benefit from more cores. This is nothing new; programmers figured this out back in the 70's. I could have one hundred threads running on a single CPU core; if that core can process all those threads quickly enough, then even though adding more cores will decrease latency and reduce overall CPU load, in terms of application performance [how long it takes to process], you will see no performance benefit.

The reverse is also true; you could have one hundred thousand threads, but if any one of them is heavy workload enough where a CPU core can't keep up, then despite all those extra cores, that single thread WILL be the main reason application performance does not improve. You see this effect in games, where 2 threads tend to do the bulk of the work [main application thread + main rendering thread]. How quickly those two threads complete will drive performance, and that favors a processor with high clocks and high IPC, rather then lots of cores. Hence why Intel outperforms AMD in gaming.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790




From

http://www.extremetech.com/computing/158901-amd-server-assault-2014-roadmap

The fact that AMD is listing Berlin as a replacement for the Opteron 3300 series says good things about the potential efficiency of the new hardware. The Current Opteron 3300 parts are a mixture of four and eight-core parts clocked at 1900MHz – 2800MHz. That should put the Berlin family ahead of the quad-core Opteron 3300s, but likely behind the eight-core Opteron 3380. While we expect Steamroller to be significantly faster than Piledriver, it’s unlikely to deliver a 2x performance improvement.

We don't know the clocks. I did estimation and 4C SR @ 3.5Ghz would have about a 80% of the performance of the Opteron 3380. About a 90% if clocked @ 4GHz.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


AMD claims 1.05 TFLOP for the APU (CPU+GPU).

Assuming CPU clock is the same than Richland

CPU: 4C x 8 x 4.1GHz = 131.2 GFLOP
GPU: 512MAD x 2 x 0.9GHz = 921.6 GFLOP
APU: 921.6 + 131.2 = 1052.8 GFLOP ~ 1.05 TFLOP

If the CPU is clocked @ 4GHz then

APU: 921.6 + 128 = 1049.6 GFLOP ~ 1.05 TFLOP

To put this in context the 4770k has 848 GFLOP total.
 

cowboy44mag

Guest
Jan 24, 2013
315
0
10,810


Here's my take on all of that. Every leap in technology in human history has always been followed by a very large "lag" time to figure out how to implement that new technology.

*Just an example fuel injection was introduced in fighter aircraft in the the 1940, how long did it take to be implemented in cars and trucks?*

With that particular example the "lag" time was decades. With video games you can clearly see the "lag" time of the software catching up to more cores when you look at the Phenom II x4 and x6 processors from 4 years ago. Back then when the 4 and 6 core Phenom IIs were new to the market they could be outperformed by overclocked high end i3s running the games of that time period. In the games of that era everything ran on one or two cores and the better single core performance of the i3 meant they could "game" better. Fast forward to today and the "old" x4 and x6 Phenom IIs aren't doing too bad with modern games, whereas the i3s from the same era can no longer run the modern games as well. Why? Because in modern games the software has finally started to catch up to the hardware and implement more cores. Four years from now when the software catches up to the hardware again we may see x8 Piledriver systems outperforming x4 Haswell systems of the same era. Four years from now we may actually have games that take full advantage of 8 cores.

My take on it is AMDs hardware is suffering more from retarded (delayed) software updating to new hardware than from any real problems with the arch.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Intel wins 20 of 28 tests, but the majority of them the advantage is minimal:
169 vs 151
13559.82 vs 13427.95
46.80 vs 42.26
8.30 vs 7.96
25 vs 26
...

Only some few Intel wins by a large amount and those are the tests relying on FPU. At the other hand AMD also wins some tests by a large margin and others by a small margin.

On average the 8350 is about a 14% behind the 3770k. What you did not mention is the explanation of some of the bad results of the AMD chip. From the same review that you give:

In not all of the Linux CPU benchmarks did the Piledriver-based FX-8350 do well. For some Linux programs, AMD CPUs simply don't perform well and the 2012 FX CPU was even beaten out by older Core i5 and i7 CPUs. We can hopefully see improvements here later on through compiler optimizations and other software enhancements. As shown in my earlier AMD Piledriver compiler tuning tests from the A10-5800K Trinity, with the current GCC release there isn't much improvement out of the "bdver2" optimizations for this processor that should expose the CPU's BMI, TBM, F16C, and FMA3 capabilities over the original AMD Bulldozer processors. I hope that we will see further compiler improvements out of AMD to close some of these performance gaps.

Moreover, look at the AMD score in C-Ray that you mention

http://openbenchmarking.org/embed.php?i=1210227-RA-AMDFX835085&sha=293f200&p=2

and then look to the AMD score with a newest version of GCC

http://openbenchmarking.org/embed.php?i=1305170-UT-LLVMCLANG75&sha=1593a32&p=2

23.34 --> 19.26

A 17% increase in a test where GCC code was already good.

AMD already introduced compiler support for Steamroller in GCC. I would wait good performance from first minute.
 

8350rocks

Distinguished


Software is always a generation (or more) behind hardware. Look at Unreal 4, which is only just now incorporating the newest hardware features...and it's a couple years old. Currently the list of games on Unreal 4 Engine you can count on your fingers...that's because a great many game developers do not want to "cut down" their target market. Though the way to push adoption of newer hardware is by making games that are on the newest tech.

It's a double edged sword as a developer...as you want a maximum size target audience, but you also want to innovate. As long as there are Core 2 Duo gaming rigs out there, some developers will aim for the lowest common denominator. Though, they are straying from that methodology more and more now...the newest engines have not been widely adopted yet, because a large portion of the general public is running 2-4 year old hardware. Some of that hardware (Phenom 2 systems with 4/6 cores) are seeing massive gains from the more stringent engines, because their hardware can actually benefit from more demanding engines. Where as, some of the newer systems (8 core FX for example) are not really being maxed out, because their potential is greater than what current software can efficiently use, and certainly more than it demands.

Things will be interesting...in the next 24 months, you will see much better software support for larger core count CPUs because the 3-4 year old tech crowd is upgrading now to systems with a quad core.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


BULL !...

Cray being in the red or black has nothing to do "per se" with losses of late chips...

Bet their contracts were payed in full under the established dead lines, meaning no penalties... if they were payed later rather than sooner, and so Cray couldn't avoid look bad on the financial side for one quarter, is unfortunate but it happens ( no direct loses).... after all what is a quarter? ... bet Cray would wish to have double the contracts and every one of them late, than have half the contracts and all before schedule.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


It could be, but...

CPU: 4C x 8 x 4.1GHz = 131.2 GFLOP

What is 4C for ? ... 4 cores ?... the integer cores don't do FLOP operations.

If is like BD/PD, 8 operations by 4 cores, means 4 operations per module, and since each FMAC is 2 operations, it means that the FlexFPU has the same 2 128bit FMAC pipes per module, for the 64bit double precision operations, which is what counts in those FlexFPUs .

I think the FLOP rate of "single precision" in the FlexFPU is the same of double precision... i really don't know, single precision is not that important for those CPU workloads, it could double... it could be less... most probably will be the same....

The GPU FLOP count is almost exclusively "single precision", meaning 32bit ops. The GPU can also do double precision but i think is 1/16 of the single precision... or at most like the top Radeon 1/4 of the single precision rate...

So i think is necessary to do a distinction.

CPU at 4Ghz ( rounded for sake of numbers)

Single/double precision

4 FMAC pipes(2 modules) x 2 ops x 4000 = 32GFLOPS (single or double)

Quite below your 132GFLOPS.

Now if each FMAC is 256bit wide, which i suspect is the case, and each FAMC pipe can do 4 64/32bit ops (vectors have that capability) it would be double or ...

4 FMAC pipes(2 modules) x 4 ops (256bit vectors) x 4000 = 64GFLOPS (single or double).. yet too small

But if each module has 2 FlexFPU ( one per "cluster"... it could be the case), capable of 4 256bit AVX ops per cycle in the 2 module 4 core config of the APUs, so matching the theoretical peak of 2 AVX ops per core, never possible do the shared exec nature of the design, of Intel chips.... them comes very close...

8 FMAC pipes(2 modules, 2 FlexFPU per module) x 4 ops (256bit vectors) x 4000 = 128 GFLOPS (single or double)... in this case comes very close

So if you stand by your numbers, AMD confirmed it you say... what you are saying is each *Module* on SR APUs, will have double FlexFPU with 2 256bit FMAC pipes each, able of 4 64bit ops per each of those FMAC pipes(vectors).(edt)

It means it could have the same design of the supposed, if not fake, probable Excavator
http://farm6.staticflickr.com/5321/9104546631_4c7a4a023b_o.jpg

Its half the FLOP count of that EXv, because that EXv has 4 core/threads per module and has what probably is 8 256bit FMAC pipes all in one module, that is 4 FlexFPU in one module( or 4x the size)... that is my "speculation" to (a 4 module 16 cores/threads EXv die will have 1TFLOP double precision, a MCM G34 like socket would have 8 modules 32 cores/threads, 2 TFLOPS double precision ... the why by by Phy x)

Nevertheless this is CPU side, GPU will be

512MAD x 2 x 0.9GHz = 921.6 GFLOP "single precision", double precision will be as much 921.6/4 = 230.4 GFLOPS

Kaveri APU (speculation)

~1TFLOP single precision ( 920 GPU + 128 CPU)

~360GFLOPS double precision ( 230 GPU + 128 CPU)
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780
CORRECTION
It means it could have the same design of the supposed, if not fake, probable Excavator
http://farm6.staticflickr.com/5321/9104546631_4c7a4a023...

Its half the FLOP count of that EXv, because that EXv has 4 core/threads per module and has what probably is 8 256bit FMAC pipes all in one module, that is 4 FlexFPU in one module( or 4x the size)... that is my "speculation" to (a 4 module 16 cores/threads EXv die will have 1TFLOP double precision, a MCM G34 like socket would have 8 modules 32 cores/threads, 2 TFLOPS double precision ... the why by by Phy x)

Each EXv module is 128 GFLOPs double precision, so its not 1 TFLOP die 2 TFLOPS MCM... its half of that. IF single precision is double it could be that.

Nevertheless i think 6 module die at 28nm FD-SOIm which is roughly half the size of 32nm PD-SOI, could be possible, so it could be 1.5 TFLOPS double precision per MCM (128x12 = 1536) which is already above the actual Phy X, which is not a CPU (so power per flop will be supeior) (edt)... (DDR4 will be indispensable here, and at least 3 channels per die, but most probably this will go 2.5 interposed stacked HBM memory )
 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860


What is one benchmark that nearly every review site on the web uses? What is it compiled with?

So while Cinebench is a valid bench as quite a few people use the Intel OpenMP libraries, it is not representative of all render engines. In fact, Cinebench probably only represent the smaller part of the market that uses the Intel OpenMP API. On dual CPU systems, the Opteron machines run a bit slower than they should; on quad CPU systems, this lack of "AMD NUMA" awareness will have a larger impact.
http://www.anandtech.com/show/5058/amds-opteron-interlagos-6200/10

What does that tell you about a cpu that can keep up with Intel on their own compiler? What would happen if cinebench supported sse4.1 and avx on AMD chips? This is known as cinema 4d r14. AMD gets sse 4.1 and Intel still gets a lesser advantage of AVX instead of sse2 and avx seen in cinebench 11.5.

My question is why haven't we recieved a cinebench r13/14/15? Kinda odd we go from 9.5 - 10 - 11.5 ... _____. Is Intel schill tactics still in play on cinebench 11.5 since the newer cinema 4d have better AMD support? Why is it that every review site around uses cinebench 11.5 to show how superior avx is to sse2?

The compiler benchmarks you showed that going from sse2 to sse3 had the fastest speedup, but cinebench is lacking that support from AMD.

The truth is thats what your comparing with Intel vs AMD, AVX to sse2.
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


*SIC*

I think Big cores... so propagandized... will continue, the exposed above is a solution, the "traditional" server market is not only very small for AMD( thanks intel for the anti-competitive exclusive deals), its also shrinking fast... new markets, they will go after Phy X and HPC, the module uarch is just so much better than any MIC or tiled uarch, and is already a CPU...

 

blackkstar

Honorable
Sep 30, 2012
468
0
10,780


Compare FX overclocking to Intel Bridge overclocking. Notice how FX can take 1.7v under water no problem and still run perfectly fine without degrading? Now do some detective google work and find what it takes to degrade Sandy Bridge. I've seen folks on XS start to degrade after running just a little over 1.5v for a month or so. And there are people with FX 8350s that run 1.6v+ 24/7 since release and have no problems.

This is what people are talking about when they say SOI is superior to bulk. You seem to have confused architecture with fabrication.

But I would not be surprised if AMD were on bulk with Piledriver that the gap between Intel and AMD would be much, much larger as we'd probably still see 3ghz range FX 8350.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Just because there's no penalties doesn't mean there wasn't loss of profit. Whenever a product is delayed there is a profit loss. A 1 quarter delay is 1 quarter loss of support/maintenance contract, which is sizable for that type of product. Not to mention interest payments on credit lines.

Yes it is the cost of doing business and Cray has to anticipate that, which is why they didn't directly blame AMD. They have business relationships to maintain. They have since recovered and are doing quite well now, but they have shifted their new lead product to Intel instead of AMD. Which ultimately hurts AMD (and us consumers) in the long run and has contributed to their decline in server chip sales. The domino effect.

My speculation is losing a key new design win like Cray XC30 has had a direct effect in how much R&D AMD is contributing towards their high end server chips and thus FX, since it's the same die. It's basically at a support mode level now based on supply contracts, and the 2014 roadmap reflects that.

 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860
on another note, AMD popped a 6W cpu

http://legitreviews.com/news/15806/

3W normal use, 6w Maximum = haswell "SDP" gimmick at 4.5W that has a maximum TDP of 11.5W

this g-series uses the jaguar core.
 


I disagree. The reason AMD quads are outperforming i3's has less to do with being quads, and more to do with their clock advantage. Clock an i3 2300 at the same speed as a FX-4300, and the i3 wins due to its IPC advantage. Hence why i3's are typically hanging around the FX-4000 and PII X4 series of chips.

And as I have said many, many times, this is NOT going to change. Though I'm sure you won't hesitate to somehow blame consoles for holding PC's back.

A 17% increase in a test where GCC code was already good.

And intel likely got a performance gain as well, keeping the performance difference between the two roughly the same.

What is one benchmark that nearly every review site on the web uses? What is it compiled with?

How about this: Give me a group of say 7-8 benchmarks to test, and I'll run them though my various detection software and tell you what they were compiled with.
 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860


what 7-8 benchmarks are used on every review site? missed that point tho, one benchmark seems to appear on every site. why is that?

Aside from that, what makes you certain that your "detection" is detecting all possibilities?

For example using Intel libraries under MSVC. Wouldn't that detect MSVC yet optimize under "Genuine_Intel"?
 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


Microsoft and Sony must be stupid lol... or they are going for 8 thread games that don't need that much single-thread power (why choose Jaguar ?)...

Perhaps is not MSFT or Sony that are stupid ... at least someone is...

Did you do any tests yourself to come to those wonderful conclusions ?... i bet MSFT and Sony did...

Yes there is some validity to Amdhal's law.. but there is also the Gustafson law... in the HPC multithreading world there is more than meets the eye, study both.

Besides soon Amdhal's law will be a relic, SpMT (speculative multithreading) and "dataflow" approaches with "Hardware transactional memory", are just the BEST ways to extend all forms of parallelisms, soon there will very blurred were a single-thread begins and ends... and just for remind a GPU is a "dataflow" engine, that is why a single chip can cope so well with thousands of threads in those Waves or Warps and not choke...

Your conclusions are rather *prejudice* more than anything else( you lose not gain)

 

8350rocks

Distinguished


Cinebench 11.5

PCMark

Sysmark

SuperPi

 

hcl123

Honorable
Mar 18, 2013
425
0
10,780


yes... and the millions of those "financial" derivative papers, ABSOLUTE COMPLETE worthless pieces of paper... worth indeed the trillions and trillons of dollars that the Too Big To Fail Banks claim they do...

Perhaps when they need another bailout, you can jump first in like LOL...

Before there was the "marketing campaign" for the the status quo that filled perhaps more than 60 pages... now there will be more 60 pages to defend the banksters POVs ? :??: crazy :pt1cable:

[ you don't do business in the real world like that... everything is under contract with well established dead lines... a direct fault of any part would be immediately attributable and compensation for losses or failed profits already foreseen(edt)... nothing of the case passed with BD/Cray... it was a bit later than supposed, too bad Cray needed the income very badly... and you are just using Cray laments to attack AMD... for you i hope AMD dies, so that you can buy celerons at $2000, then of course you have nothing to complain LOL ]

 

griptwister

Distinguished
Oct 7, 2012
1,437
0
19,460


LOL! That's jacked up! I hope they die the day hafijur goes to buy a new craptop! xD
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Yes, "4C" stands for 4 cores. I used standard formula by AMD performance labs

(CPU Cores x freq x 8 FLOPS)

the 8 FLOPS (per core) follows from 4SP x 2FMA

You can rewrite the formula in term of modules if you want. The result is the same

2M x 16 FLOPS/M x 4.1GHz = 131.2 GFLOP
2M x 16 FLOPS/M x 4.0GHz = 128 GFLOP
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Saying a product was late is not an attack, it's just pointing out a fact. A fact which had a number of negative repercussions for AMD.

Your implying I want AMD to die is absurd. I buy AMD products for both personal and professional use on a yearly basis. Of course I need to know where AMD is heading for procuring new systems.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


The hardware performance depends on IPC, clock speed, and number of cores. Increasing any of those you increase performance in general. Evidently if you overclock enough an i3 it will outperform a quad.

Hardware is reaching physical limits with current technology and that we will not see tomorrow a single core clocked at 30GHz, but more and more multicores. Intel is going to release their first eight-core chip, because there is no way that they could fabricate a quad or a dual core chip with the same performance.

I don't know what you mean by "blame consoles for holding PC's back", because it is evident that new consoles will increase PC gaming quality a lot of, even Nvidia is already saying that.




GCC 4.7 --> 4.8.1

FX-8350: 23.34 --> 19.27.
i7-3770k: 33.05 --> 28.21.

Performance gain for the FX is 17.4%. Performance gain for the i7 is 14.6%. This is a 19% more for the FX.

Of course, the relevant improvements in performance for AMD chips will come from better support for the bdver2 FLAG. I don't know how many more improvement to wait, but 30-50% does not seem exaggerated. Better support for bdver2 will not improve the performance of Intel chips.
 

GOM3RPLY3R

Honorable
Mar 16, 2013
658
0
11,010


Put a 6300 and a 8350 on there, and do the overclocked version as well. 1 AMD vs 5 Intels doesn't really prove a point.
 

griptwister

Distinguished
Oct 7, 2012
1,437
0
19,460
Niether does your friend saying: My CPU runs at 1C under load xD

Put a GOOD cooler on a FX 8350 and I can almost guarantee you, the FX 8350 would run cooler under the same HS as a 3770K would. Only if I had money... I'd do this myself!!!
 
Status
Not open for further replies.