AMD CPU speculation... and expert conjecture

danokurd · Jun 18, 2014

juanrga · Jun 18, 2014

cdrkf :

I understood your point, but I disagree. I think CMT was a bad idea back then. This is why no other processor builder followed DEC 1996 arch except AMD.

AMD CMT approach makes little sense (then and now) because it is a self-inconsistent hybrid of a TCU and a LCU:

(1) Bringing lots of cores in a relatively small die makes sense when you are designing a manycore, but this is a multicore. And multicore != manycore. This confusion was a mayor design fault of CMT.

(2) As already covered in early FSA (precursor to HSA), CPUs are for latency, GPUs for throughput. However, with CMT, AMD go for a throughput optimized CPU, "moar cores", which not only makes little sense per se (serial code, branching...), but it contradicts what HSA is about: latency + throughput.

(3) The shared FPU was chosen because someone at AMD pretended to use the CPU for integer and, in a future, external GPU as a kind of giant FPU. CMT integer vs floating 'clusters' is a copy of old DEC boxes. Against this 'clustering' ignores that one also needs FP for latency workloads. Moreover, the shared FPU complicated the design, instead simplifying it, when compared to CMP arch. because the shared FPU has to implement a form of SMT.

Moreover, the FMAC units can be united to throughput 256-bit code, but then are only accessible to one of the cores in the module at one time, bringing back throughput by one half and eliminating any performance gain from running AVX-like code, which again makes no sense.

(4) Modules were aimed to provide more throughput, but then the shared decoder brings ~20% penalty, which finished in throughput code running faster when only one thread is scheduled to each module and the companion cores are unused. Modules were also aimed to improve power consumption, but in the above case the full module is active even when only one of the cores is working, increasing the power consumption when compared to a traditional CMP design where each core can be parked when unused.

(5) As mentioned above, modules were also aimed to improve power consumption. The CMT approach pretends to do it by reducing die area by sharing elements

Power ~ Area

But then IPC was reduced by area constraints. Net throughput of CMT would still be superior thanks to "moar cores". The problem is that low IPC would be a step backward for a CPU, then this had to be compensated by emphasis on a high-frequency engine ~4GHz. The problem is that power is not linear with frequency

Power ~ frequency^n

where n>=2. Thus any power advantage from sharing resources was eliminated by the high-frequencies resulting in higher TDPs for the CMT design. Power consumption is the reason why throughput machines are clocked at low frequencies : 1--2GHz.

(6) To tolerate the increased latencies necessary for a high frequency target, branch prediction is critically important; however in CMT the branch predictor is shared by the two cores in each module reducing prediction to one half.

(7) This is all about the hardware. CMT also introduces software problems on the software side. A CMP scheduler is trivial. A SMT scheduler is easy (first fill real cores then virtual ones). A CMT scheduler is complex. For throughput, independent threads have to be scheduled on free cores on separate modules with the companion core unused; otherwise performance is reduced due to front-end bottleneck. For data dependent threads, it is better to schedule on the same module for avoiding performance penalty from moving cache data across modules. For efficiency the scheduler has to follow other strategies, because scheduling two threads in two modules increases power consumption.

At the end no scheduler can extract maximum efficiency from a CMT approach.

(8) This is all about CMT. Bulldozer introduced further flaws/problems on top of that.

cdrkf :

Agree. But at least we know that it was not AMD fault, but just a point case of bad luck. In any case the GDDR5M solution was only temporary, because GDDR5 cannot scale. As I mentioned in a previous post the new 2016 APUs will offer fast stacked DRAM for the high-end range.

It is expected that APUs will replace both CPU and dGPU about 2018. Although Intel could accelerate things and kill GPUs before.

juanrga · Jun 18, 2014

In your imagination "everything is possible", including:

But in real world Carrizo will be produced in bulk as I just wrote a pair of days ago in my article about it.

You and others can continue posting nonsense and react with your usual personal attacks when your claims are corrected. But when Carrizo was finally released I will laugh once again at many 'experts' here

cdrkf · Jun 18, 2014

juanrga :

The slides AMD released announcing Carrizo state it's targeted at 65w and below, and as Kaveri has shown quite nicely bulk silicon is very efficient at lower TDP, it just doesn't scale well when you start looking at the 90 - 100w power envelope. Given the lower power target I very much doubt there is any incentive for AMD to push Carrizo onto SOI.

szatkus · Jun 18, 2014

juanrga :

In your imagination "everything is possible", including:

But in real world Carrizo will be produced in bulk as I just wrote a pair of days ago in my article about it.

You and others can continue posting nonsense and react with your usual personal attacks when your claims are corrected. But when Carrizo was finally released I will laugh once again at many 'experts' here

I'm not claiming anything yet (did you notice the question mark?).
Well, in past your guestimates were almost always wrong, so I rather liesten to FX.

*grabbing a popcorn*

szatkus · Jun 18, 2014

cdrkf :

juanrga :

The slides AMD released announcing Carrizo state it's targeted at 65w and below, and as Kaveri has shown quite nicely bulk silicon is very efficient at lower TDP, it just doesn't scale well when you start looking at the 90 - 100w power envelope. Given the lower power target I very much doubt there is any incentive for AMD to push Carrizo onto SOI.

According STMicro the 20 nm FD-SOI is as efficent as 14 nm and slightly more expensive than bulk. And Carrizo will compete with Broadwell so maybe choosing FD-SOI would be better.

cdrkf · Jun 18, 2014

szatkus :

cdrkf :

According STMicro the 20 nm FD-SOI is as efficent as 14 nm and slightly more expensive than bulk. And Carrizo will compete with Broadwell so maybe choosing FD-SOI would be better.

I guess we wont know until it comes out, either way it looks like AMD are really pushing on efficiency which can only be a good thing.

szatkus · Jun 18, 2014

cdrkf :

szatkus :

cdrkf :

According STMicro the 20 nm FD-SOI is as efficent as 14 nm and slightly more expensive than bulk. And Carrizo will compete with Broadwell so maybe choosing FD-SOI would be better.

I guess we wont know until it comes out, either way it looks like AMD are really pushing on efficiency which can only be a good thing.

Right, unfortunately the only working FD-SOI part I'm aware of was some ARM core ~3.0 GHz. Hard to say how well it does work for > 3.0 GHz.

8350rocks · Jun 18, 2014

juanrga :

Never said anything about Carrizo did I...?

You jump to conclusions...

Additionally...I have a great deal of information I cannot share...and frankly, I think you seriously underestimate who you are dealing with.

Additionally, I already told everyone about the information coming in your post...look back...I said, "information about MANTLE on Linux coming soon". Though you must have skimmed over that...much like you do to so many posts.

Again...you know much less than you think.

Also, here is a tidbit...Carrizo is the mobile codename. The DT codename is different, and as of yet unreleased...

Cazalan · Jun 18, 2014

juanrga :

I call it like I see it. Even your own links call it an "academic pissing contest." It's just a misinformation campaign to distract from the technology. Of course Mouttet is biased, he has memristor patents himself which he's trying to sell. Go figure.

Having an interest in a technology doesn't mean you're hyping it. I don't own any HP stock, and the last HP I bought had an AthlonX2 in it.

jdwii · Jun 18, 2014

Lol juan you kidder you. I actually have to admit i agree with him on the CMT bit.

jdwii · Jun 18, 2014

szatkus :

No shared cache causes more latency which is a problem in a CPU design heck even the L1 cache is being shared i think. More cache is not a good way to handle the problem either its almost the lazy way out. FPU really isn't used as much as you people think its probably the ALU+AGU being taxed and needing more resources plus the design is meant for very high clock rates above 4.0Ghz and at a lower clock rate and a longer pipeline the prediction unit sucks. To prove that cache is so important PD adding L1 cache and there is claims that is the main reason why performance went up 10% when compared to BD even more so in games.

jdwii · Jun 18, 2014

Mods can take this out if they want not trying to advertise for this site but i would like some users to read the performance Analysis on the site called dsogaming, why? Well it seems like most games still only use 3 cores or less and he test's that i noticed this way back in GTA4 days that the game didn't use the 4th core even if windows task messenger claims so. The reason why the FX "6" core is needed now a days over a quad core fx is because the modules don't have to force the 3 cores to share many resources being on different modules where a quad core fx or APU would be doing this.
Edit i would even go as far as saying people when they go Amd for gaming only to get a 6 core fx and a Noctua NH-D14 and OC the leaving crap out of it. I see this being a better fit for games since most only use 3 cores. This setup will also cream a 8350fx stock anyday in gaming.

szatkus · Jun 18, 2014

jdwii :

L1 is the only cache which isn't shared between threads. L2 is shared within module, so single-threaded benchmarks use it like normal cache.

jdwii :

Yup, you're right with that one.

jdwii :

Steamroller got larger L1 cache, not Piledriver. It's even larger than Intel Core or K10.

juanrga · Jun 18, 2014

The reasons for choosing bulk instead FDSOI have not changed since this topic was pushed for Kaveri.

You never mentioned Carrizo, but szatkus did and I replied him.

Your "information about MANTLE on Linux coming soon" has not changed a bit what I wrote before about how, why, and when AMD would port MANTLE to linux.

Having interest in the technology is one thing; spreading hype and misunderstandings about a non-existent technology; ignoring arguments, data, and links, given by Moutter and trying ad hominem against him; or ignoring what the rest of people involved has said is another.

I also note your double standard against AMD. When it was mentioned to you that AMD has developed new stack cache stuff, patented it, and that AMD will implement the new technology in next gen, your answer was "Stack Cache isn't that new anyway. This is a 12 year old paper...". The funny thing is that nobody said or pretended that AMD invented stack caches. We only said you that AMd D has developed new techniques related to stack caches.

Now when you are spreading HP hype in several forums, now you are shown that HP has not invented anything new, but is trying to reclaim other's inventions, then you are no more interested in who did what or when did, but you are now only interested in the supposed 'technology'...

http--www.gamegpu.ru-images-stories-Test_GPU-Action-Watch_Dogs-test-proz_nvidia_hi.jpg

http--www.gamegpu.ru-images-stories-Test_GPU-Action-Watch_Dogs-test-wd_amd.jpg

jdwii · Jun 18, 2014

szatkus :

No Piledriver did get more cache "We might be seeing the effects of Piledriver's larger L1 cache TLB at the 32KB block size"
http://techreport.com/review/23750/amd-fx-8350-processor-reviewed/3
Then we see here that Piledriver L2 cache is effected when it comes to latency here
http://www.extremetech.com/computing/100583-analyzing-bulldozers-scaling-single-thread-performance/3
See how the L2 cache latency is higher then even the phenom the only guess here as to why is over the shared cache.

jdwii · Jun 18, 2014

Juan again you can tell that their isn't a difference from 3 and 4 cores Just because they are in use doesn't mean if effects performance when they are not in use. Again GTA4 uses all of my 8 cores at 30% however i can disable all of them but 3 and get the same performance. Watch dogs was one of those titles that barley uses the 8 cores. Actually based on the pictures you sent which are right the 6 core performs the same as the 8 core and the 8 core is clocked 14% higher. As you can also tell the Superior performance per clock compared to Amd(50% higher under my tests) a 5.0Ghz 8 core processor comes out even(little ahead) to a 3.4Ghz Quad core I5. That is with a 45% increase in clock speed and twice the integer cores If this game was truly taking advantage of 8 cores we should see performance about 30%(or more) compared to the I5 but we don't.
Not that we won't Watch dogs was really the first game so far under that site that does take advantage of more then 3 cores and actually being able to use 8 cores even if not much of a performance difference exist later i can benchmark that game myself as well and see.

juanrga · Jun 18, 2014

jdwii :

First benchmark that I gave you: FX-8350 @4GHz is up to 27% faster than FX-6350 @3.9GHz. The difference in clocks is only 3%. The rest is from "moar cores".

Again from first benchmark: i7 IB 12 threads @ 4.3GHz is up to 12% faster than i7 SB 8 threads @ 4.6GHz. With lower clocks the 12 thread i7 win because "moar cores".

Second benchmark: i7 SB 12 threads @3.5GHz is up to 11% faster than i7 HW 8 threads @ 3.5GHz.

jdwii · Jun 18, 2014

juanrga :

Yes we do indeed see just that which seems a bit odd that they had different results from the other site, but that doesn't make their findings fake. When looking at their benchmarking suite we see that one is using a 780TI and the other site is using 2 780TI SLI. Maybe the other one stresses the CPU more and made a CPU bottleneck. Either way that one site seems to be different from other's such as this one Toms had this result

Its weird that some results show such improvement and other sites don't might want to test this myself and see what's up i only have a 770 though either way juan i'm not making one call or another on this one however it does seem like watch dogs is one of the few games that do use more then 3 cores. A lot of next gen games might change this as well they have to having to deal with low-end CPU cores.

Edit also Juan i noticed that the I7 3970x also has 15MB of L3 cache which is 87% more compared to the I7 4770K not that it would add much of improvement but the 6 core Intel is only 11% faster dispite having a 100mhz higher turbo and 87% more L3 cache. If however the 4 cores where such a bottlneck we would see a performance increase of around 50% or at least more then 11%. So again we see the cores being used but lightly. But i do see a 8% increase compared to the I5 and I7 using the same clock speed but with HT so it does use the extra cores(does have more L3 cache though). Also i feel bad for people with Bulldozer so happy i never bought one ew.

Cazalan · Jun 18, 2014

Posting a couple slides and video lectures I happened to find is not hype. If the technology is non-existent then why are there 200+ patents on it?

You have a very minor point on that one, touché, I'll work on that, but it was a 1 line comment. Nothing even close to your diatribe against HP. I wasn't even going to comment on it here because it is off topic.

Again with the hype. The thread is titled "HP bets it all on The Machine, a new computer architecture ". I didn't start it. Excuse me for linking slides from HP in a thread titled for HP. That's how forums are supposed to work. Could you BE anymore ridiculous?

(PS: Sorry mods I took the bait. Now back to your regularly scheduled program)

palladin9479 · Jun 19, 2014

jdwii :

*sigh*

No.

AMD isn't the only company who use's CMT, not by far. Both Oracle and IBM use a far more aggressive threading model, an eight to one ratio.

What hurts BD's design is the prefetcher and branch predictor. The front end can't keep the internal resources fed enough. It's actually about on par with everyone else in the industry. One of Intel's primary advantages over everyone else is their advanced decode and prediction technology, they can unroll loops and fill cache more accurately then anyone else by a fair margin. Shared FPU isn't a big deal either as it's rarely used in such large quantities, only synthetics will show a severe performance hit. Shared L2 cache is rough because of how they are doing scheduling with two separate cores vs one jumbo core (abstracted as two).

Overall it's a pretty solid server design, good at doing many non-related things at once, but it suffers in the one category enthusiasts like to jump up and down on, gaming with 2~3 threads.

juanrga · Jun 19, 2014

Techspot is GPU bottlenecked. Toms result looks as those that I gave you: FX-8 is 35--36% faster than FX-6; the 12-threads i7 is 37--38% faster than i5. Again this shows scaling above 6-threads.

This is an AMD thread, and you are repeating stuff was already answered in the other thread. As several people said to you then (including myself): stop the hype.

cdrkf · Jun 19, 2014

palladin9479 :

jdwii :

*sigh*

No.

AMD isn't the only company who use's CMT, not by far. Both Oracle and IBM use a far more aggressive threading model, an eight to one ratio.

What hurts BD's design is the prefetcher and branch predictor. The front end can't keep the internal resources fed enough. It's actually about on par with everyone else in the industry. One of Intel's primary advantages over everyone else is their advanced decode and prediction technology, they can unroll loops and fill cache more accurately then anyone else by a fair margin. Shared FPU isn't a big deal either as it's rarely used in such large quantities, only synthetics will show a severe performance hit. Shared L2 cache is rough because of how they are doing scheduling with two separate cores vs one jumbo core (abstracted as two).

Overall it's a pretty solid server design, good at doing many non-related things at once, but it suffers in the one category enthusiasts like to jump up and down on, gaming with 2~3 threads.

One slight ray of hope however is the fact that modern game engines are finally using more threads. BF4 and Watchdogs for example both scale nicely to 8 threads and in those games the old FX8350 is holding it's own quite well. I also think its fairly safe to assume that as both the xbone and PS4 now have 8 core processors, multiple threads is going to become the norm.

If you think about it the last gen consoles also handily explain why many games work well with 3 threads, that happens to be the core count on the 360.

I think if anything it bodes will for owners of the FX 8XXX parts being able to keep the cpu's for quite some time.

juanrga · Jun 19, 2014

Intel starts admitting cannot compete in servers against AMD and other ARM vendors using only x86 Xeons and releases hybrid processor [ Xeon + FPGA accelerator ]

http://www.theregister.co.uk/2014/06/18/intel_fpga_custom_chip/

90W ARM SoCs with estimated 960 GFLOP/s are going to be released. Cavium has explicitly mentioned that its SoCs are targeting high-performance 140W Xeon-based servers (not ULP servers :sarcastic:

). The fastest Xeon that I know peaks at 518 GFLOP/s which is one half of the ARM SoC promised performance.

It is understandable why Intel is releasing this x86+FPGA hybrid. The problem here is that this hybrid requires special code that has to be offloaded to the accelerator and that FPGAs are not flexible:

Using FPGAs in a data center "does require some level of sophistication,"

However, if someone chooses to buy a bunch of FPGA-Xeon frankenchips, they will be wedded to that hardware for anywhere from two to five years, depending on refresh rates. This means that if the tasks being offloaded to the FPGA change dramatically enough, the FPGA's logic may no longer be quite right even after in-the-field reconfiguration. It's a big bet for customers to have to make.

In my opinion AMD [ K12 + GCN ] SoC plus HSA tools would be a much more nice choice than Intel 'frankenchips'.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Months ago I explained in this thread that the supercomputers of the year 2020 designed by Intel, Cray, Nvidia, AMD... all them will use APUs/SoCs because CPU+dGPU cannot provide enough performance (energy wall problem).

I also mentioned then that AMD was preparing ~10 TFLOP/s APUs with stacked DRAM for the year 2020. This is part of the 'smart' answers that I received then

LOL...right...10 TFLOP APU...ok...I will believe that when I see it...

Believe it or not, but those ultra-high performance APUs will be used on everything from low power phones to 20 MW supercomputers.

Funny enough, AMD's Papermaster has just given a talk about products/plans for year 2020

http://www.marketwatch.com/story/amd-accelerates-energy-efficiency-of-apus-details-plans-to-deliver-25x-efficiency-gains-by-2020-2014-06-19

Take 95W Kaveri GFLOP/s

multiply it by the number given by Papermaster and you get an APU with 21.4 TFLOP/s. But have in mind that this is single precision and I gave double precision numbers when I was discussing supercomputers:

21.4 TFLOP/s (SP) ~ 10.7 TFLOP/s (DP)

Papermaster talk contains other useful info. AMD projects to achieve the expected efficiency for the year 2020 over the next three central pillars:

-- Heterogeneous-computing and power optimization: Through Heterogeneous
System Architecture (HSA), AMD combines CPU and GPU compute cores and
special purpose accelerators such as digital signal processors and
video encoders on the same chip in the form of APUs. This innovation
from AMD saves energy by eliminating connections between discrete
chips, reduces computing cycles by treating the CPU and GPU as peers,
and enables the seamless shift of computing workloads to the optimal
processing component. The result is improved energy efficiency and
accelerated performance for common workloads, including standard
office applications as well as emerging visually oriented and
interactive workloads such as natural user interfaces and image and
speech recognition. AMD provides APUs with HSA features to the
embedded, server and client device markets, and its semi-custom APUs
are inside the new generation of game consoles.

-- Intelligent, real-time power management: Most computing operation is
characterized by idle time, the interval between keystrokes, touch
inputs or time reviewing displayed content. Executing tasks as quickly
as possible to hasten a return to idle, and then minimizing the power
used at idle is extremely important for managing energy consumption.
Most consumer-oriented tasks such as web browsing, office document
editing, and photo editing benefit from this "race to idle" behavior.
The latest AMD APUs perform real-time analysis on the workload and
applications, dynamically adjusting clock speed to achieve optimal
throughput rates. Similarly, AMD offers platform aware power
management where the processor can overclock to quickly get the job
done, then drop back into low-power idle mode.

-- Future innovations in power-efficiency: Improvements in efficiency
require technology development that takes many years to complete. AMD
recognized the need for energy efficiency years ago and made the
research investments that have since led to high impact features.
Going forward many differentiating capabilities such as Inter-frame
power gating, per-part adaptive voltage, voltage islands, further
integration of system components, and other techniques still in the
development stage should yield accelerated gains.

juanrga · Jun 19, 2014

cdrkf :

The 360 could execute six-threads at once; however, the PS3 main processor only could process two threads (and efficient offloading to SPEs was difficult). Developers thus coded for the minimum common denominator, which resulted in poorly threaded games that ignored the extra cores in FX-8000 series.

This gen both consoles have 8-cores, with two reserved for OS and tools and six available to games. Moreover the cores are very low clocked, which means that developers cannot rely only on a pair of cores for the engine and ignore the rest.

It will take some time before developers learn to use all the cores efficiently, but several modern games are already scaling up to 6-threads or more. Moreover, MANTLE has been designed to eliminate code inefficiencies that exist in DX & OGL that preclude an efficient use of 8-cores for gaming. Check the second point in the next slide

Some time ago, Eurogamer did a poll among game developers and all them selected the FX-8350 over the i5-3570k for future gaming.

AMD CPU speculation... and expert conjecture

Reputable

Distinguished

Distinguished

Judicious

Honorable

Honorable

Judicious

Honorable

Distinguished

Distinguished

Splendid

Splendid

Splendid

Honorable

Distinguished

Splendid

Splendid

Distinguished

Splendid

Distinguished

Splendid

Distinguished

Judicious

Distinguished

Distinguished

Share this page