AMD CPU speculation... and expert conjecture

Page 708 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.


It would certainly help, but you'd still be severely bottlenecked by main memory, so the cost probably isn't justified.
 

logainofhades

Titan
Moderator


Lack of L3 cache kills performance. That has been the case since the Athlon II X4 vs Phenom II X4 days. It hurt the FM2 chips bad enough that Phenom II was still superior, clock for clock. I think Kaveri finally caught up/slightly surpassed PhII. Trinity and Richland were slower than both Piledriver FX 4xxx and Ph II x4, clock for clock. Had they had an L3 cache, they would have been far better chips. A 7850k, with L3 cache, would probably have been a much nicer chip.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


AMD is limited due to its own schedule. They lack efficient GPU architecture then need to jump on HBM first to compensate for that and recover part of market share. Nvidia will jump latter to HBM because their efficient architecture will allow to them to continue being competitive despite using GDDR5.

AMD will have only access to first HBM modules, whereas Nvidia will have access to more advanced modules. That was the point of the fudzilla article and also my point. Evidently, AMD will have access to mode advanced modules latter, but not now for the 390/390X.

Yes each stack of memory is 3D, but you are confused about the standard terminology of stacking. 2.5D stacking APU/GPU means that the stacks of memory are side to side to APU/GPU die using an interposer to connect both. 3D stacking APU/GPU means that the stacks of memory are put just on top of the APU/GPU die.

amd-radeon-3d-hbm-vs-25d-high-bandwidth-memory.jpg


And the article about 20 TFLOP APU mentions some advantages/disadvantages of both kind of stacking.



I mentioned three articles from 2015.... ironically.
 


Can it be a yield issue if they decide to include L3?

One thing is to get defective APUs with disabled GPU components and CPU components, but you'd be adding now L3 as well to the "it could fail" pool. Kaveri was the first dense 28nm process they used, so *maybe* and just *maybe* they could add L3 when they get the CPU component good enough and stop playing catch up against Intel. I mean, Excavator might be good and all, but even if Intel removed the L3, they're still way faster. AMD needs to re-make the CPU portion before adding L3, because otherwise it would be too expensive IMO to justify the small improvement over non-L3 cache. Because, you know, games are not the big market they're aiming.

Cheers!
 

8350rocks

Distinguished



So you just learned what 3D vs 2.5D stacking is, and now you answer everything with the same marketing slide and explanation...with or without relevance to the subject?
 

Embra

Distinguished


Thanks all for the insight.

 

logainofhades

Titan
Moderator


It could be a yield issue. I think that the AM3 Athlons might have been Phenom II, with a defective L3, that was closed off and sold as an Athlon. They probably still could have done something like that with FM2/FM2+. No matter how you look at it, there is bound to be yield issues. Selling a chip, at a reduced price, is better than just throwing it away. Hence why we had unlockable Phenom II X2 and X3. Kaveri, with L3 cache, would have been superior to FX 4300, IMO. I honestly think it would have been very competitive with at least i3.
 


But here's the problem: The GPU side is going to be constrained by main memory long before the CPU side gets constrained by L3. So there is literally ZERO benefit for performance, since the GPU is still going to be starved. And L3 is very large, expensive, and power hungry, three things you DON'T want on a chip (that we think) that is going in mobile platforms.

Point being, the extra CPU side performance gained by the L3 doesn't mesh with the markets where APUs are being sold. So the cost justification isn't there.
 

+1 to large and expensive, especially expensive. amd's apus are designed to be inexpensive, value chips.

i think intel uses it's L3 cache a bit differently from the rest. amd and arm's way of using/configuring socs with L3 cache seems similar to me.

however, amd can take advantage of the node shrink and stick some on the next apu/soc. i wonder if amd will make something like a consumer version of xbone(R)'s apu with zen/a57/k12.
 

truegenius

Distinguished
BANNED


price looks like guesstimation
4096 bit ? :chaudar:
well they must be smoking too much :rofl:



these things alone enough to skip l3 on apu
l3 is last trick to get performance specially in budget segment and to compensate this they used more l2



exactly my observation, but you went ahead of me because i was struggling with quoting system

0743c9283a327fc14647b227cea66df89a37b3828d16ff785e2c0e44168d223d.jpg


time for an internet high five
internet-high-five_o_39890.jpg
 

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860
As for the l3 cache, its a difference in aspect. Intel makes a cpu with an igp, aka cpu > igp

Amd went reverse and gpu > cpu. The problem is where l3 cache helps performance, AMD has none in a APU. They dont need 16mb like the fx chips, they could have went with something more reasonable like 4mb since the max is a pseudo quad core.
 

logainofhades

Titan
Moderator


Yes, something would have been better than nothing. 2x2mb, or 4mb if you will, like the FX 4300, would have been sufficient.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


You would re-read the thread because it is Fudzilla which confounded 2.5D and 3D and then another poster added to the confusion. I clarified things, but if you have some other doubt ask me in private.



The 390X comes with 2048 bit interface. The 395x2 is a dual GPU die, thus the interface is 2048 bit x 2 = 4096 bit.



From one of them:
We've seen very strong uptake on our 64-bit processors, mainly for the tablet and mobile market, with two main ecosystems being Android and iOS,
:sarcastic:
 
I'm willing to bet that reducing L2 in favor of L3 will hurt the CPU component way more than keeping L2 big enough to account for no L3.

We could dig a bit taking a look at the great Barton core (Athlon XP) and Toledo cores (Athlon 64 X2). They had L2 cache differences between them and benchies from the time could illustrate how they look. I remember exactly about it, because i had a Toledo 4400+ running at 2.5Ghz. Damn that CPU was nice.

Anyway, this is the best I could find: http://cpuboss.com/cpus/AMD-Athlon-X2-4200-vs-AMD-Athlon-64-X2-4400

Cheers!

EDIT: APU context, not CPU alone.
 

truegenius

Distinguished
BANNED


article was mainly on 395x2 but were discussing 390x too, even saying that it use 2 gpus thus its 4096 bit (if 2048bit per gpu) is wrong, it will be 2x2048 bits, current crossfire techniques ( same data in both memory pools for alternate rendering ) does not adds the memory or width of memory system dedicated to different gpus
and they said
Furthermore, the R9 390X will also possess a 4096 bit HBM memory design and is going to be the first ever graphic card to be released by either company to boast such as feature.
so they were referring to 390x for 4096bits (typo?)
and afaik 390x will use 1024bit
so can you provide link for 2048 bits ( guessing that it will be from wtftech (thats how my mind automatically render this name :whistle: ) )
 

blackkstar

Honorable
Sep 30, 2012
468
0
10,780


Juan, mate, I think you are having difficulties comprehending the fact that scaling performance in workloads increases transistor counts and decreases efficiency. You see simple ARM with horrible FPU compared to x86 being far more efficient, so you just assume that ARM will eventually catch up because of how efficient it is at its current level of performance.
http://www.bitsandchips.it/gaming/9-hardware/5214-roundup-arm-board-odroid-u3-marsboard-a20-e-rk3066?start=6

You can clearly see FPU is missing a lot of functionality and performance. Adding that performance and functionality will increase transistors and decrease efficiency. You can not have both.

You also seem to assume that GCN and Maxwell have the same design goals. They don't. Maxwell is clearly missing parts of the architecture to do certain tasks that GCN has.
http://www.anandtech.com/show/8568/the-geforce-gtx-970-review-feat-evga/14

Direct your attention to double precision F@H and video rendering benchmarks. Maxwell just doesn't have the transistors there to do those tasks properly. And it's why it's "more efficient", because when GCN is doing things that don't need those parts of the GPU, they still receive power and reduce efficiency, because AFAIK, they don't have power gating to prevent that from happening.

Nvidia is aiming for an efficient gaming card with the ability to do consumer-oriented GPGPU tasks. AMD is aiming for a GPGPU beast that can handle far more use cases with GPGPU.

That's why Nvidia fans and Nvidia happy tech media likes to push efficiency so hard. GCN will never be as efficient as Maxwell unless AMD does something magical. But efficiency is a straw man. No sane person is buying these cards to save money on their electricity bill. Efficiency is just a ruse in HEDT GPU market to deflect from the fact that features are missing from one company's products while the other products have those features. If Maxwell was great in DP F@H and video rendering, it wouldn't be as efficient.

There are markets where efficiency is important. But we're discussing HEDT graphic cards for gaming and GPGPU. The only people who care about efficiency there are people who live in some sort of country where electricity is rationed or where electricity costs a lot of money. And at that point, I'd question why they'd choose a hobby like PC gaming, specially since you could by a PS4 or Xbone and have gaming for the whole system at under 150w.
 


i was running 7XXX series cards and still had issues
also the drivers for the HD 6450 had more than a full product lifecycle and AMD still couldnt get it right
I understand when a new card comes out the first few driver releases will have some glitches but by the end of a cards life cycle that driver should be flawless

 

logainofhades

Titan
Moderator


I had a nicely clocked Brisbane 3600+. Lost my validation link for it, and my posts from back then are gone. Had a bit of a dispute with former moderation, a few years ago, and my posts were purged. Fortunately, the power hungry person involved got the boot.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790
I will write here a collective answer to several ARM issues some of you are asking.

First, Anand article is wrong. They are not controlling some relevant variables during power measurement. Moreover, Yuka did mention another fundamental point: our lack of knowledge of several fundamental parameters of the new 20nm node. There is a reason why several companies (including Nvidia) are skipping the 20nm node. Moreover, it is worth mentioning Intel is having similar power issues with the 14nm node.

Second, I don't know why some of you insist on misunderstanding the issue about efficiency again and again and again. Evidently a 100W ARM CPU @3.0GHz will not be so efficient as an single-digit watt ARM CPU @1.5GHz. Efficiency reduces with higher IPC and frequency

Efficiency \propto I^{-1} f^{-2}

What the whole industry is claiming is that a ~100W ARM CPU will be more efficient than a ~100W x86 CPU. Several of us estimate the ARM ISA provides about 20--30% advantage over the x86 ISA at Xeon levels of performance with everything else the same. David Kanter claims that K12 would be about 10% better than Zen. At lowest TDPs the gap between x86 and ARM is bigger, 2x or more, that is why Intel cannot enter the mobile market even with a process node advantage.

I note that 20--30% (or Kanter's 10%) is for the overall ISA advantage. I see too many people on forums still believing that x86 penalty is only on the decoder. I will simply copy and paste from an engineer:

It's a myth that ISA overhead is just in decode. There are many aspects of an ISA that affect the overall microarchitecture. Just to mention one example, x86 requires more load/store units due to having fewer registers and load+op instructions. x86 also uses a more complex memory ordering model.

Several ARM servers have been measured and provide more performance than Xeons but consume less power. Jdwii gave the link to Cray announcing partnership with Cavium to build ARM-based supercomputers. That is because there is an efficiency advantage. The 80W SoC gives more performance than 95W Xeons but consumes less power. Linley Gwennap wrote:

Compared with Xeon, ThunderX could deliver 50% to 100% more performance per watt and per dollar, particularly when considering the additional chips that Intel needs to complete the server design.

Third, x86-64 is an extension to x86-32. This means that all the bloat, legacy, and oddities of the old ISA are present on the new. For low power Intel chips it is estimated that about 20% of the die space is used to support the microcode ROM. Any engineer knows that designing and implementing an x86 decoder is more complex than a A64 decoder. The complexity is not because the x86 ISA is better, but because two reasons (i) decoding instructions of variable length is more complex than decoding instructions of fixed length and (ii) the x86 decode has to support hundred of legacy instructions that nobody uses anymore but are in the ISA otherwise.

I know studies that show that only a 30% of x86 opcodes are used in modern scientific applications. The rest are not used anymore but have to be supported by the hardware for formal complain to the ISA. x86 is the most bloated and poor ISA in use.

ARM is different. A64 is not an extension to A32, but a new separate ISA. Moreover, ARM engineers not only updated to 64-bit but used the change to clean the ISA eliminating legacy and bloated aspects. The new A64 is a clean, elegant, and efficient ISA. Engineers can support A64 or A32 or A64+A32. First chips will support both ISAs A64+A32 (A57, A53, Denver, Cyclone). Next chips will focus on A64 and will only support part of A32 (e.g., Vulcan support full A64 but only usermode of A32). And more latter ARM chips will support only A64, which will provide another boost on efficiency.

Finally, A64 software already exists: both OS and applications. I gave some links many pages ago.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790




I gave wrong interfaces before. The 2048 bit interface is for the APU, which has only two stacks. The single GPU is 4096 bit and the dual GPU is 4096 bit x 2. The article is correct on that the 390x has 4096 bit memory interface. They didn't smoke anything.

You can check by yourself

http://www.sisoftware.eu/rank2011d/show_run.php?q=c2ffccfddbbadbe6deeddaeaddfb89b484a2c7a29faf89fac7f7&l=en
 

truegenius

Distinguished
BANNED
Status
Not open for further replies.