AMD CPU speculation... and expert conjecture

Page 710 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


I will underline the relevant part of my post you ignored:



You seem to believe that anything posted in the internet is accurate or right. A beautiful example was when you wrote "100 mhz higher clock on the a10 5700 vs 4100-fx", because your beloved techspot review gives a wrong base frequency for the A10-5700. The A10 is not clocked 100MHz higher but 200MHz lower. The author also wrote

We were surprised by how poorly some of AMD's older processors performed.

Therefore even authors admits that results are odd, but you want present them as the norm.
 

jdwii

Splendid

noob2222

Distinguished
Nov 19, 2007
2,722
0
20,860
You bashed techspot as being the culprit so i provided an alternative, using the same mb and memory for the phemon 975 and 840, you completely missed it and left your crosshairs on techspot.

The minor frequency discrepancy doesnt make up for the massive performance deficit. The only reason you want to blackball techspot is no one else even attempts to test that many cpus all the time so you can hide behind "techspot sucks" as your defence.
 
One last note about AMD's cache implementation. AMD use's a "victim" cache system, meaning the contents of a higher level cache are evicted data from a lower level cache. The system never copies data from system memory into L2 / L3, instead it copies from system into L1. If L1 is full, then the oldest data (since time of last access) is copied into L2, if that's full then the oldest data from L2 is copied into L3, if L3 is full then the oldest data from L3 is dumped. Thus the contents of L2 are always old previously accessed data while the contents of L3 are even older. This is the primary reason why you need such large amounts of L2 / L3 cache to many any performance impact, your fighting against two sets of probabilities. First being that the data wasn't predicted and read into L1, but also that the data was previously accessed and still resident in L2 or even L3. No amount of cache can save you from a miss that wasn't previously accessed as it simply won't be stored anywhere other then main memory.

So here are the levels of probability before L3 is ever accessed for a read

First, data doesn't exist in L1 I/D, branch predictor miss
Second, data was previously accessed and thus might exist
Third, data was old enough to be evicted from L2 cache
Fourth, data as new enough to not of been evicted from L3 cache

Only if all of those are true will the data be present in L3 and thus provide for a performance benefit. The benefit is much smaller then from L2 due to L3's significantly longer latency and lower clock speed but it's still faster then main memory. Also while L1 I/D is thread specific, L2 and L3 are data from any thread that's been run on that core/CPU, excessive context switching can cause L2/L3 to be rapidly overrun by another threads old data.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810


Should be available by the end of 2015.

CPU + DDR4
dGPU + HBM

Unless you meant in a single APU in which case I'd say not until 2017/2018.
 


The branch predictor would detect that and the data would be kept relevant. If the data access pattern was obscure and random enough to screw with the branch predictor (very large databases do this) then we would of seen similarly poor results on the Intel CPU's. Intel CPU's utilize a different cache mechanism whereby the contents of each lower levels cache is kept in the higher levels. The contents of L1 I/D is kept inside L2, the contents of L3 is kept inside L3, this makes cache access and L1 context switching faster at the expense of a much smaller cache footprint (total area kept cached in memory). Intel's predictor is so good that it rarely miss's and thus they opted for raw speed (low latencies) vs a large range while AMD took the exact opposite approach due to their weaker predictor.

Like I said, I would start looking elsewhere for answers. L3 (and L4 for IBM) does have a purpose in enterprise solutions where your dealing with such large datasets that you need it. Think MS exchange servers with thousands of users and hundreds of GB worth of mailbox data being accessed in completely random patterns. Or large Oracle / Sybase / Postgres database's with thousands of transactions per second. In a home environment where your playing one game with relatively small datasets, browsing facebook and doing content consumption, you just don't create the kinds of workloads where a large cache footprint is useful.
 

truegenius

Distinguished
BANNED


its noob's quote
 

ah. my misquote.
 

truegenius

Distinguished
BANNED


* wrong
it is the amount of current which matters, because there is a certain amount of current needed to make a transistor switch its state.
also since heat is directly proportional to I^2 thus we see heating effect, which means that "electron strikes something causes heating" is again wrong and plain stupidity in terms of physics because they don't strike anything, an atom is almost empty space so why we experience heat because nothing is perfect conductor and causes loss of energy which we experience as heat or any other form of energy ( more deeply it is the amount of energy wasted to ( to fight holding atomic forces and current stable atomic state of material/atom ) to make an electron switch its place )

so why we need more voltage for more GHz, because we need to provide more energy to overcome the resistance experienced by electrons during their movement due to imperfect conductor. This means we can run our cpu @ infinite clock with constant voltage (at threshold voltage for transistor switch) at 0 Kelvin temperature because at this absolute zero temp any material becomes super conductor.

your statement can cause misconception in people, they may think 1.5v will automatically make it run at faster speed stably even if it is running at 100'C in comparison to 1.3v & 50'C but in reality we will see less stable clock at increasing temps

** which is plain stupidity of designers because material is also limited to certain clock at ambient temps thus designing a 10GHz chip is stupidity because material can't do that without using tons of power or extraterrestrial cooling solutions.
It is as stupid as designing 64 core cpu and expecting some single core workload to automatically scale to 64 cores
so why design a chip knowing that other things can't support that design (material or workload), well because of marketing and because they have no other idea left

And It does not explains binning, because binning happens for same chip
it is not even related to same core because if it is then we would have had llano at 3.7ghz stock instead of 3ghz top
binning is related to defective chip or some defect on chip
like they found l3 cache is causing many problems and is not easily correctable thus disable the cache and ship it as athlon II



not true, because if it is then we would be running IB at 7GHz on air compared to 5GHz of SB due to smaller node, or llano at 6GHz on air instead of Ph2 @4GHz on air
flow of current happens instantly you can say as fast as speed of light
we don't need to displace 1 electron from one end A to another B to get 1 electron because there are many other electrons waiting ( between A & B ) for energy to get forward motion

so saying that shorter distance means faster flow will create misconception in people's mind that 22nm will be automatically faster than 32nm



i was also thinking to talk about this but then i thought not to engage on this discussion/topic
 

Don't play this game, you will lose.

There is such a thing as simplifying something, especially complex subjects like classic electronics theory / QM. That's way outside the scope of this discussion and completely unrelated to the concept of fixed waiting times vs clock rates vs distance of of circuit. The rest is just you making nonsensical statements because your angry you got a PM while obviously trying to piss people off. I added a note tag just so anyone reading is clear that I've simplified an entire semesters worth of information.

#Note#
Electrons don't really move much inside a circuit, whats moving is the charge of the electrons as it's passed from one molecule to another down a conductor. And while this charge is moving at over 50% the speed of light there is still considerable delay when it comes to generating and draining the circuit, I tacked that all into the same term "distance" because It's already a long discussion and it's an easy concept to grasp. I simplified it because if people are more interested they can go do the research and compare the engineering material to QM / classic theory. The abridged description suffices to explain why specific uArch's are designed for specific clock speeds and arbitrarily adjusting them downward to comparison isn't very useful.
 

truegenius

Distinguished
BANNED


and rendering it wrong or useless during this process
there is a quote that says "Everything should be made as simple as possible, but not simpler"

and i listed what misconception these oversimplified things will create and since your are much knowledgeable person so people will believe you and may not do enough research to find the limitation ( because human mind choose simplest path if find, and here your statement will work as a cached resource thus no need to go long route of ram (research) )
*still waiting for explanation on limitation or "This is a/an [strike]very simplified[/strike] oversimplified version" *

and regrading that pm of how 4100 was ahead of 980/1100t, it was my question because few days ago i recommended someone to not go for BD even if it is 8150 because it will be bad for gaming (or won't be considerable increase over current thuban )
i indeed suggested him to get either i5+board or overclock current 955 to max possible or try fx8350 in that board which people claims works with latest beta bios without any issue
so if 4100 can beat 980/1100t i thought that i suggested him wrong thing and he should have got 8150 or even fx4100 and this suggestion will make me feel guilty if i see consistent wins for 4100 over 980/1100t

this is same as juan saying gaming dgpu will be dead by 2020
which is wrong specially with current techniques and in this much small time period, and explained why we can't, we can't make 500mm^2 budget cpu/apu, we can't crossfire/combine multiple apus for more performance in budget segment or even with 1-2k budget (which is almost enough for max performance for current game at current resolution), but we can use 4 hawai gpu 4x438mm^2, we can upgrade as per our requirements whenever we want, we won't be limited to what amd's top apu will be, we won't need liquid nitrogen to run it cool and many limitations like that will stop this from happening
and why i tackled this statement of juan ? because when you visit system section of forum you will see "i want a future proof rig" which means they may overclock, upgrade, crosfire/sli, more ram or storage without or minimal switching of current components that i will buy. So if you say them that by 2020 there won't be dgpu because apu will me more powerful than dgpu by then so what misconception they will develop by this statement of juan
and your statements are also like this creating misconception
but since juan provided a time period of till 2020 thus we can't argue much on this with juan's because its not 2020 and we can't predict the exact future
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Yes, this discussion was settled before Kaveri launch, no sure why someone revives it again and ignores all that was said/demonstrated then.



No hard data about latency is available.

It is worth mentioning that HBM means High Memory Bandwidth and was mainly designed as replacement of GDDR5 memory. Latency was not a priority of the designers. However, HBM modules would have lower latency than GDDR5 memory modules. The unknown here is the memory controller.

Note that HMC (Hybrid Memory Cube) has been designed to reduce latency compared to DDR. HMC has been designed to use on CPUs as well. But again the HMC designer don't give us numbers.



This is AMD plan for APUs
08.png
 

Embra

Distinguished


Thank you Palladin, I appreciate the detail explanation. :)

 

logainofhades

Titan
Moderator
I will just leave this here, regarding the L3 cache discussion.

AMD-Average-Gaming-Performance.png


There is no other way to put it: the Athlon II X4 640 is simply outclassed. Worse, perhaps, is that, even cranked up to 3.6 GHz with a 2.4 GHz CPU-NB frequency, the Propus-based processor is not able to drive our games sufficiently well. We know that a lack of L3 cache stifled performance. After all, the stock 3.4 GHz Phenom II X4 fared significantly better.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Your ideas about electrons and heat are plain wrong. Specially the part where you say "wrong and plain stupidity in terms of physics because they don't strike anything, an atom is almost empty space". But since this is off-topic I will stop here.

The part of you message about GPUs is, however, relevant because is about future AMD products and I will address.

You start by partially misquoting me. My claim was that discrete GPUs will be killed by about 2020. The article on APUsilicon gives some details why this will happen.

You pretend that APUs are only for budget segment, but are not. In fact our claim is that high-performance APUs will replace top discrete cards. You mention we cannot make 500mm^2 APUs, but the design that Nvidia engineers are working occupies 650mm^2 on 7nm node. AMD engineers don't give details about the size of their APU, but would be similar.

Cross-firing will not help, because will increase the problems described in the article.

It is funny that you claim we cannot combine different APUs to provide more performance, when this is exactly that will be made. The APUsilicon article gives a node representation with one central APU plus several assistant APUs.

This is about tech and the underlying physics. In former posts I discussed economic reasons why discrete GPUs don't have future and will disappear.
 

8350rocks

Distinguished

Interestingly enough...the 750K that way outclasses it also has no L3.
 

8350rocks

Distinguished


The issue with this speculation is 2 fold:

1.) Whatever you can do with an APU die, you can get significantly more cores for GPGPU functions onto a dedicated GPGPU die. You can also decrease power consumption if that is your goal by design for this exascale push.

2.) Unless you are discussing using TSV interposers,or some other similarly absurd system for interconnects across all the APUs you claim will be used to make exascale computers, you are still not overcoming the issue of interconnects consuming the lion's share of the power. Even TSV interposers consume power, albeit less than your normal interconnects.

The issue is, it would not be cost effective to try to mount multiple APUs + HBM/DRAM on interposers. Additionally, you end up with the issue being that while you are reducing the power consumption of your devices, the interconnects still consume a few watts here and there for each interconnect to transfer data.

The biggest issue with bigger HPCs is not the power cost of the processing units themselves, but the interconnects to get all of them working together.

Horst Simon has given multiple presentations about this particular subject, and why it will not happen by 2020, or likely within a decade of that time frame:

http://www.top500.org/blog/no-exascale-for-you-an-interview-with-berkeley-labs-horst-simon/

There is a quasi-summary of one such presentation in interview type form (Q&A).

Also, data movement will cost more than flops (even on the chip). Limited amounts of memory and low memory/flop ratios will make processing virtually free. In fact, the amount of memory is relatively decreasing, scaling far worse than computation. This is a challenge that’s not being addressed and it’s not going to get less expensive by 2018.

A relevant point above.
 


If we're going down that path, the FX 4350, that is based on the same architecture as the 750k also shows a noticeable improvement when running at the same speed (4.3 ghz). The gap isn't as big, but it is still apparent.

Now I accept the theory that it *should* be unnecessary and shouldn't provide a benefit, however it's also evident that it *does help*. Maybe it's a by product of how games are being handled?
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


The relevant fight is FX-4350 vs Athlon X4 750k. Both are Piledriver one has L3 and the other hasn't.

The standard 750k is clocked lower than the FX, but is very close in performance. More relevant is the overclocked 750k. At 4.3GHz the gaming performance of the Athlon is only 4% behind the FX despite lacking L3. The FX-4350 is clocked at 4.2/4.3GHz, thus the ~5% difference in performance that L3 brings is very far from the 40% that someone did dream once.

Similar benchmarks showing that the effect of L3 on gaming performance is almost unnoticeable (5% difference is within the margin of error of the benchmark and statistically insignificant) were provided in this same thread before Kaveri launch when certain poster here 'predicted' that Kaveri CPU was going to be 40% slower than Bulldozer on games because lacked L3.
 

logainofhades

Titan
Moderator
Kaveri still isn't much of an upgrade for those still using Phenom II's. IIRC, Kaveri finally managed to meet PhII performance, even beats it at times. Sorry if I am unimpressed that it took AMD 5 years to finally compete with itself. Faildozer was worse, and to be honest, if not for the higher clock speeds and extra cores, Piledriver really isn't any better either. A 4.2ghz FX 4350 only 4-5% better than a 4ghz Ph II X4 965. That is just pathetic. Anyone with a Ph II X4 or X6 would be wasting their money, going to FX. Maybe if they have an AM3+ board, that supports FX 8320, but even then it depends on which Ph II they have. If they have a nicely clocked X6, still wouldn't be worth it. I really hope AMD pulls a winner with what they are working on now. I want to see some competition again.
 


For once, I agree with Juan. When at similar clocks, the performance difference is negligible (<5%). Nevermind there are other functional differences between the FX and Athlon besides the L3 which would also tilt performance slightly.

For gaming, the L3 is mostly pointless. And again: APUs are memory bottlenecked anyway, making the L3 moot.
 


I never said it didn't give any improvement, I said it provided 10~15% best case scenario. That's human binary thinking for ya, people immediately assume it must be 100% one way or 100% another. The second set of benchmarks actually demonstrate exactly what I was saying. The 750K (4.3Ghz) vs FX-4350 (4.2~4.3Ghz) is 141/136 = 3.67% difference with with minimum being 6.99%. The minimum is where your getting cache miss's on L2 and that's where L3 would come in. The real problem is that L3 takes up nearly half the die, so your sacrificing 5~15% performance for cutting the die size in half or adding an semi-powerful iGPU. So if you do want to make a SoC then the first thing to get tossed is the L3 to make space.
 

Cazalan

Distinguished
Sep 4, 2011
2,672
0
20,810
The reasons for no 20nm GPUs is more complex than the process itself.

Oracle apparently has the M7, a 20nm and 10B transistor CPU currently in production. 32 cores, 8 threads each, for 256 threads/CPU. ~3.6GHz

https://www.semiwiki.com/forum/content/4300-tsmc-20nm-essentially-worthless.html

Which leads one to suspect the yields aren't so hot with that big a die. Oracle can eat that cost because their databases license by the core.
 
Status
Not open for further replies.