AMD CPU speculation... and expert conjecture

Page 615 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Nothing in that link contradicts what I said about games like BF4...
 

jed

Distinguished
May 21, 2004
314
0
18,780


These two most definitely say you are wrong, didn't loose on either.

http://anandtech.com/print/8426/the-intel-haswell-e-cpu-review-core-i7-5960x-i7-5930k-i7-5820k-tested

http://www.tomshardware.com/reviews/intel-core-i7-5960x-haswell-e-cpu,3918-6.html
 

whyso

Distinguished
Jan 15, 2012
688
0
19,060


Did you actually look at that benchmark?

Its absolute garbage. Somehow a 12 thread CPU performs better with 12+ threads? Somehow 1-3 threads doesn't increase the score? The score jumps back and forth with 4 threads better than 6 threads.
 

truegenius

Distinguished
BANNED


i also noted this memory benchmark inconsistency
in latest version of aida64, it shows 80-85% bandwidth efficiency ( and higher nb frequency yields higher bw for my 1090t )
and its also uses threads equal to number of cores and uses cou to max during memory bw benchmark ( it use only 1 core when calculating latency and l1 cache bw and latency )
though sissandra and maxxmem2 (multi thread version) use less cpu and give poor results ( 60% efficiency ) than aida64
 

bmacsys

Honorable
BANNED


Yuka, what is that screenshot from?
 

8350rocks

Distinguished


You are actually making assumptions there...

Your assumption is that a CPU can only run so many threads as it has cores/logical units, however, your CPU routinely runs more threads than that at any given time when you boot up windows. What we are seeing in that benchmark is that the CPU continues to scale well because the threads used did not, in fact, stress a single core to 100% usage.

This is typical, contrary to what many may think is a common occurrence, your average modern CPU spends less than 10% of the entire life of the processor stressed above 80-90% usage, and even then, those are rare events.
 

whyso

Distinguished
Jan 15, 2012
688
0
19,060


Nope.

1) If your benchmark cannot stress, under a designed 1 core load, one core to 100% its a pretty crappy benchmark. Every generally accepted benchmark aimed at the CPU (not cache, bandwidth, disk, etc bound) should be able to stress the core 100%. Even if I were to code something ridiculously simple as 2345678! the compiler will default to 1 core 100% load. A perfectly threaded and written benchmark will have perfect scaling.

2) That doesn't explain why there is no gain from 1 to 3 threads.

3) There should be no significant thread regression which is seen between 4 and 6 thread loads.

The benchmark doesn't appear to be compiled correctly. Scores don't scale properly. That or the tester was doing something else CPU heavy at the same time.
 

Ags1

Honorable
Apr 26, 2012
255
0
10,790
Headline Benchmark is Java based, so the Java JIT compiles it on each tested system. I trust the Java compiler to be fairly optimised for each platform, so I think it is a reasonably fair test. With Java you have limited control of threads. Windows in particular has a tendency to schedule threads from the same app to the same core, which explains why the performance does not scale linearly with threads. To work around this problem, and also to squeeze CPU time away from background threads, I spam the CPU with more threads than available cores. On Windows performance peaks when threads exceed the core count, while on Linux performance tends to max out when threads equals core count.
 

8350rocks

Distinguished




As Ags pointed out, it is the scheduler that is the culprit :) I should have delved a bit further, but I did not realize we were not on the same page :)
 

whyso

Distinguished
Jan 15, 2012
688
0
19,060


Interesting and insightful. Thank you.

However the scores should not be bouncing around like that. Something is happening when it should not be.

Edit: 8350rocks beat me to it.
 


That is RAGE using 2 monitors: 1 with the game, the second with ProcessExplorer to monitor performance real time so I could tweak further. In particular, that set up was: Phenom II 965BE C stepping @3.9Ghz and a 4890 @950Mhz IIRC. I'm using the same 4x4GB DDR3-1600 CL8 with the i7 2700K.

I haven't done that with the i7 yet, but I have some screenshots testing video decompression.

Cheers!
 

jed

Distinguished
May 21, 2004
314
0
18,780


That's pretty much no contest, for encoding and content creation the 5960X reign supreme.

 

blackkstar

Honorable
Sep 30, 2012
468
0
10,780
@8350rocks, compiler tweaking can make ridiculous speed ups or it can do hardly anything. But I think in situations when I am playing with compiler flags and not getting much (like x264) there's a lot of inline assembly anyways. Blender is a best case scenario I've found. LAME was extremely good speed up too, which is fun because MP3 encoding is something AMD usually gets trashed in benchmarks in. I just don't care about ripping CDs so I don't make a fuss about it.

I am putting together a 4 socket 6 core K10 Opteron rig pretty soon. I just custom fabbed a case for it. While I was researching if it was worth it for me, I noticed 8350 and Phenom x6 are really close in CB r10 yet 8350 walks away in r11.5. It is probably because both systems are running basic x86 instructions. It's going to get Gentoo and probably Windows 7 64 bit too.



Along with x264, Java runtime was the most disappointing recompile I ever did in Gentoo. In fact, I'd say Java runtime was even more disappointing, because x264 was maybe 5% or 10% faster but Java didn't really go anywhere. But I did do java runtime with -O2 instead of -Ofast. It also took forever to compile so I didn't play with it too much. But from my experience, Java runtimes are pretty well optimized via compiler. My theory is that Java JIT is slower than native code so they compile java runtime components as aggressive as possible to make up for it.

I realize people usually dislike Java quite a bit, but if something is going to be fair toward Piledriver when it's fairly new (I compared binary Linux, Gentoo custom, and Windows in the HWBOT java test that AMD destroys in and saw no difference) I have to give it credit.

I'd have to look into it a bit more (with my new 24 core system I can use DISTCC which sends compile jobs to that comptuer over the network, compiles it, and sends it back, and compile things crazy fast on any computer that has Gentoo installed) but I actually have a lot of faith in java to support CPU features much earlier than everyone else.
 
In any case, my point is that the eigth-core processors from AMD, including the new FX models presented last month, are a more powerfull choice over the FX-6000 series for well-threaded games. I cannot say the same of Intel eight-core Haswell, which loses to the six-core Haswell.

And why does that occur? Because AMDs lack of single-core performance NECESSITATES that design choice. Intel's doesn't; it preforms the same even if you go the heavy thread route, because its stronger cores are powerful enough to deal with the workload.

The argument isn't "AMD gains performance with more cores", it should be "AMD looses performance with fewer cores". That's why, even in titles that go beyond 2-3 cores, Intel doesn't see any gains adding more cores. That's why FX-8350 ~ i7 2600k.

Stuff that is compiled through MSVC and other compilers is really pretty poorly optimized because the target is a general group, and not a specific set of hardware. If we had something setup to run optimized for both architectures (SSE4, AVX, etc.) then things would be interesting to say the least.

By default, with standard optimizations, MSVC compiles against SSE2 in all cases. That being said, assuming O2 level optimizations (standard release config), it should plop in various SSE/AVX optimizations on its own. And last time I saw a comparison, unoptimised MSVC = Optimised GCC in terms of performance.

Did you actually look at that benchmark?

Its absolute garbage. Somehow a 12 thread CPU performs better with 12+ threads? Somehow 1-3 threads doesn't increase the score? The score jumps back and forth with 4 threads better than 6 threads.

As Args pointed out, the problem is in Java, there's no easy way to lock threads to cores. So depending on how the OS Scheduler is handling things, its possible one core could get loaded with two threads, which is not what the intent is, obviously. Its outside developer control, and there's really no way to handle that problem in Java. A case like this is one of the few times it makes sense to bind threads to a particular core.

BTW Args, here's a suggestion: You can't bind a thread to a core, but you should be able to check what core a thread gets loaded on. You might be able to do a loop where you create a thread, and if it gets put on a core that already has one of the CPU benchmark threads on it, delete the thread and try again. Once you get all the threads loaded properly, you start them at the same time. Either that, or you'd need a lot of OS specific code.

Point is, its a limitation of Java. The one thread and max thread cases should be valid, its the others that should be ignored.

And yes, this highlights the problems you can run into when threading. You REALLY have to test a bunch of cases to figure out what the best-case/worst-case outcomes are.
 

I don't think that would because the NT scheduler is constantly task switching. NT doesn't have a thread homing feature and so whenever a thread gets task swapped, there is no guarantee where it's going to land. This is actually a problem with the NT scheduler in general, it will attempt to dynamically load balance all threads uniformly across all targets which results in threads constantly playing musical chairs between cores.

I got to experience that this weekend playing XCOM Enemy Unknown. Not a very intense game but extremely fun, was playing it on the living room's 6800K. It got extremely hot so I started doing performance analysis and noticed that NT was constantly shuffling the load and resulting in all four cores being pegged at 4.1Ghz all the time vs letting some idle down to 1.6 or 2.0. I've seen a similar effect on my fx8350.

It's kind of a philosophical question. From the OS's point of view it can't be sure what the projected processor utilization will be and so it tries to maintain maximum available capacity by evenly loading all cores, even at the expensive of efficiency. From the user's perspective I know what the projected processor utilization will be and so I would prefer to get maximum efficiency.

Also for those that criticized my decision to purchase the A8-7600, the heat issue is exactly why. The low profile cooler is only rated for 65W while the 6800K can output well beyond that when it's run at full. There isn't enough vertical space to install a cooler capable of quietly removing more then 65W and prior to the SR 7K APU's there wasn't a sufficiently powerful enough APU in the 65W range.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


Did you even bother to read what I wrote?

Toms review was used by me previously. It shows a configuration where the 8-core Haswell loses to the 6-core Haswell on BF4.

Anand used a SLI configuration where the 8-core Haswell ties to the 6-core Haswell

67053.png


Thus, contrary to your claim, both reviews confirm what I wrote about BF4:



I will not insist more in this particular issue.
 

juanrga

Distinguished
BANNED
Mar 19, 2013
5,278
0
17,790


It goes in the other direction really: AMD decided to go for a moar cores approach sacrificing single-thread performance. A FX-8350 competes with an i7-2600k/3700k in aggregate throughput because AMD did look for the correspondence

[ 1 module ] <---> [ 1 core + HT ]

I.e.

FX-8000 <---> i7
FX-6000 <---> i5
FX-4000 <---> i3

Despite early mistakes when developing Bulldozer, current strategy of forcing new game engines to go wide by using 8 weak cores in consoles has proven to be a fantastic plain. Old FX-6000/8000/9000 CPUs are receiving free performance gains with the new games, whereas the new 8-core Haswell loses to the six/quad-core Haswell. Anand show the expensive 8-core Extreme loosing to the ordinary 4770k even on BF4.

Intel is now struck in a corner situation for desktop/gaming where engineers cannot improve IPC fast enough, where DX12/Mantle/OGL-Next will reduce their CPU advantage and, at the same time, "moar cores" doesn't work for them except in some few niche cases: 16-threads is not average Joe's software.
 


Do you even read what you write anymore?
 

colinp

Honorable
Jun 27, 2012
217
0
10,680
Pointless. The 8370e was potentially the only interesting one, but performance is poor and the power consumption nothing to write home about. So much for a supposedly maturing 32nm process leading to a lower tdp chip outperforming the old flagship.
 
Status
Not open for further replies.