AMD's Future Chips & SoC's: News, Info & Rumours.

Page 8 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.


I know that bolded part. It just correspond to this part of my former claim

The reason why most submissions use ICC is because it is almost guaranteed to provide the highest scores due to extracting more performance from the hardware.

I will say it again. The problem is not that ICC provides higher scores than GCC. We have know this for many years.¶ The problem is that the gap between GCC and ICC is not 43%, and the scores that AMD estimated for Xeons are all wrong.



Irrelevant. No one is objecting on the flags used during compiling. The problem is that the scores that AMD assigns to the chips from the competence don't correspond to reality.

NOTE:
¶ In the first page of this PDF you can find a figure with corrected SPECint scores. You can see the gap is estimated to be 10%. In more recent times the gap is usually considered to be 15%. It is not the 43% that AMD pretends.
 


This is the specific quote (bolded mine): "Intel GCC performance is generously estimated as 10% less than reported ICC performance. (Source: vendors, except *The Linley Group estimate)".

So, they're also implying it's more than 10% and that word implies it might be a lot more than 10%. Plus, I guess you aren't taking into account this piece of information from the previous links you sent (bolded mine):

"4.3 Compiler
The names and versions of all compilers, preprocessors, and performance libraries used to generate the result."

It's not just the main binaries being discussed, it's the whole set of benchmark ecosystem used. Plus the Kernel I would imagine, plus the libc performance libraries used. Plus taking into account GCC still has to adjust a bit in favour of the Zen uArch and immature drivers for the kernel itself. I'm pretty damn sure that 43% is reasonable when you take into account *everything*. Not simplifying it to "it's just the binary ran".

I won't deny that is an over-speculative world, but that's how marketing works.

And compilers flags are *very* relevant. If you think they're not, you're oh so mistaken.

Cheers!
 


They could be implying also that they are generously estimating a 10% gap when it is less than 10%. But we are departing from the main topic: (i) AMD is giving estimations of Xeon performance instead giving measured performance and (ii) the 43% gap they are using on the estimations is not realistic.
 


libquantum alone is worth ~15% of blue sky on Intel CPUs because you will never see the real world use...using ICC is easily another 10-15% maybe more. 10% is grossly undervaluing the performance gap, and everyone knows it.

SPEC has even recently disallowed libquantum all together because it skews their benchmarks by so much it is unreal.
 


I know the problem with libquantum and how it inflates scores by about 10--20%. I also know ICC inflates scores on both AMD and Intel hardware. My predictions for SPECint on RyZen avoided libquantum

http://www.realworldtech.com/forum/?threadid=165321&curpostid=165505

and you can check my prediction was confirmed recently "So the SPECint value for Ryzen without libquantum happens to resemble your prediction".

I also know other compilers have cracked libquantum including the compiler AMD has developed

http://developer.amd.com/tools-and-sdks/cpu-development/x86-open64-compiler-suite/

Thus, when looking for EPYC SPEC2006 scores that use Open64 compiler, don't forget to subtract the libquantum subscore.
 
Dell EMC Adds AMD EPYC™ Processors to the World’s Bestselling Server:

https://blog.dellemc.com/en-us/poweredge-servers-amd-epyc-processors/
 


ICC still gives AMD a larger performance boost then GCC does.
 


Right, it is also true for the Microsoft compiler.

MSVC-Intelcpp-f06d3d41b1061f65.png


Both RyZen and Bristol Ridge get the highest performance with the Intel compiler.
 
Hey Guy's check this.. Ring Architecture v Mesh Architecture v Infinity Fabric...

As you may well know, Intel are replacing their ring architecture found up to broadwell I think with a new (loose term) Mesh Achitecture..
Intel Introduces New Mesh Architecture For Xeon And Skylake-X Processors:
http://www.tomshardware.com/news/intel-mesh-architecture-skylake-x-hedt,34806.html

The Ring Architecture was getting far too complicated and causing a Latency especially as the core count increses..
The main difference between Mesh an Ring well Mesh can transfer info from any core to any core where as opposed to routing info around the ring of cores..

This all sounds great in theory but in practice it does not appear to be faster at all..Maybe some fine tuning is required I guess, but anyhow:
Why is 10 Core Skylake part loosing to an 8 Core Broadwell.. It seem's they are getting negative results with their new Mesh Architecture:
https://youtu.be/uznhbRXvNIY?t=346


Granted Intel have better IPC but it seems that AMD have a better interconnect with the Infinity Fabric.
Either way as the core count increases Intel's advantage decreases..

An now as the Core Wars have begun the Fabric which seemed to many at the beginning as a major drawback now looks like it could possibly be a major advantage...

It doesn't seem like they had planned on changing to the Mesh Arch (anytime soon or at all for that matter) until Zen came along !
Even though they already had the IP for it.. But now as the core count increases it appears they have no choice.
It kinda of looks like they have pulled out this Mesh Interconnect an blown the dust off it an stuck it into Skylake a little too hastily.. as it does not appear to be faster well not at this point in time anyway.

Will this will allow them to implement Multi Die solution like AMD.. Is this one of the reason's why it is being used I can't help wondering ?

Intel Introduces New Mesh Architecture For Xeon And Skylake-X Processors:
http://www.tomshardware.com/news/intel-mesh-architecture-skylake-x-hedt,34806.html
 


Except it is faster... Latency increases linearly as N for the rings and as sqrt(N) for them mesh. Any manycore has been using mesh for years, including Intel Phi. Rings and buses are only valid for small multicores because they don't scale up.



Not even close. The mesh approach is better because it is more distributed than IF. Quoting PCPER:

Negligible latency differences in accessing different cache banks allows software to treat the distributed cache banks as one large unified last level cache. As a result, application developers do not have to worry about variable latency in accessing different cache banks, nor do they need to optimize or recompile code to get a significant performance boosts out of their applications.

Anyone remember the 'optimizations' that have to be made on games to account for the huge inter-CCX latency penalties in AMD approach? There are even higher penalties on die-die communication on ThreadRipper and EPYC. I guess inter-die cache access latencies must be about one order of magnitude higher, but don't take my words too seriously.



Intel research on this stuff began about 10 years ago

https://www.pcper.com/reviews/Processors/Intels-80-Core-Terascale-Chip-Explored-4GHz-clocks-and-more?aid=363

They developed several research chips and started using mesh interconnect on commercial products with the Xeon KNL line. We have known for a while than Xeon Phi and ordinary Xeon are converging with Skylake. It doesn't have anything to do with Zen.

It doesn't have anything to do with multi-die approaches. In fact all the designs from the original research chips to the current commercial implementations are single die.


 


If I understand things correctly, the Ring bus grows less efficient with added cores as the distances and "stops" along the way increase. So the decision to go with Mesh had more to do with the core count, which depended on process architecture for the size, and so on. A better question would be what sort of latencies and performance would that 10 core part have had if it continued to utilize a Ring bus. You have to transition somewhere, 10 might seem too soon, but 12 might be too late.
 


Most people doing that sort of work on a prosumer station are going to be using GCC or MSVC. ICC is not popular among open source programmers, or DIY web developers.
 


This is expected. On the OS side, more cores to manage makes the scheduler take slightly longer to schedule threads. For workloads that do not scale to all CPU cores, additional cores will incur a slight performance penalty.

On the hardware side, the ring bus offered good latency for adjacent cores, but become unmanageable as core count increases. The mesh bus does better in this case, though adjacent cores will have slightly increased latency.

Adding unused hardware has negative performance impacts, simple as that.
 


Linux devs tend to use GCC for everything, despite the fact I personally find it lacking as an IDE. To each his own I guess.

Everyone else develops within MSVC because as far as IDEs goes, nothing even remotely comes close. Most compile the final release with MSVC too unless they have to support a non-Windows architecture, in which case the code is just shoved into GCC, usually with little more then /O2 set.

ICC certainly offers the best performance by far, but its not really used much unless there's an absolute need for performance, simply due to needing to rip out all the GCC/MSVC extensions which creep into the code during development. It's simply too much effort to support.
 
Here's another example of Skylake underperforming !!
This backs up what we've already seen in the PCWorld bench..

It is also a Multi-Threaded benchmark..
It could be down to reduced level 3 cache..
check it out here:

https://youtu.be/3VV_It9kdrs?t=559

Also the Ryzen 7 1800x and 1700x beats the 8 Core 7820x hands down with much weaker IPC... well whats going on here Intel beaten on multi threading core for core !
 


It's taken from an Anandtech (controlled test) bench.. I'm afraid I don't see what the problem is with it.. People post youtube reviews all the time on this site.

But here is a direct link to the Anandtech review benchmark... It's the same result.
Please scroll down to DigiCortex 1.20 bench:
http://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/10

Ryzen 1800x and 1700x beats the 7820x and Ryzen 5 Beats the 7800x..
Also Broadwell E beats Skylake in this benchmark.

 


OMG Intel looses in one benchmark. Possibly due to HTT being weaker then AMD SMT implementation, or due to cache layout differences.
 
It's Interesting all the same as it is a very unexpected result !
They loose and they loose again against their previous generation too.. It's the smaller cache or maybe it's their multi-threading...

It should not be happening as they have a significant IPC gain on Ryzen this is a surprising result would you not agree ?
An it shouldn't be happening against Broadwell either for that matter.

I'm also sure it's not the only benchmark it is happening in.

It is what it is !
 


IPC isn't a flat constant, it varies by workload. And a LOT of little things, from Turbo Boost to Inter-Core-Communication affect IPC. In this case, you possibly have a benchmark that is really sensitive to L3 cache access speeds, in which case older Intel CPUs could end up being slightly faster. Or maybe the benchmark suppresses Turbo Boost, and baseline CPU speed matters more.

Point being: You almost never have a case where a new CPU is universally faster across every possible benchmark scenario. Yes, Skylake is about 8-9% faster per clock on AVERAGE, but in certain specific workloads, I could easily see cases where Broadwell remains faster.

What we have in the review is a suite of benchmarks that shows performance across a suite of benchmarks. And only ONE shows odd results, likely due to changes to CPU cache topology. I see nothing here; you're trying to make a mountain out of a molehill.
 


Ah, then you're going to start saying Software will fix its issues! Blasphemy! We all know if you fail on day 1, you fail forever!

Cheers!

PS: Yes, that was sarcasm.
 

Tbf its new, different and requires optimization much like ryzen, adored tv has a video on why, 'review roundup' iirc.

 




First, performance and IPC aren't constants. Both vary from workload to workload, because no microarchitecture can be good at everything. Some workloads have conflicting requirements; i.e, one requires one thing and other requires the contrary; then when you optimize your microarchitecture to work better on one of those workloads, it start working worse in other.

Second, I don't see the issue here. Skylake-X is significantly faster on four of five benches in that link. Overall it is a clear win. Moreover, analysis of the discrepancy shows it doesn't have anything to do with mesh latencies or multithreading or anything like that. This bench seems to be sensible to L3. Skylake has less L3 and it is vistim instead inclusive and this affects performance here.

Third, similar behavior has been observed in other systems, where a new muarch wins in most cases but loses in a pair of them. Or did us forget those special situations where Excavator was slower than Steamroller or Steamroller was slower than Piledriver?
 


Intel cofirms to tomshardware that Mesh is having a negative effect on skylake results:
http://www.tomshardware.com/reviews/intel-core-i9-7900x-skylake-x,5092-2.html
 
Status
Not open for further replies.