Intel Silvermont Architecture: Does This Atom Change It All?

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.

DjEaZy

Distinguished
Apr 3, 2008
1,161
0
19,280
[citation][nom]maddoctor[/nom]Yes, you are right that is why I like you because you are always using Intel's product and even you have not and vow to not buy any AMD based product in your lifetime. Even I'm using Intel Pentium based notebook because it's better than unreliable buggy crap like current AMD A* processors. That is why when I'm looking for notebook, I will look for Intel's blue sticker in the laptop, because it always superior.[/citation]
... how old are you?
 

Bloob

Distinguished
Feb 8, 2012
632
0
18,980
So, Java and C# apps would work fine on an Intel SoC, apps using native languages would at least have to be recompiled, and distributed as a separate bundle from ARM ones, too much hassle for not enough customers. I guess Ubuntu or Win 8 is Intels best bet.
 

Fokissed

Distinguished
Feb 10, 2010
392
0
18,810
[citation][nom]de5_Roy[/nom]bulldozer!.. is the first thing came to my mind when i started reading about the cores. but it's not exactly like bd, it's different. still.. it made me chuckle. amd deserves the credit.i wonder if future intel cpus ($330+ core i7) will have the same core system instead of htt.... edit2: rodney dangerfield FTW! \o/[/citation]
The shared L2 cache resembles the Kentsfield processors exactly, and given that Conroe was released 5 years before Bulldozer, I think Intel deserves the credit.
 

InvalidError

Titan
Moderator
[citation][nom]de5_Roy[/nom]i wonder if future intel cpus ($330+ core i7) will have the same core system instead of htt....[/citation]
Why would Intel ever drop HyperThreading? It provides ~30% extra performance for ~5% more power and transistors by enabling more efficient use of existing computing resources in heavily threaded code. It is one of the most cost-effective and power-efficient tweaks in modern CPUs and GPUs.

My bet is that it will go the other way around: Intel putting HT in Atoms and increasing the number of threads per core on i7/Xeon. I bet Atom will get HT within the next two years.

Extracting more performance out of a given core's execution units per clock through simultaneous multi-threading is much simpler than extracting more performance out of a single instruction stream with deep out-of-order execution. AMD and ARM will likely get on-board the SMT train eventually.
 

ojas

Distinguished
Feb 25, 2011
2,924
0
20,810
I had to use 1E9's compatibility mode to finally log in to this page. Chrome and IE9's normal modes wouldn't work.

Site needs serious work, it's like Win 8 currently: Change for change's sake with core functionality completely disrupted.

Anyway, on topic, you might also want to read AnandTech for more detail on this. He's also covered graphics.

On Jaguar (my thoughts): It's not low enough for phones yet. Tablets, yes, but not phones. Also, having solid GPU performance is fine, but the CPU needs to be there too. Remember, this is mobile, efficiency is paramount. And you're not going to see 170 GB/s on a standard Jaguar-containing SoC.

With Merrifield/Bay Trail, I think 10-20GB/s memory bandwidth might be a reality. With Broadwell, Intel might end up bringing Core to Atom more directly and integrating it with the main tick-tock cycle...ARM will be on 20nm when Intel's going to be doing 14nm....and when Skylake hits, ARM would still be on 16nm at the most.

2014 marks the beginning of the end of ARM's domination in mobile. 2015 onwards x86, both Intel and AMD will have a solid lead. I think the only real competition will be from Nvidia, and I don't expect Qualcomm to not fight back. Samsung, Apple will probably exit the market and use Intel's stuff.

Anyway, AT link:
http://www.anandtech.com/show/6936/intels-silvermont-architecture-revealed-getting-serious-about-mobile

What we know about Baytrail is that it will be a quad-core implementation of Silvermont paired with Intel’s own Gen 7 graphics. Although we don’t know clock speeds, we do know that Baytrail’s GPU core will feature 4 EUs - 1/4 the number used in Ivy Bridge’s Gen7 implementation (Intel HD 4000). Ultimately we can’t know how fast the GPU will be until we know clock speeds, but I wouldn’t be too surprised to see something at or around where the iPad 4’s GPU is today.
Intel’s eDRAM approach to scaling Haswell graphics (and CPU) performance has huge implications in mobile. I wouldn’t expect eDRAM enabled mobile SoCs based on Silvermont, but I wouldn’t be too surprised to see something at 14nm.
On single threaded performance, you should expect a 2.4GHz Silvermont to perform like a 1.2GHz Penryn. To put it in perspective of actual systems, we’re talking about around the level of performance of an 11-inch Core 2 Duo MacBook Air from 2010. Keep in mind, I’m talking about single threaded performance here. In heavily threaded applications, a quad-core Silvermont should be able to bat even further up the Penryn line. Intel is able to do all of this with only a 2-wide machine (lower IPC, but much higher frequency thanks to 22nm).
That’s the end of the Intel data, but I have some thoughts to add. First of all, based on what I’ve seen and heard from third parties working on Baytrail designs - the performance claims of being 2x the speed of Clovertrail are valid. Compared to the two Cortex A15 designs I’ve tested (Exynos 5250, dual-core A15 @ 1.7GHz and Exynos 5410 quad-core A15 @ 1.6GHz), quad-core Silvermont also comes out way ahead. Intel’s claims of a 60% performance advantage, at minimum, compared to the quad-core competition seems spot on based on the numbers I’ve seen.

Also: Tech Report: http://techreport.com/review/24767/the-next-atom-intel-silvermont-architecture-revealed
 
Just so everybody knows we do need AMD to continue to operate and produce their products and maybe even improve the quality a bit because if they don't and the only one left standing is Intel then we are going to pay a lot more for their CPU'S which are already expensive. Right now if you want a top of the line CPU from Intel then it's going to cost you $1000, I can't imagine what that price will jump to with no AMD in the picture.

Maddoctor , you need to stop with your negative comments about AMD or I will start to think that your trolling, so far no AMD fans have responded to this post and when one does there will be a flame war and you will be the cause of it.
 
If I understand Chris' last article, didn't Intel say their research showed that some changes that made the chips perform better in real life actually make them worse in benchmarks, and vice-versa? If that's the case, Intel will have to make a big push to show the devices responsiveness instead of benchmark numbers.
 

InvalidError

Titan
Moderator

That is the same as with any other form of benchmarking. Different benchmarks have different characteristics which may or not be representative of real-world computing loads. Each architectural change may have positive effects on some real and synthetic workloads while they may have adverse effects on others.

Not every architectural tweak translates into a win across the board under all circumstances so the objective is to achieve the biggest gains where they matter most.
 


Of course. But to the mindless, uneducated masses, the numbers mean the most, even if they don't know what the numbers actually mean. A few years ago a lot of people panned Windows Phone 7 because they were all single-core CPUs. However, since the OS was designed and optimized for single cores, it was every bit as fast and responsive as multi-core Android phones. If all the general public sees are lopsided benchmark numbers or that the Atoms still aren't quad-core, it won't matter how much performance they really have, they won't sell. Intel will have a big marketing/educating challenge to overcome.

Personally, the idea of a full x86 phone is incredibly appealing to me.
 

4745454b

Titan
Moderator
didn't Intel say their research showed that some changes that made the chips perform better in real life actually make them worse in benchmarks, and vice-versa?

I don't know if true or not. If true it's a problem with the synthetic benchie. As it obviously isn't doing what real programs do. I've said this before and I'll say it again. I don't even look at the synthetic benchmark page(s) on new parts. I don't care how many bungholio marks your new part can make. All that matters to me is speed. And when the 2900XT came out and we all saw massive synthetic scores higher then the 8800ultra but gaming scores lower then the 8800GT, I stopped paying attention to synthetic scores. You don't play 3DMark or superpi. Why do you care what those results are?
 

ojas

Distinguished
Feb 25, 2011
2,924
0
20,810
> Since AMD already published the Software Optimization Guide, you can make the comparison
> yourself, or just assume that usually Jaguar have a bit more than Silvermont, for example:

Umm. You missed the most important number. L2 size/latency.

Intel tends to do a good memory subsystem, and people seem to always dismiss that. The numbers I've seen for Jaguar are a 24 cycle load-use latency of the 512kB L2. David claims (I think) 14 cycles for the 1MB shared L2 on Silvermont. That looks like a big advantage for Silvermont.

Things like ROB sizes are almost unimportant compared to the really fundamental things. I'm a huge proponent of OoO, but it doesn't even need to be all that deep to get the low-hanging fruit. You need to have good caches: low latency and reasonable associativity. If those are bad, no amount of core improvements will ever make up for it (see the majority of the ARM cores), and if the caches are really good, you can make do with shallower queues.

(Side note: I'd like to see actual benchmark numbers for the Silvermont cache accesses. Sometimes CPU people quote the latency after the L1 miss (rather than the full load-to-use latency), sometimes they quote the "n-1" number of cycles, sometimes it turns out that pointer following adds another few cycles that they don't mention, yadda yadda, just to make their numbers look better than they are).

So if the Silvermont 14 cycles are for "14 cycles after the L1 miss", then that is much worse than if it's a true "14 cycle pointer-to-pointer chasing" latency. But I'm assuming it's true load-to-use latency for now, since that's in the same ballpark that Intel did back in the Merom/Yonah days.

Linus (Trovalds)
http://www.realworldtech.com/forum/?threadid=133235&curpostid=133311
 

4745454b

Titan
Moderator
I'd have to agree. AMDs Athlon with the IMC was faster then the P4 in part because by no longer using a FSB they were able to get info from the cache MUCH faster then Intel could. Along with the much shorter pipeline their chips were quite a bit better.

Problem is Intel learned a lot from tweaking the FSB so that by the time they started using the IMC theirs was a lot better then AMDs. And for some reason AMD hasn't really brought the latency down much at all. At least not that last time I checked.
 

bluekoala

Distinguished
Feb 8, 2008
333
0
18,810
I owned 2 netbooks with atom CPU's. From my experience, they were really crappy. If super crappy got 50% better, then it's still kinda crappy. Although any improvement is a good improvement in my mind. I'll just make sure to steer clear of atom cpus (and all junk surrounding it)until they are PROVEN to be competitive not only in power consumption and performance, but price as well.
 

mapesdhs

Distinguished
[citation][nom]InvalidError[/nom]Why would Intel ever drop HyperThreading? It provides ~30% extra performance for ~5% more power and transistors by enabling more efficient use of existing computing resources in heavily threaded code. It is one of the most cost-effective and power-efficient tweaks in modern CPUs and GPUs. ...[/citation]

It's amazing how often people criticise the modern HT implementation,
which is strange because as you say in heavily threaded code it can be
extremely effective (not always - AE is an example so I've read - but
usually). Best example I came across recently was my 4.3GHz i7 870
giving a much higher 3DMark11 Physics score (9752) than a 4.5GHz
3570K (9297); now that was a surprise, ie. the HT made all the difference
(compare any 870 to a 760 at the same clock, similar effect). Of course if
a task doesn't benefit from HT, then turning it off gives more thermal
headroom and a higher oc (eg. another 870 I have can do 4.1 with HT on
vs. 4.44 with HT off).

If Intel does ever add HT to Atom, what would be good is if each core
could exploit HT on the fly, ie. use it if a task gains from its use, turn it
off if not & thus save power aswell. Hmm, dynamic HT... bit too complex
perhaps.

Ian.

 

InvalidError

Titan
Moderator

The way I understand it, most of the extra power use with HT simply comes from increased activity in decode and execution units. If HT is enabled but the extra thread is sleeping (not decoding or issuing any instructions), it does not cause extra CPU activity and does not use any more power than disabling it. So HT is already sort-of dynamic.

The main thing it would need to be more power-efficient is a kernel driver or hack to analyze each thread's typical CPU requirements to sort out which ones are worth waking a CPU core for and which ones should be scheduled on secondary threads as much as possible.
 


Read the whole thing I wrote. I didn't say the chip would suck because of benchmarks, or that I give extra weight to synthetic benchmarks. I said since so many other people do base their judgment on those numbers, Intel may have a harder time marketing these chips against Tegra and Snapdragon.
 

ojas

Distinguished
Feb 25, 2011
2,924
0
20,810
About HT: 1) Current gen Atom already has HT

2) Silvermont won't get HT because Intel figured moving to OoOE would use the same die space and power, but provide much better performance.

3) In Core, you have enough space and power to do both HT and OoOE.
 

InvalidError

Titan
Moderator

Not much of an argument considering how little extra power and space HT needs vs how much computing power it adds: ~30% more performance for ~5% more of anywhich one other parameter, making it ~6X more efficient than adding an equivalent amount of brute-force processing power through extra cores or higher clocks.

HT is more efficient than OoOE but requires sufficiently threaded workload to actually leverage. HT's real problem is that most software today is still fundamentally single-threaded so most programs benefit a lot more from single-threaded performance optimizations than extra cores or SMT.
 

ojas

Distinguished
Feb 25, 2011
2,924
0
20,810

Well...
The move to an OoO paradigm generally comes with penalties to die area and power consumption, which is one reason the earliest mobile CPU architectures were in-order designs. The ARM11, ARM’s Cortex A8, Intel’s original Atom (Bonnell) and Qualcomm’s Scorpion core were all in-order. As performance demands continued to go up and with new, smaller/lower power transistors, all of the players here started introducing OoO variants of their architectures. Although often referred to as out of order designs, ARM’s Cortex A9 and Qualcomm’s Krait 200/300 are mildly OoO compared to Cortex A15. Intel’s Silvermont joins the ranks of the Cortex A15 as a fully out of order design by modern day standards. The move to OoO alone should be good for around a 30% increase in single threaded performance vs. Bonnell.
Previous versions of Atom used Hyper Threading to get good utilization of execution resources. Hyper Threading had a power penalty associated with it, but the performance uplift was enough to justify it. At 22nm, Intel had enough die area (thanks to transistor scaling) to just add in more cores rather than rely on HT for better threaded performance so Hyper Threading was out. The power savings Intel got from getting rid of Hyper Threading were then allocated to making Silvermont an out-of-order design, which in turn helped drive up efficient use of the execution resources without HT. It turns out that at 22nm the die area Intel would’ve spent on enabling HT was roughly the same as Silvermont’s re-order buffer and OoO logic, so there wasn’t even an area penalty for the move.

That was AnandTech.

Intel claims that the core and system level microarchitectural improvements will yield 50% higher IPC for Silvermont versus the previous generation. Comparing the two microarchitectures in Figure 7, that is a highly plausible claim as out-of-order scheduling alone should be worth 30% and perhaps 5-10% for better branch prediction. Additionally, the 22nm process technology has reduced Vmin by around 100mV, and increased clock frequencies by roughly 20-30%. In total, this paints a very attractive picture for Silvermont.
One of the biggest advantages of Silvermont’s out-of-order scheduling is that the whole concept of ‘ports’ or ‘execution pipes’ becomes irrelevant. The microarchitecture will schedule instructions in a nearly optimal fashion, and easily tolerate poorly written code. Of course, the resources for out-of-order execution are not free from a power or area standpoint. However, Intel’s architects found that the area overhead for dynamic scheduling is comparable to the cost for multi-threading, with far better single threaded performance.
The microarchitecture of Silvermont is conceptually and practically quite different from Haswell and other high performance Intel cores. The latter decode x86 instructions into simpler µops, which are subsequently the basis for execution. Silvermont tracks and executes macro operations that correspond very closely to the original x86 instructions, a concept that is present in Saltwell. This different approach is driven by power consumption and efficiency concerns and manifests in a number of implementation details. Other divergences between Haswell-style out-of-order execution and Silvermont are dedicated architectural register files and the distributed approach to scheduling.

That was from Real World Tech
http://www.realworldtech.com/silvermont/

What i have to say is that: the HT you're looking at is based on a OoO pipeline, while the original Bonnell/Saltwell pipeline was in-order.
From whatever testing i've done of the Xolo X900, and the other benchmarks i've read for Clover Trail and Medfield, HT had little effect for Atom (this includes Linpack, which showed no improvement under ICS or Gingerbread for Medfeild).
 

InvalidError

Titan
Moderator

The general principles behind HT/SMT remain the same regardless of the presence of OoOE or superscalar execution: provide one or more alternate independent instruction stream(s) to help fill execution slots when execution is about to stall from lack of eligible instructions.

The main problem with OoOE is that its complexity increases almost exponentially with look-ahead depth while the chances of finding extra instructions to squeeze in decrease due to valuable and expensive resources getting tied up behind conditional branches, mispredicts, speculative execution, cache misses, memory fetches, etc. which OoOE cannot efficiently do anything about so the cost vs benefits of pursuing deeper OoOE keeps getting worse as you attempt to extract more ILP per thread.

With SMT on the other hand, the costs are almost directly proportional with the hardware thread count and chances of finding instructions to fill ports with also increase linearly with the number of active threads loaded on each core.

SMT and OoOE are not mutually exclusive. They are synergistic: SMT drastically reduces the depth/complexity of OoOE required to keep most ports busy on every clock tick under sufficiently threaded workloads (may span multiple applications) and OoOE increases the likelihood of finding something to do in each individual hardware thread. Both contribute significantly to extracting the most work out of the least surface area and power, which would be a significant advantage on platforms with limited power supply if pervasive threading became more common.

Another nice benefit of HT on a limited power budget is that you do not need to rely as much on branch prediction, speculative execution and various other expensive (both power and area) tricks. You can simply execute instructions from other threads and hope dependencies will start resolving themselves before every thread gets stuck, which means less wasted work/power - that was the goal behind the original Atom and reality disagreed with Intel's vision of how much of a good idea that was.

HT becomes more effective than increased OoOE once you have enough execution resources to actually start worrying about OoOE failing to keep them busy most of the time - that's where the OoOE and support structures start ballooning out of proportions like they do on mainstream desktop/laptop CPUs. If you look at chips designed for massive parallelism (GP/GPUs, Xeon Phi, UltraSparc T5, etc.), they tend to have much more primitive OoOE (sometimes none whatsoever) than chips designed with somewhat of an obsession for single-threaded performance like Haswell.

The optimal balance between multi-core, OoOE and HT/SMT is all about availability of threaded system workload (not necessarily all from a single game/application) or lack thereof. The balance just happens to still be heavily weighed towards single-threaded performance's favor.

The main problem with the old Atom: nobody cares how much more efficient an SMT design can be if they thoroughly suck at extremely common single-threaded tasks like parsing HTML and calculating layouts to render web pages. Not having OoOE left the old Atom severely crippled in that department, which is clearly not acceptable if Intel wants to gain market share in mobile devices.

Give it maybe two years. Atom will likely grow to quad-port execution and HT will come back to help keep them full without expanding the OoOE circuitry too much nor sacrificing the all-so-important single-threaded performance.
 
Status
Not open for further replies.