Everything Zen: AMD Presents New Microarchitecture At HotChips

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
No offense, but this is old-world thinking. Consider that, in the majority of tech products, the OS can be compiled by the manufacturer for the specific SoC. And basically all of the apps that run on it are JIT-compiled. So, I'd argue that code portability & performance portability no longer really matter, especially in most of ARM's customers' markets.

So, for those markets, they can make cores which lack the overhead of an instruction decoder and have much simpler schedulers. Even the caches can be simplified. This saves power and die area, enabling more cores and higher clock speeds.

For a long time, we've seen GPUs becoming more like CPUs. I think we're about to see CPU start becoming a bit more GPU-like.

Again, the best testament to this will be Nvidia's new 64-bit Denver core. They've already announced a SoC, built around it, for automotive applications. I'm hoping a tablet/laptop SoC is soon to follow, and we'll see how it stacks up. I believe their original Denver core was the fastest 32-bit ARM core, at least for a time.
 

"Old world thinking" or not, from an engineering point of view, it is much simpler in terms of total design effort to work with an architecture that allows local optimizations on its own than having to re-engineer and re-characterize both the software and hardware stack with every architecture change that has any sort of effect on scheduling, dependencies and resource availability. With "old world" thinking, compilers and CPUs can be improved independently.

If you think that lacking instruction decoders and complex scheduling magically enables higher clock frequencies or performance, look at the very GPUs you cited as an example: low clock frequencies compared to Skylake. The much simpler shaders fall drastically short from achieving higher clock frequencies and the work achieved per shader per clock is also substantially less, which would murder performance in mainstream software where single-threaded performance reigns supreme and no amount of JiT recompiling is going to change that. In case you may have missed it, the multi-core mobile craze has been dying down lately as people have discovered that more weaker cores deliver generally worse overall experience than fewer stronger cores.

Also, the K1 (the 64bits dual-core variant) used the Denver core and many people have complained about K1 devices such as Google's Nexus 9 having horrible battery life even during trivial use, which is why Nvidia decided to give up on mobile computing and chase automotive opportunities which aren't as heavily battery-challenged.
 
I don't follow. Compilers are optimized for specific micro-architectures. They all have options to tell them which set of instruction set extensions to use and how to tune the code. And I think you over-emphasize the degree to which they'd be coupled. Compilers can & do continue to evolve high-level optimizations independent of their architecture-specific backends.

VLIW has already established this precedent. There have been numerous VLIW micro-architectures, for embedded applications, that lacked binary compatibility between generations.

Actually, it's a design choice. If your applications have enough concurrency and people are willing to pay for a larger die, then it's more energy efficient to have lots of lower-clocked units. However, someone could easily tune the simple, in-order cores of modern GPUs to achieve blistering clockspeeds. It's all about tradeoffs. In fact, if 14/16 nm die space were cheaper, I bet current gen GPUs would have even higher shader counts and even lower clock speeds.

I hate to link another site, but the best analysis of the matter you're likely to find doesn't really support your assertion:

http://www.anandtech.com/show/9518/the-mobile-cpu-corecount-debate/18

That said, I do see mobile core counts leveling off.

Are you sure? Google's Pixel C uses Tegra X1.

For my final exhibit, and what got me seriously thinking about the issue, consider this piece:

http://www.csm.ornl.gov/SOS20/documents/Sohmers.pptx

He makes a compelling case about how much power is wasted by all of CPUs' efforts to compensate for naive code. I'm not convinced these guys will succeed, but I think their compass is pointed in the right direction.

Think about it: why waste all the power & die area to continually decode the same instructions over-and-over again? The output is exactly the same. Why waste power managing cache, when compilers are smart enough to do it statically? Why redo all the work of scheduling the same instructions, when the ordering will be almost the same from one run to the next? Compilers can see more, be smarter, and persist optimizations from one run to the next. JIT-like runtimes can dynamically re-tune code, based on profiling, even in the midst of it running.

I don't know if it's true, but I've read that the VLIW philosophy started to take hold when people noticed that more power and die area was being spent on instruction scheduling, branch prediction, etc. than on actual computation. Yet true VLIW was constrained to embedded applications, where cross-generational binary compatibility wasn't a requirement. In the world of mobile computing, the hardware and software platforms are so highly-controlled and interpreted languages are so dominant that I think the yolk of binary compatibility has effectively been shed.

To chip away at the wall that CPUs are up against, more creativity and daring is needed. Make the hardware simpler and the software smarter. Makes sense to me, anyway.

Just to be clear, I don't see x86 CPUs going this route. Not for a while, at least.
 

Modern CPUs do not decode the same instructions over and over again. Intel got rid of that with the Trace Cache in the P4, Core 2 and beyond designs, AMD will be doing the same with Zen. Instructions in inner loops only need to be decoded on the first pass through. With most software spending the bulk of their performance-critical loops executing the same 50-1000 instructions hundreds of times, decode is effectively eliminated from the performance-critical path.

BTW, having a powerful FP-centric CPU won't help you at all for heavily non-linear workloads such as parsing files like HTML or running things like game logic code. Those are applications where FP power and tons of weaker cores are of limited use where fewer stronger cores tend to dominate. That's the intrinsic nature of irregular algorithms.

Unless you can completely eliminate heavily conditional code from your algorithm (this will never happen since parsing and anything similar are heavily conditional by definition), there will always be a place for deep out-of-order cores to make that non-parallelizable code go faster. For heavy FP stuff, you can delegate that to GPGPU, there is no need to mess with the CPU for that,
 
That's the very first thing we discussed, in this thread. Did you forget?

No, I mean instruction decoding at the program level. The micro-op cache isn't that big, so you'll be decoding most of a program's instructions multiple times. And you'll be duplicating that same effort every time you run it. That is what I meant.

A lot of programs don't benefit from this, because they execute enough other instructions between visiting their various hotspots that the trace cache would have long forgotten the last time it was there. And some programs just aren't terribly hot-spot oriented. Memory is so big and programming languages are so high-level that programs needn't be coded as tightly as they once were.

How is that apropos, here?

First, I don't imagine completely eliminating instruction scheduling (and never said as much). Second, you're sort of contradicting yourself by claiming that deep out-of-order scheduling is needed to accelerate non-parallelizable code. If it's non-parallelizable, then data dependencies would defeat deep reordering attempts, meaning that only shallow reordering would be of much use.

I will further speculate that one of the main benefits of deep reordering is to pipeline loads (and possibly cacheline allocation, in preparation of upcoming writes - don't know if any CPUs are that sophisticated). But compilers can (and do) perform prefetching, though perhaps they're rather more timid than if they had to explicitly manage the cache.

The biggest problem with static scheduling is the unpredictability of memory latency. And for that, I imagine using a facility like SMT, to concurrently execute blocks of instructions bookended by loads/stores/jumps. That's the role I see for dynamic instruction scheduling - simply to dispatch the instructions from different blocks & threads. Again, rather GPU-like. As I'm obviously not a CPU designer, take this with a grain of salt.

What I'm trying to do is highlight the amount of die space & power that's being spent on optimizing naive code. And most of it is stuff a JIT compiler & dynamic optimizer could do as well or better. I think that's a huge opportunity, and some others clearly agree (Nvidia, Soft Machines, and Rex, to name a few). Maybe you don't, but neither of us is going to prove anything, here. We just have to wait and see.

BTW, even if you have no intention of agreeing, you'd probably find the above link an interesting read.
 

The most compute-intensive parts of common algorithms are extremely hot-spot-oriented. If you look at things like SHA, AES, STL constructs, sorting algorithms, geometry transforms, all manners of math libraries, etc., all the stuff typically doing the heavy-lifting behind an application, the main loops in the bulk of them are in the hundreds of instructions at most. How many uOPS are in Intel's and Zen's trace caches? Over a thousand, more than enough to accommodate the innermost three or more performance-critical loops in most algorithms.

1000 instructions may not sound like much to work with but it is enough to do a surprising amount of work, especially when the compiler correctly identifies your intent and spews out the correct optimal code, give or take a few optimization quirks.


Nothing contradictory there: if you have a conditional branch, you may have 50 other things that you can compute before the conditional branch but the only dependency that matters where preventing a pipeline stall is concerned is the branch which is usually dependent on only a fraction of the instructions before it, such as incrementing or decrementing the loop counter. A deeper out-of-order queue has more freedom for speculative execution along the most likely path to keep itself busy until the branch is resolved. If the branch prediction is correct, then the deeper reorder queue can be that many instructions ahead of a shallower one. If the prediction is incorrect, the intermediate results are discarded and the CPU is no worse off apart from wasted power. That's why you see all CPU designers emphasizing any improvements to branch prediction.


The Neo link is about an architecture aimed at scalable FP-centric compute. Most user-interactive processes and parsing-like code is heavily dependent on conditional code, the stuff nearly all other architectures are horrible at and does not parallelize well.

 
So, what's your point? Did you notice that I agreed with you 3-4 times about micro-op caches being a good thing? And I said it at least once, before that. So, if you're arguing that they're good, then it's certainly not with me.

All I'm saying is that it'd be even better not to have to waste the power and die space on decoders or micro-op caches. Not to mention having a shorter pipeline, altogether.

So, that's not what most people would consider non-parallelizable code, or certainly not what I thought you meant by it. If there are "50 other things you can compute", then it's got a heck of a lot of ILP.

Compilers can certainly do speculative execution. The software-based alternative to branch prediction is profile-driven optimization (which you can even do concurrent with program execution).

And for branches that are difficult to statically predict (e.g. ones which exhibit a high degree of local consistency, but are globally inconsistent), go ahead and keep some simple branch prediction. But you could even make that a different instruction, so that the hardware doesn't waste precious entries, in its smaller table, on highly-consistent branches. Just a thought.

But that's the point! What are we talking about, if not wasted power, die space, and lost performance?

You're missing the forest for the trees. The point of that link was their analysis of how much power is wasted by hardware trying to do things to "help" software. As I said, I'm not sure they'll succeed in their goal, but I think their compass is pointed in the right direction. By that, I meant that I felt they're on to something, by trying streamline the hardware and get software to shoulder more of its own burden. The GPU guys get it, because they've been offering a co-designed hardware + software runtime environment since day 1.

One place where I think they're misguided is by trying to make memory access latency deterministic. That might be workable in some tightly-controlled HPC workloads, but not in most general-purpose computing scenarios. That's the main reason I think SMT and a similar sort of intra-thread scheduling are warranted. GPUs are ahead of the game on this, too.
 

And what I'm saying is that the overhead cost has become negligible in most performance-critical cases: every uOP cache hit is one less access to L1 I-cache and instruction decoder, no power penalty there, which is why full-blown x86 cores can slip below 5W.


Not parallelism that can be achieved by throwing extra cores at the problem. Without a reorder buffer, you need your compiler to have a cycle-accurate CPU model to avoid unnecessary stalls. Keeping a CPU model perfectly in sync with the actual implementation is more difficult than it sounds due to all the little things that can cause timings to be off by a cycle.

I have worked in ASIC validation and it was quite common to have simulation models disagree with silicon results in subtle ways simply because minor differences in signal timings caused priority encoders to pick events in a different order from simulations. The chips still produced the correct final output but the automated testbench had to be modified to account for the fact that some transactions may get scheduled in a different order when multiple function blocks request access to PCI, RAM or other shared resources on the same clock tick.


How much performance do you gain in performance-critical code by choosing to stall the pipeline to conserve power instead of taking a chance on a prediction? In the quest for the highest single-threaded performance practically achievable to meet the demands of typical user-interactive code, some sacrifices have to be made.

Consumers care more about having their interactive stuff being snappy than having the highest possible power efficiency, smallest die area and lowest cost.
 
I don't have any data to present, beyond what that presentation showed. Even though the micro-op cache avoids the decoding cost, the cache has its own costs.

I was reacting to the idea that wasted power is no big deal - not trying to say that branch prediction is bad. I thought that would be clear, as I also had comments on profile-driven static branch prediction and a nod to some amount of runtime branch prediction (possibly with hardware assist).

I'm not going to dig up any more data, to support my ideas. If you want to believe that the evolution of Von Neumann architectures has reached its logical conclusion, that's fine with me. I am simply trying to paint a picture of a promising avenue to push back the wall, a bit further. I find it interesting to think about.
 

I'm not saying that a definitive brick wall has been reached but the uphill slope is definitely getting steeper every year for all architectures aiming for increased throughput per core/thread.

The main reason for my pessimism is fundamental software design: no matter what instruction set and architecture you use, typical software will still average one conditional branch every 10-20 instructions. The mispredict rate and how far ahead the CPU is able to look for stuff to do place a fundamental limit on how high IPC/ILP can go. As long as the bulk of mainstream software continues to rely heavily on single-threaded performance for optimal user experience, all architectures will hit similar brick walls for typical consumer use regardless of how innovative they may be.

Typical software patterns are the seemingly insurmountable bottleneck. Clever profiling, different architectures and deep out-of-order execution can only do so much to improve that.
 
hmm...Apparently if you are going to buy Zen CPU is to have Win 10, as I was checking today on google and according o PCGamer/PCGaming the only windows will support Zen and new Intel CPU is win 10, while the previous versions wont support them!

Seems either will be forced to buy win 10 or switch to Linux or just buy i7-6700k and run it with win7;
 

You can run DOS, Win98SE, XP or anything else on Zen if you want to, you just won't have access to all of Zen's (or Skylake/Kaby Lake) features and may need to jump through extra hoops to make it work. (Ex.: having to dig up a sub-32GB HDD or equivalent with 512B clusters.)
 



I completely agree InvalidError. This is the article IceMyth was referring to:
http://www.pcgamer.com/intels-kaby-lake-and-amds-zen-processors-will-only-support-windows-10/

Basically the writer took what Microsoft said and blew it out of proportion. It'd be like saying that windows 98 won't work with a dual core because the OS wasn't written to handle it. The OS would work fine, but only 1 core would be used. It's the same with any new features added to a CPU vs any non-latest OS.

More detail about the support issue can be found on Microsoft's website:
https://support.microsoft.com/en-us/help/13853/windows-lifecycle-fact-sheet
 
I wouldn't buy a CPU with a plan to use an OS that didn't support it. Maybe you get lucky, but I typically find that my hardware works noticeably better, once I've installed all the chipset drivers for it, etc.

If you want to run Windows, and really don't like Windows 10 (and I say this as a non-Windows 10 user), there's always ReactOS:

https://www.reactos.org
 


Exactly, so at the end the OS will not allow you to use 100% of your CPU power. Also the drivers you will need to update ..etc and not forget if there are OS files that need to be updated by MS which seems will not happen might causes crashes in the future with 0 support!
 

Drivers for old OS won't exist, even less so updates for them. That's why most essential peripherals like SATA ports have "legacy" mode for compatibility with old OS.
 
Very interesting stuff from Team Red - I've been using AMD for almost 25 years now and I'm happy with my FX-8350 at the moment so have no plans to move to AM4 for some time yet, however if Gigabyte launch an AM4 board that's as good as my 990FXA-UD3 R5, I might build a third system based on the AM4 platform (I have a second system based on an Athlon II X2).

On a side note, I also notice that AMD's ZEN logo is almost the same as a ZEN logo I created in 2014 for my (back then) custom PC, and even the slogan "I Am Zen" is used - but I'm not worried, as great minds think alike!

You can see my original logo here: https://plus.google.com/+HuangLong8/posts/SFP1WXS6xnt
 


Your programs could actually crash. Your CPU will claim it supports AVX or whatever, but your OS won't know what CPU you have and may not actually save your AVX registers on context switch. Then bad things happen. Old OS + new CPU + new software = broken software.
 
First of all, no disagreement on the risks of running an OS which doesn't support your CPU. I said you might get lucky, but that also implies maybe you don't.

Second, the specific case you cite seems unlikely. Take AVX-512, for instance. Given that no Win7-supported CPUs implement it, I'd wager that any software which uses it will require Windows 10. Also, some BIOS' have the option to limit CPUID, which can probably be used to work around such problems (program doesn't detect the instruction, so doesn't use it).

In my mind, the bigger issue is chipset compatibility & drivers. Even if you can get Windows 7 to run, it won't use the hardware optimally.
 

If the OS initialized the CPU correctly, it should have masked the CPUID capabilities fields to represent what both support to prevent these sorts of issues in properly written backward-compatible architecture/extension-specific multi-path code.
 


Must be a more recent thing. I remember when MMX, SSE, and others came out, programs would read the CPUID and try to use instructions the OS didn't support. Doing a quick google, sounds like Windows 8.1 and newer support CPUID masking.
 
Status
Not open for further replies.