AMD Piledriver rumours ... and expert conjecture

Page 34 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...
 
It's less die space for the IMC, so it makes sense to keep it at 2. VLIW4 takes like 50% of die space, so they gotta cut CPU goodies 😛

Besides, I'm sure Trinity's IMC is tweaked for lower latency than the CPU version of PD. Just like in Llano.

Cheers!
 
This sort of forum worship is usually reseved for the mighty MU ... who has not been around much lately.

I will light a small candle for him.

:)

Sorry guys, I have been really busy recently with work and just got around to reading some of the epic CPU threads. I guess you guys know how that one works- finally get done with all of the school and training and then get a real job, and watch your free time evaporate. At least when I am on call at work and it's not overly busy (it usually is overly busy though), they just installed unlocked WiFi and THG isn't blocked by the nanny filter. 😀

MU_Engineer..
he's a server freak and knows AMD Operton's like they were his children...

I am a devoted SMP fan, that's for sure 😀 There's just something about using big-iron parts that interests me like how some of you get interested by a super-high-end GPU or a massively overclocked system.

Oh, and back on topic. Bulldozer as a design theory appears to be very solid. AMD knows it isn't going to be able to outrun Intel in per-core performance and possibly in total per-CPU performance. Intel has too many dollars and people in their R&D department to work on eking out a percent here and a percent there for AMD to seriously want to play that game. Intel would start an arms race of sorts and bankrupt AMD, since AMD's market cap is a single digit percentage of Intel's. AMD seems to have been aiming for a space-efficient, dollar-efficient design that could scale well enough from notebook to server to handle the 99+% of the market that doesn't insist on the absolute highest performance at any price (e.g. Intel's top-bin CPUs like the ~$5000 Westmere-EP E7-88xx series.) Increasing throughput and performance is much easier to do by adding threads than by making each set of thread execution hardware stronger. AMD has been adding threads to increase performance for some time, as Thuban has shown. Software keeps continually being more and more multithreaded, so this is not an unreasonable approach. AMD thus is looking to make a chip that handles as many threads as possible but doesn't have an unmanufacturably-large die. AMD's options were to simply add more Stars-type cores, come up with "slimmer" cores and add yet more of them, or adopt SMT. AMD apparently did a lot of research and found that they needed to keep discrete integer cores present but some of the rest of the HW could be shared without a big performance deficit when fully loaded, unlike SMT. So they had a shared frontend, discrete integer cores reminiscent of Bobcat (which was another "good enough performance but cheap to make" chip), shared L2 cache, and semi-shared FPU. My guess is that the FPU isn't too hard to replace with a bunch of Radeon GPU HW in the near future as it isn't nearly as directly tied in with the cores as with Stars. That would fit in with AMD's ultimate goal of Fusion- to have cheap, easy-to-make, space and dollar-efficient integer core modules paired up with GPU HW acting as a super-strong FPU as well as a GPU.

The initial Bulldozer parts failed to meet their performance expectations for non-obvious and still not completely known (or publicly known?) reasons. This happened to AMD the last time they did substantial changes to their uarch with Barcelona. They essentially ran into teething problems with a significantly new design, and I feel this is what happened here. Best guesses as to Bulldozer's lower-than-anticipated performance usually involve the caches, namely that the L1 is too small and the L2/L3 are too slow. I wonder if the FPU scheduler isn't also a bit immature as well and isn't properly "splitting" the FPU for simultaneous 2x128 bit operations on the different cores. 128-bit non-AVX FPU-heavy tasks run about as fast on 4 threads as they do 8, whereas you would think they would run considerably faster. The integer schedulers seem to be fine as Bulldozer does scale fine with non-FPU-heavy tasks out to its full 8 threads. So, my guess as to what is going to be in Piledriver is essentially "fixing" the implementation issues in Bulldozer:

1. Increase in L1 size.
2. Better FPU scheduler
3. Possibly an even beefier FPU, maybe a 2x256 bit unit?
4. L2 and L3 latency reduction
5. Increased uncore speeds
6. Other subtle tweaks that increase performance due to internally-identified issues that we are not aware of. This could be anything from increasing internal cache/core bandwidth, scheduler tweaks, etc.

And then the typical refresh tweaks:
7. Another module- we have heard this before.
8. Support for even faster DDR3 memory, probably DDR3-2133 or maybe even DDR3-2400 support in the desktop parts.
9. A third IMC channel for GPU-equipped parts, we have also heard this before.
10. Higher clock speeds at identical thread counts and TDPs due to a maturing 32 nm process
11. Software tweaks, such as how AMD had the Linux kernel TLB bug fix (which did NOT need to shut off the L3 on Barcelona like the Windows BIOS fix did) and the Cool 'n Quiet individual-core-scaling tweak between Barcelona and Shanghai.

All in all, I fully expect Bulldozer --> Piledriver to be like Barcelona --> Shanghai. Major redesigns have teething problems, period. Even Intel has them, anybody who has tried to run a Sandy Bridge with its new, majorly-redesigned on-die IGP on Linux can easily testify to that one.
 
reason for Bulldozer performance....
all the AMD fanbois think the processor is just ahead of it's time and everything else needs to catch-up..
:na:

Some of that actually is/was true with AMD CPUs. The QuadFX is probably the best example. It performed notoriously poorly when it was released, often the second CPU being present actually *lowered* performance. Why? Because Windows XP doesn't support NUMA worth a crap, so the CPUs were busy bouncing data in RAM back and forth between each other over the HT link as Windows XP randomly moved threads from one CPU to another. If Windows Vista or 7 had been out at the time- or if XP had actually supported NUMA- the QuadFX would have been a real challenger to the Core 2 Quads. (Linux did support NUMA well at the time and the FX-7xs were competitive with the C2Qs on that OS.) Well, at least until you tried to overclock, then the C2Qs actually had overclocking headroom whereas the FX-70s had little to none, but you get what I am saying. Ditto with the Barcelona- Windows XP/Vista didn't know how to handle the per-core Cool 'n Quiet well and bounced threads from one core to another, landing on a low-clocked idle core and sacking performance. Bulldozer also is noted to perform notably better on Linux than it does on Windows, because the Linux kernel and compilers have support for this CPU while the current version of Windows does not. It doesn't change the fact that Intel's products still outperform AMD's, but it definitely narrows the gap somewhat.

why is amd still using around 9xx pin count, i think if amd can increase the pin count then it can increase its cpu's performance

for example
compare i7-39xx and sb, they have huge difference in pin count and also in perfornance

AMD is still using ~900 pins on their desktop CPUs because that is all they need. The majority of the pins on AMD desktop CPUs are for the IMC to connect to the RAM- 480 pins on AM2/AM2+/AM3/AM3+ are used for this purpose. Most of the rest are used for power supply, and a fairly small number for miscellaneous things like the single HyperTransport link, temp monitoring, etc. The number of CPU power pins doesn't need to increase if the CPU TDP does not increase. AMD's TDP has remained at 110-140 W maximum on performance desktop sockets since 2003. They also haven't changed the memory controller's overall design a whole lot either, so you don't need more pins.

The reason the i7-39xx series has a lot more contacts that AM3+ is because the i7-39xx series is based off the many-socket-capable LGA2011 interface and contains a lot of I/O connectivity that really only a server would find useful. Performance is NOT better for an identical chip type. The i7-3820 is essentially an i7-2700K with a 100 MHz higher clock speed and 2 MB more L3 cache. Performance of the i7-3820 is nearly identical to the i7-2700K, despite the i7-3820 having about 900 more contacts in the socket and two more memory channels. An increased number of pins/contacts only help performance when you really need what they connect to. A quad-core Sandy Bridge doesn't need the extra stuff LGA2011 can connect to, and it does fine on LGA1155.

daily life example
consider a stadium with 10 gate and another one with 20 gate . which can alow more people to go in and out when they both are completely filled

The analogy would be more like this: Consider a stadium with 10 gates and another with 20 gates. 50 people show up to the game and trickle in over one hour. Sure, you can theoretically handle many more people at the same time with the extra 10 gates, but when people show up one or two at a time, the extra 10 gates aren't even used.

noob2222, i thought linux was a very fast/efficient OS, couldn't they program games just as visually appealing in a linux based OS at the same time having to use less cpu/gpu 'power'?

Yes, and it has been done before. Enemy Territory:Quake Wars is the latest AAA game I can think of that has been released for Linux. It looks the same on Linux as it does on Windows. The reason more games don't show up on Linux is that you'd have to write the game in OpenGL rather than Direct3D, since Direct3D is a proprietary MS standard. MS has a stranglehold on the market so OpenGL games are very rare. Gaming on Linux is a pretty small market so few developers release even halfway-large titles for that OS.

Seeing PD going to quad MC for server boards, I wonder if we will see that update implemented on the 1090FX boards with backward compatibility for dual channel for 990FX.

I think the "quad-channel" bit is referring to the Opteron 6000 MCM model. Those are already quad-channel (really 2x dual channel). The last rumors I have heard are saying that Piledriver will be released on existing sockets, namely AM3+, C32, and G34, so no new single-die quad-channel socket is apparently in the works. There was a previous rumor of a "Socket C2012" to replace C32 for the Piledriver generation that was supposed to have three memory channels, but that apparently was canceled along with Krishna and Wichita. AMD of course is keeping its cards very, very close at hand and we likely won't know what they are really working on until shortly before launch.
 
If they get it back to Phenom II levels they would be back at square one. Intel has IB coming out in April. It will have better performance than SB that much we know. Then in 9 months to a year is Haswell.

PD needs to do two things: Drop power draw a lot because man BD is a hog and get performance above Phenom II in every aspect. Then it may be competitive in the DT market.

Then again AMD isn't focusing on Intel anymore as they have moved past that.

Still if PD performs better than IB it will cost you plenty to have that as we are seeing from the HD7970 thats priced way to high for my liking. And I do want a HD7970, just don't want to spend that much for it. Of course I will probably wait until Kepler is out as that will probably help drive the price of it back down to normal levels.

If they get single thread performance back to Phenom II levels they will be up to square one. It will put them at what they were expecting from BD and will give the MOAR COARS!!! strategy some room to differentiate itself. I think meeting performance in single thread will catapult PD ahead in everything else as BD already pulls ahead in most tests.

When you compare Phenom II to BD calling it a side-grade is generous at best when you consider it as 4c vs 8c (as it is marketed). I don't believe that Bulldozer is what AMD wanted from their new arch but you don't sell many products with "Well it could be better but you should buy it anyways."

As far as Intel is concerned AMD has stated that it doesn't want to compete on pure x86 performance any more and I think this is one time we can believe the press release. I think they are going to keep pushing their Super-Duper-Multi-Core-MADNESS!!! and their remarkably successful APUs as ways to stand out from Intel.
 
The performance of bulldozer could be increase significantly if they just improved the cache latencies and bumped up the clock, which would be expected in time.

The more cores strategy just doesn't hold up as of right now though. Very few apps use 8 threads and even less uses them well, AMD can only hope to stay even on performance if they can increase single thread performance.

The thuban example shows that more threaded performance doesn't help AMD in the market, intel still stays ahead for gamers and eventually the market won't care if AMD offers threaded performance for less since people will recognize intel as the superior platform. That is what is happening right now as AMD offers its APUs as very good product for what it does but people still buy intel laptops for more because they recognize the brand. Atom netbooks sold even tho bobcat was a far superior product.

As for trinity, AMD needs marketing and getting its name out there. The performance increase is expected but a product sells because of marketing in today's economy, ipads and iphones offer nothing more than android products but sell better in the US because of marketing and name alone.

Hopefully AMD can get everything sorted out for win8 launch so they actually get performance they promised and don't have an excuse. The front side needs improvement as well as the fpu. Better branch prediction and per module optimizations would help on top of the better caching. They should also stay on max of 8 cores and just try to improve them as theres no way of using more cores effectively in more general computing apps today, they can stay ahead of intel in core count but theres no reason to release 10-16 cores for desktop any time soon.

 
^^ I would love to have a 16c desktop CPU since Folding@Home is dropping BigAdv (high reward work) support for any system with less than 16 cores fairly soon.

^ I don't disagree but if AMD can keep within 5-10% of Intel on single thread I think they will do fine. Like mu said Intel can keep throwing engineers at the problem to tweak out that last 1% and make up the costs within 2 weeks of sales.
 
MU outlines the issues better then I ever could.

BD is a good uarch, they screwed up the implementation, something that is to be expected all things considered. Let's see how far they run with this and what they can tweak it to. And please fix that damn L2 latency issue.
 
enough with all these cores that are just dreadful in single (per core) core performance.
I rather have a quad of superior quality and performance then to have eight cores of average performance.

And yet, when people like me argued that point in the BD thread, the AMD fanboys went insane.

Software, as currently written, compiled, and excuted by the OS, typically does not end up being easy to use multiple threads of execution AT THE SAME TIME, without some sort of synchronization bottleneck somewhere. Its VERY hard to do, and only a few specific apps have the ability to do this easily. I've long argued that going multi-core was simply a way to get around the fact per-core performance was not improving as fast as it once was, and that clock speeds were limited by power draw. But beyond quads, I don't expect to see much improvement in the vast majority of apps.

BS will shine in servers, since multiple APPLICATIONS can be run in parallel. But for single applications, like 99% of the desktop world does, BD is not a good architecture, due to per-core performance.

This is the same logic that explains why Intel is sticking with 2-4 cores on teh vast majority of their products: The cost of adding two more cores is not justified by performance increases that they give.
 
And yet, when people like me argued that point in the BD thread, the AMD fanboys went insane.

Software, as currently written, compiled, and excuted by the OS, typically does not end up being easy to use multiple threads of execution AT THE SAME TIME, without some sort of synchronization bottleneck somewhere. Its VERY hard to do, and only a few specific apps have the ability to do this easily. I've long argued that going multi-core was simply a way to get around the fact per-core performance was not improving as fast as it once was, and that clock speeds were limited by power draw. But beyond quads, I don't expect to see much improvement in the vast majority of apps.

BS will shine in servers, since multiple APPLICATIONS can be run in parallel. But for single applications, like 99% of the desktop world does, BD is not a good architecture, due to per-core performance.

This is the same logic that explains why Intel is sticking with 2-4 cores on teh vast majority of their products: The cost of adding two more cores is not justified by performance increases that they give.

I have to disagree with you somewhat. The problem that the program is solving or the process that the program is going through must be divided in the problem space by the programmer. This may not be so difficult to do. I understand that the entire problem can't be split and synchronization is required. Synchronization is communication and there will be some threads waiting on some others but that's ok, as long as it's done correctly very few cpu cycles will be used. It's about handling events properly.

It is so that AMD cpu architectures are not as fast as the Intel architectures but I like the many core concept. I believe the Bulldozer architecture was not implemented properly, both in process (manufacturing) and in design. But I'm hoping it will get fixed and when it does I will want one. I won't need the fastest cores I just want them to be fast enough and more of them.

My main point is that programs in many ways can be divided into threads or even processes. I have done this btw. You might say there is an overhead penalty but I would say that the OS takes care of that if you use proper synchronization. That is, correctness (takes some practice, no cheating allowed) and you must put the thread or process to sleep while waiting. This should be part of the communications protocol but certainly isn't difficult to do explicitly by hand. In fact, every development team should have a member who devotes himself/herself to just such a thing.

I do not so much believe the problem can be solved by the compiler or the OS. Please, someone tell me different.

In fact, I'm thinking that inadvertently the Bulldozer architecture is promoting thought in this area. Believe me, you aren't going to be able to cross the programmers off the list. As much as some people would like to...
 
still, give me an i7-2600K over the FX-Bulldozer anyday, I do not care about specific tests (tailored in some way) that favor AMD's more cores arc.

Your reply is short sighted. Allow me to explain why.

Consider a drop of rain falling from the sky. How much computation using our best fluid dynamics theory and numerical methods would be required to determine how that drop of rain moves though the atmosphere to the ground? Can it be done, even? Not really. And yet the drop of rain has no problem falling just how it falls all the way to the ground.

In the same kind of way, computers solve problems sequentially by executing instructions. How fast can one cpu core go? How many instructions can be executed in one second? Enough to play Crysis?

The answer is to not be linear and execute instructions sequentially. The answer is probably to not execute instructions at all but until someone figures out how to make that leap into that non linear evaluation process, using more cores is a good idea so we can do more things at the same time. This problem isn't going away and I believe it is the real problem to solve at this point. You know, were this problem solved, instructions per clock may not be important to you at all.

So ease up in your linear thinking and help me to say: GO AMD!
 
Again, unless software gets a fundamental re-thinking, for 90% of all applications, it is near impossible to design them in such a way where you get multiple threads running at the same time, without bottlenecks, that are doing enough of a workload to warrent using more cores.
 
i think we can use multiple core by tweaking or making changes in window.

The way to do this is spliting a thread into multiple.
Why i am saying this, because i found that it is possible when i was using vms and prime95 in vm, that vm splited the workload of 1thread of prime95 in vm to both the cores in base os

you can try this at home
 
I thought it would be "Yes, just enter your full name, SSN, and valid credit card information into this Google Docs form so we know where to send the hardware and can charge for 'Shipping and Handling'." :lol:
 
I have to disagree with you somewhat. The problem that the program is solving or the process that the program is going through must be divided in the problem space by the programmer. This may not be so difficult to do. I understand that the entire problem can't be split and synchronization is required. Synchronization is communication and there will be some threads waiting on some others but that's ok, as long as it's done correctly very few cpu cycles will be used. It's about handling events properly.

But again, you still need to be doing multiple events at the same time for threading to have any real impact. And very few work-heavy workloads can be easily broken up into smaller units running in parallel, because most jobs end up being a series of sequential events.

For example, when we first got quads, the thought was that you could write a game with the audio, rendering, physics, and AI engines each running on a different CPU. Sounds like a nice scheme, until you realise that the audio is almost totally dependent on the other three engines, and AI, physics, and rendering all interact with eachother in some fashon, which limits how much of their workloads can be done without needed some form of syncromization [which can take a VERY long time if one of those threads is pre-empted by the OS, in which case your application is basically stuck in a waiting state. Thankfully, Windows typically doesn't pre-empt high workload forground application threads...].

Farther, with more heavy workload threads running, you usually require a LOT more memory access at any one point in time, which, in 32-bit .exes or low memory systems, can result in a lot of paging, bringing HDD IO waits into the equation. [Moving to native 64-bit .exe's would help solve this problem though...]. You also need to put a LOT more thought on what the CPU decides to cache in the L2, which can have a huge impact on performance [especially if theres a lot of paging going on].

Point being, the "simple" scheme of stuffing each game engine on a different core got very complex very quickly. And HEAVEN FORBID your AV decides now is a good time to start on the 4th core, in which case you've just tanked you application performance. [Using more cores has the negative side effect of increasing the chances of a total application bottleneck if another application starts to execute on that same core, forcing you app to a waiting state. If your other threads are stuck in a syncronization state, you are basically dead].

And this is BEFORE we even begin to consider how the OS is actually managing all these resources.
 
Status
Not open for further replies.