AMD Piledriver rumours ... and expert conjecture

Page 270 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...
 
I think while AMD has some work to do catching up with Intel on IPC it is clear with the recent strategies (in power efficiency) that both have adopted, that the point of diminishing returns has been reached in terms of performance for Intel (and therefore AMD when they catch up) given the current restrictions placed on the systems in terms of operating systems and software contraints.

Both will endeavour to offer slight improvements (but not stellar) in performance in the future. More cores isn't the answer given software constraints for many applications - parallelism is more of a dream than a reality at present, particularly for programmers of many games.

Efforts with process improvements (reduction in die size) will be spent improving graphics real estate, and power efficiency.

Though X86 has constraints of the design (quaint addressing modes / segment registers / variable-width & macro instructions - requiring more decode work therefore overhead, as these are implemented via microcode) both Intel and AMD have spent a lot of money developing microarchitectures for the x86 instruction set ... hence its efficiency and flexibility in the long term.

While ARM may have had a performance lead at the very low power boundary that advantage is being rapidly eroded. ARM is also just a simple instruction set and not a microarchitecture.

Each new iteration for x86 offerings from Intel and AMD is now more likely to produce similar performance with a considerable reduction in power.

INTEL and AMD need each other. They also NEED MS.

As our interest in hand held devices increases ... that will become a driver for manufacturing to suit.

That being said I for one take my souped up M405 to meetings while the others have their puny i-pads ... pathetic relic aren't I ??
 
x86 is slowly reaching the ARM power envelopes.
But can ARM scale up to x86 performance levels ? Unlikely.

Both will endeavour to offer slight improvements (but not stellar) in performance in the future. More cores isn't the answer given software constraints for many applications - parallelism is more of a dream than a reality at present, particularly for programmers of many games.

i used to work in an IT company some time back. The work was to develop server side Java applications for database management and retrieval. The devs used threads all the time. But, they were all run on one core only. Nobody even knew how to do parallel threads on multiple cores.
Another work was to create Macros in Excel and Access, which fetch data and then process it and convert it into graphs, tables and charts. Again, this was all sequential. Even though, it can be made into SMT. But nobody knew how.

So yes, parallel programming is a dream right now.


We (finally) get dynamic physics engines that can handle multiple object collisions. The API uses the GPU, since this type of physics engine would scale to a reasonable degree (think PhysX).

AFAIk, Physx is the only physics engine that uses GPU. Bullet and Havoc are still CPU based.
 


iPads are cliche. Everybody owns one. Same with iPhones, macbooks, intel chips.
 


X86 performance AMD is playing catch up, but each evolution of current architecture will see gains on that front.

Parallelism well I think it is more than a dream, perhaps in gaming but in content creation applications that is becoming prevelent, yet it is still in the infancy stage.

As HSA evolves so will the software writes that go with it, it will be a mass effort from all parties, AMD is just making it possible to go that route.
 


http://bulletphysics.org/wordpress/

Bullet uses some OpenCL.

As for moving things from the CPU to the GPU and creating a bottleneck, you guys are missing the point. If you have a discrete card, the iGPU getting hammered with operations doesn't matter.

PhysX does this completely wrong. The optimal way is to have two pieces of silicone, one for graphics and one for floats like physics. When Nvidia did away with the PPUs, they took away the entire benefit of PhysX. You can see how poorly they did by the fact that 7970 with PhysX on high on the CPU pretty much ties with GTX 680 with PhysX on GPU.
http://www.techspot.com/review/577-borderlands-2-performance/page5.html

AMD will have some way to get around this. I don't think some of you see how much potential this has. It's the equivalent of having an on die PPU that can do everything else float related as well. Using things like OpenCL Bullet Physics won't bog down the GPU if you have an APU and a discrete.
 


Right their is when i laughed lets see if i was behind in somthing i would say the same thing. It's not good to put all your egg's in one basket which their doing with HSA.
 


Intel are putting there eggs in the x86 basket, make of it what you will. Its not like AMD have said they are not going to improve IPC and x86 performance, heck look at that PD 7-10% in IPC performance, how much that translates into pure performance well that will be seen soon enough, they targeted 15% said they were pleased with the outcome so maybe its more.
 
IPC will be around 7%-9% and the speed bump will take the rest a little over 11%, maybe.

They'll keep being somewhat power hungry I think, but within the same ballpark (when OCed to their limit).

That's my full estimation. Hope I'm wrong and it's more, hahaha.

Anyway, sarinaide is right. AMD is not abandoning x86 at all, they'll just not prioritize it like Intel is. HSA and APUs are a gamble, we all agree, but when you're hanging on a thread, you have to take the risk and go for it. They don't have the money/capital/market to successfully get into different markets and fail with no problems, like Intel.

Cheers!
 
HSA on an APU works well. They can even have that and a external GPU for the actual rendering during a game. The APU style architecture can be very flexible, just need programers to embrace it. The problem is AMD doesn't have the marketshare and money to actually make it happen fast.
 


I can't stress this enough: Developers will not spend a significant amount of time coding for a niche architecture that is not universally adopted and supported.

Seriously, Win64 has been mainstream for almost 5 years now, and how many programs even have 64-bit builds? And that should be a simple recompile for the most part...
 


I suspect within the next few CPU architectures, you will see Intel start to drop parts of the old 16-bit backend from its processor design. That would free up some space and allow some power savings right there. Would break legacy compatibility though...



Excel (and most of office) frankly does not scale well. Nothing devs can do about that. Excel should be scaleable for the most part. (Office is badly in need of a redesign...)

As for threads in general, I say this again: The CreateThread API call (http://msdn.microsoft.com/en-us/library/bb202727.aspx) has no facilities whatsoever for assigning a thread to a particular core. The ONLY way to do this on windows is via the SetThreadAffinityMask call.

And even then, you have to be concerned about data integrity. When do I have to start locking/unlocking. Making sure if you abort a thread, you unlock all its resources. This kills performance. Then you have all the implicit locking Windows does as the result of certain API calls. Its actually easy to lock the main GUI if you aren't careful (which I've seen plenty of people do over the years). The preferred method for most programs is to let the windows scheduler handle core loading, unless you have a workload that is known to be parallel in nature and not likely to be blocked by external I/O.

(Protip: A 'finally' block is NOT guaranteed to run, so don't rely on them to clean up the program state. Example: A thread that is deleted from an external thread will never go through its finally block, bleeding resources. You clean up resources when you are done with them, period, to avoid this issue. Almost all Java devs are taught wrong in this regard, and make very poor programs as a result. /rant).

FYI, some of the relevant API calls:
CreateThread: http://msdn.microsoft.com/en-us/library/windows/desktop/ms682453(v=vs.85).aspx (Note you'll always call this via a differnet API call, such as BeginThread)
SetProcessAffinityMask: http://msdn.microsoft.com/en-us/library/windows/desktop/ms686223(v=vs.85).aspx
SetThreadAffinityMask: http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspx
SetThreadIdealProcessor: http://msdn.microsoft.com/en-us/library/windows/desktop/ms686253(v=vs.85).aspx

Note the lack of an easy way to find a HTT core. Aside from looking at the raw CPUID data, I don't know how to determine if a core is HTT or not (I'd be shocked if there wasn't an API call somewhere that exposes this though...)

And BTW, theres a massive trap with all these threading options. Lets look at a simple problem: You have two reasonably independent threads that can be run in parallel. How do you do this?

Option 1: Leave it to the Windows scheduler.
Option 2: Hardcode the two threads to cores 0/1 (or any two cores)
Option 3: Hardcode the second thread to any core besides core 0

Option 1 is the only correct option.

Option 2 ignores the possibility of high workloads already existing on the first two cores. Unacceptable to make the assumption you are the only heavy-work process active.

Option 3 ignores the possibility that the other thread will NOT be placed on core 0 to start. Same problem as Option 2 also exists.

So if you want to start hardcoding your thread logic, then you have to start asking for how heavy each CPU core is, figure out which ones are doing the least work, and manually assign your threads to those cores. Nevermind that in the time it takes to do this, those cores might be back at heavy workload again. Or some other task will use those cores, and two other cores might be doing less work, but because you hardcoded the thread logic, your thread can't jump to the less overworked cores. Woops.

See how threading very quickly becomes REALLY complicated?

AFAIk, Physx is the only physics engine that uses GPU. Bullet and Havoc are still CPU based.

My main point was that physics engines that can handle multiple-object interactions dynamically will HAVE to be GPU based, because they are massively parallel and complicated equations that will not be able to be run on the CPU with any decent amount of speed. [It would be equivalent of attempting to do rendering entirely on the CPU. You CAN, it will just be really, really slow.)
 



According to whom? AMD only made general statements regarding the production issues. This is common in industry where neither party wants to take full blame. AMD got a good deal with GloFo to only pay for good die. Most companies don't get those kinds of arrangements.

The only company shipping 5Ghz chips right now is IBM and those things are 1200+W massive MCM bricks (larger than an iPad) that require liquid cooling.

I'm not an advocate for GloFo but AMD would have to be smoking something to think a low pin count desktop CPU package with AIR cooling would get higher than 4Ghz easily. That was just wishful thinking of engineers that thought the process tech would save the bottlenecks of Bulldozer.

Look at what the motherboard manufacturers have had to go through to get higher overclocks. They throw on 10+ voltage regulators and 8phase power. This is why Intel is pulling the VRMs into the chip because they need better control over the power delivered. The process tech is getting smaller and the capacitors on the motherboard are sitting too far from the transistors.

You get problems with instantaneous voltage drop within the die itself. Even within areas very close together in the die.

http://www.design-reuse.com/articles/4598/meeting-the-challenges-of-90nm-soc-design.html

This is an old article but it gets worse as the gates get smaller.

The process tech can't fix these issues. They have to be designed into the layout.


 



That's true but when true bottlenecks hit the industry incredible things can happen. I was reminded of that recently watching a Ted talk. I forget the name but there were ancient comments from the vacuum tube days. Saying how they shrank them as much as they could and they just weren't lasting. And out of the blue the transistor is born. Revolutionizing the industry as we know it today.

I think we're 10+ years from a true bottleneck. With 450mm wafers and 10/7/5nm tech, 3D stacked silicon there's a lot left to wrangle out of the transistor. At some point quantum computing will arrive but that's very distant.
 


Vacuum tubes were amazing technology at one point... 14nm transistors are going to seem like ancient tech in 10 years.
 


There's one thing though and that is x86 is already here with current software unlike HSA which needs help from the software side



You read my mind but some software is supporting it a little.
 
whats the advantage of 64 bit for the general applications ? none. They will just use more memory.

linux has a more efficient solution : x32 ABI (NOT x86)
x32 is using 64 bit register mode but 32 bit memory mode. So applications that dont need the 64 bit memory space, but do need the 64 bit computation, can do so now.
 
IPC will be around 7%-9% and the speed bump will take the rest a little over 11%, maybe.
its still confusing
lets change/re-define the rules
first rule is, do provide a ipc/relative ipc difference
second rule is , do provide a comparision between cpu models (like 8150 and 8350)
trird rule is, play it clean and fair
last but not the least, follow above rules 😛


Option 1: Leave it to the Windows scheduler.
Option 2: Hardcode the two threads to cores 0/1 (or any two cores)
Option 3: Hardcode the second thread to any core besides core 0
Option 1 is the only correct option.
Option 2 ignores the possibility of high workloads already existing on the first two
cores. Unacceptable to make the assumption you are the only heavy-work process
active.
Option 3 ignores the possibility that the other thread will NOT be placed on core 0
to start. Same problem as Option 2 also exists.
So if you want to start hardcoding your thread logic, then you have to start asking
for how heavy each CPU core is, figure out which ones are doing the least work,
and manually assign your threads to those cores. Nevermind that in the time it
takes to do this, those cores might be back at heavy workload again. Or some other
task will use those cores, and two other cores might be doing less work, but
because you hardcoded the thread logic, your thread can't jump to the less
overworked cores. Woops.
and throw in a sempron or celeron (single core cpus) and you got 😗
 


Which is exactly why they WILL NOT do that. Sheesh. :heink:
 
There is no issue for Intel to put more real estate (given their lead in that regard)on their chip ... the question is what, or more of what:

1. Power saving circuits ... already implemented and giving considerable savings in power.

2. Cache ... more of it gives diminishing returns and chews up lots of power when its running. Having power circuits to shut down areas when not in use is handy though.

3. Additional circuits for dedicated use ... AES and others ... yes ... working well.

4. More cores ... diminishing returns unless specific applications can use it ... servers / rendering / etc.

5. Using the graphics array for APU work - great ... still in its infancy. Quicksynch / Fusion.

6. more turbo or dedicating more L2 cache to a faster core or what ?
 

obviously no one will ever use a dedicated gpu ... because APUs are soo powerful for all future games.

problem is if they do, then your arguement is completely void and in fact be that much faster.
 


If there is a need, then it makes sense to implement. Video Transcoding, for instance, makes sense to use OpenCL. Games, not so much. But if there isn't a performance need, then you won't see it be adopted.



Larger Address Space, allowing for more then 2GB of RAM to be accessed. Even on Win64, the old 2GB Address Space limit remains in effect (4GB, if the application is compiled as LAA). This limits what you can do, especially since textures take up so much of that space...You also get the benefit of all the extra Registers that Win64 offers, which can give a significant boost to performance, depending on how register intensive the application in question is. Finally, you remove the need to go through WOW32, which carries a slight performance hit (2% or so).

linux has a more efficient solution : x32 ABI (NOT x86)
x32 is using 64 bit register mode but 32 bit memory mode. So applications that dont need the 64 bit memory space, but do need the 64 bit computation, can do so now.

Which is idiotic. Its basically a cludge within the kernel, basically a revere WOW32. Essentially, you run the CPU in long mode, but put an artificial limit on memory usage. Idiotic concept that was developed so the devs could drop support for 32-bit linux.

and throw in a sempron or celeron (single core cpus) and you got 😗

I'm assuming the programmer was smart enough to identify how many CPU's the system can see at the start of the program; that goes without saying. A better example would be the Pentium 4 with HTT, since if you aren't careful, you'd offload to the HTT core...[hey, that kinda sounds familiar...]



Note you simply ignored the APU only scenario...Guess you don't have a valid counter-argument then?
 
Status
Not open for further replies.