AMD Piledriver rumours ... and expert conjecture

Page 236 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Status
Not open for further replies.
We have had several requests for a sticky on AMD's yet to be released Piledriver architecture ... so here it is.

I want to make a few things clear though.

Post a question relevant to the topic, or information about the topic, or it will be deleted.

Post any negative personal comments about another user ... and they will be deleted.

Post flame baiting comments about the blue, red and green team and they will be deleted.

Enjoy ...
 


And a quad is cheaper than dual? hmmmmmm

I still remember those mighty azz 3.8GHZ Pentium 4's. Those were the good days for us pc junkies. none of these 3.4 ghz core i's.
 

well 2 cores were still better in 1 in some instances of running 2 programs at a time because of the FSB and the caches are dedicated mostly to each core.
 
My point being that even the so-called single fat cores were simulating parallel processing by timeslicing CPU time. Multicore processors are just a much more efficient way of supporting lots of threads - doing parallel processing with parallel hardware. (of course there are far fewer cores than threads, even with massively parallel GPUs).
 

not possible or not practical? there is a thin line and most often everything is put into the not possible category because its not cost effective.

From what I understand with locks, there are alternatives, but they are difficult and harder to implement, wich means more cost.

the problem also is that the solution (from what I gathered) will actually slow down lesser core computers as instead of a lock-wait your turn, its 3 simultaneous threads, each execution and one comparator that must always be running. However in a high core-count system, this would be considerably faster as you don't have threads sitting doing nothing because the data is locked, instead you will occasionally have a thread given new data to re-run if the data itself was changed.

How often is data locked, examined, and the value unchanged then ulocked again? ^ eliminates thread waiting for nothing, but requires morar cores.

locks are extremely simple but in themselves slow down the program, but require no overhead to maintain.
 

game makers want games to run on the most machines possible without doing too much work. Dual cores are still the dominate cpu spec for people. Thus games will try to build for the lowest common denominator.
 

heres waiting for transactional memory to change things regarding multithreading in general programing.
 



Its not that big of a deal, The reason why they made a dual-core is because making one fat core was becoming more and more difficult. I'd say close to all big programs use 2+ cores anyways. Programs like Excel should use all the cores in 1 system.
 


Its not that the games demans a lot of raw power, they just dont use more than 2 cores.
 


I would think that statement should be qualified a bit. Some threads are fairly "sparse" in that they spend a lot of CPU cycles waiting around for input or some other lightly loaded thread to complete. Intel's justification for SMT was that if some thread had maybe 30% wasted NOP clock cycles, switch to another thread and execute it for a bit, then switch back to execute the first thread. That way you keep all that expensive hardware occupied to the maximum extent as possible, seeing as it is already powered up and everything.

Of course nowadays with selective powering down of portions of a full core - front ends, cache portions, etc - maybe it doesn't make as much sense efficiency-wise as it did 4-5 years ago, when SMT got as much as 25-30% performance improvement over no SMT depending on the software being executed.
 

SMT also works much more effectively on longer pipelines which is why it was implemented into the p4. Its effectiveness becomes a serious question when code could be optimized without it. The only place I seen it be worth the effort nowadays is the atom where there is much less hardware utilization due to no OoO. Im not sure how efficient it is in regards to die space but the energy efficiency gains isn't that great.
 


How much faster is a 5770 over a 8800GTX?

I'm not sure there is a big enough difference to discern meaningful results here.
 


Again, you can't speed is serial processes by adding more cores.




Sephamores, Mutexes, Interlocks are all functionally the same at a really low level: The keep data locked for a period of time so only one thread can access them.

Any time you have two threads running at the same time that can both read/write to a variable, you need to lock it every time you access it to ensure your changes aren't hammered by the other thread. [Note that in the case where one thread only writes and teh other only reads, you don't need the extra overhead. In any other case, locks are a necessity].

the problem also is that the solution (from what I gathered) will actually slow down lesser core computers as instead of a lock-wait your turn, its 3 simultaneous threads, each execution and one comparator that must always be running. However in a high core-count system, this would be considerably faster as you don't have threads sitting doing nothing because the data is locked, instead you will occasionally have a thread given new data to re-run if the data itself was changed.

Uhhh...no. You don't give a thread "new data", because you, the developer, have no clue your thread is blocked. Thats the domain of the OS.

Its the OS that determines if a thread can run or not. Its the OS that schedules threads to run. Its the OS that puts threads on a specific core. All developers can do is make the heavy workload threads as parallel as possible and hope the OS allocates them in a semi-parallel way. [At this point, choice in compilers can be a HUGE performance factor.]

Within the application, I have no way of knowing if a thread is ever blocked, because if the thread WAS blocked, it would be unable to run and determine it is blocked! If a thread can't run, the OS preempts it, and some other thread (maybe for your program, maybe not) will run instead.

My point being, the OS scheduler plays a role.

How often is data locked, examined, and the value unchanged then ulocked again?

Explicitly when you have two threads that can both access the same data object, implicitly by the OS whenever you do memory access. Depending on design, this can be a measurable performance impact, or negligible. Depends a lot on how many threads need to access the same data structures. If you have a lot of threads that need access to the same object, theres not much you can do performance wise.

-------------------------------------------------

Now lets look at games again. You have a LOT of data that ends up being shared (Specifically, the geometry matrix, which touches the rendering, physics, AI, and audio engines). Every time its accessed, no one else can touch it until its explicitly unlocked again. That by itself will limit scaling, because you have to design around that possibility (try and ensure by design that the threads won't need to access the structure at the same time). But by doing so, you limit how parallel you are (you now have serial processes).

Now, a lot of this is done under the hood by game engines these days, so developers don't notice. But THINK about how the different game engines can interact: Audio cues can affect AI processing, so audio must be done before AI is processed. Physics can affect geometry, which in turn can affect audio processing (assuming a 3D sound engine that accounts for terrain features). And so on.

You begin to realize that a lot of what has to happen must be done in a SPECIFIC ORDER. That itself limits how parallel you can be, and thus limits the benefit of adding more cores. If only 20% of the program is actually parallel, you can not speed up the program by more then 20% via adding more cores.
 


IIRC SMT was only about 5% effective on the P4 - just more opportunity for the weaker front end decoders to mispredict and stall out the longer pipes 😛.

And there was discussion about the relative die area costs in the old BD thread. From what I remember, SMT is <5% extra core area on Sandy Bridge or maybe Nehalem, vs. 12% for Bulldozer's CMT. A full extra core would presumably be 100% extra core area :).
 


True, I didn't want to spend a lot of $$ on a 5-yr-old system so I went with an equivalent. I still use that system occasionally for legacy gaming.

However my point was that the Q6700 system itself was a serious bottleneck under such extreme gameplay conditions, not just the GPU which IIRC is a little faster than the 8800GTX. The CPU has to spawn all those enemies and I doubt the game (one of the expansion modules for Neverwinter Nights, I'd have to go look up the name) uses all 4 cores anyway since it is basically NWN with a few tweaks plus a few new bugs.

I use XP 32-bit as the OS on that system so perhaps there are memory leaks that rapidly fill up the available ~3GB left over after the 3/4GB GPU memory gets mapped into the same 4GB space. Never really looked into the matter as I can avoid the issue entirely by killing the level boss and ending the level before the bug takes over.
 


key prhase being "you don't need the extra overhead" becuase that requries more cores?
As for performance...

http://woboq.com/blog/introduction-to-lockfree-programming.html
Results: (on my 4 core machine)





Push (ms) Pop (ms) Total (real / user / sys) (ms)

With QMutex 3592 3570 7287 / 4180 / 11649

Lock-free 185 237 420 / 547 / 297


Not bad: the lock-free stack is more than 100 times faster. As you can see, there is much less contention (the real is smaller than the user) in the lock-free case, while with the mutex, a lot of time is spent blocking.

100 times isn't much faster ...

another intro to lock-free programming

http://www.cs.cmu.edu/~410-s05/lectures/L31_LockFree.pdf
http://www.eetimes.com/design/embedded/4214763/Is-lock-free-programming-practical-for-multicore-

I think your stuck on this aspect

The caveat is that unfortunately the development of application software algorithms with these facilities is both different and more challenging than working in the traditional way. When using traditional locks, a software designer first identifies "critical sections" of code and then "protects" them by selecting one of the traditional locking mechanisms to "guard" the critical sections. In the past, avoiding locks seemed dangerous or tended to involve intricate, convoluted algorithms. For these reasons, lock-free programming has not been widely practiced.

BTW, earlier I called it a comparitor thread, its actually Compare-And-Swap Loops

Also, EVEN AMD sais lock-free programming is for multi-core systems. http://developer.amd.com/pages/125200689.aspx

but like you said, its extra overhead and programmers/devs don't like that.
 
Status
Not open for further replies.