hcl123 :
So this is a very old "arguing" and "experimentation", its more than proved now that in REAL high performance "parlance", then you must talk multi-thread. Perhaps there is a reason why Sony and MSFT choose 8 cores for a gaming console, and single-thread performance is not a big issue(otherwise they would had chosen another kind of core), when in the PC world the trend seems to go "contra-natura" and no one seems wanting to go beyond 4 cores/threads. Is it Intel "cloth" felt ?.. any doubt that Sony or MSFT might end up developing 8 thread games? ... its an old story in any case, that end up smashed against the REALITY of very low scaling prone ILP nature of current ISA paradigma(be it RISC or CISC).
Different types of systems. On a console, or any other integrated system, where you have exactly ONE hardware profile, and you don't have a heavy OS to lug around, you can code VERY low level, directly to the hardware, and extract significantly improved performance. I can, for instance, guarantee the contents of every single memory address at just about any point in any program I choose to run. I can guarantee what threads are running on the system. And so on.
Look, I've worked on systems where you code directly to the HW. Its a different way of life. About 90% of my work is VERY optimized assembly (because when you measure code and memory space in the KB realm, you really care about code efficiency). Trust me when I say, PC's are probably never getting more then about 50% of their theoretical maximum performance, simply due to overhead.
Long story short, consoles seems indicating multi-threading intents, the all server world is heavy multi-threading since ages, common OSs now can be also heavy multi-threading, compilers can support it fine (only they can't code for the developer... yet)... why isn't or can't the PC world accompany the trend ?
Again, PC's are a different world. Consoles are integrated machines; you can code to a very low level. Servers are designed around parallel workloads (multiple users, large datasets). PC's, for the most part, don't.
Yes "dark silicon" is a serious concern, which is made worst by wider fatter cores. In any case it doesn't mean at all multi-threading can't scale, the contrary, nothing can scale really for the foreseeable future but multi-threading. The problem is the PC software isn't there, and as it seems unwilling to be there. But i think reality will creep in eventually as the 8 issue wide IBM CPU attempt, i think Haswell less than expected performance increase for a "tock" is just a warning...
Haswell was focused on the iGPU. No shock there.
Its a question of good development tools. Its a "caricature" that AMD is much more a pusher of IBB (Intel Building Blocks) than Intel itself... Intel seems to have forgotten about good multi-threading development tools altogether, in server world is no big deal since it employs already the "ninjas" of coding, but for the DT world good tools could be essential to take multi-threading programing out of the ground.
There is little any compiler can do in this area. If the OS scheduler decides to run thread A on core 0, guess what? Thats where that thread is going to run. If the OS scheduler decides it has a high priority interrupt to handle, and your applications thread is running, guess what? You get kicked off the core. This is entirely the domain of the OS, not the compiler/optimizer.
Just wonder why Intel is pushing HTM (hardware transactional memory) into their new designs... guess it will be turned off in DT only active in server SKUs, when it could be real useful to all. HTM and only 4 or even 8 threads is simply a bad joke!..
HTM has its own downsides. The concept is simple: Perform the action in question without placing a lock, and when you are done, confirm the memory in question has not been changed. If this works, then you saved a very minor amount of processing (no need to place a lock). If, however, some other thread DID change the contents, guess what? The transaction is undone, a lock put in place, and the operation is performed a SECOND time, this time with the lock in place. So the "fail" case is going to be at least 2x as slow as conventional processing, for a very minor speedup in the "pass" case.
Hence why most developers understand that if you have two threads that require potentially simultaneous access to the same data structure, you *probably* have a design issue that needs resolving. Basic threading principle: If you have to do a lot of reaching across the thread boundary, then your threading model is probably wrong.
Now, in a massively parallel situation, HTM would probably result in a significant speedup in system performance, simply because the overhead of locks will likely be greater then the cumulative performance hit of all the retries. For minimal threaded workloads, I would expect a decline/no change in performance.