While I'm not disagreeing on your main point here, it is a bit disingenuous to put it that way, since SPECFP scores are single threaded, so neither Power5's multithreading nor multicore abilities matter here. The L3 likely helps a lot, but its off die, hence quite a bit slower than IPF's large ondie caches.
Not quite. SpecFP is written without any multithreaded code, however, compilers are allowed to auto-parallelize and generate threads (such as with ICC 8.0). This has increased the P4's SpecFP score with HT over non-HT scores. Although the SMT implementation on the P4's nowhere near as performance-oriented as on the Power5.
As for the L3 cache of the Power5, it has been improved significantly. Quite significantly actually. The latencies for the old Power4 was about twice as high as McKinley/Madison. Word around is that the Power5's L3 cache has close to 1/3 the latency of that of the Power4.
VLIW is not directly related to being inline or predication, which was the topic here. Don't know enough about modern GPU's to comment though, are they inline (I think, but not sure), or do they support predication (think not, but again, not sure)?
VLIW does explicit parallelization upon code generation, which, in turn, would require that compilers use some form of prediction/inlining. Of course, most modern GPU's use a JIT (drivers) to optimize for their VLIW architectures (something .Net will be able to do much more effectively with IA-64 than with scalar architectures).
"long time" may be a bit misleading here, actually they have much less time to compile than a static compiler, but what you meant to say is exactly my point. In spite of the lack of time to optimize, the generated code is typically considerably faster because at run time you can do optimizations and profiling you can not do at compile time. This is my whole argument to state static scheduling (predication) will never be optimal, no matter how good the compiler, so purely relying on it for extracting ILP, and omitting OoOE is not a silver bullet. I'd WAG an itanium without predicate registers but with OoO and decent branch prediction logic would be (considerably) faster than current Itaniums at the expense of moderately higher complexity.
Erm, long time being runtime. JIT's don't have much time to compile but they have all the time in the world to profile. If you're doing the same loop over and over again, the first few iterations, the JIT may be slower, but given a long enough runtime, the JIT can adapt its compiling methods to perform better. This requires, of course, that the application has a long runtime (and is not something like an Office Applications).
And as I've already mentioned, OoOE's benefits are becomming smaller and they are, by no means, as powerful as profiling done by JIT's. The level of profiling and optimizations done by JIT's is simply beyond the level of hardware to do. Your window for optimizing using OoOE is about 80 or so instructions for the most advanced scheduler that I know of out there (Prescott). Your JIT can optimize the entire code structure of an arbitrary size. The more advanced you make your dynamic optimization methods, the more complexity and heat you'll have on-chip. Running it in software, however, costs you nothing except memory footprint (something there is plenty of) and possibly some processor resources (but considering just how much of modern MPU's die remains idle, I'd say there's much to spare).
Again, JIT's along with a VLIW, simple core architecture is a proven method under certain applications (3D graphics and rendering). All modern GPU's use this method.
I think you should give current JIT's a second look. Any benchmark/review I've seen lately gives roughly equally and often better performance using Java/JIT (or .NET) than compiled C++. Keep in mind, Java/JIT execution times includes the compilation AND garbage collection, which indicates the execution time as such is faster with JIT than statically compiled code for most (if not all) code out there. That being said, I'm not sure how much time a JIT compiler spends on compilation and garbage collecting percentage wise, but I assume its non trivial.
Again, for what applications? For databases and webservers? Definitely. For office applications and/or scientific computing? Hardly. I say bring on the benchmarks if you truly have them. I'm a heavy Java programmer myself and I'll admit, most applications I write are more sluggish than their c++/ICC counterparts.
Yes, but Itanium is anything but deeply pipelined. It is quite wide though. 128 registers is quite a bit, but I don't think making Itanium OoO would be more costly for Itanium than say P4. Time will tell, but I expect IPF to implement OoO over the next few years; predication just isnt a substitute and with each process shrink, OoO gets even cheaper.
The P4's nowhere near as wide as McKinley/Madison and replicating 8 programmable registers using 128 physical registers is a lot *lot* simpler than replicating 128 programmable registers for register renaming. Logic required to replicate register files while still maintaining flat access grows *exponentially*. And looking at Netburst, such things only serve to cause greater power consumption.
Thing is, they are anything but mutually exclusive! I also disagree with your premises. Any current cpu design goes to extreme lengths to avoid pipeline stalls for good reason..
And achieves very (relatively) little. Optimizations upon compile can speed your program up by many folds. By factors of 10's perhaps using a good compiler with the proper optimizations. OoOE on even the most advanced scheduler out there generates perhaps 2-3 IPC worth of ILP if it's lucky and that's *with* a lot of optimizations that the compiler has to do to tweak the instruction stream. This compared to the in-order, scalar designs of the Pentium days offers perhaps a 2-3 fold improvement in ILP *with* proper static optimizations upon compile time.
You tell me which brought more improvement.
Single threaded performance still matters, and OoOE is one way to improve it that has been expoited by every (modern) architecture execpt IPF. Its low hanging fruit IMHO..
The costs for OoOE keep growing for wider or longer designs and the benefits are growing smaller and smaller relatively. Especially as compiler technology improves. It simply isn't worth it anymore for the most part.
A good JIT in conjunction with a very open ISA (VLIW) will be much better for much of today's performance-demanding applications (multimedia, scientific computing, 3d graphics, etc.) IA-64 offers the hardware for this, only time will tell to see whether Intel invests in runtimes to take advantage of that.
"We are Microsoft, resistance is futile." - Bill Gates, 2015.