Warning: Another long post. lol
Itanic's "efficiency" comes from it's huge die size
It's interesting that you point out Itanium large die size. If you were to read the entire article from AnandTech that I posted earlier, they actually talk about this issue. While the die size might be large, 432mm^2, the majority of that is just the L3 cache. If you were to look at the core size, that is the transistors that actually process data and the L1 cache, it is actually relatively small, only 80mm^2. 15mm^2 of that is for x86 compatibility. The 130nm Opteron is 190mm^2 of which more than half of it is cache. So essentially, the fundamental core size of the processing units is the same. Only the extra cache makes the Itanium larger.
Now you'll point out that regardless of comparable core sizes, larger overall die size is still larger. This of course leads to concerns on heat. On this issue Anandtech notes:
"It is clear that the Itanium core has a big advantage in the area of threading and power dissipation constraints. If you are not convinced, the dual core Itanium Montecito (90 nm process) has no less than 1.72 billion transistors, but it is still able to consume less than 130 W. Compare this with the 300 million transistor Power 5+, which consumes about 170 W on a 90 nm SOI process."
The other issue of increased die size and the need for more cache in Itanium is of course cost. Anandtech notes:
"Time is on the side of the Itanium. As new process technology was introduced, cache sizes have been growing very quickly during the past years, without introducing extra cost or high latency. No competitor has the advantages that Itanium has:
1. As caches get bigger, Itanium benefits more than the x86 competition. X86 CPUs target higher clock speeds and, as such, it is more difficult to use large low latency caches.
2. Intel has mastered as no other the skill to produce very dense and fast cache structures. "
From what I have seen, the A64s still seem to be capable of higher speeds.
The point is that clock speeds will only go so far in increasing performance. It's doubtful that the A64 will be able to reach 4GHz clock speeds in a production environment. Even with speeds of 4GHz, the performance increase would not be linear so its unlikely you would even receive a 20% increase in performance. As well, the power comsumption and temperature levels would be unacceptably high now that the focus is on lowering them regardless of whether a 65nm process is used. This would mean that we cannot expect large performance increases from clock speed increases.
Other than clock speeds, there just aren't that many other options for the superscalar architecture to produce large performance increases. Multiple integrated memory controllers don't offer much benefit to single-processors as they just aren't bandwidth limited especially when DDR2 is introduced. Adding additional SSE or AMD64 extensions would help but the performance benefit from extensions are generally minor. Similarly prefetch, branch and loop detection routines are just about as efficient as they are going to get. The branch prediction on the Pentium 4 is 97% accurate so a 1% or 2% increase won't show major benefits.
That would leave drastic measures for the superscalar architecture to continue. For instance, you could add more execution units. However, generally the x86 architecture already has more than enough. Including more co-processors or cell processors is nice, but that would increase die size and generate more heat so they wouldn't be very efficient from a cost or power consumption perspective.
I'm hoping that I'm not being misunderstood. I'm not saying these points to support Itanium. I'm saying these points to explain why multicore and multi-threading are necessary since performance increases in a single-core plain superscalar environment are now hard to achieve.
If M$ were to design a better scheduler, HT would be a total waste (cross your fingers for vista).
The disconnect that we have is that you are talking about HT in the Pentium 4 so naturally the perspective is coloured by the negative impression of Prescott. However, what I'm talking about is HT in Conroe's sucessor which is on a 45nm process. Whether the Prescott runs hotter with HT enabled is irrelevent. Conroe on 65nm already looks cool enough, while its successor on 45nm with leakage solved will be even cooler. In such a case there is plenty of temperature room for HT. Yes, it will be hotter but the additional heat won't matter since the original temperature would be low to begin with.
Now about the scheduler. While an inproved scheduler will increase performance this isn't the same as Hyperthreading. Even with better scheduling the processor can only process 1 thread at a time. What HT allows is 2 threads which don't need the same execution units to be processed at the same time. Even if Vista has a better scheduler, it would not make HT obsolete. In fact, with a better scheduler the threads that HT receives would be further sorted and organized yielding even greater HT benefits.
The fact that the 840EE performs primary tasks slower while increasing the speed of tertiary priorities is mostly a scheduler problem. The scheduler occasionally allows two threads through that need some of the same execution units. This would cause slow downs in the primary tasks. A better scheduler like what Vista may have would be smarter and pair the tertiary task with something that doesn't have execution unit conflicts ensuring the primary thread achieves full performance.
Its quite possible that HT will actually run better on a Conroe-derivative than on Prescott. Both the Pentium 4 and K8 are 3-instruction issue design. However, Conroe will be a 4-issue design. Now, people argue that x86 architecture rarely saturates a 3-issue design and so Conroe will not benefit from having a wider issue rate. While this may be true for a standard x86 processor, it makes Conroe ideal for HT. If most of the time only 2-issues are used, this means that with HT enabled you would get a fully parallel issue rate with 2 x 2 issues through a 4 issue design.
A wider issue rate, 6-wide, is why CMT (a less advanced version of HT) is so effective on Itanium, with a 30% performance benefit. Conroe with HT could see similar performance increases since the EPIC architecture generally issues more intructions at a time than x86. Probably something like 4 issues through a 6-wide design in Itanium while x86 generally issues around 2 instructions. A 4-wide design is very sufficient for HT in an x86 environment.
Now a wider design also requires more execution units to process those intructions. In this regard Conroe also looks promising. The Pentium 4 had 2 FPUs but they were both specific in their functions. One handled floating-point and SSE addition, subtraction, multiplication, and division, as well as MMX. The other only handled floating-point and SSE moves and stores. Conroe looks to have 2 or 3 full FPUs, each of which can perform all these functions.
The Pentium 4 also had 1 slow ALU to handle complex calculations like shift and rotate, and 2 fast ALUs. While the fast ALUs operate at twice the clock speeds, they were limited in what they could processor. 1 could do addition, subtraction, logical operations, evaluate branch conditionals, and execute store-data ops. The other fast ALU was even more limited and could only do addition and subraction. Conroe looks to have at least 3 full ALUs each of which could do all these functions.
Even though Conroe may have similar numbers of execution units to the Pentium 4, each of Conroe's execution units can process the complete range of instructions. This is critical to HT as it allows a wider variety of threads to be paired together as worries about whether the available executions units can do the operation required. While it won't offer the same performance as 2 cores, the probability of 2 threads being executed at the same time through HT increases with operation constraints limited and is now only constrained by execution unit number. In general the execution unit number should be fine as most instructions don't use that many anyways.
The only other concern is that HT needs to be associated with a long pipeline. That doesn't appear to be correct. If CMT can work fine on the 8-step pipeline of Itanium, HT should work fine on the 14-step pipeline of Conroe. A wider-issue rate and availability of execution units are far more important than pipeline length.
Overall, I still think HT is an ideal addition to a 45nm Conroe successor. The wider-issue rate, and numbers of full-function execution units ensures HT performance, while the 45nm process and the Conroe architecture design itself ensures that heat and power consumption concerns are mitigated.