[citation][nom]rustyxshackleford[/nom]The problem I see here is that no matter how many cores you`re putting in the package, it`s still a serial processor. I am in agreement with Bill Dally that we need to concentrate more on parallelism in our processor designs. I have no doubt that this CPU can probably parallel task very well, but we`re just delaying the inevitable here. This is not entirely semiconductor companies or hardware companies fault. Programmers are still writing in serial fashion, and the majority of programs cannot properly utilize all the abilities of the current hardware. That statement of course does not include those who are writing for Knoppix, beowulf, compute cluster and the like.[/citation]
Actually, you sound like someone that read something, somewhere, without really understanding what's really going on.
Nehalem based processors are very wide, and have about as much ILP (instruction level parallelism) as is feasible. Some even suggest it is too wide. It's why hyperthreading works - the processor has more execution resources than one thread can typically use. It's also one reason it works better than on the Pentium 4, which was narrower.
Going wider would increase size, power use, and even limit clock speed slightly. At this point, it's very difficult to improve performance on a single thread, and it's much easier to add cores and improve TLP (thread level parallelism).
Itanium was designed to be much wider and be able to handle more instructions per cycle, but in practice, it hasn't been as successful as Intel might like.
Intel has given up on maximum thread level performance when they gave up on the Pentium 4 design. Very high clock speeds would be the best way, (although the Pentium 4 was a bad design by any measure) but high clock speed designs also take disproportionate amounts of energy and can be quite large, so Intel decided against continuing to follow that path. It would be very difficult to get a good performing high clock speed quad core with reasonable power use.
Now before someone says the Pentium 4 was slow, that's not an indictment against high clock speed processors, it was just a bad implementation of it. IBM was very successful with it. The Pentium 4, strangely, had only one decoder, and if the trace cache didn't have the instruction stream, it ran as a scalar processor. This happened almost 50% of the time the cache miss rate was so high. So, it wasn't that the processor had such a long pipeline (it was double-pumped anyway, which effectively made it a lot 'shorter') that made it slow, it was just a terrible design on Intel's part that happened to have a long pipeline.
Put another way, if you had the trace cache and single decoder on a Nehalem, it would run like a dog too. Actually, dogs aren't that slow, as natural as that expression sounds. How about a Sloth?