To answer your question precisely, it was partly because branch prediction was not very good back then, and partly because very long pipelines were employed as things were expected to ramp up to ever higher speeds.
Northwood had a 20-stage pipeline and was expected to scale to 4GHz. Prescott had a 31-flavor pipeline and was expected to cover 4-7GHz while Tejas would've had a 40-50 stage pipeline and was targeted at 7-10GHz. Long pipelines are supposed to help improve clockspeeds because each stage does less work--there are more, but smaller steps. This is why Prescott was such a slow and hot disappointment--it never even reached the speed Northwood was predicted to top out at, so the longer pipeline made it perform worse than Northwood despite the doubled cache.
The problem with long pipelines is if anything goes wrong, the entire pipeline must be flushed and so not only does performance tank because you have to start all over wasting all of the time that went into filling that pipeline, but
all of the power used to fill that pipeline is wasted too.
Speculative execution is a way to increase performance over in-order processing in much the same way as caching--you start working based on a prediction that the result or data will be needed shortly, and discard it if it wasn't. The problem is predicting what will be needed in the cache or which calculation result is not easy, and the accuracy rate of early branch prediction (which tries to statistically predict which of two possible branches will be the needed one) was not good.
Well if you do not care about power consumption there is a perfectly obvious way to make up for a poor branch predictor--you simply execute
both branches and discard the one that isn't needed,
essentially doubling your power consumption or halving your efficiency. But the result will be ready in time without any stalls from having guessed wrong.
The problem with Logic is sometimes a branch cannot continue until the result of something else is known. There is no way to delay the progress of something in a pipeline (that is, there is no special parking area to pull something out of a pipeline until that dependency is available) so normally if this is encountered the pipeline gets flushed and you have to start over from the beginning. The Pentium 4 has a
unique system that redirects things back to the beginning of the execution units in the pipeline where it can loop over and over wasting power until the result is available and the branch can continue. That means the Pentium 4
can load its execution units to 100% while doing no work at all and just waiting for a result.
Netburst was the first Intel architecture that was undocumented (to stymie competitor AMD) so
we only know this from the Russians over at iXBT Labs.
As an aside, both speculative execution and cacheing are very power consuming ways to improve performance, so the original Atom processor went back to in-order with very little cache (like a Pentium 1) to save power at the expense of performance. That's why they weren't affected by Spectre/Meltdown which attacks the method of branch prediction used by the CPU manufacturers. By the time of Core 2, the branch predictor had improved enough to guess correctly an average of
96% of the time.
Pentium 4 also ran at full speed all of the time, plus Windows 9x didn't even have a
halt instruction, so you'd expect it to idle hot. The exception was Pentium 4M for laptops which had a low default multiplier and would increase its multiplier under periods of high demand for performance. Core 2 is the same way--the default multiplier was 6x and it could increase to the maximum rated multiplier as needed, kind of like the Turbo system in today's CPUs (which is very different from the
Turbo Button of yore which could lock the CPU at 4.77MHz)