The last time Intel sacrificed pipeline length for clocks, we got Netburst and that did not end well with the first generation getting destroyed by the P3 in most benchmarks despite the P4 having a 500-1000MHz clock advantage and the P4 got its ass handed to it again with the Core2 3+GHz P4 while at a 500-1000MHz handicap as well. The last time AMD tried deep pipeline in pursuit of high clocks, we got Faildozer which got its ass handed to it against an i3 even when OC'd to 5GHz. Trading deep pipelines for higher clocks rarely works.
It makes no sense to make the pipeline longer when the IPC penalty from having higher execution latency (can't pack execution units as tightly when dependencies take one or two extra clocks to become available) is greater than the clock gain. Deeper pipelines also waste more power and silicon on clock distribution and clocking data. Splitting the pipeline in smaller stages also means more total cycle time wasted on setup-and-hold times between data latches that cannot be used for useful work.
Intel has been refining the Core architecture for 14 years, pretty sure Willow Cove is on a knife's edge between IPC potential and clocks where adding an extra pipeline stage is practically guaranteed to cause more harm than good.
The thing is, NetBurst didn't have a bunch of other stuff in place to help it remain viable. In 2000, when NetBurst came out, it went from the Pentium 3's 12 stage pipeline to a 20 stage pipeline ... and Prescott jumped the shark with 31 stages. The lengthy pipeline allowed for higher clockspeeds, which at 130 and 90 nm was too hot to handle. At 10nm, 5GHz is a different story.
Plus, if you look at pipeline length, early NetBurst isn't necessarily
that much longer than modern Skylake. There are all sorts of tradeoffs to make, and we don't know yet precisely what Intel is going to do. But two to as many as four extra stages combined with better branch prediction isn't a huge deal. 10-15 extra stages? Yeah, that leads quickly to bad things usually.
Modern pipelines are usually around 15 stages, and Intel has been in the 14-16 range since Nehalem. But Ice Lake is already apparently 14-20 stages, depending on which pipeline and instruction are being executed. There's a lot of wiggle room, depending on what else is done, and no absolutely universal answer as to what is best.
If four extra pipeline stages allow clocks to be 20-25% higher and only cause a 5% loss in performance due to branch mispredictions, it could be a net win. Power and efficiency also come into play, naturally. Bulldozer had many other issues that caused problems beyond the long pipeline -- like the unusual "2 partial cores" approach, and a lot of "edge cases" that ended up being more like the typical case and tanked performance.
Keep in mind, Willamette only had 42 million transistors -- you could make it the equivalent of 168 million for a quad-core variant, or 336 million for 8-core. Northwood was 55 million (~220 million equivalent for 4-core, 440 million for 8-core). Even Prescott with it's 31-stage pipeline was only 125 million, and a big chunk of that went to the L2 cache at the time. So 1 billion transitors for 8-core with an 8MB L2 equivalent, maybe.
With Comet Lake, Intel is already sitting at around ... well, it's not saying, but 2-core + GT2 Skylake was 1.75 billion, so probably at least 3-4 billion for the full 10-core chip seems likely. (SKL-X is 8.33 billion for 18-core with no GPU, for example. And a big chunk is the L3 cache, naturally.) Point being, with a budget of at least 3-4 times as many transistors per core, a lot can be done that makes a change in pipeline length not entirely out of the question.
Anyway, I'm not saying Intel is or even should have a longer pipeline than Skylake, but until it does a deep dive on Willow Cove or Golden Cove or whichever cove is inside Rocket Lake, we won't know what has changed.
Pipeline length is a lot like execution width. We can't really go much wider on designs -- 6-wide fetch and dispatch is already so wide that often most of the execution slots end up unused. I mean, what's the actual IPC for any given program on a modern AMD or Intel CPU? I've heard it ranges from about 0.7 to 2.0, with an average of maybe 1.4. That's out of a theoretical IPC of 6. If Intel went with an 8-wide design, it would probably only improve average IPC from 1.4 to 1.45 or something minuscule. And yet, we have to do
something to get faster chips.