ojas :
What i have to say is that: the HT you're looking at is based on a OoO pipeline, while the original Bonnell/Saltwell pipeline was in-order.
The general principles behind HT/SMT remain the same regardless of the presence of OoOE or superscalar execution: provide one or more alternate independent instruction stream(s) to help fill execution slots when execution is about to stall from lack of eligible instructions.
The main problem with OoOE is that its complexity increases almost exponentially with look-ahead depth while the chances of finding extra instructions to squeeze in decrease due to valuable and expensive resources getting tied up behind conditional branches, mispredicts, speculative execution, cache misses, memory fetches, etc. which OoOE cannot efficiently do anything about so the cost vs benefits of pursuing deeper OoOE keeps getting worse as you attempt to extract more ILP per thread.
With SMT on the other hand, the costs are almost directly proportional with the hardware thread count and chances of finding instructions to fill ports with also increase linearly with the number of active threads loaded on each core.
SMT and OoOE are not mutually exclusive. They are synergistic: SMT drastically reduces the depth/complexity of OoOE required to keep most ports busy on every clock tick under sufficiently threaded workloads (may span multiple applications) and OoOE increases the likelihood of finding something to do in each individual hardware thread. Both contribute significantly to extracting the most work out of the least surface area and power, which would be a significant advantage on platforms with limited power supply if pervasive threading became more common.
Another nice benefit of HT on a limited power budget is that you do not need to rely as much on branch prediction, speculative execution and various other expensive (both power and area) tricks. You can simply execute instructions from other threads and hope dependencies will start resolving themselves before every thread gets stuck, which means less wasted work/power - that was the goal behind the original Atom and reality disagreed with Intel's vision of how much of a good idea that was.
HT becomes more effective than increased OoOE once you have enough execution resources to actually start worrying about OoOE failing to keep them busy most of the time - that's where the OoOE and support structures start ballooning out of proportions like they do on mainstream desktop/laptop CPUs. If you look at chips designed for massive parallelism (GP/GPUs, Xeon Phi, UltraSparc T5, etc.), they tend to have much more primitive OoOE (sometimes none whatsoever) than chips designed with somewhat of an obsession for single-threaded performance like Haswell.
The optimal balance between multi-core, OoOE and HT/SMT is all about availability of threaded system workload (not necessarily all from a single game/application) or lack thereof. The balance just happens to still be heavily weighed towards single-threaded performance's favor.
The main problem with the old Atom: nobody cares how much more efficient an SMT design can be if they thoroughly suck at extremely common single-threaded tasks like parsing HTML and calculating layouts to render web pages. Not having OoOE left the old Atom severely crippled in that department, which is clearly not acceptable if Intel wants to gain market share in mobile devices.
Give it maybe two years. Atom will likely grow to quad-port execution and HT will come back to help keep them full without expanding the OoOE circuitry too much nor sacrificing the all-so-important single-threaded performance.