No it does not, it is the total number of dispatch ports that will determine how many instructions go to the execution units, a 3 issue core is max-theoretical 3, and simple random dependencies dictates that a processor will never acheives this over time. The time average IPC will always be less than the total dispatch capabilities of the processor unless special 'fusion' tricks are used.
You should read some of those links I post.
Here is another one.... this one looks at how weel the OoO execution efficiency changes IPC as a function of window size, what you will notice is that the actual IPC never exceeds the total issue of the processor.
http://courses.ece.uiuc.edu/ece512/Papers/Michaud.2001.HPCA.pdf
Since I'm not working nights anymore, I will read and expound upon your purported proof that AMD can't get more "IPC" meaning successfully executed and retired intsturctions PER SECOND ( per cycle is usually a relative term dependent upon instruction length, complexity and code efficency).
As noted in Part 2 the ALpha 21264 had to in some cases re-issue instructions and it still would DESTROY X86 at half the MHz.
ALso, if you refer to the two different approaches Intel and AMDs patent took for the VAUNTED RHT, it shows that efficiency is the key (just like with the Carnot cycle). Theoretical analog tests can sometimes mimic actual digital OS response curves but once checks and error is applied the actual test algorithm matters as much as the results.
Section 3 descibes the problem overcome by Core 2 load before stores and is a micocode based algorithm that use "probably" 2-4B tags for order in the instruction window(damn that RHT).
By expanding the fetch size to 32B (Barcelona) AMD can add extra instructions to both data and instruction fetches so that more instructions can be loaded before non-conflicting stores - even Intel can't load a value dependent upon a store (unless they use an advanced OS cache that can update the OS stack without going through main memory) - which will enable the retire mechanism to operate more times per cycle on average.
Section 4 shows that FIFO is a key to execution in that if the prescheduler sorts according to an efficient microcode algorithm, 3 FIFOs can actually produce 6 instructions.
By adjutsting latency (~resistance in a complex RC serial\parallel circuit) with varying capacitance or resistance values it is possible to optimize the total cycle usage and increase superscalar capabilities.
(sidebar)
Going faster than the speed of sound is impossible.
(endsidebar)
Section 5 describes how increasing registers and parallelization can also increase efficiency and alleviate latencies resulting in increased IPC ( 2.2 is STILL less than 2.8).
Section 6 clearly shows a theoretical 14 IPC by varying line length and buffer size. Thsi is assuming the same 8 issue ideal core used. If we assume a 97% prediction rate for L1 hits and 3 issue while also assuming HT with two loads per cycle @ 128b provides an "unlimited" supply of data, a quick extrapolation will give approximately (14\8) = (x\3) or 5.25 IPC.
The last section (thank god) explains that associativity in caches was not considered whcih allows for a theoretical application of a few % points to the HW tally.
How close is a THEORETICAL 5.25 to 6?
Wow you really didn't read the same paper JumpingJack linked. Nowhere does it state the amount of instructions can go over the amount of decoders. The ONLY way to do this is to use Fusion techniques. Thus fusing TWO instructions into a single instruction which is then executed. This process is what the core 2 Duo call Macro Ops Fusion.
Technically the processor still is only able to decode 4 IPC but BECAUSE one of those 4 can be comprised of TWO instructions that are fused together the Core 2 Duo CAN at times reach 5 Integers Per Cycle.