This question is kind of a software question but I think the answer ultimately comes down to a CPU architecture question, so it seems like this is a good place to ask it.
I have written an arbitrary-precision multiplication function and been using it on the Pentium IV for several years. It uses SSE2 instructions for some parts, and on the Pentium IV it is a little more than twice as fast as the best similar function I can write using the general-purpose registers. I recently bought a quad-core Core2 system, and I have noticed that the difference between SSE2-based function on that machine and the GPR-based function on that machine is much less than a factor of two. In other words, the SSE2 and GPR functions run at much closer to the same speed on the Core2, with the SSE2 function being only about 30% faster, not twice as fast. I'm not comparing the Core2 to the P-IV, but rather the incremental improvement of the SSE2 function relative to the GPR function.
The question is why this might be. Did Intel make the GPR instructions on the Core2 better/faster, or did they do something that resulted in a slow-down of the SSE2 instructions? Again, I'm not trying to compare the Core2 to the P-IV, but rather the SSE2/GPR speed on the Core2 relative to that ratio on the P-IV.
I have written an arbitrary-precision multiplication function and been using it on the Pentium IV for several years. It uses SSE2 instructions for some parts, and on the Pentium IV it is a little more than twice as fast as the best similar function I can write using the general-purpose registers. I recently bought a quad-core Core2 system, and I have noticed that the difference between SSE2-based function on that machine and the GPR-based function on that machine is much less than a factor of two. In other words, the SSE2 and GPR functions run at much closer to the same speed on the Core2, with the SSE2 function being only about 30% faster, not twice as fast. I'm not comparing the Core2 to the P-IV, but rather the incremental improvement of the SSE2 function relative to the GPR function.
The question is why this might be. Did Intel make the GPR instructions on the Core2 better/faster, or did they do something that resulted in a slow-down of the SSE2 instructions? Again, I'm not trying to compare the Core2 to the P-IV, but rather the SSE2/GPR speed on the Core2 relative to that ratio on the P-IV.