You are focused too much on the software aspect, and not the hardware implementation. Sure, you have twice as much data per clock, but you also need twice the amount of transistors to accomplish this! For example (oversimplified to two vs four registers),
to evaluate (1+2) * (4-3)
two registers (x86) (times 2) run in parallel, seporated by '|' : r1= register 1, r2=register 2
r1 = 1 | r1 = 4
r2 = 2 | r2 = 3
r2 =r2+r1 | r2=r2-r1
(idle) | [mem location] = r2
r1=[mem location] | (idle)
r1 = r1 * r2 | (idle)
now, with 4 registers, one pipeline
r1 = 1
r2 = 2
r3 = 4
r4 = 3
r1 = r1 + r2
r4 = r4 - r3
r4 = r1 * r4
this just as easily could have been done with three registers, though:
r1 = 1
r2 = 2
r1 = r1 + r2
r2 = 4
r3 = 3
r3 = r3 - r2
r3 = r3 * r1
In the first two forms, the same bandwidth is used, only the first one is done in 6 steps, while the latter two are done in 7 steps. However, the last is done using only three registers, so it really only uses (3/4)*7 = 5.25 steps (relatively). However, the first has 3 idle clocks (1.5 if you call each thread a half a clock), making it only effectively use (6-1.5)=4.5 clocks (the scheduler will make sure that these are used by other running programs), so in this example, it is more efficient to use the old architecture! Of course, this is very simplified, and requires much use of multithreading. The real question is in PRACTICE, is it more efficient? Remember, in order to double size of the architecture, you need to double the architecture itself! Saying that x64 is twice as large as x86, while true, is like saying that there is a new pizza that is twice the size of the original. While that is true, they do not tell you that it costs twice as much! So if you have the same amount of money and are ordering a lot of pizza, you get the same amount of pizza either way. The question is, is there more crust (waste) if you order 10 new pizzas or 20 old pizzas? That is very hard to say.....
mi1ez :
I think radnor had a pretty good summary of why it can potentially be faster.
A true 64bit application can shift around and process up to twice the amount of data per clock (aforementioned "Processor Word") so things like photshop could (for example) send 2 pixels to be processed for every one in 32bit.
This is an ideal situation and in real life this doesn't happen, but that's the theory.
The fact that you can address up to (something like) 128Gb of RAM in 64bit as opposed to the 32 bit limitation of approx 3.2Gb can also help speed up certain apps in certain situations (again this will be things like HUGE HUGE photoshop images)