You're the troll here if you think that even in single-precision, the 670 should be 4-5 times faster than a 560 TI just because of the difference in core count, especially since the cores aren't even comparable cores. Alright then, I'll do the math for you since instead of doing it yourself, you fail to realize that what I've said is correct.
GTX 560 TI reference specifications for the GPU cores and memory
GPU
Core count- 384
Frequency- 822MHz GPU/1644MHz core frequency
memory
interface- 256 bit GDDR5
frequency- 1336MHz
GTX 670 reference specifications for the GPU cores and memory
GPU
Core count- 1344
Frequency- 915MHz GPU frequency and core frequency + small Turbo
memory
interface- 256 bit GDDR5
frequency 1502MHz
First off, we can clearly see how each core in a Fermi card is approximately equal to two Kepler cores worth of performance with the same GPU frequency. So, the GTX 670 effectively has half the core count that it says in comparison to a Fermi GPU to get a more accurate representation of how their clock rates compare to each other.
So, it's more like comparing 384 cores at 822MHz to 672 cores at 915MHz.
Simple math would get us to 384 times 822 = 315648 and 672 times 915 equals 614880. So, at best, the GTX 670 could only be almost twice as fast as the GTX 560 TI in single precision math (doesn't even come close in dual-precision) unless there were other changes made to do something about that. However, sometimes simple math is too simple. Core count increases do not scale perfectly (they scale worse and worse as the core count increases), so the GTX 670 can't even be twice as fast as the GTX 560 TI. Heck, GTX 560 TI SLI is almost on-par with the GTX 670 on average.
EDIT: I forgot to add that although GTX 560 TI SLI is about on-par with a single GTX 670, keep in mind that GTX 560 TIs don't have the best scaling, although it is fairly good. I don't remember the exact average on games, but I think it's somewhere in 75% to 85%, kinda close (I think that it's slightly ahead) to the scaling of the somewhat improved VLIW5 GPUs in Radeon 6800 cards. Point is that the 670 is not even twice as fast as the 560 TI and if I had to consider causes for this, the two major ones in mind are the 670's inferior memory bandwidth for the GPU's performance and the fact that core count differences do not scale as linearly as clock frequency increases. There is actually some law or something about this that has been studied quite well and has a decent wiki, for what that's worth.
Furthermore, just because CUDA supports single-precision math doesn't mean that a program that uses CUDA is using single-precision math. Sometimes, that just isn't good enough and dual-precision (or better, but that's not relevant to this conversation) and a CUDA accelerated program must use dual-precision math. In such workloads, the GTX 670 and 680 won't be able to beat a GTX 560 TI like they do in single-precision math. Funny that someone whom works with CUDA doesn't know something so simple. Beyond that, I never said that CUDA only has dual-precision tools anyway. Heck, you didn't even specify whether or not your application uses dual-precision or single precision, but considering the fact that the GTX 670 has higher single-precision throughput, chances are that it's dual-precision if I had to guess if the GTX 670 is beaten by a GTX 560 TI.
Regardless, even with single-precision, there's no way for the GTX 670 to reasonably be much more than about double the performance of a GTX 560 TI and I'm including the GTX 670's Turbo in that number.