That 2-4X performance gain sounds respectable on paper. In reality though, if the CPU could run 2X faster by using properly vectorized SSE code, the performance difference would drop substantially and in some cases disappear entirely. Unfortunately, it is hard to determine how much performance x87 costs. Without access to the source code for PhysX, we cannot do an apples-to-apples comparison that pits PhysX using x87 against PhysX using vectorized SSE. The closest comparison would be to compare the three leading physics packages (Havok from Intel, PhysX from Nvidia and the open source Bullet) on a given problem, running on the CPU. Havok is almost certain to be highly tuned for SSE vectors, given Intel’s internal resources and also their emphasis on using instruction set extensions like SSE and the upcoming AVX. Bullet is probably not quite as highly optimized as Havok, but it is available in source form, so a true x87 vs. vectorized SSE experiment is possible.
...A review at the Tech Report already demonstrated that in some cases (e.g. Sacred II), PhysX will only use one of several available cores in a multi-core processor.