It doesn't matter really. The Athlon can achieve at best, twice the peak FP throughput as the P4 in the best case for the Athlon (1 whole FP instruction per clock) and worst case for the P4 (half an FP instruction per clock). In the case of workloads fitting entirely into cache, that would be the best case for the P4. And with workloads not fitting into cache, it would be semi-worst case for the Athlon. Since you do not have room left for prefetch nor do you have enough room to store seti workloads. Now, while the Athlon's total cache is 384KB, keep in mind that 64KB of that is for instructions only. Seti workloads are data which means it only gets 320 KB at best, and even then it's not true because you need to reserve a little room for other processes such as OS functions.
You have the P4's FP unit filled to the breem with FP instructions allowing it to maintain almost a full 1 FP instruction per clock (not counting SSE/SSE2) vs an Athlon who has to wait countless clocks for fetching from memory the data that it needs. Even if the Athlon could achieve more than 1 FP instruction per clock (which is very very rare in any x86 code), it'd still have all those idle cycles.
"We are Microsoft, resistance is futile." - Bill Gates, 2015.