So many replies, not 1 with anything to do with the question... Now I'm no electronic engineer but AFAIK cycles per second relies on
* The amount of pipeline stages such as Fetch Decode Execute Store. You can have a larger amount of pipelines that each do a simpler job and that can result in a faster speed, but the trade off is if you have data dependencies or branches then the pipeline could drain and be forced to start over. For example, if you have information coming down a 20 stage pipe, the latest piece says what does X= but the answer hasn't yet been computed, the pipeline drains of what it was doing and what it was going to do next and the instruction to calculate X is sent down the pipe. Back in the mid 2000s Apple was showing the world it was faster overall to have less pipeline stages running at a slower clock rate than vice-versa.
* Parallelism = running concurrent code. Also branch prediction and out-of-order execution.
* Cache design. Not only architecture eg. L4, L3, L2 etc, but the physical size.
The larger the cache is, the further away from the "core" the farthest part of the cache is. Electricity is not instantaneous in transition, so the further it has to travel, the longer it takes, the less cycles per second. That is why the last level cache is the slowest, it is furthest from the "core". If you double the size of the L1, it's going to be twice as slow. As the process shrinks, the transistors shrink, they get closer together and can run faster.
* A million other things I have no clue about
On a 5950x, latency for L1 is 0.8ns. Latency through L3 from thread 1 to 2 (or core to SMT) is 6.5ns, from 1 core to another in the same CCX is 20ns, and from CCX1 to CCX2 takes 84.5ns. DRAM measure 79ns (not to be confused with the time it takes to actually send and receive the data). At 5GHz, that's a full cycle every 200,000ns. The entire process can only be as fast as it's slowest point, and there is a lot going on. Unfortunately though, all of the data is stored on a hard drive, which is basically the furthest physical component in a PC from the CPU.
I guess that there is a much higher benefit from enlarging the cache and having fewer cache misses (and not hitting ultra slow RAM) and having a slower operating frequency, vs speeding up the core and keeping cache at a tiny size and doing other things to optimise for cycles per second only. Hope this helps and gives you something to think about.
PS, Nothing matters more than single core work-per-second, and clock speed a is a great way to go about it. The reason is, not many problems can be solved concurrently. You can't work out what goes next without knowing what just happened. Try solving 10,000 decimal places of Pi concurrently. Out-of-order execution exists for some things, but there is a performance penalty, it is far from a a cure-all.