The distinction between RISC and CISC is largely moot these days since x86 CPUs put their instruction decoders at the top of their execution pipeline to turn x86 code into RISC-like micro-ops and it is those micro-ops that end up in the reorder buffer. From this point of view, the execution pipeline of modern x86 CPUs can be considered closer to pure RISC than many RISC chips.
Now, as you said, a pure RISC CPU would need more instructions to perform the same task as an x86 CPU and if a RISC CPU tried to achieve the same average throughput per clock as Skylake, I bet the RISC scheduler would need to be quite beefy to dredge up the extra instruction or two per clock needed to keep up the pace.
Intel makes tiny x86 schedulers too: look at Atom and Xeon Phi. Scheduler complexity is directly related to the amount of ILP you want to extract from the instruction stream. When the CPU lacks execution units to achieve high ILP (Atom) or does not care about ILP because it is optimized for TLP (Phi), scheduling becomes much simpler.
BTW, in modern pipelined RISC CPUs, NOP is probably the only true single-cycle instruction. Everything else typically requires at least three ticks: one tick to move data from the register file to the execution unit input, one cycle through the ALU and one more cycle to store the result back into a register and/or present it at the input of another execution unit if a dependent instruction is scheduled there next.
This doesn't make any sense. Because RISC CPU's do not need to convert macro-ops into micro-ops, with very few exceptions, they don't need nearly as complex a front end decoder. The decoder on a RISC CPU is primarily responsible for determining valid instructions, doing assignments, looking up references to the register windows and so forth, it's no where near as beefy as what Intel bolts onto their CPUs. This in turn means the scheduler can be less complex, it doesn't need to deal with an external x86 macro-op being five to seven micro-ops and tracking them as such. There are pro's and cons to this approach with the result being the final performance outcome being the same. Neither is inherently better then the other, but because of the different design philosophies you need to optimize processor resources differently. CISC designs
require very strong front end decoders & schedulers in order to achieve high throughput while RISC design's
require very powerful I/O subsystems to do the same thing. The link I provided from Oracle explains what each component is doing.
Virtually all RISC instructions take 1 clock cycle to execute. This is because they do not use complex referencing or, with very few exceptions, allow compound instructions. Each ISA strays from this a little bit with SPARC being the most militant in it's adherence. Once you start adding complex pipelining this number becomes a bit fuzzy. The RISC instruction still takes exactly one cycle to
execute, there is just a lot of preparatory work done ahead of time to optimize the number of executions that can happen while avoiding stalls. CISC on the other hand would take many cycles, depending on the instruction, division being the worst. The difference is that a single CISC instruction would have to be represented by three or more RISC instructions, so as I said it's not a one to one comparison. The one cycle execution time is important to designing the predictor and scheduler, specifically fixed execution times allow for simpler timing vs variable execution time.
Take a simple MULT instruction that multiples two numbers and stores the result in location of the first variable.
In x86 Pseudo-code, would be like this
MULT Variable_A Variable_B
That's a single operand in machine language
In RISC Pseudo-code, would look like this
LOAD Register_A, Variable_A
LOAD Register_B, Variable_B
PROD A, B
STORE Variable_A, Register_B
And depending on the ISA there might be a requirement to first lookup the memory location of the variables instead of directly referencing it. And remember this is machine language, so four to five separate operands going into the intrusion queue.
The top machine instruction would need to first be converted into something that looks much like the bottom but is done entirely inside the CPU, then each of those micro-ops would be ordered and scheduled, executed, put back together, retired then another macro-op fetched (in reality several were fetched simultaneously). The bottom set would just be scheduled, possibly reordered, executed individually and no requirement to return as they are in native language and part of a stream of other instructions. The single CISC instruction takes multiple cycles to execute because it's complex, each of those RISC instructions takes one cycle each to execute. You get about the same
total execution time but the scheduling is done very differently.
Two different ways of accomplishing the same task, each with different pro's and cons. RISC is heavily limited by memory I/O due to raw number of operands flying across (instruction not data), CISC is limited by prediction and caching. If you make a mistake in predicting a RISC instruction it's not as big a performance penalty as mispredicting a CISC instruction. RISC relies a lot more on the compiler not screwing up and making spaghetti code. This is why Intel invested so much engineering effort into their magic branch predictor and caching technology. They have, by far, the absolute best caching prediction logic on the planet. It's responsible for their performance lead in single threaded computing.