I think you are confusing work queues with multithreading.
Those are special registers that contain the address of the beginning of each thread, and that is how much its those so many threads at *the same time*.
In SPARC *block-multithreading* model is a different story because the TLB (translation lookaside buffer) can be warmed up, and also the L1 that may contain pre-decoded pre-fetched instructions are also warmed up, of those so many threads(current versions are 8 thread per core), but the pipeline only executes one thread at a time, dictated by its internal PC (process/instruction counter) logic not by the OS (operating system) scheduler. Basically it takes advantages of the so many pipeline bubbles that always exist in common code to perform internal fast context switches... ... in the end *the pipeline* only executes one thread at a time, but it can be quite efficient because even there was only 1 thread and only one to execute, those are always full of bubbles from some clock cycles to others, and if you have 2 or more threads on 1 core targeting those bubbles, the original thread can execute almost exactly in the same time frame as if it where alone, yet gives "the illusion" that is executing more than one thread at a time.
SMT (simultaneous multi-threading, a.k.a HTT in intel lingo) and AMD CMT (cluster multi-threading) are evolutions of this logic in that 2 threads are really executed at the same time by the pipeline( it has advantages and some drawbacks), being the difference between the 2 that CMT has dedicated hardware almost in a logic of co-processores(clusters) (actually the FlexFPU *IS* a co-processor, meaning it can track the evolution of a thread semi-independently). So 3 threads at the same time per module ( DON'T KNOW) could be possible but only by a fraction of a split second upon an OS dictated context switch, and this while the FPU just finishes to execute/writeback a few instructions left from a previous thread. In practice and in reason it should be said each module only executes 2 threads at each time because the pipelines must be flushed(any register/cache must be write back to memory to provide consistency) upon those OS context switches.
IMHO AMD CMT is clearly superior, not much because of having quite additional dedicated resources for 2 threads ( the integer cores of so much polemic), which could/should always boost 2 threads at the same time... but more because AMD uses multithreading logic for all its pipeline, i.e. its divided in thread <domains> in the sense explained above about block-multithreading... its *Vertical Multithreading*...
In BD and PD the logic is one cycle per thread on each domain of the pipeline, that is, one domain deals with one thread instruction on one cycle, then the next cycle it switches for the other thread, and on and on (very similar to the interleaving multithreading exercises of Cray). In Steamroller the logic is changed to 2 cycles per thread making it a true vertical block-multithreading scheme.
In this context, there aren't exactly 2 decoders on Steamroller... has above is as if, but is an illusion. As revealed in a RWT thread(rumor or not don't know) SR will have the same 4 decode pipes, naturally considerably beefed up but 4, and *the way i see the difference* is that it will have 2 dedicated decode domain input buffers, and 2 dedicated output buffers, that share the 4 decode pipes in a SMT (simultaneous multithreading- that is, decode from 2 threads simultaneously as the FlexFPU and the L2 but on a vertical logic) plus a block-multithreading fashion. It can be tremendously more effective for those same 4 pipes, courtesy of the vertical multithreading scheme.
This is just to show how versatile and superior this "vertical multithreading" is... as if each group of pipeline stages in a *domain bordered by input output buffers*, where in themselves like "vertical co-processores" ( well a bit exaggerated lol)... and how much more easy it will be to replace or improve those <domains> without having to re-design a whole chip, as in the traditional/synchronous pipelines of Intel cores. AMD BD uarch is asynchronous/semi-synchronous pipeline, could much more easily augment efficiency by making parts run-ahead... could much more easily change the resources/characteristics of each domain... could much more easily make each module with 3 or 4 Integer thread cores/cluster or a number of *heterogeneous cores/clusters*... could even not hard put ARM Arch64 integer cores where now are x86 cores lol...
Yes many opinions put BD uarch as a failure (propaganda is rampant is all competitive lucrative businesses), but IMHO is not a failure, its *POTENTIAL* clearly superior to anything Intel has... this asynchronousness and modularity *potential* is very very difficult to came by right(no wonder it took as rumor 8 years to finish first iteration), but it could provide an accelerated path for improvements in successive iterations.