Guess i should have been more clear. Using LM Studio to run a model using multiple gpus.
I understood you well enough and that's what I did as well, for my last round of testing.
(I've been in HPC for four decades and scale-out problems have been worked on before computers were even properly invented)
LMstudio is a really nice UI, but it only sits on top of a very tall stack of software layers.
The layer splitting is done by llama.cpp, which is also used by a lot of other tools or via command line, if that's how you roll.
And llama.cpp also sits on many layers and supports different runtimes underneath, which you can also configure and switch on LMstudio, e.g. allowing me to go between CPU-only, Vulkan and CUDA on my Nvidia machines, or just between Vulkan and CPU on my systems with Intel and AMD GPUs.
How that switches between the potentially different physical pools of RAM that hold the weights and the compute units that work on them, isn't that clear to me, but generally a preference for locality can be implied: if a layer's data sits on xPU A, there is a good likelyhood that it will also be executed by that xPU and not on xPU B. However, technically there are quite a few options, because several GPUs and CPUs can share a virtual memory space and APUs might even have a physically unified pool... which might be NUMA...
So while CUDA as a framework isn't likely to execute on x86 CPUs or AMD GPUs, it
could use CPU RAM and even (from a pure PCIe perspective) AMD/Intel GPU RAM. That becomes a little more interesting when you have unified memory and iGPUs.
More likely llama.cpp will switch the run times with where the layers are stored, so even if you select a CUDA runtime, as soon as layers exceed the RAM capacity of one device, they are likely moved entirely to whatever runtime the other device uses.
But that's all software, you could do your own if you can afford the time.
All that doesn't really matter if PCIe is the best you have in the middle and the inner loop of an LLM and since that is 64GB/s on PCIe v4 x16 (or less with bifurcation), that's what limits their speed. The individual GPUs might have bandwidth adding to multiple TB/s, even the CPU might do 256GB/s on quad channel RAM, but because of high data dependency of LLMs that just changes how fast they can twiddle their thumbs. If they have to pass the bridge often enough for any token produced, it's going to be slow.
That's why scale out hardware architectures for machine learning use NVlink fabrics to come as close as possible to local RAM speeds.