Have you tried 2x RTX 3090 in NVLink?
That's the setup I have (with Ryzen 5950X + 128GB RAM) but I haven't gotten Meta's Llama 2 running yet
Gotta use some of the few 2 slot sized 3090's, I went with EVGA XC3 models. They are about $800 each on eBay these days.
If I could start fresh, it sounds like a near ideal match to try with: Already have the 5950 with 128GB of ECC RAM, one RTX 4090 and another 3090 that I could temporarly wrangle back from one of my sons.
Of course I'd rather have Tom's hardware try the permutations, hits and misses between a dual 4090 setup with PCIe linking and dual 3090 with an NVlink, because I can't burn that sort of money that easily to get the hardware (and no one sends me review units).
I had started with a GTX1080ti, upgraded to RTX2080ti for the dense data types and tensors, then went for the 3090 relatively late for the 24GB of RAM and INT4, was really happy when it worked so well with the Lllama 30B and then upgraded to the 4090 for the extra tokens per seconds and because I could get a 3slot design, which would allow me to go double with a matching mainboard in the corp labs.
(BTW: I don't think any of these included NVlink cables, the last ones I remember seeing were from GTX 9080ti and I don't think those would work. No idea how hard those are to obtain.)
Dual slot 3090 seemed a pretty theoretical option, those who managed to get them, held on tight and it's only now that you
hear about dual 4090 being done in some Chines back alleys.
So *hint*, *hint* if Tom's hardware is looking for a nice piece of non-ChatGPT4 editorial content, some experimentation with running 70B Llama2 on dual/tripple/quad consumer GPUs should be interesting reading.
And I've heard that slot extension cables e.g. used to change mount directions for "RGB" enthusiasts, even work at PCIe 4.0 x16, so with a bit of crazy open chassis testing, even oversized consumer GPUs could be used for some of this experimentation.
Actually, I'm pretty sure I could get paid by my employer for spending the time on this and I'd only need tom's hardware to send me the hardware... but I'm based in Europe, so there is logistics, among others.
Then again LLMs are reported to really require bandwidth for inferencing so the gap between 1TB/s on the RTX 4090 GDDR6 and 1.6TB on the A100 HBM2 is a hard limit on the token rate, even without the overhead of split models.
For functional testing in the labs (which is my use case), the first may not too big of an issue, for running inference at an acceptable mix of cost and performance it might actually force you towards the data center GPUs: it's for good reasons Nvidia builds and prices them differently and not
everybody is rushing to consumer GPUs to run LLMs... we could just be behind on the data.