Question How to shop for GPUs (or other hardware) for LLM Workloads ?

Status
Not open for further replies.

Zork283

Distinguished
Jul 5, 2014
28
0
18,530
Hello,
I would appreciate some guidance on what hardware (GPU or otherwise) I should purchase to enable me to run LLMs locally on my machine.
Here are my system specs.

CPU: AMD Ryzen 9 7950X
Motherboard: ASRock X670E PG Lightning AM5 ATX Mainboard.
RAM: 64GB 5200 MHz (in 2 32GB DIMMs, leaving 2 slots free for expansion)
Case: Fractal Design Define 7
PSU: 1000W
GPU: Arc A770 16GB

As you can see by the oversized PSU and case, I have plenty of room for expansion.

I will start by describing my use case and then go from there:

I installed Nous-Hermes-13B-GGML & WizardLM-30B-GGML using the instructions in this reddit post. The main limitiation on being able to run a model in a GPU seems to be its VRAM. Nous-Hermes-13B-GGML requires 12.26 GB of ram and I am able to offload the entire model to my A770 GPU, causing it to run much faster than when even some of its layers are left on the CPU. I have aboslutely no complaints about how Nous-Hermes-13B-GGML runs, however, the model itself clearly has limitiations. WizardLM-30B-GGML requires 27GB, I can only send 40 of 63 layers of to the GPU, and it runs very slowly, only outputting about 1 token per second, and the program crashes frequently, seemingly whenever any video demands are put on the GPU (Like just generating basic screen output). As a result, WizardLM-30B-GGML takes several minutes to post a single reply.

Additionally, I would like to be able to run even larger LLMs as they become available, so I may just hold off on buying anything now and wait until there is an open source one that can rival GPT-4. In this case, I would still need to know how to go about choosing which components to handle the workload.

When launching koboldccp.exe to run the LLM, I chose 16 threads since the Ryzen 9 7950X has 16 physical cores, and select my only GPU to offload my 40 layers to. When I offload 40 layers, my VRAM usage on my GPU is 15.1GB. However, when the model is running, my GPU usage it typically only around 48%-55%.

At this point, I am going to just share my thoughts and musings on this subject, as I do not have any good answers.

I assume that I would need a GPU with more than 27GB of VRAM to run the whole model, but I have not seen any GPU with that much VRAM that isn't insanely expensive. If it is possible to divide the workload among multiple GPUs, then I could just get another A770 16GB because it seems to have very good VRAM per dollar.

Are GPUs even the way to go for this workload? I remember a Veritasium video featuring a company called Mythic AI that was making analogue components for running Neural Networks efficiently. I went to their website, but didn't see any way to buy any of their products.

Any advice or suggestions on the subject would be appreciated. Thank you in advance.
 

JRRT

Prominent
Mar 25, 2022
41
0
530
I am somewhat familiar with analog computing for AI purposes. I still suspect that it has it's place. But if I understand correctly, it will not work for this. There are a lot of different ways to go about trying to recreate intelligence with a machine. The current Large Language Models, as far as I am aware, all are designed to run on digital hardware and accomplish what they do through vast amounts of number crunching. Which is why they run quickly as long as they fit on your video card, and slow to a crawl if they don't. GPUs are, fundamentally, highly specialized, highly parallel, math chips.
I am also looking for information about ways to split these models across multiple GPUs and I will try to find your post again and let you know if I do. Because the very best open source models are supposed to be similar to Chat GPT once you take the time to train them on the relevant data. But we are talking about 65 billion parameters and 8 or 16 but tokens. So you would need at least an 80 GB video card, which I definitely can't afford.
 
Status
Not open for further replies.