News How to Run a ChatGPT Alternative on Your Local PC

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
Hello,

Anyone have idea what caused this error? I tried reinstalling everything but I always get to the end and get this error.

(llama4bit) PS C:\AIStuff\text-generation-webui> python server.py --gptq-bits 4 --model llama-7b
Loading llama-7b...
Traceback (most recent call last):
File "C:\AIStuff\text-generation-webui\server.py", line 241, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\AIStuff\text-generation-webui\modules\models.py", line 101, in load_model
model = load_quantized(model_name)
File "C:\AIStuff\text-generation-webui\modules\GPTQ_loader.py", line 56, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)
TypeError: load_quant() missing 1 required positional argument: 'groupsize'
(llama4bit) PS C:\AIStuff\text-generation-webui>

If you have idea how can I fix this please let me know.
Yeah, it's broken and that after only a few days.

I tried editing line 56 in the python code and just added another parameter which clearly is a integer number. It seems that the parameter must be a power of 2 so you can enter 1, 2, 4, 8,16 and so on. Then that error disappears but the model then has errors loading/coping values so it seems it's not that easy to fix.
 
  • Like
Reactions: Paulii6661
Yeah, it's broken and that after only a few days.

I tried editing line 56 in the python code and just added another parameter which clearly is a integer number. It seems that the parameter must be a power of 2 so you can enter 1, 2, 4, 8,16 and so on. Then that error disappears but the model then has errors loading/coping values so it seems it's not that easy to fix.
Hello and thank you for your information. I guess I'll take a break for now and try it next weekend hopefully they will fix it until then. :)
 
Hello, thank you for your reply.
I tried your command from all main folders but it says "fatal: not a git repository (or any of the parent directories): .git" could you please provide more information?

By the way I'm trying to run it on RTX 2070.
Someone fixed it. Do:

cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python setup_cuda.py install
 
Someone fixed it. Do:

cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python setup_cuda.py install
I can confirm now it's working. And compared to standard vs 4bit 7b parameter model on my RTX 2070 is unbelievable. Speed is similar to 2,7b models.

Unfortunately I didn't figure out yet what to do with it. If anyone have idea let me know.
 
  • Like
Reactions: bit_user
The article says it uses CUDA 11.7.

And one of the linked instructions also say to use CUDA 11.7:

https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/

This:

C:\ProgramData\miniconda3\envs\llama4bit\lib\site-packages\torch\include\pybind11\cast.h(624): error: too few arguments for template template parameter "Tuple" detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]" (721): here
...is a C++ error that implies there's probably a version mismatch between two packages. That supports the idea that your CUDA version is too low, which we can also see by the fact that it happens while compiling a CUDA kernel:
2 errors detected in the compilation of "e:/llmrunner/text-generation-webui/repositories/gptq-for-llama/quant_cuda_kernel.cu".
Thanks. Even though nvcc claimed I was using 11.7, I found that I had an old 11.4 install from another product. After wiping this install and installing the 11.7 dev kit from Nvidia's site, I was able to compile the library. The other fix was to use the 2022 build tools rather than the 2019 build tools, after totally uninstalling the former and restarting the computer after installing the latter. I am continuing to work on the other steps now.
 
  • Like
Reactions: bit_user
That could work. What GPU are you trying to run this on? Anyway, you might try this command (from the folder where you did "https://github.com/oobabooga/text-generation-webui.git"):

git checkout 'master@{2023-03-18 18:30:00}'

In theory, that will get you the versions of the files from last Saturday.
Could you perhaps benchmark some non tensor cores GPUs too? It would be interesting to find out how much tensor cores matter. You could run llama 7b on a GTX 1660Ti, RTX 2060 and GTX 1080, measuing generation speed. It could lead to interesting results if GPTQ is benefitting from Tensor cores.
 
Could you perhaps benchmark some non tensor cores GPUs too? It would be interesting to find out how much tensor cores matter.
Good question, but I wouldn't do that before platform-level performance tuning. If there are bottlenecks related to Windows or the host software stack, that would downplay any differences being tested in the GPUs, themselves.
 
Someone fixed it. Do:

cd repositories/GPTQ-for-LLaMa
git reset --hard 468c47c01b4fe370616747b6d69a2d3f48bab5e4
pip install -r requirements.txt
python setup_cuda.py install

Thank you ! this solved loading the model.

Unfortunately, I got a memory error just after
(llama4bit) C:\AiStuff\text-generation-webui>python server.py --gptq-bits 4 --model llama-7b
Loading llama-7b...
Found models\llama-7b-4bit.pt
Loading model ...
Done.
Traceback (most recent call last):
File "C:\AiStuff\text-generation-webui\server.py", line 234, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\AiStuff\text-generation-webui\modules\models.py", line 101, in load_model
model = load_quantized(model_name)
File "C:\AiStuff\text-generation-webui\modules\GPTQ_loader.py", line 84, in load_quantized
model = model.to(torch.device('cuda:0'))
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\transformers\modeling_utils.py", line 1808, in to
return super().to(args, *kwargs)
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 1145, in to
return self._apply(convert)
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 844, in _apply
self._buffers[key] = fn(buf)
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 4.00 GiB total capacity; 2.97 GiB already allocated; 14.09 MiB free; 3.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any idea on how to solve this ?
 
You guys really got to stop wasting time with Windows and gaming GPUs for this stuff. Just set up a second SSD with Ubuntu 22.04 and get a real graphics card like the RTX A4000 /A4500 / A5000 etc..

Also get 128GB memory stop gimping yourselves then complaining how slow it is when you only have 32GB

I'm gonna have to try this one with a web UI though i was about to start making my own for llama-chat 😂

I had things running briefly but had hardware issues I'm currently trying to resolve, also waiting to see how the new DDR5 high capacity memory modules pan out; not sure yet if memory requirements will fall faster than 192GB becomes easily common place. It can definitely get tough to keep up with the dev advances though, you take a month off and come back and do a git pull and your local repo is completely different. Make sure you export your currently working conda env just in case you need to roll back to an old working software stack.
 
Thank you ! this solved loading the model.

Unfortunately, I got a memory error just after
(llama4bit) C:\AiStuff\text-generation-webui>python server.py --gptq-bits 4 --model llama-7b
Loading llama-7b...
Found models\llama-7b-4bit.pt
Loading model ...
Done.
Traceback (most recent call last):
File "C:\AiStuff\text-generation-webui\server.py", line 234, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\AiStuff\text-generation-webui\modules\models.py", line 101, in load_model
model = load_quantized(model_name)
File "C:\AiStuff\text-generation-webui\modules\GPTQ_loader.py", line 84, in load_quantized
model = model.to(torch.device('cuda:0'))
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\transformers\modeling_utils.py", line 1808, in to
return super().to(args, *kwargs)
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 1145, in to
return self._apply(convert)
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 844, in _apply
self._buffers[key] = fn(buf)
File "C:\Users\TD\miniconda3\envs\llama4bit\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 4.00 GiB total capacity; 2.97 GiB already allocated; 14.09 MiB free; 3.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any idea on how to solve this ?


Are you trying to run this on a GPU with only 4GB of memory? Because that's what it's saying in the last line of that traceback . Otherwise start Googling for PYTORCH_CUDA_ALLOC_CONF errors and you'll find others' posts and possible solutions real quick.
 
Are you trying to run this on a GPU with only 4GB of memory? Because that's what it's saying in the last line of that traceback . Otherwise start Googling for PYTORCH_CUDA_ALLOC_CONF errors and you'll find others' posts and possible solutions real quick.
That's what it looks like. Given the Llama-13b requires 10GB or more VRAM, I think 7b will need at least 6GB. There's no 3.5b Llama model, so you'd have to try OPT-2.7b (https://huggingface.co/facebook/opt-2.7b), but getting the 4-bit variant of that will also require additional effort.
 
You guys really got to stop wasting time with Windows and gaming GPUs for this stuff. Just set up a second SSD with Ubuntu 22.04 and get a real graphics card like the RTX A4000 /A4500 / A5000 etc..

Also get 128GB memory stop gimping yourselves then complaining how slow it is when you only have 32GB

I'm gonna have to try this one with a web UI though i was about to start making my own for llama-chat 😂

I had things running briefly but had hardware issues I'm currently trying to resolve, also waiting to see how the new DDR5 high capacity memory modules pan out; not sure yet if memory requirements will fall faster than 192GB becomes easily common place. It can definitely get tough to keep up with the dev advances though, you take a month off and come back and do a git pull and your local repo is completely different. Make sure you export your currently working conda env just in case you need to roll back to an old working software stack.
There's a lot more to it than just swapping the SSD, because swapping GPUs and drivers under Linux is also a bit more involved (in my experience), plus at least the last time I checked (which should be better now I hope) the RX 7900 cards wouldn't even work under Linux. Every time I need to try and do something the Linux way, I generally end up spending a couple of extra hours chasing my tail. And I'm by no means a Linux neophyte; I used to run it as my OS for software development back in the day (90s and early aughts). But I haven't done proper software development in 20 years.

Since our readers are, like most people, probably 99% Windows users, or at least not-Unix users, I want to write up articles for how THEY can get this stuff working — without having to resort to "Step 1: Spend however long it takes to get to grips with Linux!"