Than what are the minimal specs to run ollama (llama3 8b or preferably 27b) at a decent speed?
I have an old pc that now runs my plex and arr suite. Was thinking of upgrading it a bit and running ollama on it as well. It doesn’t have a gpu, so what else does it need? I don’t have a big budget, so no new nvidia card for me.
“decent speed” depends on your subjective opinion and what you want it to do. I think its fair to say if it can generate text around your slowest tolerable reading speed thats a bare minimum for real time conversational things. If you want a task done and don’t mind stepping away to get a coffee it can be much slower.
The i7 duo core 2.6ghz CPU in my laptop trying to run 8B was jusst barely enough to be passing grade for real time talking needs at 1.2-1.7 T/s it could say a short word or half of a complex one per second. When it needed to process something or recalculate context it took a hot minute or two. Kind of annoying if you were getting into what its saying Bumping it up to a desktop with a AMD ryzen 5 2600 6 core CPU was a night and day difference. It spits out a sentence very quick at 5-6 t/s. Im still working on getting the 4GB RX 580 GPU used for offloading so those numbers are just with the CPU bump. RAM also matters DDR6 will beat DDR4 speed wise
Heres a tip, most software has the models default context size set at 512, 2048, or 4092. Part of what makes llama 3.1 so special is that it was trained with 128k context so bump that up to 131072 in the settings so it isnt recalculating context every few minutes…
Some caveats, this massively increases memory usage (unless you quantize the cache with FA) and it also massively slows down CPU generation once the context gets long.
TBH you just need to not keep a long chat history unless you need it,.
Another thing I’d recommend is running kobold.cpp instead of ollama if you want to get into the nitty gritty of llms. Its more customizable and (ultimately) faster on more hardware.
Thats good info for low spec laptops. Thanks for the software recommendation. Need to do some more research on the model you suggested. I think you confused me for the other guy though. Im currently working with a six core ryzen 2600 CPU and a RX 580 GPU. edit- no worries we are good it was still great info for the thinkpad users!
Ollama on a ten year old laptop? Lol, maybe at 1T/s for an 8B.
Than what are the minimal specs to run ollama (llama3 8b or preferably 27b) at a decent speed?
I have an old pc that now runs my plex and arr suite. Was thinking of upgrading it a bit and running ollama on it as well. It doesn’t have a gpu, so what else does it need? I don’t have a big budget, so no new nvidia card for me.
“decent speed” depends on your subjective opinion and what you want it to do. I think its fair to say if it can generate text around your slowest tolerable reading speed thats a bare minimum for real time conversational things. If you want a task done and don’t mind stepping away to get a coffee it can be much slower.
The i7 duo core 2.6ghz CPU in my laptop trying to run 8B was jusst barely enough to be passing grade for real time talking needs at 1.2-1.7 T/s it could say a short word or half of a complex one per second. When it needed to process something or recalculate context it took a hot minute or two. Kind of annoying if you were getting into what its saying Bumping it up to a desktop with a AMD ryzen 5 2600 6 core CPU was a night and day difference. It spits out a sentence very quick at 5-6 t/s. Im still working on getting the 4GB RX 580 GPU used for offloading so those numbers are just with the CPU bump. RAM also matters DDR6 will beat DDR4 speed wise
Some caveats, this massively increases memory usage (unless you quantize the cache with FA) and it also massively slows down CPU generation once the context gets long.
TBH you just need to not keep a long chat history unless you need it,.
Thank you thats useful to know. In your opinion what context size is the sweet spot for llama 3.1 8B and similar models?
Honestly as small as you can manage.
Again, you will get much better speeds out of “extreme” MoE models like deepseek chat lite: https://huggingface.co/YorkieOH10/DeepSeek-V2-Lite-Chat-Q4_K_M-GGUF/tree/main
Another thing I’d recommend is running kobold.cpp instead of ollama if you want to get into the nitty gritty of llms. Its more customizable and (ultimately) faster on more hardware.
Thats good info for low spec laptops. Thanks for the software recommendation. Need to do some more research on the model you suggested. I think you confused me for the other guy though. Im currently working with a six core ryzen 2600 CPU and a RX 580 GPU. edit- no worries we are good it was still great info for the thinkpad users!
Can you afford an Arc A770 or an old RTX 3060?
Used P100s are another good option. Even an RTX 2060 would help a ton.
27B is just really chunky on CPU, unfortunately. There’s no way around it. But you may have better luck with MoE models like deepseek chat or Mixtral.
tinyllama would probably run reasonably fast. Dumb as a rock sure; but its something!