The problem is simple: consumer motherboards don’t have that many PCIe slots, and consumer CPUs don’t have enough lanes to run 3+ GPUs at full PCIe gen 3 or gen 4 speeds.
My idea was to buy 3-4 computers for cheap, slot a GPU into each of them and use 4 of them in tandem. I imagine this will require some sort of agent running on each node which will be connected through a 10Gbe network. I can get a 10Gbe network running for this project.
Does Ollama or any other local AI project support this? Getting a server motherboard with CPU is going to get expensive very quickly, but this would be a great alternative.
Thanks
This is false: Mistral small 24b at q4_K_M quantization is 15GB. q8 is 26GB. A 3090/4090/5090 with 24GB or two cards with 16GB (I recommend the 4060 Ti 16GB) will work fine with this model, and will work in a single computer. Like others have said, 10Gbe will be a huge bottleneck, plus it’s just simply not necessary to distribute a 24b model across multiple machines.
Thank you, but which consumer motherboard + CPU combo is giving me 32 lanes of PCIe Gen 4 neatly divided into 2 x16 slots for me to put 2 GPUs in? I only asked this question because I was going to buy used computers and stuff a GPU in each.
Your point about networking is valid, and I’ll be hesitant to invest in 25Gbe right now
You don’t need cards to have full bandwidth, they only time it will matter is when you’re loading the models on the card. You need a motherboard with x16 slots but even x4 connections would be good enough. Running the model doesn’t need a lot of bandwidth. Remember you only load the model once then reuse it.
An x4 pcie gen 4 slot has ~7.8 GiB/s theoretical transfer rate (after overhead), a x16 has ~31.5GiB/s - so disk I/O is likely your limit even for a x4 slot.
I see. That solves a lot of the headaches I imagined I would have. Thank you so much for clearing that up