Something almost nobody, researchers included, saw coming was the acceleration of progress in small LLMs and their capability relative to large closed-source models like Claude and ChatGPT. Something truly incredible is that Qwen3 Coder today has more coding ability and less latency than Open AI’s leading edge models from about a year ago.
Not just that you can run this locally1 but that you can run it on used hardware or GPU’s that are nearly half a decade old. Some point to the advent of “agentic workflows” which incentivized small models in order to fit more of “agents” into a single GPU’s Vram, in certain cases optimizing for parallelism too.
As a result, small “budget” local AI builds2 are more popular than ever. And this isn’t just something we’ve noticed on /r/localllama - this has proven to be the case in data analysis on Vast.AI and comments we receive on llamabuilds.ai . Curiously, with recent improvements to tools like ExLlama and Vllm, running multiples of nVidia 3060 12gb GPUs or modded 2080ti’s is now in high-demand.
Most of what I want to cover is which of these GPU’s is the best “budget” option for local AI builds. Let’s get started!
As the flagship GPU of the 20xx series of nVidia graphics cards the 2080ti had impressive performance at it’s release date. In particular 616GB/s of memory bandwidth across a 352-bit memory bus. This gets interesting once board repair / GPU modders found that 2GB variants of the same memory chips could be re-soldered onto the PCB to double the GPU’s memory from 11GB to 22GB.
Curiously you can actually purchase these from a GPU repair shop located in Palo Alto CA - seemingly one of the only shops providing these modifications under nVidia’s nose and legal purview.
The scaling isn’t perfect since the GPU still only has 12-16 memory controllers - but the same bandwidth across the 352-bit memory bus still applies. This made this card very desirable in 2022-2023 for local AI since even at the elevated price of $499 this GPU offered incredible capability for the price. However, the reliability and quality of these modified GPU’s was always questionable and never reached the level we see from Chinese companies modding the RTX 4090D (and these at times have signs of being de-soldered with blow torches).
Ironically, nVidia likely coined the 3060 12gb to be nothing more than a mid-range 1080p gaming GPU. Adding more VRam at the time just meant this GPU could barely render modern gaming titles at 1440p without huge power requirements. In time it also became a winner for nVidia as a great GPU for Ethereum crypto mining. Which brings us to the first huge win for this gpu, massive supply. With the end of GPU mining and thousands of mining rigs being parted out - there’s never been a better time to buy nVidia 3060’s - granted for local AI please ONLY FOCUS ON THE 12GB VARIANT!
It’s actually possible to find these on eBay for under $200 quite commonly in September of 2025.
Even though this GPU is technically slower and has about half the Vram onboard as the modded nVidia RTX 2080ti - I fully endorse this as a better option. Not only for price, but for availability, reliability and driver support. I only expect these GPU’s to continue to get more numerous and reduce in price as 2025 comes to a close. Also, with the advent of Vllm improvements and the Flash Attention framework making it even easier to split inference across multiple GPU’s the necessity to rely on single GPU’s with tons of Vram simply isn’t there anymore.
4x nVidia 3060 12GB using Vllm for inference can run Llama 3.3 70b IQ4_XS around 15-25 tokens/s - not bad considering the entire setup only costs around $1000 at current eBay prices.
So in conclusion - go out and build as many 4x RTX 3060 12gb machines as you can - but if you can right now I’d still recommend a pair of nVidia RTX 3090’s if you can afford them. If your budget is scraping off the bottom of the barrel nVidia RTX 3060’s are without question the best option on earth right now.
(old gaming PC turned local AI server)
See full build on llamabuilds.ai





