In 2026, ChatGPT+ and even its rivals like Claude Pro or Google Gemini can cost roughly $240-$300 per year. While there are free versions of this software, if you want the pro features, the $20/month subscription fee can feel like the cable bill of the 2020s: expensive, restrictive, and lacking in privacy.
For the price of two years of renting a chatbot, you could buy yourself an RTX 4070 or even an RTX 3060 12GB and own the hardware forever, and while this might feel like a large upfront investment, it's much more worth it in the long run. Moving to local AI isn't just a privacy flex; it can also provide you with a superior user experience. You get no rate limits and 100% uptime even if your internet goes out.
7 things I wish I knew when I started self-hosting LLMs
I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.
Why 12GB of VRAM?
While it's not essential, it's the sweet spot for sure
If you're looking to invest in a GPU primarily for AI, VRAM is a key specification to consider. While CUDA cores are central to inference speed, as is memory bandwidth, VRAM will ensure that the models have room to function and breathe. Picking up a GPU that has 12GB of VRAM means that you can self-host AI tools with ease. No more worrying about the cloud, no more worrying about a consistent internet connection.
12GB is the current enthusiast baseline. It means you can run 8B models like Llama xLAM-2 or Mistral at high quantization with context windows of up to 16k-32k. If you use 4-bit quantization, the model only uses about 5GB, leaving 7GB of RAM strictly for the KV cache (also known as the AI's working memory). This will allow you to feed the AI entire books or codebases up to 32,000 tokens while keeping the entire session on the GPU for instant responses. Just make sure the model supports a context window of that size, as Llama 2 7B's official context window only goes to 4,096 tokens.
If you want to run 14B to 20B models, then 12GB of RAM also works just as well, but you'll likely be limited to one-shot prompting. Models like Mistral Nemo (12B), Qwen 3 (14B), and Phi 4 (14B) are designed for users who need reasoning for coding and logic but don't have a data center sitting around in their closet. A 14B model at 4-bit quantization takes up roughly 9-10GB on a 12GB card. These models fit entirely in VRAM without having to worry about room for up to a 4K context window.
Because these models don't have to spill over into your much slower system RAM, you'll get speeds of 30-50 tokens per second on an RTX 4070. If you're running them on an 8GB card, these same models will have to be split between your VRAM and your system RAM, causing speeds to plummet to a painful 3-5 tokens per second.
It isn't the end of the world, and you can still self-host an AI tool and ensure you get all of the benefits of not relying on subscriptions or the cloud, but if you want optimized performance, then a 12GB GPU is the way to go.
Software has come just as far as hardware
You don't need coding skills to take advantage of these tools anymore
Just as hardware has come a long way, software has also come a long way, with so many open-source options. You get a one-click experience with so many self-hosted AI tools — you don't even need a terminal. LM Studio and Ollama provide you with that "downloading an app" experience. You search for a model, hit download, and you're chatting away. The experience is no different from installing and running a web browser for those who aren't as tech-savvy or just don't want the headache.
If you're someone who doesn't want to learn an entirely new UI, then products like OpenWebUI mean that you can run a local interface that looks and feels exactly like ChatGPT, complete with document uploads and image generation.
You also get the benefit of data sovereignty. Local AI means you can feed your tax returns, private medical data, or unread source code without wondering if it's being used to train the next version of a competitor's model. You also don't have to worry about any of your data being in the hands of large brands that you might not necessarily trust. Everything is hosted on your own device unless you configure it otherwise.
When actually using these self-hosted tools on an RTX 4070, I found that a local 8B model was able to generate text faster than I could even read, with a generation rate of 80 or more tokens per second consistently. This was using AWQ at 4-bit quantized on a vLLM backend, but you may be able to achieve ever so slightly higher numbers if using a TensorRT-LLM backend, thanks to its hardware-specific compiler. Note that if you were to use an RTX 3060, you would likely see slower generation speeds as a consequence of its significantly lower memory bandwidth.
Those who use ChatGPT+ frequently will find that the model can lag during peak hours. Suddenly, I don't have to worry about this anymore.
I also benefited from RAG (Retrieval Augmented Generation). My local model could stay awake and scan 50 local PDFs in seconds without hitting a file-size limit, unlike when I upload my documents to the web. Of course, you can take advantage of RAG using online AI tools thanks to newer embedding models, but in turn, you have a large privacy trade-off as you'll be providing unrestricted access to your files.
Self-hosting is an option for all
12GB VRAM or not, you can self-host
Even if you don't have a 12GB GPU, you can still take advantage of self-hosting. Despite these AI tools running slower if they are working off of your system RAM, you still get all of the benefits of self-hosting, but there will be a latency trade-off. Your local searches will take slightly longer when compared to cloud providers, but you might find that the privacy is worth the extra wait time.
Having 12GB of VRAM on your GPU is the brand-new sweet spot. It's the hardware that truly connects you to the next era of computing. My PC isn't just a gaming machine or a workstation anymore; it's a silent, private, and permanent intellectual partner, and the $20 I save every month is a much-welcomed bonus.