For the last few weeks, I have been working on FileChat — a read-only AI coding agent. As a proponent of privacy and independence from third-party services, I wanted FileChat to be fully local. But making FileChat’s AI components run locally turned out to be more difficult than I initially thought.
FileChat relies on AI models at two places: creating embeddings for quickly finding relevant files and chat. Let’s start with the first one.
Local Embeddings
The default approach to quickly spinning up a local embedding model is SentenceTransformers. This is where my journey started, the key issue being the choice of the right embedding model. It had to be small enough to work well on average consumer hardware, its context window had to be large enough to cover most source code files, and the embedding quality had to be at least satisfactory. After experimenting with multiple choices, I settled on nomic-embed-text-v1.5.
This setup worked well for a PoC. But when I wanted to release FileChat publicly, I encountered issues. My goal was to make installation as simple as running pip install filechat At the same time, I wanted to support different hardware accelerators — for starters, x86-64 CPUs, NVIDIA, and Intel Arc. However, SentenceTransformers relies on PyTorch, and installing the correct version for your hardware accelerator can be challenging. At the very least, it involves using one of Pytorch’s package indicides.
I wanted to save my users from this pain. The first solution I implemented was a combination of optional dependencies and specifying exact dependency URLs. This idea worked until I realized that you cannot use URLs in your dependencies if you want to publish your package in PyPI.
After some research, I decided to switch to ONNX Runtime. It supports a wide range of hardware backends, most of which are available as a package on PyPI.
For now, I am sticking with this solution. There is still room for improvement, though. Since I want the embedding model to be reasonably fast and memory-efficient, I am using a quantized version. I noticed that different types of quantization offer different performance on different hardware. I am currently using a single type of quantization, offering a reasonable compromise across supported accelerators. In the future, FileChat might instead choose the correct type depending on what hardware you want to use to run the embedding model.
Local Chat
Making chat fully local was a much greater challenge. I wanted to save this problem for later, so the first few versions of FileChat relied on Mistral AI’s API. When I finally decided to experiment with local LLMs, the embeddings component was already running on top of the ONNX Runtime, so this library was a natural starting point.
But I soon faced many difficulties. ONNX Runtime offers a GenAI toolkit that should make running generative LLMs easier, but it seems to have many rough edges. The second issue is that not that many models are released in the ONNX format. ONNX Runtime offers a tool for turning other formats like GGUF into ONNX, but I couldn’t make it work.
The second idea I tried was letting FileChat download the appropriate binary for llama.cpp, start its OpenAI-compatible server, and just connect to it via the OpenAI SDK. This approach seemed promising — I was able to download the binary, fetch a model from HuggingFace, start the server, and connect it to FileChat’s chat interface.
But the performance on my laptop featuring a pretty standard Intel Core 7 155H CPU was abysmal, even when I ran the model on the integrated Arc GPU. The long wait before seeing the first generated token made FileChat practically unusable. Even relatively small 4B models were slow. And you can’t really expect much in terms of quality from such models either.
So I gave up. Well, sort of. I concluded that an integrated local LLM isn’t yet feasible if I want FileChat to be usable on average consumer hardware. I had to find a compromise.
The most recent version of FileChat allows you to choose from three LLM model providers: Mistral AI, OpenAI, and a self-hosted OpenAI-compatible server. In other words, I leave the local LLM setup up to the user. Advanced users with access to sufficient compute can make their setup fully local. On the other hand, less experienced users, those limited by available hardware, or people who just don’t care, can keep their lives simple by pasting an API key.
This article should serve as a cautionary tale. There is a reason why we still aren’t running LLMs locally and instead default to using models provided by big AI companies, or use servers with expensive GPUs. Yes, it’s technically possible to run a local model on a consumer laptop, but the results, both in terms of quality and speed, are not yet there.
Have you also tried implementing a fully local LLM solution? Did you encounter similar challenges? Were you able to overcome them? I am curious to hear your story.