LocalLLM Advisor — Privacy-First AI Recommendations

8 min read Original article ↗

A few months ago I bought a PC with an RTX 5060 Ti. I wanted to run a coding-oriented LLM locally, keep my data off third-party servers, and stop paying per-token for things I could handle on my own machine. Picking the hardware took me an afternoon. Figuring out which model to actually download took me the rest of the week.

The information was out there, but it lived in six or seven different places that don't talk to each other. I'd check a model card on HuggingFace for the parameter count, then cross-reference the GGUF quantizations available from bartowski or unsloth, then look up my GPU's memory bandwidth on TechPowerUp, then search r/LocalLLaMA for tok/s reports from someone with similar hardware, then check if the VRAM footprint left enough headroom for a reasonable context window. Halfway through, a new quantization variant would show up and I'd start over.

I kept a spreadsheet for a while. It got unwieldy after about twenty models.

Why this keeps getting harder

Running models locally has gone from a niche hobby to something a significant chunk of developers and researchers do routinely. The reasons are well-documented: privacy, latency, cost control over time, and not being dependent on an API that can change pricing or terms of service at any time. The tooling on the inference side has gotten remarkably good: llama.cpp, Ollama, vLLM, and others have made the actual “run the model” part mostly painless.

But the decision that comes before inference, i.e., which model to use, at which quantization level, on which hardware, and what performance to expect, none of that has been solved in a centralized way. New models appear on HuggingFace weekly. GPU product lines keep branching. Quantization methods evolve (GGUF alone has gone through multiple format revisions). The matrix of possible combinations grows faster than any single source can keep up with, and the existing resources each cover only one slice of it.

What I ended up building

LocalLLM Advisor is a web tool that answers two questions:

  • “Given my hardware, what is the best model I can run?”
  • “Given a model I want to run, what hardware do I need?”

It currently covers 1.4k+ models across dense and MoE architectures, 206 GPUs (NVIDIA, AMD, Intel Arc, Apple Silicon), and 78 CPUs.

The Model Finder takes your GPU (auto-detected via WebGPU, or selected manually) and a use case - chat, coding, reasoning, vision, roleplay, embedding - and returns a ranked list of models that fit. Each result shows the quantization level, estimated VRAM usage, estimated tok/s, and a ready-to-paste Ollama command. The ranking weighs model quality (from the Open LLM Leaderboard: MMLU-PRO, HumanEval, MATH, IFEval, and others), predicted speed, and quantization quality, with different weights depending on the use case. Coding leans harder on HumanEval and BigCodeBench scores; roleplay prioritizes instruction following and generation speed.

The Hardware Finder works the other direction. Pick a model, set a speed preference (usable, fast, or blazing) and a budget, and it shows which GPUs can handle it as single-card, multi-card with NVLink, or multi-card over PCIe, with current street prices pulled from Amazon and eBay. This is the mode I use most when a new model drops and I want to know if my current card can handle it or if I'm looking at an upgrade.

There's also a community benchmarks section where people submit real tok/s numbers from their own hardware, and a GPU price tracker with 30-day price trends and alerts. The benchmarks section is still in early stages, I'd like a lot more data points, especially on mid-range cards.

How the estimation actually works

The core insight behind the performance estimates is that LLM token generation (the decode phase) is memory-bandwidth bound, not compute bound. Each decode step reads the active model weights from VRAM. How many tokens per second you get is, to a first approximation, just how fast your GPU can read those weights:

tok/s ≈ memory_bandwidth_GBps / model_size_GB

where model_size_GB = parameters × bits_per_weight / 8.

VRAMfull weight read per token320 GB/sGPU COMPUTEmemory-bound stallbandwidth92%compute12%0255075100%

Decode is bandwidth-bound: loading weights from VRAM dominates each token step. CUDA cores sit idle waiting for data, raising TFLOPS does not improve throughput.

This is the same relationship the llama.cpp community uses as a rule of thumb, and it holds up well in practice. It's why the RTX 4090 and RTX 3090 end up with similar LLM performance despite the 4090 having far more compute; their memory bandwidth is in the same ballpark (1,008 GB/s vs. 936 GB/s).

DENSE 7Ball weights read per tokenVRAM 4.5 GB32 tok/sMOE 67B / 7B ACTIVEsubset of experts read per tokenVRAM 40 GB28 tok/sactive weightsresident, inactive

Things get more interesting with MoE models. DeepSeek R1, for instance, has 671 billion total parameters but only activates about 37 billion per token. You need VRAM for all 671B (every expert must be loaded because any could be activated), but the per-token read is only 37B worth of weights. So the VRAM requirement is massive, while the per-token speed is comparable to a 37B dense model. Getting this distinction wrong, and a lot of resources do, gives you either wildly pessimistic speed estimates or wildly optimistic VRAM estimates.

KV cache adds another layer. At short context lengths (4K tokens or less), the cache is small relative to the model weights and barely affects speed. But at 32K+ contexts, it can add tens of gigabytes. A 70B model that fits comfortably in VRAM at 4K context might not fit at 128K, because the KV cache alone approaches the size of the model weights. The tool models this with a power-law approximation calibrated against measured KV cache sizes across 7B to 70B models:

extra_kv_cache_mb = 128 × (params_B / 7)^0.4 × (context - 4096) / 1024

The 0.4 exponent reflects that larger models tend to widen their hidden dimensions rather than just stacking more layers, while Grouped-Query Attention keeps the KV head count fixed (typically 8) regardless of model size. This fits measured values to within about 5% across the common model sizes.

For GPU+RAM offload scenarios, the model is split across GPU and CPU layers that process sequentially. The total time per token is the sum of both, not an average. CPU layers read directly from system RAM (the bottleneck is RAM bandwidth, not PCIe), while PCIe only carries the small activation vectors between layer groups, adding roughly 0.1–0.2 ms of overhead per token. This sequential nature is why even offloading 20% of layers to RAM tanks your throughput: system RAM offers 40–80 GB/s of effective bandwidth versus hundreds of GB/s on the GPU side.

Every estimate in the tool has been validated against community benchmarks (with a ±15–30% uncertainty band). 15% for full GPU inference, 25% for offload, 30% for CPU-only. Real-world variation from drivers, thermal throttling, inference runtime configuration, and background system load means any single-point estimate is going to be wrong for some percentage of users. I'd rather show the range.

What runs client-side and why

The entire recommendation engine runs in your browser. No server calls for the main flow, no account required, no telemetry. The model and GPU databases are bundled as static JSON files in the build (about 1.7 MB total). Supabase powers only the community features (benchmarks, reviews, price alerts) where a database is actually necessary.

I made this choice partly out of principle (a tool about running AI locally shouldn't phone home to do its job) and partly for practical reasons: it's simpler to deploy, cheaper to host, and faster for the user. The whole thing is a static Next.js export on GitHub Pages.

Where the estimates are weakest

I want to be upfront about limitations. The bandwidth formula works well for straightforward single-GPU, short-context inference, which is the most common case. But there are scenarios where I'm less confident:

Mid-range GPUs (8–12 GB VRAM cards like the RTX 4060 or RX 7600) running heavily quantized models (Q3, Q4) are where the gap between predicted and measured tok/s tends to be largest. At aggressive quantization levels, dequantization overhead and cache behavior start to matter more, and these are hard to model from spec sheets alone.

Multi-GPU setups over PCIe (without NVLink) have real overhead that depends on the specific model architecture and how the runtime partitions layers. The tool uses empirical scaling factors (0.85× for PCIe, 0.95× for NVLink) but these are averages, not guarantees.

CPU-only inference varies a lot depending on ISA extensions (AVX2 vs. AVX-512 vs. AMX), memory channel configuration, and NUMA topology. Two machines with the same nominal specs can show 2× differences in tok/s if one has its memory channels populated differently.

The community benchmarks feature exists specifically to close these gaps. Real-world data from real hardware configurations is more valuable than any formula, and I'm actively using submissions to validate and recalibrate the heuristics. If you have numbers from your own setup, I'd genuinely like to see them.

The methodology page

Every formula, assumption, and data source used in the tool is documented on the methodology page. It covers model size estimation, KV cache calculation, the bandwidth heuristic, inference modes, MoE handling, offload modeling, multi-GPU scaling, prefill speed, time-to-first-token, and confidence levels. I wrote it so that anyone can check the math, point out where it breaks, and help improve it.

That's essentially it. I built this because I needed it, I kept working on it because the problem turned out to be more interesting than I expected, and I'm sharing it because the people who'd benefit most are the same people who could help make the estimates better.

Try LocalLLM Advisor

Find the best model for your GPU, or the best GPU for your model.

localllm-advisor.com