Show HN: L88 – A Local RAG System on 8GB VRAM (Need Architecture Feedback)
Hey everyone,
I’ve been working on a project called L88 — a local RAG system that I initially focused on UI/UX for, so the retrieval and model architecture still need proper refinement.
Repo: https://github.com/Hundred-Trillion/L88-Full
I’m running this on 8GB VRAM and a strong CPU (128GB RAM). Embeddings and preprocessing run on CPU, and the main model runs on GPU. One limitation I ran into is that my evaluator and generator LLM ended up being the same model due to compute constraints, which defeats the purpose of evaluation.
I’d really appreciate feedback on:
Better architecture ideas for small-VRAM RAG
Splitting evaluator/generator roles effectively
Improving the LangGraph pipeline
Any bugs or design smells you notice
Ways to optimize the system for local hardware
I’m 18 and still learning a lot about proper LLM architecture, so any technical critique or suggestions would help me grow as a developer. If you check out the repo or leave feedback, it would mean a lot — I’m trying to build a solid foundation and reputation through real projects.
Thanks! Nice project, especially given the VRAM constraints. A few things I've learned
building production RAG that might help: 1. Separate your query analysis from retrieval. A single LLM call can classify
the query type, decide whether to use hybrid search, and pick search parameters
all at once. This saves a round-trip vs doing them sequentially. 2. If you add BM25 alongside vector search, the blend ratio matters a lot by
query type. Exact-match queries need heavy keyword weighting, while conceptual
questions need more embedding weight. A static 50/50 split leaves performance
on the table. 3. For your evaluator/generator being the same model — one practical workaround
is to skip LLM-as-judge evaluation entirely and use a small cross-encoder
reranker between retrieval and generation instead. It catches the cases where
vector similarity returns semantically related but not actually useful chunks,
and it gives you a relevance score you can threshold on without needing a
separate evaluation model. 4. Consider a two-level cache: exact match (hash the query, short TTL) plus a
semantic cache (cosine similarity threshold on the query embedding, longer TTL).
The semantic layer catches "how do I X" vs "what's the way to X" without hitting
the retriever again. What model are you using for generation on the 8GB? That constraint probably
shapes a lot of the architecture choices downstream.