Lessons We Learned Building a RAG Assistant Without a Separate Vector Database

How we used StarRocks, Gemini, and tool-based retrieval to power grounded Q&A in a developer community Slack.

Author: Billy Chang, Software Engineer at Phoenix AI

Press enter or click to view image in full size

StarRocks gives data teams a fast open-source analytical database with a unified execution engine, a flexible deployment model, and strong performance for real-world workloads. But as the StarRocks community grows, the support workload grows with it: maintainers repeatedly answer the same questions about docs, GitHub issues, release notes, and historical Slack conversations.

Rocky is the official Slack assistant we built to address that problem. Its job is simple: take repetitive Q&A work off community maintainers while keeping answers grounded in StarRocks documentation and related sources.

The architecture is the important part. Rocky itself runs on StarRocks: document chunks, keyword lookup, vector retrieval, and similarity scoring all live in a single OLAP table. The AI that answers questions about StarRocks also runs on StarRocks. The result is a compact AI application built from roughly 600 lines of Python, one StarRocks table, and a Gemini API key.

Press enter or click to view image in full size

How Rocky works in the Slack channel

The RAG Foundation in StarRocks

The conventional first step when building a RAG application is to introduce a purpose-built vector database. Each new component adds operational overhead: a separate deployment, backup strategy, and consistency model. For a lightweight community bot, that overhead is hard to justify.

Instead of adding a vector database, we kept everything in StarRocks. Document chunks — along with their 768-dimensional Gemini embeddings — live in a standard OLAP table using a PRIMARY KEY model. Retrieval is a SQL query. The architectural principle is simple: if your analytical database already supports array-type columns and cosine similarity functions, you do not need a second data system for vector search.

The table definition looks like any other StarRocks table:

CREATE TABLE docs (
    id     BIGINT NOT NULL,
    path   VARCHAR(512),
    `index` INT,
    `text`  STRING,
    vector ARRAY<FLOAT>   -- 768-dim Gemini embedding
) ENGINE = OLAP
PRIMARY KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;

And retrieval is a single query using cosine_similarity:

SELECT path, `index`, `text`,
       approx_cosine_similarity([0.012, -0.034, ...], vector) AS similarity
FROM docs
ORDER BY similarity DESC
LIMIT 8;

No SDK, no client library, no second data system. From Rocky’s perspective, the “vector database” is just another StarRocks table it already knows how to query.

Architecture: From Slack Events to Grounded Answer

The end-to-end flow is deliberately minimal. A Slack @Rocky mention triggers the bot, which delegates reasoning to Gemini 3 Flash with native function calling. The model decides whether to search documentation or query Google, cycling through up to ten tool-call rounds per turn.

Press enter or click to view image in full size

The Rest of the Stack at a Glance

Press enter or click to view image in full size

Slack receives the mention: slack_bolt (Python) captures the event via Socket Mode and extracts the user query plus thread context.
Gemini reasons and calls tools: the LLM receives a system prompt with strict honesty rules and two available tools: search_starrocks_doc (client-side vector retrieval) and google_search (server-side grounding via Gemini’s built-in web search).
Vector retrieval executes in StarRocks: search_starrocks_doc embeds the query using gemini-embedding-001 with task_type=RETRIEVAL_QUERY, then runs the cosine similarity SQL above.
The model synthesizes an answer: Gemini assembles the retrieved chunks, generates a Markdown response, and Rocky converts it to Slack mrkdwn format before posting.
Telemetry flows to the observability stack: every LLM call, tool invocation, and token count is captured as an OpenTelemetry span, keyed by thread_ts as the session ID.

The entire bot is about 600 lines of Python in a single file. The rest of the toolchain, including document chunking, embedding generation, and index building, adds only a few thousand more. Storage, retrieval, and similarity scoring are handled entirely by StarRocks.

Why This Stack Works for a Lightweight RAG App

Three design choices keep Rocky’s operational footprint small while still delivering useful answers.

Primary Key Table + Stream Load for Hot-Swappable Docs

The document corpus is not a streaming workload. When the StarRocks docs update, we re-chunk the entire docs/en/ directory, regenerate embeddings, and reload the full table via Stream Load:

curl --location-trusted -u root: \
     -H "format: json" \
     -H "strip_outer_array: true" \
     -T docs_with_embedding.json \
     -XPUT http://host:8030/api/starsage/docs/_stream_load

One HTTP request loads ~5,000 rows, each carrying a 768-dimensional float array. Stream Load is the right fit for this scale: no Broker to maintain, no Flink pipeline to operate, and no ongoing ingestion infrastructure. For workloads in the thousands-to-tens-of-thousands-of-rows range with embedded vectors, it is the simplest path.

The Primary Key table model ensures atomic replacement. Queries hitting Rocky during the reload see a consistent snapshot, not a half-updated corpus. This is the core reason we chose a Primary Key table over a Duplicate Key table: not for row-level upserts, since this workload is full-table rebuilds, but because PK + Stream Load together make each reload atomic. Rocky never sees a mix of old and new chunks; it serves from either the previous corpus or the fully updated one.

Tool-Based Retrieval Over Context Stuffing

Rocky does not pre-load documents into the system prompt. Instead, search_starrocks_doc is exposed as a callable tool, and the model decides when and how often to invoke it.

In practice, this means a greeting like “hi Rocky” triggers zero retrieval calls and burns minimal tokens. A nuanced question like “how does GROUP_CONCAT behavior differ between 3.5 and 4.0?” triggers multiple retrieval calls — the model queries once per version and cross-references the results. The retrieval tool caps at eight invocations per turn; combined with Gemini’s server-side google_search for release notes and GitHub issues that fall outside the docs, Rocky covers a wide surface area without maintaining a separate web-scraping pipeline.

Embedding Cache for Predictable Cost

Community questions cluster heavily. “How to add a bitmap index” and “how to deploy on Kubernetes” appear multiple times per week. A local SQLite cache, keyed by query text and task_type (since RETRIEVAL_QUERY and RETRIEVAL_DOCUMENT produce different vectors), eliminates redundant embedding API calls. Hot queries cost exactly zero.

Get StarRocks Engineering’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

On the LLM side, Gemini’s cached input tokens bill at 25% of the standard rate. Combined with the embedding cache, Rocky’s per-answer cost sits in the $0.001 range, sustainable for a community-funded project running indefinitely.

Six Lessons from Production RAG Traffic

These lessons came from real community usage, not from benchmarks. If you are building a domain-specific assistant on top of any IM platform, most of them transfer directly.

1. Prompt Engineering Beats Retrieval Tuning for Hallucination Control

Early versions of Rocky fabricated Red Hat mirror names, invented non-existent release notes URLs, and hallucinated version numbers. Community members clicked through to 404 pages. For a Q&A bot, confident-sounding wrong answers are worse than no answer at all.

The fix was not a retrieval improvement — it was a prompt change. We added an explicit honesty clause to the system prompt:

Do NOT fabricate facts. When you cannot give a reliable answer: state you are not certain, point to authoritative sources, suggest asking in the community.

The effect was immediate. RAG retrieval matters, but prompt engineering is often the cheaper and faster lever for suppressing hallucinations.

2. Fix Chunking Before Retrieval Tuning

StarRocks documentation contains a large number of MDX import … lines, Markdown tables that span multiple chunks, and extremely long SQL DDL statements. Feeding them directly into embeddings is garbage in, garbage out.

We invested heavily in the chunking pipeline:

Header-based hierarchical splitting: chunks split at Markdown headers, with the full header path preserved as metadata so the LLM knows where each chunk sits in the doc tree.
Noise filtering: MDX import lines are stripped; code blocks exceeding 4,000 characters are dropped entirely (they dilute semantic signal without contributing to answers).
Output table simplification: MySQL-style result tables in examples are compressed into structured descriptions instead of raw ASCII tables.

The engineering effort spent on chunking was roughly ten times the effort spent on integrating vector search. If your RAG recall quality is poor, examine your chunking before switching embedding models or tuning index parameters.

3. RAG Is a Tool, Not a Pipeline

Exposing retrieval as a function call — rather than baking it into every request — lets the model skip retrieval entirely when it is unnecessary. The model handles greetings, follow-up clarifications, and simple factual questions without touching the vector index.

Mixing client-side tools (search_starrocks_doc) with server-side tools (google_search), a capability fully supported in the Gemini SDK since version 1.70, gives Rocky a fallback path for content outside the documentation corpus without requiring a custom web-scraping service.

4. Follow the Platform’s Session Model

Slack provides threads with thread_ts identifiers. We use thread_ts directly as the session ID, persisting conversation history to data/slackthreads/{thread_ts}.json and including the most recent six turns in each request:

contents = []
for turn in history[-MAX_CONTEXT_MESSAGES:]:
    contents.append({"role": turn["role"], "parts": [{"text": turn["text"]}]})
contents.append({"role": "user", "parts": [{"text": query}]})

Twenty lines of code total, and far simpler than maintaining a session database.

One common misunderstanding from user feedback is worth clarifying: “Rocky has no memory.” In reality, this happens because in Slack, Rocky only responds when it is @-mentioned. It does not mean Rocky truly has no memory. This is a product interaction issue, not a technical one.

5. Instrument Early, Not Post-Launch

After Rocky went live, we wanted to answer questions like: How many questions did Rocky handle this week? Which question categories triggered the most tool calls? What did the average response cost? Which answers were challenged, corrected, or followed up on by users?

Without traces, the system was a black box. So we added a full OpenTelemetry instrumentation layer.

The lesson: Add tracing before launch. Without traces, token usage, costs, and tool-call patterns are invisible, which makes cost optimization, hallucination attribution, and weak-question analysis much harder.

6. Cache Aggressively, Track Every Dollar

Embedding APIs look cheap on a per-call basis, but high-frequency community questions compound fast. The SQLite embedding cache and Gemini’s cached-token pricing together keep Rocky’s unit economics in the $0.001-per-answer range. For a community-maintained project, that is the difference between “sustainable indefinitely” and “we need to discuss the budget.”

Honest Tradeoffs

Rocky is a lightweight RAG assistant, not a production support system. Several limitations are worth acknowledging:

Hallucinations are reduced, not eliminated: prompt-level guardrails help, but they do not guarantee correctness. Rocky works best for well-documented topics; edge cases and version-specific quirks still trip it up.
Verbose responses: like most RAG bots, Rocky tends to over-explain. A TL;DR mode toggle, letting users switch between long and short answers, is on the roadmap.
No cross-thread memory: context stays within a single Slack thread. A user who asked about materialized views yesterday starts from zero today.
No human escalation loop (yet): when Rocky cannot answer after multiple attempts, it should automatically tag a maintainer. That workflow is on the roadmap but not yet shipped.
No comprehensive troubleshooting capabilities: this will need to combine with the debugging skill we recently open-sourced.

These are active items on the roadmap, informed directly by community feedback.

What Rocky Demonstrates

Rocky is a small project — 600 lines of Python, one StarRocks table, and a Gemini API key — but it validates a larger point. For lightweight AI applications, an analytical database that supports array columns and similarity functions can eliminate the need for a separate vector database entirely. One table handles document storage, keyword search, and vector retrieval. The operational surface area stays minimal, and the team maintains one system instead of two.

The bot handles real community traffic daily, with an average user rating of 4.2 out of 5 and a per-answer cost under a tenth of a cent. If you are building a similar domain assistant and want to keep your infrastructure simple, the single-table RAG pattern is worth considering, especially if you already run StarRocks.

Try it yourself: mention @Rocky in the StarRocks community Slack, or explore Phoenix AI, the agentic AI database for building intelligent applications.