GitHub - lightonai/next-plaid: NextPlaid, ColGREP: Multi-vector search, from database to coding agents.

NextPlaid & ColGREP

NextPlaid is a multi-vector search engine. ColGREP is semantic code search, built on it.

ColGREP · NextPlaid · Models · Docs

ColGREP

Semantic code search for your terminal and your coding agents. Searches combine regex filtering with semantic ranking. All local, your code never leaves your machine.

Quick start

Install:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/lightonai/next-plaid/releases/latest/download/colgrep-installer.sh | sh

Build the index:

colgrep init /path/to/project  # specific project
colgrep init                   # current directory

Search:

colgrep "database connection pooling"

That's it. No server, no API, no dependencies. ColGREP is a single Rust binary with everything baked in. colgrep init builds the index for the first time. After that, every search detects file changes and updates the index automatically before returning results.

Regex meets semantics:

colgrep -e "async.*await" "error handling"

Agent integrations

Tool	Install
Claude Code	`colgrep --install-claude-code`
OpenCode	`colgrep --install-opencode`
Codex	`colgrep --install-codex`

Restart your agent after installing. Claude Code has full hooks support. OpenCode and Codex integrations are basic for now, PRs welcome.

How it works

flowchart TD
    A["Your codebase"] --> B["Tree-sitter"]
    B --> C["Structured representation"]
    C --> D["LateOn-Code-edge · 17M"]
    D --> E["NextPlaid"]
    E --> F["Search"]

    B -.- B1["Parse functions, methods, classes"]
    C -.- C1["Signature, params, calls, docstring, code"]
    D -.- D1["Multi-vector embedding per code unit · runs on CPU"]
    E -.- E1["Rust index binary · quantized · memory-mapped · incremental"]
    F -.- F1["grep-compatible flags · SQLite filtering · semantic ranking
100% local, your code never leaves your machine"]

    style A fill:#4a90d9,stroke:#357abd,color:#fff
    style B fill:#50b86c,stroke:#3d9956,color:#fff
    style C fill:#50b86c,stroke:#3d9956,color:#fff
    style D fill:#e8913a,stroke:#d07a2e,color:#fff
    style E fill:#e8913a,stroke:#d07a2e,color:#fff
    style F fill:#9b59b6,stroke:#8445a0,color:#fff
    style B1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
    style C1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
    style D1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
    style E1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
    style F1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888

What the model sees. Each code unit is converted to structured text before embedding:

# Function: fetch_with_retry
# Signature: def fetch_with_retry(url: str, max_retries: int = 3) -> Response
# Description: Fetches data from a URL with retry logic.
# Parameters: url, max_retries
# Returns: Response
# Calls: range, client.get
# Variables: i, e
# Uses: client, RequestError
# File: src/utils/http_client.py

def fetch_with_retry(url: str, max_retries: int = 3) -> Response:
    """Fetches data from a URL with retry logic."""
    for i in range(max_retries):
        try:
            return client.get(url)
        except RequestError as e:
            if i == max_retries - 1:
                raise e

This structured input gives the model richer signal than raw code alone.

More: install variants, performance tuning, all flags and options → colgrep/README.md

Why multi-vector?

Standard vector search collapses an entire document into one embedding. That's a lossy summary. Fine for short text, bad for code where a single function has a name, parameters, a docstring, control flow, and dependencies.

Multi-vector keeps ~300 embeddings of dimension 128 per document instead of one. At query time, each query token finds its best match across all document tokens (MaxSim). More storage upfront. That's what NextPlaid solves with quantization and memory-mapped indexing.

NextPlaid

A local-first multi-vector database with a REST API. It's what powers ColGREP under the hood, but it's a general-purpose engine you can use for any retrieval workload.

Built-in encoding. Pass text, get results. Ships with ONNX Runtime for ColBERT models, no external inference server needed.
Memory-mapped indices. Low RAM footprint, indices live on disk and are paged in on demand.
Product quantization. 2-bit or 4-bit compression. A million documents fit in memory.
Incremental updates. Add and delete documents without rebuilding the index.
Metadata pre-filtering. SQL WHERE clauses on a built-in SQLite store. Filter before search so only matching documents are scored.
CPU-optimized. Designed to run fast on CPU. CUDA supported when you need it.

NextPlaid vs FastPlaid. FastPlaid is a GPU batch indexer built for large-scale, single-pass workloads. NextPlaid wraps the same FastPlaid algorithm into a production API that handles documents as they arrive: incremental updates, concurrent reads/writes, deletions, and built-in encoding. Use FastPlaid for bulk offline indexing and experiments, NextPlaid for serving and streaming ingestion.

Quick start

Run the server (Docker):

# CPU
docker pull ghcr.io/lightonai/next-plaid:cpu-1.0.6
docker run -p 8080:8080 -v ~/.local/share/next-plaid:/data/indices \
  ghcr.io/lightonai/next-plaid:cpu-1.0.4 \
  --host 0.0.0.0 --port 8080 --index-dir /data/indices \
  --model lightonai/answerai-colbert-small-v1-onnx --int8

# GPU
docker pull ghcr.io/lightonai/next-plaid:cuda-1.0.6
docker run --gpus all -p 8080:8080 -v ~/.local/share/next-plaid:/data/indices \
  ghcr.io/lightonai/next-plaid:cuda-1.0.4 \
  --host 0.0.0.0 --port 8080 --index-dir /data/indices \
  --model lightonai/GTE-ModernColBERT-v1 --cuda

Query from Python:

pip install next-plaid-client

from next_plaid_client import NextPlaidClient, IndexConfig

client = NextPlaidClient("http://localhost:8080")

# Create index
client.create_index("docs", IndexConfig(nbits=4))

# Add documents, text is encoded server-side
client.add(
    "docs",
    documents=[
        "next-plaid is a multi-vector database",
        "colgrep is a code search tool based on NextPlaid",
    ],
    metadata=[{"id": "doc_1"}, {"id": "doc_2"}],
)

# Search
results = client.search("docs", ["coding agent tool"])

# Search with metadata filtering
results = client.search(
    "docs",
    ["vector-database"],
    filter_condition="id = ?",
    filter_parameters=["doc_1"],
)

# Delete by predicate
client.delete("docs", "id = ?", ["doc_1"])

Once the server is running: Swagger UI · OpenAPI spec

More: REST API reference, Docker Compose, environment variables → next-plaid-api/README.md

API Benchmarks

End-to-end benchmarks against the NextPlaid API on BEIR datasets. Documents are uploaded as raw text in parallel batches of 64. Search queries are sent as raw text, one at a time, with 16 concurrent workers to simulate real user traffic. All throughput numbers (docs/s, QPS) include encoding time — the model runs inside the API, so every document and query is embedded on the fly within the API.

Setup: lightonai/GTE-ModernColBERT-v1 on NVIDIA H100 80GB, top_k=100, n_ivf_probe=8, n_full_scores=4096. CPU search uses INT8-quantized ONNX encoding on the same machine.

Dataset	Documents	MAP	NDCG@10	NDCG@100	Recall@10	Recall@100	Indexing (docs/s)	GPU QPS	GPU P95 (ms)	CPU QPS	CPU P95 (ms)
arguana	8,674	0.2457	0.3499	0.3995	0.7126	0.9337	77.1	13.6	170.1	17.4	454.7
fiqa	57,638	0.3871	0.4506	0.5129	0.5184	0.7459	41.3	18.2	170.6	17.6	259.1
nfcorpus	3,633	0.1870	0.3828	0.3427	0.1828	0.3228	86.7	6.6	262.1	16.9	219.4
quora	522,931	0.8170	0.8519	0.8644	0.9309	0.9730	105.5	20.9	126.2	17.7	235.1
scidocs	25,657	0.1352	0.1914	0.2732	0.2020	0.4418	46.9	17.5	139.3	16.5	281.7
scifact	5,183	0.7186	0.7593	0.7775	0.8829	0.9633	53.1	7.9	169.5	16.9	305.4

Models

Any HuggingFace ColBERT-style model can be exported to ONNX. By default, both FP32 and INT8 quantized versions are created. INT8 quantization reduces size (~4x smaller) and improves speed with minimal quality loss.

pip install pylate-onnx-export

# Export model (creates model.onnx and model_int8.onnx)
pylate-onnx-export lightonai/GTE-ModernColBERT-v1 -o ./my-models

# Export + push to HuggingFace Hub
pylate-onnx-export lightonai/GTE-ModernColBERT-v1 -o ./my-models --push-to-hub myorg/my-onnx-model

Ready-to-use models

These can be served with NextPlaid and used with ColGREP without export:

Model	Use case
`lightonai/LateOn-Code-edge`	Code search, lightweight
`lightonai/LateOn-Code`	Code search, accurate
`lightonai/mxbai-edge-colbert-v0-32m-onnx`	Text retrieval, lightweight
`lightonai/answerai-colbert-small-v1-onnx`	Text retrieval, lightweight
`lightonai/GTE-ModernColBERT-v1`	Text retrieval, accurate

Any PyLate-compatible ColBERT model from HuggingFace can be used when converted to ONNX.

License

Apache-2.0

Citation

@software{next-plaid,
  title  = {NextPlaid, ColGREP: Multi-vector search, from database to coding agents.},
  url    = {https://github.com/lightonai/next-plaid},
  author = {Sourty, Raphaël},
  year   = {2026},
}

@misc{LateOn-Code,
title  = {LateOn-Code: a Family of State-Of-The-Art Late Interaction Code Retrieval Models},
author = {Chaffin, Antoine},
url    = {https://huggingface.co/collections/lightonai/lateon-code},
year   = {2026}
}