NextPlaid & ColGREP
NextPlaid is a multi-vector search engine. ColGREP is semantic code search, built on it.
ColGREP
Semantic code search for your terminal and your coding agents. Searches combine regex filtering with semantic ranking. All local, your code never leaves your machine.
Quick start
Install:
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/lightonai/next-plaid/releases/latest/download/colgrep-installer.sh | sh
Build the index:
colgrep init /path/to/project # specific project colgrep init # current directory
Search:
colgrep "database connection pooling"That's it. No server, no API, no dependencies. ColGREP is a single Rust binary with everything baked in. colgrep init builds the index for the first time. After that, every search detects file changes and updates the index automatically before returning results.
Regex meets semantics:
colgrep -e "async.*await" "error handling"
Agent integrations
| Tool | Install |
|---|---|
| Claude Code | colgrep --install-claude-code |
| OpenCode | colgrep --install-opencode |
| Codex | colgrep --install-codex |
Restart your agent after installing. Claude Code has full hooks support. OpenCode and Codex integrations are basic for now, PRs welcome.
How it works
flowchart TD
A["Your codebase"] --> B["Tree-sitter"]
B --> C["Structured representation"]
C --> D["LateOn-Code-edge · 17M"]
D --> E["NextPlaid"]
E --> F["Search"]
B -.- B1["Parse functions, methods, classes"]
C -.- C1["Signature, params, calls, docstring, code"]
D -.- D1["Multi-vector embedding per code unit · runs on CPU"]
E -.- E1["Rust index binary · quantized · memory-mapped · incremental"]
F -.- F1["grep-compatible flags · SQLite filtering · semantic ranking
100% local, your code never leaves your machine"]
style A fill:#4a90d9,stroke:#357abd,color:#fff
style B fill:#50b86c,stroke:#3d9956,color:#fff
style C fill:#50b86c,stroke:#3d9956,color:#fff
style D fill:#e8913a,stroke:#d07a2e,color:#fff
style E fill:#e8913a,stroke:#d07a2e,color:#fff
style F fill:#9b59b6,stroke:#8445a0,color:#fff
style B1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
style C1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
style D1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
style E1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
style F1 fill:none,stroke:#888,stroke-dasharray:5 5,color:#888
What the model sees. Each code unit is converted to structured text before embedding:
# Function: fetch_with_retry # Signature: def fetch_with_retry(url: str, max_retries: int = 3) -> Response # Description: Fetches data from a URL with retry logic. # Parameters: url, max_retries # Returns: Response # Calls: range, client.get # Variables: i, e # Uses: client, RequestError # File: src/utils/http_client.py def fetch_with_retry(url: str, max_retries: int = 3) -> Response: """Fetches data from a URL with retry logic.""" for i in range(max_retries): try: return client.get(url) except RequestError as e: if i == max_retries - 1: raise e
This structured input gives the model richer signal than raw code alone.
More: install variants, performance tuning, all flags and options → colgrep/README.md
Why multi-vector?
Standard vector search collapses an entire document into one embedding. That's a lossy summary. Fine for short text, bad for code where a single function has a name, parameters, a docstring, control flow, and dependencies.
Multi-vector keeps ~300 embeddings of dimension 128 per document instead of one. At query time, each query token finds its best match across all document tokens (MaxSim). More storage upfront. That's what NextPlaid solves with quantization and memory-mapped indexing.
NextPlaid
A local-first multi-vector database with a REST API. It's what powers ColGREP under the hood, but it's a general-purpose engine you can use for any retrieval workload.
- Built-in encoding. Pass text, get results. Ships with ONNX Runtime for ColBERT models, no external inference server needed.
- Memory-mapped indices. Low RAM footprint, indices live on disk and are paged in on demand.
- Product quantization. 2-bit or 4-bit compression. A million documents fit in memory.
- Incremental updates. Add and delete documents without rebuilding the index.
- Metadata pre-filtering. SQL WHERE clauses on a built-in SQLite store. Filter before search so only matching documents are scored.
- CPU-optimized. Designed to run fast on CPU. CUDA supported when you need it.
NextPlaid vs FastPlaid. FastPlaid is a GPU batch indexer built for large-scale, single-pass workloads. NextPlaid wraps the same FastPlaid algorithm into a production API that handles documents as they arrive: incremental updates, concurrent reads/writes, deletions, and built-in encoding. Use FastPlaid for bulk offline indexing and experiments, NextPlaid for serving and streaming ingestion.
Quick start
Run the server (Docker):
# CPU docker pull ghcr.io/lightonai/next-plaid:cpu-1.0.6 docker run -p 8080:8080 -v ~/.local/share/next-plaid:/data/indices \ ghcr.io/lightonai/next-plaid:cpu-1.0.4 \ --host 0.0.0.0 --port 8080 --index-dir /data/indices \ --model lightonai/answerai-colbert-small-v1-onnx --int8
# GPU docker pull ghcr.io/lightonai/next-plaid:cuda-1.0.6 docker run --gpus all -p 8080:8080 -v ~/.local/share/next-plaid:/data/indices \ ghcr.io/lightonai/next-plaid:cuda-1.0.4 \ --host 0.0.0.0 --port 8080 --index-dir /data/indices \ --model lightonai/GTE-ModernColBERT-v1 --cuda
Query from Python:
pip install next-plaid-client
from next_plaid_client import NextPlaidClient, IndexConfig client = NextPlaidClient("http://localhost:8080") # Create index client.create_index("docs", IndexConfig(nbits=4)) # Add documents, text is encoded server-side client.add( "docs", documents=[ "next-plaid is a multi-vector database", "colgrep is a code search tool based on NextPlaid", ], metadata=[{"id": "doc_1"}, {"id": "doc_2"}], ) # Search results = client.search("docs", ["coding agent tool"]) # Search with metadata filtering results = client.search( "docs", ["vector-database"], filter_condition="id = ?", filter_parameters=["doc_1"], ) # Delete by predicate client.delete("docs", "id = ?", ["doc_1"])
Once the server is running: Swagger UI · OpenAPI spec
More: REST API reference, Docker Compose, environment variables → next-plaid-api/README.md
API Benchmarks
End-to-end benchmarks against the NextPlaid API on BEIR datasets. Documents are uploaded as raw text in parallel batches of 64. Search queries are sent as raw text, one at a time, with 16 concurrent workers to simulate real user traffic. All throughput numbers (docs/s, QPS) include encoding time — the model runs inside the API, so every document and query is embedded on the fly within the API.
Setup: lightonai/GTE-ModernColBERT-v1 on NVIDIA H100 80GB, top_k=100, n_ivf_probe=8, n_full_scores=4096. CPU search uses INT8-quantized ONNX encoding on the same machine.
| Dataset | Documents | MAP | NDCG@10 | NDCG@100 | Recall@10 | Recall@100 | Indexing (docs/s) | GPU QPS | GPU P95 (ms) | CPU QPS | CPU P95 (ms) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| arguana | 8,674 | 0.2457 | 0.3499 | 0.3995 | 0.7126 | 0.9337 | 77.1 | 13.6 | 170.1 | 17.4 | 454.7 |
| fiqa | 57,638 | 0.3871 | 0.4506 | 0.5129 | 0.5184 | 0.7459 | 41.3 | 18.2 | 170.6 | 17.6 | 259.1 |
| nfcorpus | 3,633 | 0.1870 | 0.3828 | 0.3427 | 0.1828 | 0.3228 | 86.7 | 6.6 | 262.1 | 16.9 | 219.4 |
| quora | 522,931 | 0.8170 | 0.8519 | 0.8644 | 0.9309 | 0.9730 | 105.5 | 20.9 | 126.2 | 17.7 | 235.1 |
| scidocs | 25,657 | 0.1352 | 0.1914 | 0.2732 | 0.2020 | 0.4418 | 46.9 | 17.5 | 139.3 | 16.5 | 281.7 |
| scifact | 5,183 | 0.7186 | 0.7593 | 0.7775 | 0.8829 | 0.9633 | 53.1 | 7.9 | 169.5 | 16.9 | 305.4 |
Models
Any HuggingFace ColBERT-style model can be exported to ONNX. By default, both FP32 and INT8 quantized versions are created. INT8 quantization reduces size (~4x smaller) and improves speed with minimal quality loss.
pip install pylate-onnx-export # Export model (creates model.onnx and model_int8.onnx) pylate-onnx-export lightonai/GTE-ModernColBERT-v1 -o ./my-models # Export + push to HuggingFace Hub pylate-onnx-export lightonai/GTE-ModernColBERT-v1 -o ./my-models --push-to-hub myorg/my-onnx-model
Ready-to-use models
These can be served with NextPlaid and used with ColGREP without export:
| Model | Use case |
|---|---|
lightonai/LateOn-Code-edge |
Code search, lightweight |
lightonai/LateOn-Code |
Code search, accurate |
lightonai/mxbai-edge-colbert-v0-32m-onnx |
Text retrieval, lightweight |
lightonai/answerai-colbert-small-v1-onnx |
Text retrieval, lightweight |
lightonai/GTE-ModernColBERT-v1 |
Text retrieval, accurate |
Any PyLate-compatible ColBERT model from HuggingFace can be used when converted to ONNX.
License
Apache-2.0
Citation
@software{next-plaid, title = {NextPlaid, ColGREP: Multi-vector search, from database to coding agents.}, url = {https://github.com/lightonai/next-plaid}, author = {Sourty, Raphaël}, year = {2026}, } @misc{LateOn-Code, title = {LateOn-Code: a Family of State-Of-The-Art Late Interaction Code Retrieval Models}, author = {Chaffin, Antoine}, url = {https://huggingface.co/collections/lightonai/lateon-code}, year = {2026} }
