GitHub - shebe-oss/shebe: Fast BM25 full-text search for code repositories with MCP integration for AI coding agents.

5 min read Original article ↗

Shebe

pipeline status coverage report Latest Release

Shebe find_references example

Fast Code Search via BM25

Shebe is a fast and simple local code-search tool powered by BM25. No embeddings, No GPU, No cloud.

Research shows 70-85% of developer code search value comes from keyword-based queries. Developers tend to search with exact terms they know: function names, API calls, error messages. BM25 excels at this.

Trade-offs:

  • Repositories must be cloned locally before indexing (no remote URL support)
  • No semantic similarity: "login" does not match "authenticate". However, BM25 supports multi-term queries without performance degradation - agents quickly learn to include synonyms (e.g., login OR authenticate OR sign-in). For true semantic search, pair with vector tools. See detailed analysis.

Capabilities:

  • 2ms query latency
  • 2k-12k files/sec indexing (6k files in 0.5s)
  • 200-700 tokens/query
  • Full UTF-8 support (emoji, CJK, special characters)
  • 14 MCP tools for coding agents (claude, codex etc) (reference)

Size:

  • ~10k lines of Rust source code (and another ~10k LoC test code).
  • 2 binaries (cli and mcp) each at ~8MB.

Positioning: Complements structural tools (Serena MCP) with content search. Coding agents learn tool selection quickly:

  • grep/ripgrep - Exact regex patterns, exhaustive matches, small codebases
  • Shebe - Ranked results, large codebases (1k+ files), polyglot search, boolean queries
  • Serena - Symbol refactoring, AST-aware edits, type-safe renaming

Alternatives: Cloud solutions like turbopuffer and nia come at a premium. Shebe is a free, local-only alternative. See WHY_SHEBE.md for benchmarks.

Table of Contents


Quick Start

1. Install

Homebrew (macOS and Linux):

brew tap shebe-oss/tap
brew install shebe

See the homebrew-tap repository for supported platforms and troubleshooting.

Manual download (Linux x86_64):

export SHEBE_VERSION=v0.5.8
curl -LO "https://github.com/shebe-oss/shebe-releases/releases/download/${SHEBE_VERSION}/shebe-${SHEBE_VERSION}-linux-x86_64.tar.gz"
curl -LO "https://github.com/shebe-oss/shebe-releases/releases/download/${SHEBE_VERSION}/shebe-${SHEBE_VERSION}-linux-x86_64.tar.gz.sha256"

sha256sum -c shebe-${SHEBE_VERSION}-linux-x86_64.tar.gz.sha256
tar -xzf shebe-${SHEBE_VERSION}-linux-x86_64.tar.gz
sudo mv shebe shebe-mcp /usr/local/bin/

Verify:

2. Index a Repository

# Clone a test repository
git clone --depth 1 https://github.com/envoyproxy/envoy.git ~/envoy

# Index it (creates session "envoy-v1")
shebe index-repository ~/envoy envoy-v1
# Output: Indexed 8,234 files (12,847 chunks) in 2.1s

3. Search Code

# Search for access log formatting
shebe search-code envoy-v1 "accesslog format"
Results for "accesslog format" in envoy-v1 (top 10):

1. source/extensions/access_loggers/common/access_log_base.h [0.847]
   class AccessLogBase : public AccessLog::Instance {
     void formatAccessLog(...);

2. source/common/formatter/substitution_formatter.cc [0.823]
   SubstitutionFormatter::format(const StreamInfo& info) {

4. Find References

# Find all references to SubstitutionFormatter
shebe find-references envoy-v1 SubstitutionFormatter --symbol-type type
References to "SubstitutionFormatter" (type) - 23 found:

HIGH CONFIDENCE (18):
  source/common/formatter/substitution_formatter.h:45
    class SubstitutionFormatter : public Formatter {

  source/extensions/access_loggers/file/file_access_log.cc:28
    std::unique_ptr<SubstitutionFormatter> formatter_;
  ...

For detailed setup, see INSTALLATION.md.


Common Tasks

Quick links to accomplish specific goals:

Task Tool Guide
Rename a symbol safely find_references Reference
Search polyglot codebase search_code Reference
Explore unfamiliar repo index_repository + search_code Quick Start
Find files by pattern find_file Reference
View file with context read_file or preview_chunk Reference
Update stale index reindex_session Reference

Refactoring Workflow

Shebe's find_references and search_code tools work together to enumerate all code locations affected by a refactoring task. In this example, Claude Code uses Shebe to analyze a pagination work plan and identify every file that needs to change -- completing the full impact analysis in ~1 minute.

View full workflow (6 screenshots)

Step 1: Index repository and run parallel find_references

Index and find references

Step 2: search_code locates CLI routing code

Search code results

Step 3: Structured analysis -- source files needing changes

Analysis source files

Step 4: New modules, wiring and documentation

Analysis modules and docs

Step 5: File creation plan and exclusion list

Analysis new files

Step 6: Impact summary (~11 files, 1m 14s)

Summary table


Configuration

Quick Reference

Variable Default Description
SHEBE_INDEX_DIR ~/.local/state/shebe Session storage location
SHEBE_CHUNK_SIZE 512 Characters per chunk (100-2000)
SHEBE_OVERLAP 64 Overlap between chunks
SHEBE_DEFAULT_K 10 Default search results count
SHEBE_MAX_K 100 Maximum search results allowed

Configuration File

Create shebe.toml in your working directory or ~/.config/shebe/shebe.toml:

[indexing]
chunk_size = 512
overlap = 64
max_file_size = 10485760  # 10MB

[search]
default_k = 10
max_k = 100

See CONFIGURATION.md for complete reference.


Documentation

Getting Started

Reference

Development


Performance

Validated on Istio (5,605 files, Go-heavy) and OpenEMR (6,364 files, PHP polyglot):

Metric Result
Query latency 2ms (consistent across all query types)
Indexing (Istio) 11,210 files/sec (0.5s for 5,605 files)
Indexing (OpenEMR) 1,928 files/sec (3.3s for 6,364 files)
Token usage 210-650 tokens/query
Polyglot coverage 11 file types in single query

See docs/Performance.md for detailed benchmarks.


Architecture

See ARCHITECTURE.md for developer guide.


Troubleshooting

Issue Cause Solution
"Session not found" Session doesn't exist or typo Run list_sessions to see available sessions
"Schema version mismatch" Session from older Shebe version Run upgrade_session to migrate
Slow indexing Disk I/O or large files Exclude node_modules/, target/, check disk
No search results Empty session or wrong query Verify with get_session_info, check query syntax
"File not found" in read_file File deleted since indexing Run reindex_session to update
High token usage Too many results Reduce k parameter (default: 10)

For detailed troubleshooting, see docs/guides/mcp-setup-guide.md.


Project Status

Version: v0.5.X
Status: Release Candidate
Testing: 76% coverage
Next: Pagination for list_dir and read_file when more than 500 files match a search term

See CHANGELOG.md for version history.


License

See LICENSE.


Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.