Show HN: Code retrieval findings from a real-world benchmark
github.comWhile building our OSS chat-with-your-codebase system, we faced so many choices: chunking strategies, embeddings, retrieval algorithms, rerankers, etc.
Since we don't want to make decisions based on vibes, and because academic benchmarks are somewhat contrived, we made our own. Our dataset consists of 1,000 questions about Hugging Face's Transformers library, where each question requires 1-3 Python files to be answered correctly.
We started by comparing proprietary APIs for the various sub-tasks involved in an AI copilot. Here are our initial learnings: - OpenAI's text-embedding-3-small embeddings perform best. - NVIDIA's reranker outperforms Cohere, Voyage and Jina. - Sparse retrieval (e.g. BM25) is actively hurting code retrieval if you have natural language files in your index (e.g. Markdown). - Chunks of size 800 are ideal; going smaller has very marginal gains. - Going beyond top_k=25 for retrieval has diminishing returns.
We're just getting started and plan on continuously sharing our findings with the community. Go OSS!
No comments yet.