Show HN: Repo2vec – an open-source library for chatting with any codebase

93 points by nutellalover a year ago · 55 comments · 1 min read

Reader

Hi HN, We're excited to share repo2vec: a simple-to-use, modular library enabling you to chat with any public or private codebase. It's like Github Copilot but with the most up-to-date information about your repo.

We made this because sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through the code itself.

We tried to make it dead-simple to use. With two scripts, you can index and get a functional interface for your repo. Every generated response shows where in the code the context for the answer was pulled from.

We also made it plug-and-play where every component from the embeddings, to the vector store, to the LLM is completely customizable.

If you want to see a hosted version of the chat interface with its features, here's a link: https://www.youtube.com/watch?v=CNVzmqRXUCA

We would love your feedback!

- Mihail and Julia

resters a year ago

Very useful! I was just thinking this kind of thing should exist!

I would also like to be able to have the LLM know all of the documentation for any dependencies in the same way.

CuriousJ a year ago

OP's cofounder here. The nice thing is that a lot of repos include the documentation as well, so it comes for free by simply indexing the repo (like huggingface/transformers for instance).
nutellaloverOP a year ago

Thanks!
This is a great idea. Definitely something we plan to support.

cool-RR a year ago

I want to feed it not only the code but also a corpus of questions and answers, e.g. from the discussions page on GitHub. Is that possible?

nutellaloverOP a year ago

Thanks for the request! This is on our roadmap, as is supporting Github issues and eventually external documentation/code discussions from Slack, Jira/Linear, etc.
- nutellaloverOP a year ago
  
  Feel free to submit an issue on the repo and we'll get to it!
spaceship__sun a year ago

I just need to have gemini 1.5 pro in VS code dev environment and pass in the entire codebase in the context window. THEY STILL HAVEN'T DONE THIS.
- CuriousJ a year ago
  
  Depending on how large your codebase is, that could get pricey, at least for now. But it's probably just a matter of time until it all gets dirt cheap.
  - nutellaloverOP a year ago
    
    Definitely agree that the trend is toward lower cost where a lot of these use-cases are unlocked. Especially as all the major 3rd party LLM providers scramble to ship better models to retain mind-share.

peterldowns a year ago

Very cool project, I'm definitely going to try this out. One question — why use the OpenAI embeddings API instead of BGE (BERT) or other embeddings model that can be efficiently run client-side? Was there a quality difference or did you just default to using OpenAI embeddings?

CuriousJ a year ago

OP's cofounder here. For us, OpenAI embeddings worked best. When building a system that has many points of failure, I like to start with the highest quality ones (even if they're expensive / lack privacy) just to get an upper threshold of how good the system can be. Then start replacing pieces one by one and measure how much I'm losing in quality.
P.S. I worked on BERT at Google and have PTSD from how much we tried to make it work for retrieval, and it never really did well. Don't have much experience with BGE though.
- peterldowns a year ago
  
  Understood, thanks for the clear answer. Very cool that you worked on BERT at Google — thank you (and your team) for all of the open source releasing and publishing you've done over the years.
  I'm using OpenAI embeddings right now in my own project and I'm asking because I'd like to evaluate other embedding models that I can run in/adjacent-to my backend server, so that I don't have to wait 200ms to embed the user's search phrase/query. I'm very impressed by your project and I thought I might save myself some trouble if you had done some clear evals and decided OpenAI is far-and-away better :)
- xrd a year ago
  
  I wish you could tell the stories of how you eval'ed BERT at Google. Sounds meaty.
  - CuriousJ a year ago
    
    Retrieval is rarely ever evaluated in isolation. Academics would indirectly evaluate it by how much it improved question answering. The really cool thing at Google is that there were so many products and use cases (beyond the academic QA benchmarks) that would indirectly tell you if retrieval is useful. Much harder to do for smaller companies with a smaller suite of products and user bases.
nutellaloverOP a year ago

We ran some qualitative tests and there was a quality difference. In fact, benchmarks show that trend to generally hold: https://archersama.github.io/coir/
That being said, our goal was to make the library modular so you can easily add support for whatever embeddings you want. Definitely encourage experimenting for your use-case because even in our tests, we found that trends which hold true in research benchmarks don't always translate to custom use-cases.
- peterldowns a year ago
  
  > we found that trends which hold true in research benchmarks don't always translate to custom use-cases.
  Exactly why I asked! If you don't mind a followup question, how were you evaluating embeddings models — was it mostly just vibes on your own repos, or something more rigorous? Asking because I'm working on something similar and based on what you've shipped, I think I could learn a lot from you!
  - nutellaloverOP a year ago
    
    Happy to help!
    At the beginning, we started with qualitative "vibe" checks where we could iterate quickly and the delta in quality was still so significant that we could obviously see what was performing better.
    Once we stopped trusting our ability to discern differences, we actually bit the bullet and made a small eval benchmark set (~20 queries across 3 repos of different sizes) and then used that to guide algorithmic development.
    
    peterldowns a year ago
    
    Thank you, I appreciate the details.

zaptrem a year ago

We have LLMs with hundreds of thousands of tokens context windows and prompt caching that makes using them affordable. Why don’t we just stuff the whole code base in the context window?

CuriousJ a year ago

This paper shows that 200-800 is the ideal chunk size; if you go above, the model starts getting confused / distracted. https://arxiv.org/pdf/2406.14497
- zaptrem a year ago
  
  Makes sense. Thanks!
nutellaloverOP a year ago

The truth is we started there. But for any reasonably-sized, complex codebase this just isn't going to work as the context window isn't sufficient and moreover it becomes harder for the LLM to reason over arbitrary parts of the context.
For the time being, indexing and retrieving a good collection of 10-20 code chunks is more effective/performant in practice.
siamese_puff a year ago

Not an expert, but OP is right and this is generally a known issue with large windows and RAG. Small chunks are usually best. Also how you chunk is important. OP - what’s the most optimal way to parse/chunk code snippets?
- jshobrook a year ago
  
  You can use the AST to chunk the code: https://docs.sweep.dev/blogs/chunking-2m-files
  - CuriousJ a year ago
    
    We're using an improvement over this exact blogpost actually. We started from there, but weren't happy that some of the chunks were really small (and they would undeservedly get surfaced to the top). So we added some extra logic to merge the siblings if they're small.
    https://github.com/Storia-AI/repo2vec/blob/1864102949e720320...

erichi a year ago

Is it somehow different from Cursor codebase indexing/chat? I’m using this setup to analyse repos currently.

nutellaloverOP a year ago

Big fans of Cursor ourselves. One of the goals with this library is to make it easy for maintainers of OSS projects to expose chat support functionality to their users in a very streamlined, easy-to-setup fashion.
So yes you can certainly use to index and query your own repos for yourself, but it's also a way to get more of your OSS lib users onboarded.

adamtaylor_13 a year ago

Sorry for the dumb question but can I use this on private repositories or is it sending my code to OpenAI?

simonw a year ago

Out of interest, are you worried that OpenAI would go against their API license terms and train on your data anyway, or are you worried that they might log your data and then have a security breach that exposes it to malicious attackers?
- phantomathkg a year ago
  
  I think people simply worry calling Open AI on a lower price plan would cause the data to be scan for training purposes.
  - simonw a year ago
    
    Their API terms and conditions say they won't do that.
    I'm fascinated by how little people trust them!
    
    adamtaylor_13 a year ago
    
    Terms and conditions only mean something if you have the money and patience to hold someone’s feet to the fire.
    If I’m a CTO figuring out how to enable my team, I care a great deal about whether or not our private code is going to OpenAI.
    
    simonw a year ago
    
    I'm confident they don't want your code in their training data. The amount they have to lose if they're found to be using customer code as training data is enormous. Plus there are no guarantees that your code is good for training a model - model providers have been focusing much more heavily on quantity rather than quality of training data recently.
    (Worrying that they may log your data and then have a security breach is a different matter - that's a reasonable concern, they've had security bugs in the past.)
    I call this the AI trust crisis: people absolutely won't believe AI companies that say they won't train on their data: https://simonwillison.net/2023/Dec/14/ai-trust-crisis/
    
    thomaskang08 a year ago
    
    Quality over quantity, rather?
    
    simonw a year ago
    
    Yes, that's what I meant! Too late to edit now.
    
    nutellaloverOP a year ago
    
    This is a great read Simon.
- adamtaylor_13 a year ago
  
  All of the above. I’m not overly worried. But it’s surprising that they don’t mention it anywhere.
nutellaloverOP a year ago

You can certainly apply to a private repo. If you want to ensure data stays local, you would have to add support for an OSS embedding/LLM model (of which there are many good offerings to pick from).
- infocollector a year ago
  
  Please update TL;DR: repo2vec is a simple-to-use, modular library enabling you to chat with any public or private codebase "by sending data to OpenAI."
  - nutellaloverOP a year ago
    
    Thanks for the note! We welcome contributions!

kevshor a year ago

This looks super cool! Is there currently a limit to how big a repo can be for this to work efficiently?

CuriousJ a year ago

We noticed an interesting phenomenon related to the size of the repo. The bigger it is, the more its utility skews towards learning how to use the library as opposed to how to change it, i.e. for the big repos the chat is more useful for users than developers/maintainers.
nutellaloverOP a year ago

Great question. For most small repos (10-20 source files) this works incredibly well out-of-the-box.
We stress-tested with repos like langchain, llamaindex, kubernetes and there the retrieval still needs work to effectively return relevant chunks. This is still an open research question.

wiradikusuma a year ago

Is this for a specific language? Does it support polygot (multiple languages in 1 project)?

nutellaloverOP a year ago

Yup! We use tree-sitter and parse it at the file-level.

interestingsoup a year ago

Any plans on allowing the use of a local LLM like Ollama or LM Studio?

CuriousJ a year ago

OP's cofounder here. Yes, we started with what we perceived as highest quality (OpenAI embeddings + Claude autocompletions), but will definitely make our way to local/OSS. The code is super modular so hopefully the community will help as well.

ccgongie a year ago

Super easy to use! Thanks! What's powering this under the hood?

nutellaloverOP a year ago

The starter config is Openai embeddings + llm, pinecone vector store, gradio for the UI. But it's customizable so you can swap out whatever you want easily.
- leobg a year ago
  
  What is Pinecone used for? I would assume that an average repo yields only a few hundred or thousand chunks. Even with brite force similarity search that is just 2-digit milliseconds on CPU. Faster than any API call. And even if you got into the million chunk scale, there’s FAISS and HNSW. So wouldn’t outsourcing this to an external provider not only be unnecessary, but making things slower?

RicoElectrico a year ago

I wonder if it will work on https://github.com/organicmaps/organicmaps

So far two similar solutions I tested crapped out on non-ASCII characters. Because Python's UTF-8 decoder is quite strict about it.

CuriousJ a year ago

OP's cofounder here. Thanks for pointing out this test case. Surfaced that we weren't handling symlinks properly. With this fix, I was able to successfully embed and index most of the repo (though I stopped at 100 embedding jobs so that we don't burn through OpenAI credits).
P.S. You'll see a bunch of warnings for e.g. binary files that are ignored. https://github.com/Storia-AI/repo2vec/commit/1864102949e7203...
nutellaloverOP a year ago

OP here! I love this stress test. Will index and get back to you!

ranger_danger a year ago

is there a docker image?

Settings

Show HN: Repo2vec – an open-source library for chatting with any codebase

Keyboard Shortcuts