Launch HN: Chonkie (YC X25) – Open-Source Library for Advanced Chunking

143 points by snyy a day ago


Hey HN! We're Shreyash and Bhavnick. We're building Chonkie (https://chonkie.ai), an open-source library for chunking and embedding data.

Python: https://github.com/chonkie-inc/chonkie

TypeScript: https://github.com/chonkie-inc/chonkie-ts

Here's a video showing our code chunker: https://youtu.be/Xclkh6bU1P0.

Bhavnick and I have been building personal projects with LLMs for a few years. For much of this time, we found ourselves writing our own chunking logic to support RAG applications. We often hesitated to use existing libraries because they either had only basic features or felt too bloated (some are 80MB+).

We built Chonkie to be lightweight, fast, extensible, and easy. The space is evolving rapidly, and we wanted Chonkie to be able to quickly support the newest strategies. We currently support: Token Chunking, Sentence Chunking, Recursive Chunking, Semantic Chunking, plus:

- Semantic Double Pass Chunking: Chunks text semantically first, then merges closely related chunks.

- Code Chunking: Chunks code files by creating an AST and finding ideal split points.

- Late Chunking: Based on the paper (https://arxiv.org/abs/2409.04701), where chunk embeddings are derived from embedding a longer document.

- Slumber Chunking: Based on the "Lumber Chunking" paper (https://arxiv.org/abs/2406.17526). It uses recursive chunking, then an LLM verifies split points, aiming for high-quality chunks with reduced token usage and LLM costs.

You can see how Chonkie compares to LangChain and LlamaIndex in our benchmarks: https://github.com/chonkie-inc/chonkie/blob/main/BENCHMARKS....

Some technical details about the Chonkie package: - ~15MB default install vs. ~80-170MB for some alternatives. - Up to 33x faster token chunking compared to LangChain and LlamaIndex in our tests. - Works with major tokenizers (transformers, tokenizers, tiktoken). - Zero external dependencies for basic functionality. - Implements aggressive caching and precomputation. - Uses running mean pooling for efficient semantic chunking. - Modular dependency system (install only what you need).

In addition to chunking, Chonkie also provides an easy way to create embeddings. For supported providers (SentenceTransformer, Model2Vec, OpenAI), you just specify the model name as a string. You can also create custom embedding handlers for other providers.

RAG is still the most common use case currently. However, Chonkie makes chunks that are optimized for creating high quality embeddings and vector retrieval, so it is not really tied to the "generation" part of RAG. In fact, We're seeing more and more people use Chonkie for implementing semantic search and/or setting context for agents.

We are currently focused on building integrations to simplify the retrieval process. We've created "handshakes" – thin functions that interact with vector DBs like pgVector, Chroma, TurboPuffer, and Qdrant, allowing you to interact with storage easily. If there's an integration you'd like to see (vector DB or otherwise), please let us know.

We also offer hosted and on-premise versions with OCR, extra metadata, all embedding providers, and managed vector databases for teams that want a fully managed pipeline. If you're interested, reach out at shreyash@chonkie.ai or book a demo: https://cal.com/shreyashn/chonkie-demo.

We're eager to hear your feedback and comments! Thanks!

mritchie712 - a day ago

We (https://www.definite.app/) have a use case I'd imagine is common for people building agents.

When a user works with our agent, they may end up with a large conversation thread (e.g. 200k+ tokens) with many SQL snippets, query results and database metadata (e.g. table and column info).

For example, if they ask "show me any companies that were heavily engaged at one point, but I haven't talked to in the last 90 days". This will pull in their schema (e.g. Hubspot), run a bunch of SQL, show them results, etc.

I want to allow the agent to search previous threads for answers so they don't need to have the conversation again, but chunking up the existing thread is non-trivial (e.g. you don't want to separate the question and answer, you may want to remove errors while retaining the correction, etc.).

Do you have any plans to support "auto chunking" for AI message[0] threads?

0 - e.g. https://platform.openai.com/docs/api-reference/messages/crea...

yawnxyz - a day ago

I'm curious if chunking is different for embeddings vs. for "agentic retrieval" e.g. an AI or a person operates like a Librarian; they look up in an index at what resources to look up, get the relevant bits, then piece them together into a cohesive narrative whole — would we do any chunking at all for this, or does this purely rely on the way the DB is setup? I think for certain use cases, even a single DB record could be too large for context windows, so maybe chunking might need to be done to the record? (e.g. a db of research papers)

gazagoal - 3 hours ago

Is it easily extensible? For instance, when chunking PDF-converted-texts, is it possible to apply transformation or attach metadata to chunks?

zackify - 19 hours ago

You guys should steal the ideas I had in mind and partially implemented on https://github.com/zackify/revect

Similar to you I saw a lot of bloated projects out there. Mine is 90mb container.

I want to do what your project does but in addition have extensions for every day apps that index into a db.

Your private database for all ai interactions.

I also have a cloud version using the mcp auth spec, but it’s all for fun and probably not worth releasing.

Do you have any plans to do further use cases such as this?

elpalek - 17 hours ago

Do you have a benchmark for comparing different chunking methods? Your existing benchmark is to compare different libraries.

ChromaticPanic - 11 hours ago

When I was looking at your library last week, It didn't look like there was a direct way to use my own embedding model endpoints. For example, I run snowflake arctic embed in vllm and it would be good be able to use it with Chonkie's semantic chunkers.

whoaanni2 - 5 hours ago

Congrats on the launch and all the best. For PDF, does it convert directly to markdown using deterministic approaches of is compatible with reducto/unstructured/llamaparse? How does it fit with these players?

amir_karbasi - a day ago

Looks great! I had looked at Chonkie a few months back, but didn't need it in our pipelines. I was just writing a POC for an agentic chunker this week to handle various formatting and chunking requirements. I'll give Chonkie a shot!

pj_mukh - a day ago

Super cool!

It looks like size and speed is your major advantage. In our RAG pipeline we run the chunking process async as an onboarding type process. Is Chonkie primarily for people looking to process documents in some sort of real-time scenario?

greymalik - a day ago

You’re part of YC but this is open source - how do you plan to make money off of it?

ketzo - 20 hours ago

I’m building out a side project where I need to ingest + chunk a lot of HTML — wrote my own(terrible) hunker naively thinking that would be easy :’)

Definitely gonna give this a try!

Andugal - a day ago

Congratulations for the launch!

You said that Chonkie works with multiple vector stores. I was wondering what RAG database HN uses? Do you need a specialized one (like Chroma) or is Postgres just fine?

_epps_ - a day ago

Excited to try this out! Also +1 for Moo Deng-ish mascot.

hweller - 21 hours ago

Congratulations on the launch! would be awesome to see support for MongoDB Atlas as one of the vector stores and Voyage AI as an embedding provider if you are interested. I can imagine quite a few customers that would prefer a lightweight interface for chunking- lmk how I can help make that happen from the Mongo side!

blef - 18 hours ago

> - Code Chunking: Chunks code files by creating an AST and finding ideal split points.

I'd be interested to use it for SQL, did you try? Does it works well with it? I'm not familiar with the tree-sitter library

elliot07 - a day ago

Chonkie is great software. Congrats on the launch! Has been a pleasure to use so far.

olavfosse - 21 hours ago

Very cool!

What's the story for chunking PDFs?

We've been using Marker and handling markdown->chunks manually.

tevon - a day ago

Was just looking into chunking strategies today, this looks great! Will update with any feedback.

pzo - a day ago

Is this only for node (how about bun/deno)? Have it been tested to work with react native?

dbworku - 20 hours ago

Very cool. Dope maintainers and project!

petesergeant - 9 hours ago

It would be cool to have examples of what the distinct chunks each approach takes looks like. They should just be -- essentially -- paragraphs, right?

babuloseo - a day ago

I like the mascot.