Ask HN: Strategies or tools for embedding multiple file types?

3 points by spruce_tips 8 months ago · 3 comments · 1 min read

Reader

I've worked a good bit with embedding strategies for RAG. But they've only been for documents that are identical in structure i.e. interview transcripts.

I'm curious how others have thought about handling embeddings for multiple file types (txt, pdf, image, docx, ppt, etc.)? Obviously, I could handle each file type individually and then build a flexible search layer on top, but I'm concerned about the level of maintenance required.

One idea I had was to build a translation layer of sorts that would take some arbitrary file type in, map it onto a standardized text schema, and embed that. For images (which are much less common in my dataset), I would use an LLM to describe the image and cast that text into my standard format. The standard format would allow me to simplify the chunking and embedding logic for each file type, and make the vector search layer a lot easier to maintain.

I know this won't be perfect, but I think it could solve most of what I'm trying to achieve.

---

Curious what others think about this and what you have tried.

Cheers,

spruce_tips

chiccomagnus 8 months ago

If you don't want to reinvent the wheel, we have built exactly that, goggle "Preprocess"

skeptrune 8 months ago

Strongly recommend using Apache Tika[1] for this. It's industry standard for ubiquitous document text extraction.

You can take the text output from Tika, chunk it with something like Chonkie[2], and embed it for your search index.

-[1]https://tika.apache.org/

-[2]https://chonkie.ai/

spruce_tipsOP 8 months ago

sweet this looks great. hadnt heard of chonkie but tika was on my list. thanks!

Settings

Ask HN: Strategies or tools for embedding multiple file types?

Keyboard Shortcuts