Ask HN: Way to extract relevant parts from a PDF based on a question?

1 points by madhatter999 2 years ago · 8 comments · 1 min read

Dear HN,

I am trying to do some semantic search in a given corpus of PDF documents based on a question as input. My goal is to find the relevant parts from the PDF that best answers the input question. I am interested in finding out concepts, frameworks, and methodologies that will help me with this task. If you have any pointers, I would greatly appreciate it!

liampulles 2 years ago

This is a key usecase for text embeddings. Essentially it is a process of converting sentences or paragraphs to vectors, where the closeness of vectors then represents a semantic similarity.

So you can convert all the paragraphs in your document into vectors, convert your question into a vector, and then find the e.g. 10 closest vectors, or all that fall under a certain maximum distance, etc.

You can store the embeddings in a vector database, to search across multiple documents.

madhatter999OP 2 years ago

Thank you for the answer! Wouldn't it make sense to compare the potential answers to the questions to the parts of the documents?
- liampulles 2 years ago
  
  If the potential answers are given, sure

verdverm 2 years ago

LlamaIndex is my tool of choice right now

https://docs.llamaindex.ai/en/stable/

https://docs.llamaindex.ai/en/stable/examples/citation/pdf_p...

I'm using it with Qdrant and can get the text sections & locations that are tied to the answer & citation as well

madhatter999OP 2 years ago

Thank you for the pointer! What makes LlamaIndex different than other similar tools in your opinion?
- verdverm 2 years ago
  
  They are focused on RAG / agent systems and have lots of docs / tutorials to that effort

anoni2 2 years ago

Try: https://notebooklm.google/

madhatter999OP 2 years ago

I didn't know notebooklm had that feature embedded. Thanks for sharing!

Settings

Ask HN: Way to extract relevant parts from a PDF based on a question?

Keyboard Shortcuts