Ask HN: Way to extract relevant parts from a PDF based on a question?
Dear HN,
I am trying to do some semantic search in a given corpus of PDF documents based on a question as input. My goal is to find the relevant parts from the PDF that best answers the input question. I am interested in finding out concepts, frameworks, and methodologies that will help me with this task. If you have any pointers, I would greatly appreciate it! This is a key usecase for text embeddings. Essentially it is a process of converting sentences or paragraphs to vectors, where the closeness of vectors then represents a semantic similarity. So you can convert all the paragraphs in your document into vectors, convert your question into a vector, and then find the e.g. 10 closest vectors, or all that fall under a certain maximum distance, etc. You can store the embeddings in a vector database, to search across multiple documents. Thank you for the answer! Wouldn't it make sense to compare the potential answers to the questions to the parts of the documents? If the potential answers are given, sure LlamaIndex is my tool of choice right now https://docs.llamaindex.ai/en/stable/ https://docs.llamaindex.ai/en/stable/examples/citation/pdf_p... I'm using it with Qdrant and can get the text sections & locations that are tied to the answer & citation as well Thank you for the pointer! What makes LlamaIndex different than other similar tools in your opinion? They are focused on RAG / agent systems and have lots of docs / tutorials to that effort I didn't know notebooklm had that feature embedded. Thanks for sharing!