Settings

Theme

Text extraction

1 points by theslay 12 years ago · 2 comments · 1 min read


Hi, I'm working on plagiarism detection and I need some help on text extraction from pdfs. I've tried PDFTextStream which really works well for extracting text from pdfs. I need to be able to extract the text into a strutured format where i could query thing like title, chapters,etc. Would appreciate it if I could get pointers to achieving this task. Thanks

pedalpete 12 years ago

Have you tried posting this to http://stackoverflow.com ? That's a better forum for these kinds of questions.

If you were to write a blog post about how to structure the extracted text, that's more the HN thing.

mindcrime 12 years ago

I won't swear to it, but I suspect you're going to have to largely roll your own, and that it will be at least partly heuristic driven. I use Apache Tika[1] to extract text from PDFs and then index it with Lucene, but we don't need to discriminate between various chapters or anything. But I can picture how you could use OpenNLP[2] and some custom code, to break down the chapters.

[1]: http://tika.apache.org

[2]: http://opennlp.apache.org

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection