Ask HN: How to extract information from mutiple (unstructured text) documents?

2 points by gpa 4 years ago · 3 comments · 2 min read

I need to extract certain information from research publications, such as species, biomass, geographic location, and maybe related environmental data. Assume that I will convert PDF to text and, if necessary, do OCR. But here's the catch: other species with similar data can be quite close to my target species on the same page, paragraph, sentence, or in the same table. Moreover, indicator values can be quite close or the same (e. g., biomass B = 1.2 kg/m^2), as the species are from the same Genus. For example, Mytilus has 3 species (actually more) - Mytilus edulis, Mytilus trossulus, and Mytilus galloprovincialis. How would an algorithm with no prior knowledge determine that a specific value relates to my target species rather than, say, the one adjacent to it in the same table or paragraph? I'm a human, and I know what to look for as I have prior knowledge, but I cannot process hundreds or thousands of articles as quickly as a machine can. Does anyone have experience in using a tool that can correctly parse such information after appropriate setup? I am aware of:

- HN search results (https://hn.algolia.com/?q=information+extraction)

- Apache Tika (https://tika.apache.org/)

- Apache OpenNLP (https://opennlp.apache.org/)

- Apache UIMA (https://uima.apache.org/external-resources.html)

- GATE (https://gate.ac.uk/)

But I am not sure if any of these can do the job, as I haven't used them. I also know that there are companies that have developed similar solutions (https://www.ontotext.com/knowledgehub/case-studies/ai-content-generation-in-scientific-communication/), possibly by using GraphDB. In addition, what is the best data storage solution? In one case, you extract a table from the publication, whereas in another - a single data point. It's not worth the effort of creating a separate table for a single data point. What would be the right approach, software (library) and possible workflow and data storage solution in this case?

PaulHoule 4 years ago

No off-the shelf information extraction system is going to be useful for your task. In particular most of the systems you list are notorious rabbit holes and dead ends. (Well, UIMA was developed by IBM to support projects that have 100+ coders and data entry people, it's not a dead end if you have a budget that big...)

If getting the right answer matters for you you need to start with a workflow system that will let you do the task manually. You will absolutely need it for two reasons: (1) editing cases that the extraction system gets wrong, (2) creating a training/evaluation set for the extraction pipeline.

When you've got a well-defined task you can do manually then you can think about automating some of the extraction (80% is realistic) with rules like regexes, RNN/CNN/Transformer models.

My contacts in Argentina who do projects like this all the time say that it takes maybe 20,000 examples to train an extraction model and that fits my experience. What separates the people who succeed at this kind of project from those who fail is that those who succeed make the training set, those who fail exhaust themself looking at projects like Tika, OpenNLP, UIMA, etc.

gpaOP 4 years ago

Sounds reasonable. And yes, I am a one-man team. But then, after collecting all this information, how do you organize and store all this information for on-demand availability? For tables extracted from the publication, you can create a new table withing a relational database. But how do you organize single data points extracted from the text? Where do you store them?
- PaulHoule 4 years ago
  
  If I were writing a workflow system in a hurry I'd use arangodb in the back end and a web server based on asyncio. I like the idea of something RDF-based but I'd need to develop the right algebra for isolating individual "records" in an RDF database so they can be updated safely.
  I worked at a startup that built a system with quite a few parts including a system for annotating text to train an extraction pipeline. That last bit had a typescript/react front end, a scala web server, and kept data in an elaborate set of tables in postgres. They were in a hurry to get it working for one particular customer so it wound up pretty half-baked.
  I have a lot of ideas on this so look up my profile and send me an email!

Settings

Ask HN: How to extract information from mutiple (unstructured text) documents?

Keyboard Shortcuts