Ask HN: How to exhaustively search the scientific literature?
I have a need for a comprehensive database of a certain type of event described in the scientific literature. For what it's worth, the event is a 'paleoearthquake', which is a historic or prehistoric earthquake that is found in the geologic record, usually by digging a trench across a fault line and identifying the disturbances in the geologic strata across or adjacent to the fault and, if possible, dating them via radiocarbon or other geochronological methods. However I don't think the specifics are particularly important.
The issue is that these are generally reported in the literature from local investigations of one or two faults, yielding a few events. These studies are done wherever there are earthquakes on land, so we have a global scope and language issues. Even limiting the results to the English peer-reviewed literature, however, it's a huge distributed search.
I estimate that there are on the order of 10,000 published events, and a mean of 2-3 events per publication.
For my immediate use of the database, it is very important for the database to be as complete as possible--I'm not looking for a sort of statistically representative sample. The literature itself is quite incomplete of course, but we're limited to what exists for now.
Starting with the first step of collating publications, what tools would one use? I have access to most journals through various university affiliations. Are there particular APIs? Web scraping tools? LLMs?
Thanks! One option that shouldn't be overlooked: get a temporary subscription to an OpenAI model that allows you to run what they originally called "deep research" (nowadays called "Extended Pro" mode.) This isn't available on the freebie chat page, it will require at least a $20/month subscription (and maybe more, not sure.) Then, basically paste your post into the prompt and let it crunch. It will take up to 30 minutes or so, and will often give you a reasonably comprehensive report in which most of the references actually exist. It is absolutely a better-Google-than-Google class of resource. I'll do that and see if it comes up with anything meaningful, and also try it on Gemini 3.1. For a query like this I wouldn't expect it to return a list of thousands of individual reports, but it might give you some good leads that you can follow up with your existing journal access. Edit: GPT results: https://chatgpt.com/share/699df5db-b3d4-800b-b737-224319593e... Gemini 3.1 Pro results: https://gemini.google.com/share/bd22eb43c13b Thanks. I've got an OpenAI subscription and tried this in the past, and got a handful of results, but nothing comprehensive. Perhaps it is better now, or I could change the way I ask. No prob, see if there's anything useful in any of the links I added to the post. I'm always interested in good benchmarks and test cases, as I usually don't have enough of my own to justify my expensive pro subscriptions. (I did not review them myself as I don't know what I'm looking at.)