Building A Full-Text Index In Javascript
garysieling.comThis is pretty cool, but the fundamental problem is still that you (or someone else) have to load an entire PDF (or set of PDFS) before you can use the full text indexing to search it.
If you're running a service (say like DocumentCloud) you're way better off precomputing a full text index on ingest and providing a search API than shunting over substantial parts of your stored documents.
Definitely cool as a piece of gear, but not terribly practical from a client-side perspective i'd think.
Yes, that is certainly true. The other issue with the technique I see is if I tried to scale this I'd probably hit some maturity issues with these libraries.
For what it's worth, it looks like DocumentCloud uses Open Calais, which is a Thomson Reuters product - I used to work there in a different division, they have a bunch of interesting products in this space.
Oh neat, what'd you do at Thomson Reuters?
I notice your blog is filled with NLP related goodies. I've been meaning to screw around with Stanford NER lib, to see if i can train up some custom recognizers for particular document domains of any utility.
I worked on a bunch of products, but the longest term one was the a side-product to WestLaw, Firm360, which was a market research tool for law firms that came from an acquisition (FindLaw). I worked on some of the data-warehousing stuff, and got to talk to a lot of people who worked on the content side. There were some teams near me that did similar things (People Map / KeyCite).
Thank you for posting this and for your hard work at TR. I'm developing something related to this stuff - http://youtu.be/3m194rui52Q (it's a really old video!)
Uses some sorts of social Open Calais-type activities
> Definitely cool as a piece of gear, but not terribly practical from a client-side perspective i'd think.
Perhaps for PDFs are proprietary or sensitive. A related use case is transformation and extraction. I used this same technique recently for a client to turn VB6-generated PDF reports into HTML tables for preview, and sending the actual data to a service endpoint as JSON.
Lunr.js does provide a way to pre-compile the index on the server side. Check out the discussion and implementation in progress of using the pre-compilation for Jekyll: https://github.com/olivernn/lunr.js/issues/26 .
Now all we need is for someone to port an LibreOffice editor in JavaScript :)
Yeah, the thought crossed my mind. There are enough people trying to make online products like Google Docs or places to post Powerpoint presentations that it may have already happened somewhere internally. Or, maybe everyone is just using LibreOffice/Muhimbi and doing it all on the server.
lunr.js looks pretty nice, seems very useful for tiny browser based stuff. For something a bit more heavyweight, I've used natural node[1] which is quite good - though not available in browser.
That one looks neat - it has some interesting NLP features like Wordnet integration and bayes classification.