Building A Full-Text Index In Javascript

70 points by olivernn 13 years ago · 11 comments

Reader

This is pretty cool, but the fundamental problem is still that you (or someone else) have to load an entire PDF (or set of PDFS) before you can use the full text indexing to search it.

If you're running a service (say like DocumentCloud) you're way better off precomputing a full text index on ingest and providing a search API than shunting over substantial parts of your stored documents.

Definitely cool as a piece of gear, but not terribly practical from a client-side perspective i'd think.

garysieling 13 years ago

Yes, that is certainly true. The other issue with the technique I see is if I tried to scale this I'd probably hit some maturity issues with these libraries.
For what it's worth, it looks like DocumentCloud uses Open Calais, which is a Thomson Reuters product - I used to work there in a different division, they have a bunch of interesting products in this space.
- knowtheory 13 years ago
  
  Oh neat, what'd you do at Thomson Reuters?
  I notice your blog is filled with NLP related goodies. I've been meaning to screw around with Stanford NER lib, to see if i can train up some custom recognizers for particular document domains of any utility.
  - garysieling 13 years ago
    
    I worked on a bunch of products, but the longest term one was the a side-product to WestLaw, Firm360, which was a market research tool for law firms that came from an acquisition (FindLaw). I worked on some of the data-warehousing stuff, and got to talk to a lot of people who worked on the content side. There were some teams near me that did similar things (People Map / KeyCite).
    
    MWil 13 years ago
    
    Thank you for posting this and for your hard work at TR. I'm developing something related to this stuff - http://youtu.be/3m194rui52Q (it's a really old video!)
    Uses some sorts of social Open Calais-type activities
xaritas 13 years ago

> Definitely cool as a piece of gear, but not terribly practical from a client-side perspective i'd think.
Perhaps for PDFs are proprietary or sensitive. A related use case is transformation and extraction. I used this same technique recently for a client to turn VB6-generated PDF reports into HTML tables for preview, and sending the actual data to a service endpoint as JSON.
arafalov 13 years ago

Lunr.js does provide a way to pre-compile the index on the server side. Check out the discussion and implementation in progress of using the pre-compilation for Jekyll: https://github.com/olivernn/lunr.js/issues/26 .

Ygg2 13 years ago

Now all we need is for someone to port an LibreOffice editor in JavaScript :)

garysieling 13 years ago

Yeah, the thought crossed my mind. There are enough people trying to make online products like Google Docs or places to post Powerpoint presentations that it may have already happened somewhere internally. Or, maybe everyone is just using LibreOffice/Muhimbi and doing it all on the server.

binarymax 13 years ago

lunr.js looks pretty nice, seems very useful for tiny browser based stuff. For something a bit more heavyweight, I've used natural node[1] which is quite good - though not available in browser.

https://github.com/NaturalNode/natural

garysieling 13 years ago

That one looks neat - it has some interesting NLP features like Wordnet integration and bayes classification.

Settings

Building A Full-Text Index In Javascript

Keyboard Shortcuts