Settings

Theme

Building A Full-Text Index In Javascript

garysieling.com

70 points by olivernn 13 years ago · 11 comments

Reader

knowtheory 13 years ago

This is pretty cool, but the fundamental problem is still that you (or someone else) have to load an entire PDF (or set of PDFS) before you can use the full text indexing to search it.

If you're running a service (say like DocumentCloud) you're way better off precomputing a full text index on ingest and providing a search API than shunting over substantial parts of your stored documents.

Definitely cool as a piece of gear, but not terribly practical from a client-side perspective i'd think.

  • garysieling 13 years ago

    Yes, that is certainly true. The other issue with the technique I see is if I tried to scale this I'd probably hit some maturity issues with these libraries.

    For what it's worth, it looks like DocumentCloud uses Open Calais, which is a Thomson Reuters product - I used to work there in a different division, they have a bunch of interesting products in this space.

    • knowtheory 13 years ago

      Oh neat, what'd you do at Thomson Reuters?

      I notice your blog is filled with NLP related goodies. I've been meaning to screw around with Stanford NER lib, to see if i can train up some custom recognizers for particular document domains of any utility.

      • garysieling 13 years ago

        I worked on a bunch of products, but the longest term one was the a side-product to WestLaw, Firm360, which was a market research tool for law firms that came from an acquisition (FindLaw). I worked on some of the data-warehousing stuff, and got to talk to a lot of people who worked on the content side. There were some teams near me that did similar things (People Map / KeyCite).

        • MWil 13 years ago

          Thank you for posting this and for your hard work at TR. I'm developing something related to this stuff - http://youtu.be/3m194rui52Q (it's a really old video!)

          Uses some sorts of social Open Calais-type activities

  • xaritas 13 years ago

    > Definitely cool as a piece of gear, but not terribly practical from a client-side perspective i'd think.

    Perhaps for PDFs are proprietary or sensitive. A related use case is transformation and extraction. I used this same technique recently for a client to turn VB6-generated PDF reports into HTML tables for preview, and sending the actual data to a service endpoint as JSON.

  • arafalov 13 years ago

    Lunr.js does provide a way to pre-compile the index on the server side. Check out the discussion and implementation in progress of using the pre-compilation for Jekyll: https://github.com/olivernn/lunr.js/issues/26 .

Ygg2 13 years ago

Now all we need is for someone to port an LibreOffice editor in JavaScript :)

  • garysieling 13 years ago

    Yeah, the thought crossed my mind. There are enough people trying to make online products like Google Docs or places to post Powerpoint presentations that it may have already happened somewhere internally. Or, maybe everyone is just using LibreOffice/Muhimbi and doing it all on the server.

binarymax 13 years ago

lunr.js looks pretty nice, seems very useful for tiny browser based stuff. For something a bit more heavyweight, I've used natural node[1] which is quite good - though not available in browser.

https://github.com/NaturalNode/natural

  • garysieling 13 years ago

    That one looks neat - it has some interesting NLP features like Wordnet integration and bayes classification.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection