Show HN: Convert scanned documents into searchable PDFs

73 points by choogi 10 years ago · 21 comments

Reader

Is this based on an open-source OCR engine, a proprietary engine running on your own server(s), or a proprietary engine you're accessing as a service?

raphman 10 years ago

Given that the OCR'ed PDFs use the "GlyphLessFont" font, it seems that tesseract [1] is used.
[1] https://github.com/tesseract-ocr/tesseract
- cfcef 10 years ago
  
  I hope not. Tesseract delivers bad results on high quality scans, far below the same OCR quality achieved by services like Google Books.
  What the OCR market needs is someone who will bring that level of OCR quality - or better - to the masses (perhaps some deep learning grad student with time to kill?), not yet another wrapper around Tesseract. We have those already!
  - jiaweihli 10 years ago
    
    Have you looked into ocropy[0]?
    Here's a nice intro[1] that later talks about how it achieves higher accuracy using an LSTM model[2].
    [0] https://github.com/tmbdev/ocropy
    [1] http://www.danvk.org/2015/01/09/extracting-text-from-an-imag...
    [2] http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-mode...
    
    cfcef 10 years ago
    
    I have not. It sounds interesting but raw and unsuitable for end-users. I hope the quality improves and they can get it packaged up in a way that existing document scanners can plug into easily.
    
    jahewson 10 years ago
    
    Note that the primary author of ocropy (formerly ocropus) works at Google.
  - acdha 10 years ago
    
    Is the problem really Tesseract or the fact that it doesn't have a robust front-end performing segmentation, de-skewing, better binarization, etc? I've heard that Google Books is actually using the Tesseract engine but has seen better results in part from better training but mostly from a more advanced system breaking each page into the blocks of text which are actually OCRed.
  - mynewtb 10 years ago
    
    I have had great result using tesseract via gimagereader. Are you sure your configuration is good?
    
    random778 10 years ago
    
    Possible to upload an example image + result?

zurbi 10 years ago

Very clean UI. But how can one judge the OCR quality of this service? The service presents me a converted PDF, but how good was the conversion?

Is this better than https://ocr.space ?

For my private documents I would always use offline OCR software like http://blog.a9t9.com/p/free-ocr-software.html

bmh_ca 10 years ago

While interesting, and looks to be a needed services, the page leaves many questions, such as:

What's the privacy model? While the PDFs are deleted, what happens to the searchable content? Is it also deleted?

What's the revenue model? How can we be sure it'll be around in a few months?

Is there an AJAX interface?

Is the quality or performance better than running Tesseract on a server?

Cheyana 10 years ago

Also relevant, how do we know they're not injecting a pdf exploit into the final document? There is no real company information to hold anyone accountable. There could be a dozen websites like this for people to use for "free" that will inject malicious script which most antivirus apps won't detect. Not saying it's not useful, and awesome if legit but there should be more accountability. This is almost the internet equivalent of a stranger in a car waving free candy at a child walking down the street. My workplace was hit yet again today with a Cryptolocker variant (second time this month) which required us to restore thousands of files from backup. All from clicking on a link in an email.

jes 10 years ago

I would use this service, if I had scanned PDFs where I didn't care about confidentiality. As it stands, though, uploading them to an unknown web resource seems risky.

Thoughts?

hondo77 10 years ago

I use PDFScanner on my Mac. Works great at scanning time or post-scanning. No, it's not free but it's worth it. Pay the $15, ya cheap bastiches! :-)

BTW, how is this news?

theGimp 10 years ago

HN is not just for news. It's for whatever you deem worthy of sharing.
It comes down to how many people agree it's interesting by upvoting :)

rm_-rf_slash 10 years ago

I've had this idea for a while, but as an iPhone app. The case where I could have used it the most was when I would be studying and looking through textbooks for a particular word or phrase. It would be so convenient to just take a picture, input the text to look for, and see a highlight. If this were a mobile app and I were still in college, I would most certainly buy it.

callesgg 10 years ago

I just use the OCR function built in to Adobe Acrobat.

Don't know it the OCR function is available in the reader version.

petemc_ 10 years ago

Same here, have been using acrobat x for this purpose for years, very impressed how good the OCR is.

patrickfl 10 years ago

Been hanging here in Firefox now for about 5-10 minutes its a receipt for my insurance (no private info) about 2 pages in length.

Either way, super cool idea. My Dad will be stoked about this as he's been OCR'ing his way into oblivion for the past few years.

panglott 10 years ago

Is this Web site accessible (say, via screen-reader)? Scanned PDFs can be a huge problem for people who are visually impaired.

Omnipresent 10 years ago

Is this based on tesseract?

Settings

Show HN: Convert scanned documents into searchable PDFs

Keyboard Shortcuts