Show HN: An API to extract texts from images and PDF files

stamplin.com

40 points by smougel 13 years ago · 32 comments

Reader

zdw 13 years ago

What's the benefit to using this over `pdftotext` and/or `pdfimages | convert | tesseract`?

trez 13 years ago

we think that's easier to integrate as there is nothing to install, nothing to maintain. We are also doing some pre/post processing which allow orientation detection for example. We also think that's might be useful for mobile as they are limited in resources. Also, on a PDF, we have an hybric way to perform both to not use slow OCR methods on easy to extract text but that's still quite a young product. New advanced features are coming.
jingo 13 years ago

The benefit is stamplin.com gets insight on what people are viewing and reading. They get to see what the user sees. They can compile a database and use or sell that information to be used for marketing purposes.
Also, it's an "API" (looks more like a url poiting to a CGI program to me, but whatever). API's are "cool" and "fun", while running local programs that you have control over is old and boring and not the future of computing.
- trez 13 years ago
  
  our API is quite new and we understand it doesn't give an outstanding value for everybody as it target easy of use for the moment but next release is going to add more advanced things.
  About us using your data, our privacy policy will clarify that.
angersock 13 years ago

I have sought to answer exactly that question above.
:)
I love the happy little Unix-style legos of productivity.

taf2 13 years ago

http://www.stamplin.com/api/ returns 403 when clicking on the API docs after confirming an account via email link

trez 13 years ago

sorry, the correct url is http://www.stamplin.com/api/docs/

angersock 13 years ago

I come bearing gifts, if anyone would like to host some of this themselves.

This follows the API documented by Stampin (minus the throttling errors)--it does not currently do the OCR, but as mentioned elsewhere by zdw you can probably get tesseract to get you like 80% of the way there. If you wanted to use that, you'd likely just replace the hacky `pdftotext` callout with your preferred toolchain.

You'll need Ruby, Sinatra, and the Xpdf tools, I believe.

Dual-licensed under the AGPL, BSD, and WTFPL licenses. idklol.

The code:

  require 'sinatra'
  require 'json'

  use Rack::Logger

  post '/extracttext' do

      begin
      status 204 and return unless params["file"] != nil

      type = params["type"] || "text"
      lang = params["lang"] || "en"

      tmpfilename = params["file"][:tempfile].path
      `pdftotext #{tmpfilename}`
      File.delete(tmpfilename)

      convfile = File.open("#{tmpfilename}.txt","r")
      lines = convfile.read.split("\n")
      convfile.close
      File.delete(convfile.path)

      content_type "application/json"
      {"text"=>lines}.to_json

      rescue
          status 500 and return
      end
  end

EDIT:

For God's sake run this in a jail and only on an internal network!

rpedela 13 years ago

I like the concept and it is a good start. Pulling text from PDFs is especially painful. I think the output format needs improvement. It is just a large array of strings. It seems like the strings are sometimes a single line, and sometimes not. My particular use case is extracting raw data from a PDF. I would like to see more structure to the output. For example, knowing where new lines, tabs, etc are located would be very helpful for parsing raw data.

Here is the PDF I used to test: https://www.gov.uk/government/uploads/system/uploads/attachm...

Is there a technical reason for the 1-2MB limit or is it arbitrary?

trez 13 years ago

Thanks for your comment!
That's something we can provide pretty easily and we would try to provide that in our next release. If you want us to help you with your specific problem, please send us an email at info@stamplin.com.
The limit has been set to prevent our server from crashing as we do not have, for the moment, the financial capability to support a massive server farm. Again, if this limit prevent you from using our API, we might move the limit up if you ask it by email.

mappum 13 years ago

The OCR is really useless. I tested it with some reddit "advice animal" memes (because there is a need for transcriptions). You would think that text is pretty simple and easy, but the output I got was like:

    /\n\nnmrs wn\ufb02qyi mm mm\nTlIIEI\ufb02|\ufb02llllM\u2018l co

trez 13 years ago

Sorry that didn't work properly for you. We are working on improving our OCR results quality. Could you please send us at info@stamplin.com the file you used to get this useless result?

gkoberger 13 years ago

The upgrade button doesn't work, and nobody is going to hover long enough to see the "Not Available Yet" title. And the current 10 requests isn't even enough to test with.

I'm excited to try this.. so figure out a way to take my money soon.

gnosis 13 years ago

This looks nice except for having to depend on your servers as a middle man.

Any chance you could release the code as Free or open source so that its users can use it standalone on their own machines?

trez 13 years ago

That's not planned at the moment but if we wouldn't find a way to monetize it, we would do it for sure.

RivieraKid 13 years ago

Why would someone want to use an API instead of a library?

trez 13 years ago

some langages might not have an appropriate library, some might want to not have heavy processes on their device (mobiles). We also think that's easier to use as there is nothing to install. That mainly depends on your case.
- RivieraKid 13 years ago
  
  I agree that there might situations where it can be useful, but:
  1) Mobiles have pretty good CPUs. I think uploading and waiting for response would be slower and less reliable.
  2) If the mobile user doesn't have an internet connection, the app won't work.
  3) As a developer, I would be dependant on an external service, that could stop working someday.

rpedela 13 years ago

Can I assume API keys are on the roadmap? I don't particularly like using my username and password.

trez 13 years ago

Yes, we'd like to increase security on each releases. It should be available in one of the next release.
- rpedela 13 years ago
  
  Great!

antrover 13 years ago

Nice. Are you using the Tessaract OCR lib at the core of the extraction?

trez 13 years ago

yes we do

it_learnses 13 years ago

Any custom requests? Let us *know.

trez 13 years ago

thx, I am gonna fix that
- trez 13 years ago
  
  fixed

smougelOP 13 years ago

Any Feedback Welcomed

sebg 13 years ago

Looks good - does it do data tables? That's a big issue and something I've heard about (run into) many times...
- trez 13 years ago
  
  Thanks for your comment! We would really appreciate if you could explain us in more details problems you faced. I am sending you an email if that's ok for you to discuss that.
  - sebg 13 years ago
    
    responded to your email. good luck.

Settings

Show HN: An API to extract texts from images and PDF files

Keyboard Shortcuts