Ask HN: What OCR tool do you use in your project?
I am working on a project where I want to extract data from PDF document. Sometimes these are scanned PDF or forms.
I am looking for for an OCR tool (paid or open source) which can effectively extract data from poorly scanned documents and forms. What do you use? It depends on what input amount, format and quality you have. There are free / open source tools (like Tesseract), but if you would like to use them, some manual or (semi-)auto preprocessing steps are very important (threshold / binarization, deskew, noise removal[1]) too get nearly comparable results to commercial tools. Some tesseract based solutions are better integrated with automatic preprocessing, you could take a look at Papermerge or other self hosted document management solutions[2]. There are also commercial SDKs around tesseract with good price point, like Vintasoft OCR[5], which supports automatic preprocessing and delivers a decent quality. If you don't mind having a (free) clicking adventure with small amounts of documents, you could also try the free verson of PDF X-Change viewer[3], which has a small but pretty good OCR to embedded PDF-Layer option which makes PDFs "searchable". But the embedded OCR data cannot be easily extracted. The best "no cloud" / offline solution I found, was Abbyy FineReader[4] which also has a command line tool, but if you really want a ready to use, easy and good quality solution, I would go with Google Lens (if you don't mind google) [1] https://towardsdatascience.com/pre-processing-in-ocr-fc231c6... [2] https://github.com/awesome-selfhosted/awesome-selfhosted#doc... [3] https://www.tracker-software.com/product/pdf-xchange-editor A bit off topic but I've just started using Google Lens to extract whole pages from books with my phone. Near perfect conversion to text is great for taking notes. Google Lens works great in individual use cases, wonder what they are using behind the scene. In my case I need to extract data on server side, so a library/API will be most suitable. I still use Tesseract. It's not the fastest or most-accurate anymore, but it gets what I need off of PDF files. Does it work well with scanned PDF?
In my experiments it was not giving the correct output. Explore different page segmentation modes and make sure you are using v4 (it's a massive step up) We started using tesseract for a project that needed to extract text from video frames. But in the end we moved to easyocr, as it needed less preprocessing for our use case. What languages do you need to support? Off the shelf models don't work well on non-Latin languages. You may need to train your own.