Settings

Theme

Show HN: OCR Search – We made scanned US Gov lease documents searchable

ocr-search.joyspace.ai

3 points by sagar-co 2 years ago · 2 comments · 2 min read

Reader

Hi HN, We launched OCR search as a part of our documents search engine. To demonstrate OCR capability, we scanned lease documents available from the General Services Administration (GSA) and performed OCR on all lease documents available for Washington state. Documents are then indexed and loaded into our search engine.

We can perform regular and zonal OCR on documents as well. This demo does not use zonal OCR. We can perform OCR on thousands of documents within one hour. The system is scalable. All it needs it more workers for OCR and indexing data.

Checkout demo and fully functional search engine at: https://ocr-search.joyspace.ai.

We have created a video to explain in more detail. Check out this YouTube video — https://www.youtube.com/watch?v=7EG9TPysBpU

We see a high accuracy for OCR.You can search for any word, numbers, obscure characters, and also search table data within documents.

While this demo is for scanned documents, we support HTML, PDF (regular and scanned), RTF, DOC, DOCX, CSV, EXCEL, JSON and other standard documents for search.

You can get access to our APIs if you are interested in building search experience into your application. We have highly available APIs for Search Engine.

You can visit https://www.joyspace.ai to get access to our search engine.

We manage indexing and search pipelines for you.

Happy to answer any questions around OCR and Search.

kerupian 2 years ago

I watched the demo. This looks quite interesting to me.

I have done some work on PDFs before and I know extracting info. from PDF is hard.

Kudos to you for building a search for scanned PDFs.

Do I have to manage Chunking for the search engine?

You mentioned about APIs. Do you support multiple clouds? For example, I have some data Dropbox, S3, GDrive, and R2. Will I be able to connect all these clouds?

Can you tell me more about data security?

Either way, looks impressive for data engineering and ML pipelines.

sagar-coOP 2 years ago

Scanned documents contain vital information. More often than not these documents are not searchable and if they are then a lot of information remains missing.

We built OCR search to showcase how you can extract data from scanned documents and make data within searchable.

A lot of new possibilties get unlocked with this information. For example, you can build AI CoPilot, RAG pipelines or power search within your application.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection