Help me with OCR and indexing of old books with tables, data, etc

RedditCrossPostBotB to DataHoarderEnglish · 2 months ago

https://preview.redd.it/zp9vlha0vmoe1.png?width=1200&format=png&auto=webp&s=25233afd4d8804e65b7d6dff7bab03f33fe6ef53

I want to start a personal project where I scan, OCR and index markdown for old books. This is a book with ALL of Romania’s roads back in 1974. It has tables and maps and all sorts of other interesting historical data points.

I already have some idea of data engineering. I’m a software engineer and I’ve made a project that helps with RAG, search and indexing of markdown files (even very big ones). My problem is the OCR part. Any tips?

Originally posted by u/alexlazar98 on Reddit.com/r/datahoarder

beep boop I’m a bot to seed discussions from Reddit. Upvote or downvote posts like normal, discuss the topics here as well!

You must log in or register to comment.

Chat