For some comparison, I recently did an OCR comparison for some work for a professor. To set some context, all documents were 1960s era typed or handwritten documents in English, specifically from this archive - http://allenarchive.iac.gatech.edu/. I hand transcribed 50 documents to use as a base comparison and ran them through the various OCR engines getting the results below.
Leven is Levenshtein Distance. Overall is a weighted average of typed vs handwritten, 90/10 if I recall correctly. All results were run on my personal machine with a 5950X, 128 GB RAM, and a RTX 3080.
From my analysis the Amazon Textract was excellent, the best of all the paid ones, and while TrOCR and PaddleOCR were the best FOSS ones, the issue with them is that they require a GPU while Tesseract I could use on CPU alone. For instance to OCR all 50 documents.
TrOCR is great if you need to do a few or have GPUs to burn, but Tesseract is by far better if you need good enough for a large volume of documents, and for my project the intent was to make a software plugin that could be sent to libraries/universities, CPU is king.
I mean to for sure, but the project isn't done yet, there's some NLP work we're doing with the results of the OCR and I really want to do a full series going over all we've done rather than one and then two months later another one.
I second the first users sentiment. Things are going to change permanently into all eternity, better and better solutions come around, existing solutions get better and better - I would love reading a blog post on your current state of research already
From my analysis the Amazon Textract was excellent, the best of all the paid ones, and while TrOCR and PaddleOCR were the best FOSS ones, the issue with them is that they require a GPU while Tesseract I could use on CPU alone. For instance to OCR all 50 documents.
TrOCR is great if you need to do a few or have GPUs to burn, but Tesseract is by far better if you need good enough for a large volume of documents, and for my project the intent was to make a software plugin that could be sent to libraries/universities, CPU is king.