Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For some comparison, I recently did an OCR comparison for some work for a professor. To set some context, all documents were 1960s era typed or handwritten documents in English, specifically from this archive - http://allenarchive.iac.gatech.edu/. I hand transcribed 50 documents to use as a base comparison and ran them through the various OCR engines getting the results below.

                           Overall       Typed  Handwritten
  OCR Engine          Leven   Cosine  Leven   Cosine  Leven   Cosine
  Amazon Textract     91.63%  98.14%  92.07%  98.76%  87.99%  92.10%
  Google Vision       93.05%  97.97%  93.84%  98.99%  85.86%  88.11%
  Microsoft Azure     80.32%  95.61%  80.65%  96.20%  79.14%  90.21%
  TrOCR               78.66%  93.97%  80.64%  96.65%  59.96%  67.89%
  PaddleOCR           84.82%  90.73%  88.60%  96.28%  49.64%  37.58%
  Tesseract           86.67%  89.53%  91.14%  95.63%  44.54%  31.39%
  Easy OCR            81.79%  85.07%  85.50%  91.89%  46.87%  19.23%
  Keras OCR           58.03%  83.57%  59.32%  89.98%  46.08%  21.20%
Leven is Levenshtein Distance. Overall is a weighted average of typed vs handwritten, 90/10 if I recall correctly. All results were run on my personal machine with a 5950X, 128 GB RAM, and a RTX 3080.

From my analysis the Amazon Textract was excellent, the best of all the paid ones, and while TrOCR and PaddleOCR were the best FOSS ones, the issue with them is that they require a GPU while Tesseract I could use on CPU alone. For instance to OCR all 50 documents.

  Tessearct       1:19
  TrOCR (GPU)    27:33
  TrOCR (CPU)  3:04:22
TrOCR is great if you need to do a few or have GPUs to burn, but Tesseract is by far better if you need good enough for a large volume of documents, and for my project the intent was to make a software plugin that could be sent to libraries/universities, CPU is king.


This would probably make a good blog post!


I mean to for sure, but the project isn't done yet, there's some NLP work we're doing with the results of the OCR and I really want to do a full series going over all we've done rather than one and then two months later another one.


I second the first users sentiment. Things are going to change permanently into all eternity, better and better solutions come around, existing solutions get better and better - I would love reading a blog post on your current state of research already




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: