SysAdmin Notepad: 2012

I just had a problem that I have not dealt with for many years, as most documents I can get quite easily in a soft copy. The problem occurred as I was provided with a print out of a document that I have to include in a newsletter, and I do not have time to track down the email of the author. As a result I decided to look for an OCR solution for fedora.

Of course there are a number of packages available in the standard repositories, after reading some reviews on the net I decided to try tesseract, as although it is quite choosy about the input format it generally does a high accuracy job.

yum install tesseract

I used my brother MFC-7420 to scan the document using simple scan in fedora. The default setting I had were 150dpi for text.

I saved the file as PNG(lossless) and then used gimp to convert it to tiff.

tesseract test.png test

The result was a file test.txt - not the best conversion - a lot of errors where things like 'in' had become 'rn'. Clearly this was because the letters were running into each other a bit.

I then re-scanned the document at 600 dpi, gimp convert to tiff and tesseract again. WOW - what a difference. Apart from a few issues with spacing (not a problem as I need to re-format all files to suit scribus anyway) it has just saved me quite a time typing.

Yeah, another open source package that works!

References:
http://www.dedoimedo.com/computers/linux-ocr.html
http://www.mscs.dal.ca/~selinger/ocr-test/

SysAdmin Notepad

Thursday, November 8, 2012

OCR on Fedora Linux