C#: OCR (Optical Character Recognition)

May 3, 2010 by C#  

The past few weeks we've been looking for a suitable OCR solution to integrate into our document management system.

One option we came across involves MODI (Microsoft Office Document Imaging) - a tool available within Microsoft Office 2003 - 2007 (not available in Microsoft Office 2010).

Simply include the MODI Type library (COM Interop) and convert image(s) to text like this:

using MODI;
using System;

class Program
{
    static void Main(string[] args)
    {
        DocumentClass doc = new DocumentClass();
        doc.Create(@"some.tiff");
        doc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true);

        foreach (Image image in doc.Images)
        {
            Console.WriteLine(image.Layout.Text);
        }
    }
}

Its quite a powerful OCR engine, but the engine behind MODI isn't microsoft based - it is licensed under ScanSoft inc - currently Nuance.

There is one part I do find a bit dodgy though, we found quite a few rather expensive OCR tools out there (from $600), that integrates with MODI - which obviously requires Microsoft Office.

I almost feel that those application belong in the freeware realm - since you already bought a license to the core OCR functionality (via MS Office) and most of the non-OCR (part you will be paying for) seems rather mediocre.

My personal opinion though... ;)


Leave a Comment


OCR API September 17, 2011 by OCR API

Well coding for all fonts and languages is not easy.I think using OCR Cloud 2.0 platform is  a good idea.It can convert virtually any image (TIF, JPG, PNG, BMP) or PDF to any standard text-based document type (TXT, DOC, RTF, XLS, PPT, XML, HTML) or searchable PDF.It also has auto-language detection and support for over 200 languages including: Latin based languages Cyrillic based languages Chinese, Japanese, Korean, Thai, and Hebrew. For free developer account signup here-http://www.ocr-it.com/ocr-cloud-2-0-api

OCR for French Script MT April 25, 2011 by Anonymous

this is not useful if the scanned image text is French Script MT font.If anyone has solution, please reply as soon as possible.

October 8, 2010 by Anonymous

http://www.codeproject.com/KB/office/modi.aspx