Re: OCR Software..?!

becka@rz.uni-duesseldorf.de
Thu, 13 Nov 1997 12:55:15 +0100 (MET)

> my idea was that scanned data would wind up in a Tk
> text editing box, with possible errors (where the confidence value of
> the recognition is low) highlighted in red.

You might evetually need a "segmentation preview" which allows (optionally)
to manually interfere with the separation of text and graphics and the
sequence in which the textboxes are to be processed.

Moreover it would be nice, if you could turn on and off every manual
step. So you could simply make a "quick-and-dirty" mass conversion
and correct errors the next morning when the stack of sheets has been
fed through the scanner as well as interactive operation.

> Recognition is the complicated part, of course. First you need to
> scan the image, then it's usually converted from grey-scale to 2-level
> black-and-white. Documents are often not perfectly aligned when
> they're scanned, so the angle at which they're tilted (called the
> "skew angle") has to be measured and compensated for.

Yeah. If you want to compensate on the image side, do so before converting
to b/w. Less quality loss.

Moreover a "de-noise" filter would be appropriate to remove speckles.

At small text sizes, it would eventually be nice to keep a grayscale image
(though this considerably complicates algorithms). At least you should
use an appropriate combined sharpening/smoothing filter (which preserves
edges, but smooths areas) to get a good image of the letters.

> Then the image has to be segmented into words, and words into letters;
Or digraphs. Many printed typefaces use this. An example is the combination
"fi". In printed form, the dot of the i is often made up of a dot attached
to the upper end of the f. Set a word containing this combination with TEX
to see what I mean.

> each letter is then recognized, and usually a confidence value is
> attached to each letter.
Yep. The same should happen on word level.

> Often there's a post-processing step which uses a language dictionary
> to correct errors; for example, if you're scanning English text, 'rn'
> might be a scanning error for "m".

Yes. The matching algorithm for the dictionary search needs to be
chosen in a way that takes typical scanning/matching errors into account.

On letter level you could use language specific hidden-markov-chains to
predict the possibility of certain next letters, which can be helpful for
deciding between several possibilities. E.g. if the last recognized
character was "q", the possibility for the next one being "u" is magnitudes
higher than for it being "n".

> The two major techniques for recognizing letters seems to be either
> neural networks, or making a vector from easily measured
> characteristics of the bitmap containing a letter; for example, xocr
> takes a histogram of the letter at 128 different angles. This
> technique dates back at least to the 1970s, but neural networks seem
> to be what all modern systems use.

The XOCR technique is not good. If it wasn't changed since my last look
it _counted_pixels_ (!) from these angles. This doesn't even distinguish
and O from a dot. Using the number of black/white transitions is a better
measure.

But do not make the standard OCR mistake to simply feed the character
matrix to a neural net and then try to train it like mad.

Feature recognition is still the most important part for a good OCR
program. If you classify them using a neural net or something simpler like
some weighted vector matching isn't too important. If your feature-
recognition is not good, neither of them will work well.

Neural nets can compensate a bit better for a bad recognizer, but
at the price of additional training time and eventually less predictable
behaviour.

> We should approach him, and get a freeware-OCR mailing list set up.
Definitely a good idea. It is one of the few things missing in freeware.

CU, Andy

-- 
Andreas Beck              |  Email :  <becka@sunserver1.rz.uni-duesseldorf.de>

--
Source code, list archive, and docs: http://www.mostang.com/sane/
To unsubscribe: echo unsubscribe sane-devel | mail majordomo@mostang.com