logo

• Home
• Gallery
• Questions
• Technology
• Products
• Contact

Copyright © 2008 Quixate Limited.

Technology

How it works

Here is a broad outline of how the Quixate text location and reading system works. The description is divided into two main sections: first we address the problem of finding the text within an image and then we look at how to read the text that we have found.

magnified section of image
Part of a typical photographic image, magnified to show individual pixels

A. Finding text

Stage A.1: Preprocessing

We first subject the photograph to a number of image processing algorithms that help to eliminate the background along with many non-text features while preserving the characteristics of the interesting parts of the picture.

text with a baseline
The baseline of a piece of text is not necessarily straight

Stage A.2: Preliminary text finding

We have two independent approaches to finding candidate regions of an image that might contain text. The approaches are based on different particular geometrical characteristics that distinguish text from other objects that typically occur in photographs. Of course, these methods are not foolproof: for example, it is easy to imagine natural or man-made features that superficially resemble the letters ‘H’ or ‘I’. We therefore allow a number of false positives into the next stage of processing, which applies a more sophisticated idea of what constitutes text.

The output of this stage is a list of candidate baselines for rows of text, each with a very rough estimate (within a factor of about 3) of the character height of that text. The baselines are represented as general curves within the original image rather than as simple straight lines.

text with a baseline and height
Estimating the height of the text

Stage A.3: Refinement

In the next stage each candidate row of text is inspected more closely. Several statistical measures are computed in order to make a more accurate judgement as to whether the row actually contains text or whether it is some other feature in the photograph. For those rows that pass this test, we also obtain a better estimate of the character height.

Again, these tests are not perfect, and so we let some false positives through to the next stage. Some of the rows that are definitely rejected as text are nevertheless still judged to be useful in the next stage (where we analyse perspective), and so we retain them separately from the successful rows.

perspective analysis
We build a three-dimensional model of the scene
 
rectified image
Rectified view of an object in the image

Stage A.4: Perspective

Perspective is one of the main features that distinguishes text in photographs from text in a scanned document. It is very common for the text to be in a plane that is not quite perpendicular to the axis of the camera, with the result that the characters in a line of text appear bigger at one end than at the other. This difference alone is enough to make many traditional OCR (optical character recognition) systems fail on photographic images.

Our system builds up a three-dimensional model of the scene in the photograph. Each row of text is assigned to a plane lying in three dimensions (and which can be at any orientation with respect to the axis of the camera). We take advantage of the fact that a single plane may contain several rows of text: consider a road sign, for example. Each plane can then be ‘rectified’: in other words, we can work out what the plane would look like if its photograph had been taken square-on. The images here show rectification in action.

Once we have our rectified rows of text (plus a few false positives that we have let through) we can proceed to try to read them.

simple thresholds do not work
Naïve thresholding gives poor results

B. Reading text

Stage B.1: Character segmentation

At this point in the process we have a collection of rectified rows of text. The first job is to separate, or ‘segment’ this row into its constituent characters. The natural approach of thresholding the image to separate the characters into connected components is unfortunately very poor at doing this. Traditional OCR systems that use thresholding usually only work satisfactorily with very large characters (typically many tens of pixels high) or employ a range of ad hoc techniques to try to undo the damage done to the image by the thresholding process. While these techniques may work for a scanned image, the various artefacts and distortions present in photographs make them unsuitable in our case. As the example here illustrates, there is often no threshold value that you can choose which will separate all the characters from one another without causing some of the characters to break up.

We use a more sophisticated segmentation algorithm that is intimately bound with the character recognition algorithms described below and which takes full advantage of the dynamic range of the original image. In this way we avoid making irrevocable segmentation decisions until the last possible moment, and as a result our system is capable of reading text down to just a few pixels in height.

character model
An assortment of ‘a’s

Stage B.2: Character recognition

We recognise characters using a flexible statistical model that is not specific to a given font. Among other things, the model is capable of modelling bold, condensed, warped and blurred characters directly. The model is often working at close to information-theoretic limits, especially where the text in the original image is only a few pixels high. The model can easily be adapted to non-Roman alphabets.

statistical model
Word modelling based on state machines

Stage B.3: From characters into words

The final step is to assemble the recognised characters into words. In general we find that text in photographs, unlike text in scanned images, tends to include a large number of comparatively rare words, and in particular proper names. It is therefore not adequate to take the top-ranking characters from the character recogniser and pass them through a spell-checker because no reasonably-sized dictionary will contain even 95% of the words we might want to read. We therefore have two statistical models, based on state machines, for the text we read: one for ‘in-vocabulary’ words (based on English language corpora) and one for ‘out-of-vocabulary’ (‘OOV’) words. Our OOV model also deals with numbers and punctuation. Again, the word models can easily be changed to suit languages other than English.

The result of this stage is a string (the text that has been read), accompanied by a score which represents how confident the system is that its answer is correct.

plain text output
Sample of typical output

Stage B.4: Filtering and output

Filtering involves first rejecting any rows that are determined by the reading process to be false positives that have slipped through the earlier stages of processing. Second, it can happen that the system finds two rows that cover the same region of the image. Normally in this case the reading process is much more confident of one of these results than it is of the other, and the poorer row is discarded.

Finally, the filtered results are output in the form of a plain text database containing details of the location and size of each row of text found, and the text it contains.