TEXTUAL IMAGE COMPRESSION

Abstract
We describe a method for lossless compression of images that contain predominantly typed or typeset text--we call these textual images. They are commonly found in facsimile documents, where a typed page is scanned and transmitted as an image. Another increasingly popular application is document archiving, where documents are scanned by a computer and stored electronically for later retrieval. Our project was motivated by such an application: Trinity College in Dublin, Ireland, are archiving their 1872 printed library catalogues onto disk, and in order to preserve the exact form of the original document, pages are being stored as scanned images rather than being converted to text. Our test images are taken from this catalogue (one is shown in Figure 1). These beautifully typeset documents have a rather old-fashioned look, and contain a wide variety of symbols from several different typefaces--the five test images we used contain text in English, Flemish, Latin and Greek, and include italics and small capitals as well as roman letters. The catalogue also contains Hebrew, Syriac, and Russian text. The best lossless compression methods for both text and images base their coding on "contexts"--a symbol is coded with regard to adjacent ones. However, the contexts used for coding text usually extend over significantly more characters than those used in images. In text compression, the best methods make predictions based on up to three or four characters, while with black-white images, the most effective contexts tend to have a radius of just a few pixels. One possibility for textual image compression is to perform optical character recognition (OCR) on the text, and only transmit (or store) the ASCII (or equivalent) codes for the characters, along with some information about their position on the page. There are several problems with this. Considerable computing power is required to recognize characters accurately, and even then it is not completely reliable, particularly if unusual fonts, foreign languages or mathematical expressions are being scanned. OCR systems can require "training" to learn a new font, and an operator may have to adjust parameters such as the contrast of the scan to ensure that errors are corrected and small marks are removed from the page. Ironically, although the image may look better, it is actually \fInoisier\fR, because it does not faithfully represent the original image. Smudged or badly printed characters are replaced with what the OCR system has interpreted them as, rather than leaving human viewers to make their own interpretation. Dirt or ink-stains, which may have given valuable clues to a researcher, are lost. Even the typeface may not be reproduced accurately, affecting the look of the document. For typed business letters, this sort of "noise" may be acceptable, even desirable, but for archives where the interests of future readers are unknown, there is a strong motivation to record the document as faithfully as possible. The compression methods investigated here are noiseless, so the original document can be reproduced exactly from its compressed form. This is done by attempting to separate the text and noise in the document. The two components are then compressed independently using a method appropriate for each.
Description
Keywords
Computer Science
Citation