TEXTUAL IMAGE COMPRESSION

Witten, Ian H.; Bell, Timothy C.; Harrison, Mary-Ellen; James, Mark L.; Moffat, Alistair

TEXTUAL IMAGE COMPRESSION

dc.contributor.author	Witten, Ian H.	eng
dc.contributor.author	Bell, Timothy C.	eng
dc.contributor.author	Harrison, Mary-Ellen	eng
dc.contributor.author	James, Mark L.	eng
dc.contributor.author	Moffat, Alistair	eng
dc.date.accessioned	2008-02-27T22:29:48Z
dc.date.available	2008-02-27T22:29:48Z
dc.date.computerscience	1999-05-27	eng
dc.date.issued	1991-11-01	eng
dc.description.abstract	We describe a method for lossless compression of images that contain predominantly typed or typeset text--we call these textual images. They are commonly found in facsimile documents, where a typed page is scanned and transmitted as an image. Another increasingly popular application is document archiving, where documents are scanned by a computer and stored electronically for later retrieval. Our project was motivated by such an application: Trinity College in Dublin, Ireland, are archiving their 1872 printed library catalogues onto disk, and in order to preserve the exact form of the original document, pages are being stored as scanned images rather than being converted to text. Our test images are taken from this catalogue (one is shown in Figure 1). These beautifully typeset documents have a rather old-fashioned look, and contain a wide variety of symbols from several different typefaces--the five test images we used contain text in English, Flemish, Latin and Greek, and include italics and small capitals as well as roman letters. The catalogue also contains Hebrew, Syriac, and Russian text. The best lossless compression methods for both text and images base their coding on "contexts"--a symbol is coded with regard to adjacent ones. However, the contexts used for coding text usually extend over significantly more characters than those used in images. In text compression, the best methods make predictions based on up to three or four characters, while with black-white images, the most effective contexts tend to have a radius of just a few pixels. One possibility for textual image compression is to perform optical character recognition (OCR) on the text, and only transmit (or store) the ASCII (or equivalent) codes for the characters, along with some information about their position on the page. There are several problems with this. Considerable computing power is required to recognize characters accurately, and even then it is not completely reliable, particularly if unusual fonts, foreign languages or mathematical expressions are being scanned. OCR systems can require "training" to learn a new font, and an operator may have to adjust parameters such as the contrast of the scan to ensure that errors are corrected and small marks are removed from the page. Ironically, although the image may look better, it is actually \fInoisier\fR, because it does not faithfully represent the original image. Smudged or badly printed characters are replaced with what the OCR system has interpreted them as, rather than leaving human viewers to make their own interpretation. Dirt or ink-stains, which may have given valuable clues to a researcher, are lost. Even the typeface may not be reproduced accurately, affecting the look of the document. For typed business letters, this sort of "noise" may be acceptable, even desirable, but for archives where the interests of future readers are unknown, there is a strong motivation to record the document as faithfully as possible. The compression methods investigated here are noiseless, so the original document can be reproduced exactly from its compressed form. This is done by attempting to separate the text and noise in the document. The two components are then compressed independently using a method appropriate for each.	eng
dc.description.notes	We are currently acquiring citations for the work deposited into this collection. We recognize the distribution rights of this item may have been assigned to another entity, other than the author(s) of the work.If you can provide the citation for this work or you think you own the distribution rights to this work please contact the Institutional Repository Administrator at digitize@ucalgary.ca	eng
dc.identifier.department	1991-450-34	eng
dc.identifier.doi	http://dx.doi.org/10.11575/PRISM/31199
dc.identifier.uri	http://hdl.handle.net/1880/46191
dc.language.iso	Eng	eng
dc.publisher.corporate	University of Calgary	eng
dc.publisher.faculty	Science	eng
dc.subject	Computer Science	eng
dc.title	TEXTUAL IMAGE COMPRESSION	eng
dc.type	unknown
thesis.degree.discipline	Computer Science	eng