TEXTUAL IMAGE COMPRESSION

dc.contributor.authorWitten, Ian H.eng
dc.contributor.authorBell, Timothy C.eng
dc.contributor.authorHarrison, Mary-Elleneng
dc.contributor.authorJames, Mark L.eng
dc.contributor.authorMoffat, Alistaireng
dc.date.accessioned2008-02-27T22:29:48Z
dc.date.available2008-02-27T22:29:48Z
dc.date.computerscience1999-05-27eng
dc.date.issued1991-11-01eng
dc.description.abstractWe describe a method for lossless compression of images that contain predominantly typed or typeset text--we call these textual images. They are commonly found in facsimile documents, where a typed page is scanned and transmitted as an image. Another increasingly popular application is document archiving, where documents are scanned by a computer and stored electronically for later retrieval. Our project was motivated by such an application: Trinity College in Dublin, Ireland, are archiving their 1872 printed library catalogues onto disk, and in order to preserve the exact form of the original document, pages are being stored as scanned images rather than being converted to text. Our test images are taken from this catalogue (one is shown in Figure 1). These beautifully typeset documents have a rather old-fashioned look, and contain a wide variety of symbols from several different typefaces--the five test images we used contain text in English, Flemish, Latin and Greek, and include italics and small capitals as well as roman letters. The catalogue also contains Hebrew, Syriac, and Russian text. The best lossless compression methods for both text and images base their coding on "contexts"--a symbol is coded with regard to adjacent ones. However, the contexts used for coding text usually extend over significantly more characters than those used in images. In text compression, the best methods make predictions based on up to three or four characters, while with black-white images, the most effective contexts tend to have a radius of just a few pixels. One possibility for textual image compression is to perform optical character recognition (OCR) on the text, and only transmit (or store) the ASCII (or equivalent) codes for the characters, along with some information about their position on the page. There are several problems with this. Considerable computing power is required to recognize characters accurately, and even then it is not completely reliable, particularly if unusual fonts, foreign languages or mathematical expressions are being scanned. OCR systems can require "training" to learn a new font, and an operator may have to adjust parameters such as the contrast of the scan to ensure that errors are corrected and small marks are removed from the page. Ironically, although the image may look better, it is actually \fInoisier\fR, because it does not faithfully represent the original image. Smudged or badly printed characters are replaced with what the OCR system has interpreted them as, rather than leaving human viewers to make their own interpretation. Dirt or ink-stains, which may have given valuable clues to a researcher, are lost. Even the typeface may not be reproduced accurately, affecting the look of the document. For typed business letters, this sort of "noise" may be acceptable, even desirable, but for archives where the interests of future readers are unknown, there is a strong motivation to record the document as faithfully as possible. The compression methods investigated here are noiseless, so the original document can be reproduced exactly from its compressed form. This is done by attempting to separate the text and noise in the document. The two components are then compressed independently using a method appropriate for each.eng
dc.description.notesWe are currently acquiring citations for the work deposited into this collection. We recognize the distribution rights of this item may have been assigned to another entity, other than the author(s) of the work.If you can provide the citation for this work or you think you own the distribution rights to this work please contact the Institutional Repository Administrator at digitize@ucalgary.caeng
dc.identifier.department1991-450-34eng
dc.identifier.doihttp://dx.doi.org/10.11575/PRISM/31199
dc.identifier.urihttp://hdl.handle.net/1880/46191
dc.language.isoEngeng
dc.publisher.corporateUniversity of Calgaryeng
dc.publisher.facultyScienceeng
dc.subjectComputer Scienceeng
dc.titleTEXTUAL IMAGE COMPRESSIONeng
dc.typeunknown
thesis.degree.disciplineComputer Scienceeng
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1991-450-34.pdf
Size:
1.84 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.86 KB
Format:
Plain Text
Description: