GREEDY MACRO TEXT COMPRESSION

Date
1987-12-01
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Text compression schemes can be divided into two classes: statistical, where symbols are assigned codes based on their probabilities, and macro, where groups of consecutive symbols (phrases) are replaced by indexes into some dictionary. A subset of macro coding called Greedy Macro (GM) accounts for the vast majority of macro schemes in the literature, including all variations of the popular Ziv-Lempel method. At each coding step, GM schemes encode as many symbols as possible with a single index to the dictionary. Although this parsing strategy is not optimal, no optimal macro scheme can be implemented with a bounded coding delay. This paper defines GM coding and establishes an algorithm which takes any such scheme and constructs a statistical coding method that achieves exactly the same compression rate. Thus GM schemes can never achieve better compression than statistical ones. The conclusion is that research aimed at increasing compression should concentrate on statistical methods, leaving macro schemes for applications in which compression efficiency can be sacrificed for speed and modest memory requirements.
Description
Keywords
Computer Science
Citation