MODELS FOR COMPRESSION IN FULL-TEXT RETRIEVAL SYSTEMS

dc.contributor.authorWitten, Ian H.eng
dc.contributor.authorNevill, Craig G.eng
dc.contributor.authorBell, Timothy C.eng
dc.date.accessioned2008-02-27T22:29:02Z
dc.date.available2008-02-27T22:29:02Z
dc.date.computerscience1999-05-27eng
dc.date.issued1990-08-01eng
dc.description.abstractText compression systems operate in a stream-oriented fashion which is inappropriate for databases that need to be accessed through a variety of retrieval mechanisms. This paper develops models for full-text retrieval systems which (a) compress the main text so that it can be randomly accessed via synchronization points; (b) store the text's lexicon in a compressed form that can be efficiently searched for concordancing and decoding purposes; (c) include a lexicon of word fragments that can be used to implement retrieval based on partial word matches; and (d) store the text's concordance in highly compressed form. All compression is based on the method of arithmetic coding, in conjunction with static models, derived from the text itself. This contrasts with contemporary stream-oriented compression techniques that use adaptive models, and with database compression techniques that use ad hoc codes rather than principled models. A number of design trade-offs are identified and investigated on a 2.7 million word sample of English text. The paper is intended to assist designers of full-text retrieval systems by defining, documenting and evaluating pertinent design decisions.eng
dc.description.notesWe are currently acquiring citations for the work deposited into this collection. We recognize the distribution rights of this item may have been assigned to another entity, other than the author(s) of the work.If you can provide the citation for this work or you think you own the distribution rights to this work please contact the Institutional Repository Administrator at digitize@ucalgary.caeng
dc.identifier.department1990-403-27eng
dc.identifier.doihttp://dx.doi.org/10.11575/PRISM/31172
dc.identifier.urihttp://hdl.handle.net/1880/46181
dc.language.isoEngeng
dc.publisher.corporateUniversity of Calgaryeng
dc.publisher.facultyScienceeng
dc.subjectComputer Scienceeng
dc.titleMODELS FOR COMPRESSION IN FULL-TEXT RETRIEVAL SYSTEMSeng
dc.typeunknown
thesis.degree.disciplineComputer Scienceeng
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1990-403-27.pdf
Size:
3.67 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.86 KB
Format:
Plain Text
Description: