A SYMMETRIC VERSION SPACE ALGORITHM FOR LEARNING DISJUNCTIVE STRING CONCEPTS

Date
1992-03-01
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The paper describes an algorithm to learn disjunctive string patterns from examples. An implementation of the algorithm learns concepts such as all C source files (*.c or *.h) in the UNIX domain. The algorithm is designed for an interactive learning model in which the number of questions to the user are minimized. Furthermore, only simple questions to the user are admissible. The representation language is restricted to a subset of regular expressions in order to reduce the complexity. The patterns in the concept grammar are extensions of Nix's gap patterns and Mo's annotated gap patterns. To learn sequences of patterns, the strings are broken up into units. The algorithm assumes that the first example is a good representative of the concept, that is, it contains all optional units. Even for limited disjunctions, the G-set of Mitchell's Candidate Elimination algorithm is infinite for domains that contain an infinite number of elements. A symmetric version space (SVS) algorithm using cover sets of the positive and negative examples computes only the most specific description that matches a set of examples and can therefore be used to learn in version spaces that make it impossible or infeasible to compute the most general descriptions. Two extra cover sets are maintained to avoid asking the user about all possible examples, a problem that arises when the number of terms in the concept is less than the maximum number of terms in a disjunction. The correctness of the SVS algorithm in version spaces with minimal generalization hierarchies, so called k-disjunctive version spaces, is shown. Furthermore, it is argued that an extended version of the SVS algorithm solves the swapping and splitting problem. The last result of the paper establishes that, though the size of the version space grows exponentially, the sample size complexity increases only linearly with the maximum length of units.
Description
Keywords
Computer Science
Citation