A SYMMETRIC VERSION SPACE ALGORITHM FOR LEARNING DISJUNCTIVE STRING CONCEPTS
Date
1992-03-01
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The paper describes an algorithm to
learn disjunctive string patterns from examples. An implementation of
the algorithm learns concepts such as all C source files (*.c or *.h) in
the UNIX domain. The algorithm is designed for an interactive
learning model in which the number of questions to the user are
minimized. Furthermore, only simple questions to the user are
admissible. The representation language is restricted to a subset of
regular expressions in order to reduce the complexity. The patterns in
the concept grammar are extensions of Nix's gap patterns
and Mo's annotated gap patterns. To learn sequences
of patterns, the strings are broken up into units. The algorithm
assumes that the first example is a good representative of the
concept, that is, it contains all optional units. Even for limited
disjunctions, the G-set of Mitchell's Candidate Elimination algorithm
is infinite for domains that contain an infinite number of elements.
A symmetric version space (SVS) algorithm using cover sets of the
positive and negative examples computes only the most specific
description that matches a set of examples and can therefore be used
to learn in version spaces that make it impossible or infeasible to
compute the most general descriptions. Two extra cover sets are
maintained to avoid asking the user about all possible examples, a
problem that arises when the number of terms in the concept is less
than the maximum number of terms in a disjunction. The correctness of
the SVS algorithm in version spaces with minimal generalization
hierarchies, so called k-disjunctive version spaces, is shown.
Furthermore, it is argued that an extended version of the SVS
algorithm solves the swapping and splitting problem. The last result
of the paper establishes that, though the size of the version space
grows exponentially, the sample size complexity increases only
linearly with the maximum length of units.
Description
Keywords
Computer Science