Using entropy to learn OT grammars from surface forms alone

The problem of ranking constraints in Optimality Theory (Prince & Smolensky 1993) in a fashion
that is consistent with a training sample comprised of input, output pairs has been solved with a
variety of algorithms (Tesar & Smolensky 2002, Boersma & Hayes 1999). The real-world problem
of learning OT grammars from training samples that consist of output forms alone, however, still
presents many challenges. Chief among these is the problem that there are often many possible-
input, possible-grammar pairs consistent with a given training sample of surface forms. The fully
faithful identity grammar is always a possible hypothesis plus there can be several unfaithful ways
to generate an observed form from a range of different inputs. Knowledge of morphology can help
the learner chose grammars that map the same input to different surface forms of a morpheme.
But, even before any morphology is known, learners can make educated guesses about grammars
with principles like Prince & Tesar's (1999) selectional preference for ranking hypotheses that are
maximally `restrictive' or Smolensky's (1996) default MARKEDNESS > FAITHFULNESS ranking to
                                                                       >
restrict the search through the space of possible grammars.

I propose another strategy for adjudicating among grammars without recourse to morphological
information that is based, not on the formal properties of the constraint rankings themselves, but
instead on information-theoretic properties of the set of inputs that each candidate grammar assigns
to the training sample. If learners choose grammars whose associated input sets have the highest
entropy (are least ordered) then learners will select grammars that maximally characterize patterns
in the training sample as consequences of the grammar rather than as accidents of the lexicon.

To implement this strategy the learners assume that all segment types, pairs of segments, triplets,
patterns of comparable complexity, etc. are equiprobable as inputs. This property needn't hold of
the lexicon that the learner ends up with, but it represents a null-hypothesis that places the onus on
the grammar to account for all patterns. This idea is central in Zellig Harris' work (1942 et seq.)
and encodes the same insight as Smolensky's (1996) Richness of the Base hypothesis in that the
grammars that ascribe the highest entropy to their inputs maximally allow Richness of the Base.

There are many methods that could be used to keep a running estimate of the entropy of the input
sets associated with candidate grammars that don't require the learner to keep an actual record of
the inputs in the sets. One such strategy (the one I employed in the case study mentioned below) is
to keep counts of bigrams (pairs of segments) and unigrams (single segments) in the inputs. This
only captures strictly local patterns but is adequate for a great many phonological phenomena. The
bigram/unigram model associated with each candidate grammar can, at any point, be turned into
an estimate of the entropy of the lexicon associated with that grammar. Other more sophisticated
measurements of the entropy of the input sets (like ones that capture non-local patterns) could be
plugged into this strategy but seem unnecessary for the early bootstrapping phase of learning.

Prince & Smolensky's (1993) syllable structure grammar with an alphabet of two consonants and
two vowels provides a test-case for the strategy. If one consonant is unmarked as an onset and one
vowel unmarked as a nucleus, the learner will be able to discern how markedness restrictions are
enforced as follows. Grammars enforcing ONSET via epenthesis will have disproportionately many
unmarked onsets while grammars using deletion won't, grammars enforcing NOCODA via epen-
thesis will have disproportionately many unmarked nuclei whereas grammars using deletion won't,
and grammars using deletion in both cases will differ from the identity grammar in that the latter
reflects surface disparities among bigram onto the inputs. In this way, lexical entropy and universal
markedness constraints reveal the way unfaithful parses perturb the distribution of surface forms.