There are many interesting linguistic and phonetic aspects associated with conversational speech such as the ambiguity of word boundaries. Training speech recognizers, in particular for conversational speech, requires large amounts of well-annotated data. Ideally, a recognizer should be trained on fully-labeled data, i.e., data in which all word segmentations are given. However, fully-annotating conversational speech is very difficult (and sometimes impossible) as a result of ambiguity in word boundaries due to co-articulation effects. In this talk, we show how to make use of partially-labeled data to train Dynamic Graphical Models (DGM) for speech recognition. The approach is based on using the notion of Virtual Evidence (VE) which lets us train models when labels are specified as arbitrary distributions (i.e. any unsigned measure) over the output domain (in this case the vocabulary). In particular, we suggest approaches in which one only needs to annotate a single unit of time for every word in the training set. Our results show that models trained on such partially-labeled data can achieve better (or at least the same) performance as models trained using fully-annotated sequence data. We apply the proposed approach to two standard speech recognition tasks: (a) TIMIT phone recognition, and, (b) Large vocabulary speech recognition using Switchboard. Further we suggest applications of this technique in the NLP domain, such as tagging transcriptions of spoken documents where sentence boundaries are not well defined or training parsers using a distribution over features/aspects of parses of a given sentence.
Back to symposium main page