Mari Ostendorf

UW Electrical Engineering

Web-based Corpora: Modeling Language vs. Gathering Counts

UW/Microsoft Symposium, 3/12/04

Performance gains in language modeling in recent years have been more driven by data collection than by advances in representation of linguistic structure. As vast text resources are increasingly available via the web, one might argue that this trend will continue. While there is no question that current models are still limited by the amount of available training data and can benefit substantially from additional resources, this talk will challenge the philosophy that "there's no data like more data". Because human language can vary substantially depending on topic and speaking/writing style, the addition of mismatched text to the training set can actually hurt language modeling performance when using simple n-gram models. This is particularly an issue for conversational speech, which differs substantially in style from most (easily available) textdata. However, we show that by explicitly representing even simple aspects of topic and style, it is possible to make use of a wider variety of text resources in training language models. The potential is much greater with more sophisticated models of linguistic structure.

Back to symposium main page