Eric Ringger

NLP Group, Microsoft Research

The Role of Linguistics in Feature Engineering for Machine Learning in Natural Language Processing

UW/Microsoft Symposium, 4/22/05

According to various estimates, empirical work in the field of NLP has grown from less than 10% of all published research papers in the field a decade and a half ago to nearly 90%. For some computational linguists, this has been a shock to the system, as ideas from statistical physics, information theory, and machine learning have replaced theoretical linguistics as the tools of choice. This raises a question that we may not have even asked a few years ago: What is the role of linguistics in computational linguistics?

To address the question, I discuss two research problems in which linguistic insights and empirical approaches have met in hybrid systems with mixed results. The first is the sentence realization problem: given an abstract representation of a sentence, we aim to produce a fluent string. For a transfer-based MT system, this is the third leg of a three-step process (analysis, transfer, generation). In this talk, I focus on our empirical approach to an inherently linguistic problem, namely clausal extraposition in German. (This work is part of the Amalgam and MSR-MT systems.)

The second problem is email classification. Our task is to identify email messages containing sentences of interest, in particular, tasks (action items) and promises (commitments). We extract a broad set of features, some superficial and some reflecting deeper linguistic phenomena, as raw material for the classification task. The product is a working solution that helps to address the email overload problem. In this case, we find that all features contribute to better accuracy, although the linguistic features do not help as much as one would hope. (This work is part of the SmartMail/TaskFlags system.)


Back to symposium main page