William Lewis and Fei Xia

Microsoft (Lewis), UW Linguistics (Xia)

Applying NLP Technologies to the Collection and Analysis of Language Data to Aid Linguistic Research

As a vast amount of language data has become available electronically, linguistics is gradually transforming itself into a discipline where science is often conducted using corpora. In this talk, we review the process of building ODIN, the Online Database of Interlinear Text, a multilingual repository of linguistically analyzed language data. ODIN is built from interlinear text that has been harvested from scholarly linguistic documents posted to the Web, and it currently holds nearly 190,000 instances of interlinear text representing annotated language data for more than 1,000 languages (representing data from more than 10% of the world's languages). Further, we have sought to enrich the collected data and extract "knowledge" from the enriched content. This work demonstrates the benefits of using natural language processing technology to create resources and tools for linguistic research, allowing linguists to have easy access not only to language data embedded in existing linguistic papers, but also to automatically generated language profiles for hundreds of languages.

