William Lewis

Visiting Assistant Professor, Dept. of Linguistics, UW

Locating, Recognizing, and Converting Interlinear Text on the Web

UW/Microsoft Symposium, 02/03/06

The Linguistics community is confronted by a quandary: while languages are going extinct at an alarming rate, the digital revolution has provided technologies that make it possible to record, analyze, and disseminate language data more efficiently than ever before. Traditional recording media of notebooks and analog recorders have become things of the past, being replaced by sophisticated digital recorders, laptops, and PDA's. However, despite cheap storage and the communicative efficiencies provided by the World Wide Web, much language data has been recorded and analyzed without a great deal of attention to the need for its preservation or dissemination. Language data is often housed locally and recorded in proprietary data formats that may themselves go extinct.

The Open Language Archives Community (Bird and Simons 2003) was formed to develop standards for data encoding to ensure that language data can be used over the long-term, and promotes metadata standards to ensure that the data can be located and used. OLAC's problem now is to find resource providers willing to make the extra effort to encode their resources and data in a way that makes them available to search. Critical mass cannot be achieved until a sufficient number of archives, institutions and individuals make their data available to OLAC, yet at the same time, many will not take the extra steps necessary to reformulate their data until they recognize that the extra effort will be worth it.

A number of efforts are currently underway to leverage existing language data as it is currently made available on the Web and make it searchable, whether the data is embedded in journal articles, posted as part of language learning materials for revitalization efforts, housed in language archives hidden behind idiosyncratic user interfaces, provided in Word or text formats, etc. The ODIN (the Online Database of INterlinear text) project was started as a pilot to test the potential for locating language resources and data by concentrating on commonly used semi-structured data types, particularly those used to encode language data. ODIN's focus thus far has been on interlinearized text, a format common to the field of linguistics. By scanning online resources and documents for instances of interlinear text and applying both novel and commonly used methods for language identification to the data, ODIN has achieved a fairly high degree of success at both locating language resources (most specifically, linguistic resources) and identifying the languages encoded, becoming the first fully automated OLAC data provider. Beyond locating resources containing language data, the ODIN team has experimented with methods for the automated migration of data encoded in legacy formats to best practice XML, from which more extensive metadata, and significantly greater interoperation, can be obtained.

Back to symposium main page