Chris Quirk

Microsoft Research

Extracting parallel sentences from Wikipedia

Wikipedia is a compelling resource for statistical machine translation due to its sheer volume of data and broad coverage of both concepts and languages. Unfortunately these comparable articles are not easily digestible by standard statistical machine translation engines. We present improved supervised models for extracting parallel sentences from noisy data that improve on the state of the art and can leverage structure from Wikipedia. The resulting sentence pairs can significantly improve translation quality.

