Jeremy Kahn

UW, Dept of Linguistics

Unlexicalizing the comma: an orthographic assist to statistical machine translation

Machine translation word-to-word alignment systems (Och and Ney [2000]) rely on simple (but computationally tractable) assumptions about the structure of the translation process, and are still quite noisy when applied to language pairs with relatively frequent long-distance reordering phenomena (e.g., Chinese/English). More-sophisticated models of translation depend on this core word-to-word translation alignment -- usually to guide the discovery of higher-order connection, for example "phrases" in Koehn et al [2003] or grammatical interactions with translation, e.g., Quirk et al.'s [2005] Dependency Treelet Translation (source side) and Galley et al's [2004] GHKM system (target side).

This research revisits the alignment question, looking at one of the simplest forms of source-language metadata available: the comma. Commas are inserted in the source language by writers (and editors), and they are frequent in text. We hypothesize that commas should constrain translation reorderings (at least sometimes) and explore methods to use this information. This presentation includes empirical evidence that the commas are not respected by the usual statistical word-alignment algorithms, demonstrates an comma-based coherence feature that provides an avenue for improvement of a state-of-the-art MT system, and discusses further ideas for extending this sort of feature discovery beyond the comma.

Back to symposium main page