The development of broad domain statistical machine translation systems is gated by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction approaches have shown that including parallel fragments in SMT training data can significantly improve translation quality. We describe efficient and effective generative models for extracting fragments, and demonstrate that these algorithms produce competitive improvements on cross-domain test data without suffering in-domain degradation even at very large scale.
Back to symposium main page