Traditional text similarity measures consider each term separately. Different but semantically related terms are not matched and do not contribute to the final similarity score. This issue is more severe in the cross-lingual setting, where vocabularies in different languages have little overlap. In this work, we propose a novel discriminative training method that projects the raw term vectors into a common, low-dimensional vector space. Our approach operates by finding the optimal matrix to minimize the loss of the preselected similarity function (e.g., cosine) of the projected vectors, and is able to efficiently handle a large number of training examples in the high-dimensional space. Evaluated on two very different tasks, cross-lingual document retrieval and ad relevance measure, our method not only outperforms existing state-of-the-art approaches, but also achieves high accuracy at low dimensions, which is a desired property in practice.
Back to symposium main page