Douglas Downey

UW, Dept of Computer and Science Engineering

Autonomous Web-scale Information Extraction

UW/Microsoft Symposium, 6/06/08

Search engines are extremely useful tools for answering questions. However, a significant number of questions users might pose -- for example, "which actors have won an Oscar for playing a villain?" -- are difficult to answer using existing search engines, because the answers do not lie on a single page. To answer these kinds of queries, users must extract and synthesize information from multiple documents. Currently, this is a tedious and error-prone manual process. In this talk, I will describe research aimed at automating the extraction of this information from the Web. I begin by presenting a model of the redundancy inherent in the Web, and show that the model can be used to identify correct extractions autonomously, without the manually labeled examples typically assumed in previous information extraction research. However, the model has limited efficacy for the "long tail" of infrequently mentioned facts; my second investigation shows how unsupervised language models can be leveraged in concert with redundancy to overcome this limitation.

Back to symposium main page