# Kenneth W. Church+ and Ping Li*

### +Machine Learning and Applied Statistics Group MSR

*Statistics Department, Stanford University

## Using Sketches to estimate associations

### UW/Microsoft Symposium, 10/28/05

Suppose one wanted to use page hits to estimate associations or
mutual information, as Turney, Etzioni and others have done. We should
not have to look at the entire corpus (e.g., the Web) to know if two
(or more) words are associated or not. A powerful sampling technique
called Sketches was originally introduced to remove duplicate Web
pages. We generalize Sketches to estimate contingency tables and
associations, using a maximum likelihood estimator to find the most
likely contingency table given the sample, the margins (document
frequencies) and the size of the collection. The proposed method has
smaller (about 1/2) errors and more flexibility than the original
sketch method. Also, the proposed method can be extended naturally to
estimate multi-way associations, formulated as a convex programming
problem. As expected, computational work and statistical accuracy
(variance or errors) depend on sampling rate, as will be shown both
theoretically and empirically. Sampling methods become more and more
important with larger and larger collections. At Web scale, sampling
rates as low as 10-4 may suffice.

Back to symposium main page