UW Aquatic & Fishery Sciences Quantitative Seminar

# Don L. Stevens, Jr.

Oregon State University, Department of Statistics

# Integrating Data from a Probability Survey and a Non-Probability Survey

## Abstract

Increasingly, federal and state agencies are using probability-based survey methodology to monitor the state of the environment. Historically, many environmental studies have not used probability based methods to select study sites. In many cases, this has resulted in a substantial amount of targeted or non-probability data covering the same population as a probability survey. Ideally, the data from both sources should be combined for comprehensive assessment. However, naïve combination of probability and non-probability survey data can result in selection bias and consequent estimation bias if the design, for one or both of the surveys, is non-ignorable. In this talk, I’ll explore two methods for merging a non-probability sample of an environmental resource with a probability sample of the same resource.

One approach is uses an application of selection functions to the question of estimating an appropriate weight for non-probability samples. Briefly, if *X* is an attribute of a population with pdf *f1(x)*, a selection function *w(x)* is a function such that if individuals with *X = x *are selected with probability w(x) from the population, then the pdf of the resulting population is *f2(x) = cw(x)f1(x). * We can view the non-probability sample as being filtered through a non-trivial selection function. The other approach uses a Dirichlet tessellation of the population domain to assign weights to sites. Briefly, the weight assigned to a site (probability or non-probability) is the population total within the Dirichlet polygon associated with the site. The two approaches are illustrated with data from a lake survey conducted by the USEPA, and data from a stream survey conducted by the California Department of Fish and Game.