School of Public Health   University of Washington Department of Health Services

Potential Applications in Infectious Surveillance

HOME

List of
Lectures

Return to
Title Page


INDEX

What is GIS?
Snow and a GIS
Calculating relativr rates
Rate Ratios for Cholera Deaths
Polygon overlay
Overlay Results
GIS as an aid to Data Visualization
GIS generic questions, location
GIS generic questions I
Ask the map
Walla Walla
Generic trend Questions
GIS generic modeling questions
Points, lines and polygons
Geographically referenced data
Primary characteristics of spatial data
Layers
Asthma hospitalization rates
Two problems
MonteCarlo Simulation
Characteristics of people living near Super Fund Sites
Superfund sites in King and Pierce Counties
1 mile buffers for super fund sites
Control Sites
1 mile buffers around control sites
Income quadrat map
Minority race distribution
Income and SF sites
SF/TRI compared to control sites
Malaria in a Kenyan Village
Empirical distribution for malaria risks
Bayesian Smoothing
Cancer deaths by Zipcode
Empirical Bayes estimation
non/Bayesian estimates
more Bayes
Methods of moments
Deaths from breast cancer
COPD deaths
Bayesian smoothed vs. unsmoothes rates ratios
Summary

Readings


  What is GIS?

Most cohort, case control studies, or descriptive studies in epidemiology don't include geography. That is, a person's location, where they work, their transportation, where they go every day, where they actually physically spend their time in their community, their timeline in terms of latitude and longitude are not included in epidemiologic studies. A person's or group's county or state or county may be part of the display of the data, but likely there is nothing about the actual spatial relationship of where they live to where others live.

Click here for SLIDE

Here we have the famous London cholera map by John Snow. The dots are the locations of cholera cases and the squares the location of the water pumps. Likely you recall that John Snow concluded that the Broad Street Pump had something to do with the cholera deaths in the London Soho neighborhood. He made this map and convinced the London authorities to put a lock on the pump. That was done and within a few days the number of deaths began to decline.

  Snow and a GIS

Click here for SLIDE

If Snow had a GIS back in 1854 he could have done something no one had ever been able to do before. He could have asked the map a question. Clicking on the regions of highest death rates (or any other) would reveal all the data associated with the regions: population, age distribution, income, and hundreds of other variables that characterize the "enumeration wards" of London.

The question is simple. The analyst sees the red areas on the map and wonders what these populations are all about, and are they different from the rest of London in some way? One of the variables in the GIS database might have been where they got their drinking water from. Of course we now know that that question would have been very important indeed.

  Calculating relative rates

Click here for SLIDE

GIS allows one to do calculations that always appeal to epidemiologists. Rates and rate ratios. GIS packages are not statistical packages but can do the the simple things. Most important, when one does a sophisticated statistical analysis in another software package, the results can be easily connected to all the geographies. The key feature of GIS is connecting data to a map.

  Rate ratios for cholera deaths

Click here for SLIDE

Snow could have calculated rate ratios and with a GIS taken a close look where they were high and low. Anyone looking at this map would see that the high rates seem to be all together; same with the low rates and all those in between.

  Polygon overlay

Click here for SLIDE

The "overlay" process in GIS, a central feature of GIS, allows us to estimate the death rates with respect to the boundaries of the water companies rather than being stuck with the rates at the enumeration areas. One might call these "synthetic estimates", that is, estimates of rates made by aerial interpolation. The "overlay" that we see here in the blue map shows the estimated rates of death with respect to the water company boundaries rather than the enumeration areas.

  Overlay results

Click here for SLIDE

Clearly Southwark and Kent water districts show much higher death rates than any area in London. GIS allows us to estimate death rates in areas for which we can not do direct measurement. Snow could calculate death rates in the London enumeration areas because he could obtain the number of deaths and the population from the enrollment books at the church parishes, which were very good at counting their parishioners, and when and from what they died. GIS would have allowed Snow to extend this data to death rates in the water districts, which covered many enumeration districts, many by fractional amounts.

What is spatial confounding, or spatial autocorrelation? A lot of times when you calculate rates of disease in places, the rates of disease or their confidence intervals are going to be biased because of the fact that there are areas right next to it that are much the same and so you can't make comparisons very easily.

  GIS as an aid to data visualization

Click here for SLIDE

GIS is right now mostly a data visualization tool. With mostly common sense it can lead us to assess the importance of spatial relationships and later to develop analytical models. Modern epidemiology started with maps - John Snow - and lately the technology and the development of the theoretical framework in geography finally allows epidemiologists to do spatial analysis that are far more sophisticated than used by John Snow. By no means should we understate his powers of spatial analysis. Consider that the germ theory was not available; consider the century in which he lived.

This slide also illustrates one of the central if not the most important features of GIS, data is linked to map or maybe better, a map is linked to data.

  GIS generic questions, location

Click here for SLIDE

GIS allows one to ask the map questions like where something is. One can get a location like longitude and latitude or proximity. Where is Leavenworth? The hatched polygon, in this case a Zipcode, displays the answer.

  GIS generic questions l

Click here for SLIDE

One can get answers to generic questions about area relationships, distance, perimeters, and proximity: how is A related to B and C? - the answers to topological questions.

  Ask the Map

Click here for SLIDE

You can ask questions like, which census tracks have less than 100 families? Here we have census track 59322. The population is 1,472. There are three families, three households, and 1,465 males. Where is this?

  Walla Walla

Click here for SLIDE

It's Walla Walla Prison. This is a very simple query for the map. SQL (structured query language) allows one to make these queries as complicated as you want and extending over as many layers as you want. An example: "What Zipcode has < 100 families and has a median household income of < $15,000 and lives within .5 miles if Interstate Highway 5?"

  Generic trend questions

Click here for SLIDE

GIS can deal with generic trend questions, that is, considering how things change in time. This is a tough task for an ordinary database but for a GIS database, things like changes in size, location, shape, changes in typology and what new features appear over a period of time are part of its capacity.

The yellow region could represent the extent of malaria in a green country. In this case, in 1980 and how the disease had spread 10 years later.

  GIS generic modeling questions

Click here for SLIDE

GIS can help you model spatial relations and do "what if" analysis. Geographic modeling is like any other modeling activity. Reality is one thing and the model another but often it's the best we can do or need to do.

The "reality" is represented by ordinary geometric shapes in the "geographic model" and then a "what if" analysis can be done by changing some of the relationships, the shape or other characteristics.

  Points, lines and polygons

Click here for SLIDE

Data in a GIS are points, lines and polygons. Points may be towns or radio towers. Lines may be rivers or roads. Polygons may be Zipcodes or towns. Scale is important here. If a map is a large-scale map (looking at a small area) then a town can't be a point; it must be a polygon. On a small-scale map (looking at a large area) then it may be adequate to represent a town by a point.

Same thing with lines and polygons. If your map is large enough scale then a river can't be adequately represented by a line and must be shown as a polygon. Rivers on the ground have width and length. Lines have only length.

  Geographically referenced data

Click here for SLIDE

This means that ordinary data about an area or a person or town or a community or anything that can appear in the ground is linked to a location.

  Primary characteristics of spatial data

Click here for SLIDE

This shows that spatial data is about position, area, perimeter, distance, proximity connected to non-spatial attributes like age, race, sex and cause of death.

  Layers

Click here for SLIDE

The fundamental quality of GIS in representing data is by the superposition of layers of data one on top of the other. Suppose you had some transparent plastic sheets. One with county boundaries, one with a thematic map, another with water and others with geographic features, and the last with toxic waste sites. Piling them one on another makes a composite map.

With a GIS one can display some layers and not others, change which are on top of another, and most importantly, connect them all together for spatial analysis.

  Asthma hospitalization rates

Click here for SLIDE

This is an example of a multilayered map. A layer of Toxic Release Inventory sites, a layer of asthma hospitalization rates by census tract, a layer of Super Fund sites, another of air monitoring stations and one of Zipcodes.

The table connected to the data shows the Zipcode and how much the age-adjusted rate of hospitalization is above the state rate.

  Two problems

Click here for SLIDE

For so long epidemiologists have avoided the ecologic fallacy: attributing the characteristics of a group to an individual. Just because someone lives in a high income, very white, highly educated neighborhood does not mean that they have any, much less all, of these characteristics. However there might be another side of the coin. Suppose in a case control study we collect lots of personal information about all sorts of possible exposures but ignore the geographic, spatial, and social context in which a person lives? This is the "atomistic fallacy."

Dealing with small numbers is not the bane of descriptive epidemiologists anymore. Many techniques have been developed to deal with small areas - Empirical Bayes smoothing, headbanging, mixed models, sandwich estimators - all of them a step towards dealing with small numbers and arbitrary geographies.

  MonteCarlo Simulation

Click here for SLIDE

Monte Carlo simulation allows us to characterize exposure or outcome without dealing with the distribution dependent ordinary statistics. The method is simple, easy to understand, the results are better than ordinary statistical methods, they can be applied to any geography and spatial confounding can be accounted for on the fly.

  Characteristics of people living near Super Fund Sites

Click here for SLIDE

The idea is to determine if the neighborhoods near Superfund sites are any different than other neighborhoods.

This notion could be easily extended to an infectious disease problem. One might want to characterize the neighborhoods where immunization rates are low or where malaria has had a recent flare-up or where they appeared a cluster of E-coli cases.

Here we looked at 257 Super Fund and Toxic Release sites in selected counties in western Washington. We constructed 1 mile buffers (also .25, .5, .75 mile buffers) over the census population data.

Using the GIS Overlay process we estimated the characteristics of the neighborhoods in the buffer by weighted aerial interpolation.

The same thing was done for 16000 random points taken as control points. Then an empirical distribution was developed by repeated sampling with replacement from the control points.

Once the empirical distribution was developed, the toxic waste site buffers could be directly compared to determine a statistical significance.

  Superfund sites in King and Pierce Counties

Click here for SLIDE

The black dots show the locations. Each has been "geocoded" which means each has been assigned a 6 decimal place longitude and latitude. GIS packages can do this as well as high-end geocoders like Centrus. One can even use a Geographic Positioning System (GPS) which is a device that uses earth orbiting satellites to determine location to a few meters; in some cases centimeters.

  1 mile buffers for super fund sites

Click here for SLIDE

Polygon overlays are used to characterize the neighborhoods around Super Fund and TRI sites.

We used other buffer sizes as well.

  Control Sites

Click here for SLIDE

This slide shows the control sites. The question for selecting control sites is how to assign a selection probability. It's not clear that it should be uniformly random. After all some sites could never have had an industrial site built on them because there was no shipping access, no high volume water access, or there was not the required zoning.

However in this case it is uniform. High income Mercer Island (in the red circle) is assumed just as likely to have been a Super Fund/TRI site as any other site.

  1 mile buffers around control sites

Click here for SLIDE

More of the same but for control sites. Again - an infectious disease probem could be worked the same way.

John Snow could have used this technique.

  Income quadrat map

This map shows the relationship of per capita income to the location of the toxic waste sites. A quadrat is a grid and this one is overlaid over the census tracts to give estimates of the per capita income in each little square.

Clearly low-income areas tend to have more sites than high-income areas. Also in an area like Seattle lots of formerly low-income areas are now high-income. Likely the change in the other direction has not happened.

  Minority race distribution

Click here for SLIDE

This shows the empirical distribution made form the control site and the location of the SF/TRI sites on that distribution. The Toxic sites have 20% of their population as minorities compared to 8.5% for the rest of the area and the statistical significance is way below 5%.

  Income and SF sites

Click here for SLIDE

Using the same graphical representation of an empirical distribution we see here the income distribution around the sites.

It's lower for the toxic sites and clearly quite different than the rest of the city.

  SF/TRI compared to control sites

Click here for SLIDE

This table comparisons for many other variables.

All were statistically significant at 5% or less except college graduated %.

  Malaria in a Kenyan Village

Click here for SLIDE

The same technique is used to characterize the areas around malaria cases in a Kenyan village.

The data is simulated. The scenario is real but the actual numbers and the geography are made up just to test this method.

A 50 m buffer is used around Malaria cases; the location is noted of water puddles, old tires and houses that have been routinely sprayed for mosquitoes. Just as before, controls were selected randomly but characterized the same way.

  Empirical distribution for malaria risks

Click here for SLIDE

Using what we know of malaria risks the simulation was carried out to assess the importance of water puddles, distance from the river, being near a sprayed house (or living in one) and having old tires near by. Puddles, the river, and old tires are breeding areas for mosquitoes.

  Bayesian Smoothing

Click here for SLIDE

Here we attempt to deal with "small" numbers. All this means is dealing with rates where the numerator is small and so is the denominator.

But just as important is "how do we account for location?"

Area for which we calculate rates for are arbitrary, What is so special about a census tract or a county boundary?

Will re-drawing the boundaries appear to change the disease rate at a particular area or point?

  Cancer deaths by Zipcode

Click here for SLIDE

The problem is dramatically illustrated here. See how area with small numbers of deaths can have very large death rates. Does this reflect some special problem that public health practitioners should be aware of?

  Empirical Bayes estimation

Bayesian smoothing has a lot to do with adjusting an area's apparent rate of disease with input from a distribution made form that data for the whole state, or perhaps better, just that from an area's neighbors.

When the "prior distribution" is ground through the Bayesian machine which has a likelihood function, a "posterior distribution" results from which the usual parameters such as means and standard errors can be drawn.

Getting a firm hold on the Bayesian notion is not within the scope of the discussion here, but by looking over the following material you might be able to get some feel about what this is all about.

  non/Bayesian estimates

The usual rate is just the number of cases divided by the population. No surprise here.

The Bayesian rate is adjusted using the Bayesian Estimate formula displayed. The adjusted rate is a weighted value resulting from a weight value that is determined by the variance and mean of the prior distribution. As W gets closer to 1 this refects higher confidence in the local rate as a function of its variance and population size.

  more Bayes

All of this is not exactly intuitively obvious but in short it means that we have a formula that makes adjustments to a rate in a "small" area depending on how big the population is for that area and how big the variance is for the "prior" distribution - that is, for the distribution of rates for the whole state or maybe the areas nearest neighbors.

  Method of moments

Click here for SLIDE

The method of getting all this Bayes business started is fun for statisticians and they are well acquainted with the "method of moments" for fitting a distribution.

The message here for the rest of us is in the red box. There are two ways to smooth rates, Aspatial and Spatial.

Aspatial says "lets use information to smooth this rate from what we know about this age/sex/race group from the whole state." Spatial says "lets use information to smooth this rate from what we know about this age/sex/race group from just the neighbors."

It makes a lot of sense to make adjustments with information from those closest rather than those on the other side of the state.

  Deaths from breast cancer

Click here for SLIDE

This slide shows the result of Bayesian smoothing. The top map shows some area in red where breast cancer appears to be really elevated. But these are areas with small populations.

Are these rates "real?"

The lower map shows the Bayesian smoothed map. Things have changed. The hot spots have disappeared.

But still some areas in the red circle still remain elevated.

  COPD deaths

Click here for SLIDE

The same calculation was done for COPD.

Most of the elevated rate area shrink back to the state level but some, although reduced, are still elevated. Are these areas where there really is a problem?

  Bayesian smoothed vs. unsmoothed rate ratios

Click here for SLIDE

Here you can see that Zipcodes with small populations tend to have higher rate ratios than higher population Zipcodes. After the Bayes process is applied all of them shrink to more reasonable values.

  Summary

Click here for SLIDE

In this discussion we have shown how GIS can be used to assess the importance of geographic context into epidemiologic studies. Details vary, but in the main it is the same for infectious and chronic disease.

GIS allows one to determine if one area is having the same exposure or disease experience as another.

Also it allows one with the help of a little statistics to deal with small areas, that is, unstable rates in areas with small populations.

Finally, GIS allows public health data to be communicated to professionals and lay people alike.


Readings:

Emerging Infections 1992 Report (Textbook), Institute of Medicine, pp.34-17.


Return to Title Page

HOME  |  List of Lectures