Emerging Infections of International Public Health Importance

Home

Lectures

Resources

Course
Evaluation


Module 3:  Public Health Response  
LECTURE 2 Readings


 
Describing Place:  Geographic Information Systems

Dick Hoskins,  PhD, MPH

 
Objectives:
  1. Know what GIS is and how it can be used to map infections
     
  2. Be able to utilize GIS and data to determine concentration of disease
     
  3. Use information from GIS to be able to strategize prevention and control measures

 

Introduction


GIS (geographic information system) refers to doing analytical work with maps, not just numbers. A geographic information system certainly has a lot to do with mapping and maps, but it's more than that. It is a way to organize information in terms of what's called a "spatial entity." This means that we are going to focus our look at epidemiologic data on "where" as well as "when" and "what."

That geographic component can be longitude or latitude; it can be a line like a road. It can be a polygon, like a country, census track or a point like the location of a cholera case. It is an instrument of exploratory data analysis and descriptive epidemiology. Lately it's becoming a part of analytical epidemiology.

What is GIS?

Most cohort, case control studies, or descriptive studies in epidemiology don't include geography. That is, a person's location, where they work, their transportation, where they go every day, where they actually physically spend their time in their community, Their timeline in terms of latitude and longitude are not included in epidemiologic studies. A person's or group's county, state or country may be part of the display of the data, but likely there is nothing about the actual spatial relationship of where they live to where others live.

Snow and a GIS

Click Here for Figure

Here we have the famous London cholera map by John Snow. The dots are the locations of cholera cases and the squares the location of the water pumps. Likely you recall that John Snow concluded that the Broad Street Pump had something to do with the cholera deaths in the London Soho neighborhood. He made this map and convinced the London authorities to put a lock on the pump. That was done and within a few days the number of deaths began to decline.

If Snow had a GIS back in 1854 he could have done something no one had ever been able to do before. He could have asked the map a question. Clicking on the regions of highest death rates (or any other) would reveal all the data associated with the regions: population, age distribution, income, and hundreds of other variables that characterize the "enumeration wards" of London.

The question is simple. The analyst sees the red areas on the map and wonders what these populations are all about, and are they different from the rest of London in some way? One of the variables in the GIS database might have been where they got their drinking water from. Of course we now know that that question would have been very important indeed.

Calculating relative rates

Click Here for Figure

GIS allows one to do calculations that always appeal to epidemiologists. Rates and rate ratios. GIS packages are not statistical packages but can do the the simple things. Most important, when one does a sophisticated statistical analysis in another software package, the results can be easily connected to all the geographies. The key feature of GIS is connecting data to a map.

Rate ratios for cholera deaths

Click Here for Figure

Snow could have calculated rate ratios and with a GIS taken a close look at where they were high and low. Anyone looking at this map would see that the high rates seem to be all together; same with the low rates and all those in between.

Polygon overlay

Click Here for Figure

The "overlay" process in GIS, a central feature of GIS, allows us to estimate the death rates with respect to the boundaries of the water companies rather than being stuck with the rates at the enumeration areas. One might call these "synthetic estimates", that is, estimates of rates made by aerial interpolation. The "overlay" that we see here in the blue map shows the estimated rates of death with respect to the water company boundaries rather than the enumeration areas.

Overlay results

Click Here for Figure

Clearly Southwark and Kent water districts show much higher death rates than any area in London. GIS allows us to estimate death rates in areas for which we can not do direct measurement. Snow could calculate death rates in the London enumeration areas because he could obtain the number of deaths and the population from the enrollment books at the church parishes, which were very good at counting their parishioners, and when and from what they died. GIS would have allowed Snow to extend this data to death rates in the water districts, which covered many enumeration districts, many by fractional amounts.

What is spatial confounding, or spatial autocorrelation? A lot of times when you calculate rates of disease in places, the rates of disease or their confidence intervals are going to be biased because of the fact that there are areas right next to it that are much the same and so you can't make comparisons very easily.

GIS as an aid to data visualization

Click Here for Figure

GIS is right now mostly a data visualization tool. With mostly common sense it can lead us to assess the importance of spatial relationships and later to develop analytical models. Modern epidemiology started with maps - John Snow - and lately the technology and the development of the theoretical framework in geography finally allows epidemiologists to do spatial analysis that are far more sophisticated than used by John Snow. By no means should we understate his powers of spatial analysis. Consider that the germ theory was not available; consider the century in which he lived.

This slide also illustrates one of the central if not the most important features of GIS, data is linked to map or maybe better, a map is linked to data.

GIS generic questions, location

Click Here for Figure

GIS allows one to ask the map questions like where something is. One can get a location like longitude and latitude or proximity. Where is Leavenworth? The hatched polygon, in this case a Zip code, displays the answer.

GIS generic questions l

Click Here for Figure

One can get answers to generic questions about area relationships, distance, perimeters, and proximity: how is A related to B and C? - the answers to topological questions.

Ask the Map

Click Here for Figure

You can ask questions like, which census tracks have less than 100 families? Here we have census track 59322. The population is 1,472. There are three families, three households, and 1,465 males. Where is this?

Walla Walla

Click Here for Figure

It's Walla Walla Prison. This is a very simple query for the map. SQL (structured query language) allows one to make these queries as complicated as you want and extending over as many layers as you want. An example: "What Zip code has < 100 families and has a median household income of < $15,000 and lives within .5 miles if Interstate Highway 5?"

Generic trend questions

Click Here for Figure

GIS can deal with generic trend questions, that is, considering how things change in time. This is a tough task for an ordinary database but for a GIS database, things like changes in size, location, shape, changes in typology and what new features appear over a period of time are part of its capacity.

The yellow region could represent the extent of malaria in a green country. In this case, in 1980 and how the disease had spread 10 years later.

GIS generic modeling questions

Click Here for Figure

GIS can help you model spatial relations and do "what if" analysis. Geographic modeling is like any other modeling activity. Reality is one thing and the model another but often it's the best we can do or need to do.

The "reality" is represented by ordinary geometric shapes in the "geographic model" and then a "what if" analysis can be done by changing some of the relationships, the shape or other characteristics.

Points, lines and polygons

Click Here for Figure

Data in a GIS are points, lines and polygons. Points may be towns or radio towers. Lines may be rivers or roads. Polygons may be Zip codes or towns. Scale is important here. If a map is a large-scale map (looking at a small area) then a town can't be a point; it must be a polygon. On a small-scale map (looking at a large area) then it may be adequate to represent a town by a point.

Same thing with lines and polygons. If your map is large enough scale then a river can't be adequately represented by a line and must be shown as a polygon. Rivers on the ground have width and length. Lines have only length.

Geographically referenced data

Click Here for Figure

This means that ordinary data about an area or a person or town or a community or anything that can appear in the ground is linked to a location.

Primary characteristics of spatial data

Click Here for Figure

This shows that spatial data is about position, area, perimeter, distance, proximity connected to non-spatial attributes like age, race, sex and cause of death.

Layers

Click Here for Figure

The fundamental quality of GIS in representing data is by the superposition of layers of data one on top of the other. Suppose you had some transparent plastic sheets. One with county boundaries, one with a thematic map, another with water and others with geographic features, and the last with toxic waste sites. Piling them one on another makes a composite map.

With a GIS one can display some layers and not others, change which are on top of another, and most importantly, connect them all together for spatial analysis.

Asthma hospitalization rates

Click Here for Figure

This is an example of a multilayered map. A layer of Toxic Release Inventory sites, a layer of asthma hospitalization rates by census tract, a layer of Super Fund sites, another of air monitoring stations and one of Zip codes.

The table connected to the data shows the Zip code and how much the age-adjusted rate of hospitalization is above the state rate.

Two problems
 
Two problems epidemiologists seldom really work...
1. Taking geography into account within the context of disease and exposure.

Ecological fallacy vs. the "atomistic* fallacy"

2. Dealing with small numbers.

Epidemiologists can't walk away from this one anymore.

*Ignoring the social & geographical context of an individual. (Can be as bad as the ecological fallacy where characteristics of an individual are assumed from the characteristics of the group--someone's census tract.

For so long epidemiologists have avoided the ecologic fallacy: attributing the characteristics of a group to an individual. Just because someone lives in a high income, very white, highly educated neighborhood does not mean that they have any, much less all, of these characteristics. However there might be another side of the coin. Suppose in a case control study we collect lots of personal information about all sorts of possible exposures but ignore the geographic, spatial, and social context in which a person lives? This is the "atomistic fallacy."

Dealing with small numbers is not the bane of descriptive epidemiologists anymore. Many techniques have been developed to deal with small areas - Empirical Bayes smoothing, headbanging, mixed models, sandwich estimators - all of them a step towards dealing with small numbers and arbitrary geographies.

Monte Carlo Simulation
 
The first of two methods made possible by GIS:
1. Monte Carlo simulation

Characterizing the environment of exposure (or disease)

  • Distribution free
  • Non-parametric
  • Easier to understand
  • Results are as good (or better)
    Can be applied to any geography

Monte Carlo simulation allows us to characterize exposure or outcome without dealing with the distribution dependent ordinary statistics. The method is simple, easy to understand, the results are better than ordinary statistical methods, they can be applied to any geography and spatial confounding can be accounted for on the fly.

Characteristics of people living near Super Fund Sites
 
Characteristics of people living near Super Fund Sites
  • Geocode SF sites - 257 sites
  • Construct 1 mile buffer
  • Overlay buffers over census data
  • Assign random locations for control sites - 16000 points (eliminate water, park, unlikely locations )
  • For control sites construct 1 mile buffer and overlay over census data
  • Construct empirical distribution by repeated random sampling of 257 sites
  • Compare SF sites with empirical distribution

The idea is to determine if the neighborhoods near Superfund sites are any different than other neighborhoods.

This notion could be easily extended to an infectious disease problem. One might want to characterize the neighborhoods where immunization rates are low or where malaria has had a recent flare-up or where they appeared a cluster of E.coli cases.

Here we looked at 257 Super Fund and Toxic Release sites in selected counties in western Washington. We constructed 1 mile buffers (also .25, .5, .75 mile buffers) over the census population data.

Using the GIS Overlay process we estimated the characteristics of the neighborhoods in the buffer by weighted aerial interpolation.

The same thing was done for 16000 random points taken as control points. Then an empirical distribution was developed by repeated sampling with replacement from the control points.

Once the empirical distribution was developed, the toxic waste site buffers could be directly compared to determine a statistical significance.

Superfund and TRI sites in King and Pierce Counties

Click Here for Figure

The black dots show the locations. Each has been "geocoded" which means each has been assigned a 6 decimal place longitude and latitude. GIS packages can do this as well as high-end geocoders like Centrus. One can even use a Geographic Positioning System (GPS) which is a device that uses earth orbiting satellites to determine location to a few meters; in some cases centimeters.

1 mile buffers for super fund sites

Click Here for Figure

Polygon overlays are used to characterize the neighborhoods around Super Fund and TRI sites.

We used other buffer sizes as well.

Control Sites

Click Here for Figure

This slide shows the control sites. The question for selecting control sites is how to assign a selection probability. It's not clear that it should be uniformly random. After all some sites could never have had an industrial site built on them because there was no shipping access, no high volume water access, or there was not the required zoning.

However in this case it is uniform. High income Mercer Island (in the red circle) is assumed just as likely to have been a Super Fund/TRI site as any other site.

1 mile buffers around control sites

Click Here for Figure

More of the same but for control sites. Again - an infectious disease problem could be worked the same way.

John Snow could have used this technique.

Income quadrat map

Click Here for Figure

This map shows the relationship of per capita income to the location of the toxic waste sites. A quadrat is a grid and this one is overlaid over the census tracts to give estimates of the per capita income in each little square.

Clearly low-income areas tend to have more sites than high-income areas. Also in an area like Seattle lots of formerly low-income areas are now high-income. Likely the change in the other direction has not happened.

Minority race distribution

Click Here for Figure

This shows the empirical distribution made form the control site and the location of the SF/TRI sites on that distribution. The Toxic sites have 20% of their population as minorities compared to 8.5% for the rest of the area and the statistical significance is way below 5%.

Income and SF sites

Click Here for Figure

Using the same graphical representation of an empirical distribution we see here the income distribution around the sites.

It's lower for the toxic sites and clearly quite different than the rest of the city.

SF/TRI compared to control sites

Click Here for Figure

This table comparisons for many other variables.

All were statistically significant at 5% or less except college graduated %.

Malaria in a Kenyan Village

Click Here for Figure

The same technique is used to characterize the areas around malaria cases in a Kenyan village.

The data is simulated. The scenario is real but the actual numbers and the geography are made up just to test this method.

A 50 m buffer is used around Malaria cases; the location is noted of water puddles, old tires and houses that have been routinely sprayed for mosquitoes. Just as before, controls were selected randomly but characterized the same way.

Empirical distribution for malaria risks

Click Here for Figure

Using what we know of malaria risks the simulation was carried out to assess the importance of water puddles, distance from the river, being near a sprayed house (or living in one) and having old tires near by. Puddles, the river, and old tires are breeding areas for mosquitoes.

Bayesian Smoothing
 
The second of two methods made possible by GIS:
2. Bayesian Smoothing
  • How to adjust rates in a "small" area.
  • What to do about small number of events in a small population.
  • How to account for location. Is geography important?

Here we attempt to deal with "small" numbers. All this means is dealing with rates where the numerator is small and so is the denominator.

But just as important is "how do we account for location?"

Areas for which we calculate rates for are arbitrary. What is so special about a census tract or a county boundary?

Will re-drawing the boundaries appear to change the disease rate at a particular area or point?

Cancer deaths by Zip code

Click Here for Figure

The problem is dramatically illustrated here. See how area with small numbers of deaths can have very large death rates. Does this reflect some special problem that public health practitioners should be aware of?

Empirical Bayes estimation

Bayesian smoothing has a lot to do with adjusting an area's apparent rate of disease with input from a distribution made form that data for the whole state, or perhaps better, just that from an area's neighbors.

When the "prior distribution" is ground through the Bayesian machine which has a likelihood function, a "posterior distribution" results from which the usual parameters such as means and standard errors can be drawn.

Getting a firm hold on the Bayesian notion is not within the scope of the discussion here, but by looking over the following material you might be able to get some feel about what this is all about.

Non/Bayesian estimates

Click Here for Figure

The usual rate is just the number of cases divided by the population. No surprise here.

The Bayesian rate is adjusted using the Bayesian Estimate formula displayed. The adjusted rate is a weighted value resulting from a weight value that is determined by the variance and mean of the prior distribution. As W gets closer to 1 this reflects higher confidence in the local rate as a function of its variance and population size.

More Bayes

All of this is not exactly intuitively obvious but in short it means that we have a formula that makes adjustments to a rate in a "small" area depending on how big the population is for that area and how big the variance is for the "prior" distribution - that is, for the distribution of rates for the whole state or maybe the areas nearest neighbors.

Method of moments

Click Here for Figure

The method of getting all this Bayes business started is fun for statisticians and they are well acquainted with the "method of moments" for fitting a distribution.

The message here for the rest of us is in the red box. There are two ways to smooth rates, Aspatial and Spatial.

Aspatial says "lets use information to smooth this rate from what we know about this age/sex/race group from the whole state." Spatial says "lets use information to smooth this rate from what we know about this age/sex/race group from just the neighbors."

It makes a lot of sense to make adjustments with information from those closest rather than those on the other side of the state.

Deaths from breast cancer

Click Here for Figure

This slide shows the result of Bayesian smoothing. The top map shows some area in red where breast cancer appears to be really elevated. But these are areas with small populations.

Are these rates "real?"

The lower map shows the Bayesian smoothed map. Things have changed. The hot spots have disappeared.

But still some areas in the red circle still remain elevated.

COPD deaths

Click Here for Figure

The same calculation was done for COPD.

Most of the elevated rate area shrink back to the state level but some, although reduced, are still elevated. Are these areas where there really is a problem?

Bayesian smoothed vs. unsmoothed rate ratios

Click Here for Figure

Here you can see that Zip codes with small populations tend to have higher rate ratios than higher population Zip codes. After the Bayes process is applied all of them shrink to more reasonable values.

Summary
 
Summary
GIS can help with
  • Using the geographic context of exposures and disease outcome for assessment, surveillance, and modeling.
  • Dealing effectively with small number problems.

In this discussion we have shown how GIS can be used to assess the importance of geographic context into epidemiologic studies. Details vary, but in the main it is the same for infectious and chronic disease.

GIS allows one to determine if one area is having the same exposure or disease experience as another.

Also it allows one with the help of a little statistics to deal with small areas, that is, unstable rates in areas with small populations.

Finally, GIS allows public health data to be communicated to professionals and lay people alike.

 

Study Questions:
  1. Describe how GIS can be used in public health for surveillance.

  2. Describe how GIS can be used in public health for control.

  3. Name the factors of emergence that are best identified and outlined with the use of GIS.

 


  Go to  Readings

 

  Go to Top

UW Home © 2002 University of Washington Department of Health Services
Box 357660, Seattle, WA 98195-7660
e-mail:
carrieho@u.washington.edu