Data and methods for "Upton Sinclair's 1934 EPIC Campaign: Anatomy of a Political Movement"

Vol. 12:4, December 2015
This page supports the article “Upton Sinclair’s 1934 EPIC Campaign: Anatomy of a Political Movement,” by James N. Gregory, published in LABOR: Studies in Working-class History of the America (December, 2015). The article is the first detailed exploration of the inner workings of Upton Sinclair’s pivotal campaign to become governor of California, including an analysis of who voted for Upton Sinclair and why the campaign was so popular in some communities and so unpopular in others. Explanations of data and methods are presented here.

The page is organized in three parts:



Who voted for Sinclair? Analysis of voting patterns


Nancy Quam Wickham began this project with me and we are jointly responsible for much of the data and preliminary analysis. She has kindly allowed me to use these data. Thanks too to Margaret Miller for helping with research and to the UC Berkeley Institute for Labor and Employment for supporting data collection.

Previous attempts to analyze the vote have been mostly guesswork. This article uses ecological inference techniques and works with four datasets that combine demographic and electoral data, one consisting of 159 municipalities and unincorporated county areas; three others based on 1940 census tracts. Los Angeles County had more than 500 tracts; San Francisco and Alameda counties each had more than 100. These demographic data were combined with election results which were reported by precincts. Data from more than 5,000 precincts are used in this analysis.


  • General Election Returns November 6, 1934, for all California counties (microfilm copies from California Secretary of State). These are hand written precinct reports submitted by county officials.  Most provide totals for municipalities.
  • U.S. Bureau of the Census. Sixteenth Census of the United States: 1940. Population and Housing. Statistics by Census Tracts (Wash. D.C., 1941-42) volumes for Berkeley, Los Angeles, Oakland, San Francisco, Tables 1,2,3,5,6.
  • U.S. Bureau of the Census. Sixteenth Census of the United States: 1940. Population. Vol 2. California (Wash. D.C., 1943) Tables 21,22,23,24, 30, 31, 32, 33, pp.540-563.


Maps 1-3 were produced using ArcGIS with shape files from the 1940 Census supplied by Minnesota Population Center. National Historical Geographic Information System: Version 2.0. Minneapolis, MN: University of Minnesota 2011

Data issues:

The analysis of voter demographics rests on datasets that combine 1934 election returns with 1940 census data at two geographic levels: census tracts and municipalities. Because census tract tabulations were never produced for the 1930 census, we used the 1940 tabulations. The five and half year gap between election and the demographic data is not ideal but could have been worse. The period saw modest population mobility compared to most decades before and after the 1930s.  I am most concerned about Los Angeles County, which experienced 26 percent population increase during the decade. Some of this growth occurred in new census tracts, which for that reason were dropped from the database. My assumption is that most existing neighborhoods retained their basic occupational and demographic profiles through the five and a half year interval. We tested this assumption in three census tracts using occupational information from the 1934 City Directory and the 1934 Great Register. The class composition of all three neighborhoods remained similar in both 1934 and 1940.

Census tract data were available for the three largest counties: Alameda, Los Angeles, and San Francisco. Three separate databases were constructed using the published census tables to record  the number of individuals by occupational sector, age, sex, race, and birth nation.  Separately, election returns from more than 5000 precincts in these counties were recorded for both primary and general elections. Using precinct maps from the Secretary of State’s office, we assigned precincts to the census tracts. Where there was overlap, we estimated how to split the precinct between the tracts. About 12% of the precincts could not be reliably assigned in this manner. They were dropped.

Here are the numbers of tracts and precincts used in the databases:

Los Angeles
San Francisco
Total Census Tracts
Census Tracts in data set
Total Precincts
Precincts in data set


The decision to drop some of the precincts means that not all votes were counted in this procedure and that, in turn, limits how we can interpret the data. For one thing, we do not know the number of nonvoters and thus cannot assess their characteristics. Because of that I refrained from using regression coefficients to estimate the actual percentage of vote associated with any of the social categories. This might have been possible if we had full vote counts and accurate population totals at the time of the vote.

A different dataset was created to analyze the vote outside the three largest counties. The units of analysis are municipalities and nonurban areas. Vote tallies and demographic information were derived from the sources listed above for 106 municipalities with populations of more than 2,500. By subtracting these urban totals from county level data, nonurban vote and demographic numbers were obtained for 53 counties. Because of data problems Sacramento and San Diego county figures were not available. The resulting dataset has 159 cases.

Ecological inference issues

Statisticians and methodologists have been arguing about the validity of various ecological inference methods for decades. Some of the newer approaches are either very complicated or very limited. Data limitations argued against trying most of them. For example Gary King’s ecological inference (EI) method depends upon knowing the number of nonvoters and is difficult to use with more than one independent variable.  I use multiple regression, realizing that some experts may be critical. I do so with appropriate caution, refraining from deriving probability estimates for the independent variables. If we had full counts of voters and non voters for each census tract and if the demographic information had been from 1934, it would have been reasonable to calculate estimates on the number and percentage of Sinclair’s voters who were blue-collar males, females, African Americans etc.

Linear multiple regression involves testing equations with different mixes of variables, looking for an equation that most efficiently accounts for variations in the dependent variable, in this case the percentage of vote for Sinclair, Merriam, or Haight. After experimenting with various combinations in the three counties, I used seven independent variables in the equation.


  • % blue collar males among all persons over 21
  • % females among all persons over 21
  • % persons 55 and older among persons over 21
  • % African Americans
  • % born in Mexico
  • % born in Italy
  • % born in Russia

Blue-collar includes all jobs listed in the census occupational categories as craft, skilled, semi-skilled, labor, farm labor, and also service. White collar includes owners, managers, professional, technical, sales, and clerical positions.

I experimented with dividing adult females into those in the labor force and those not in the labor force. In one county there were small differences associated with that distinction, but it wasn’t consistent enough to warrant adding an eighth variable to the equation.

The ethnic variables have to be used cautiously. The census recorded the number of African Americans in each census tract but not the number of Black adults and thus eligible voters. The use of the three foreign-born variables rests on the challengeable assumption that the residential distribution of voters from those ethnic populations followed the pattern of first-generation immigrants whose numbers were recorded by census track. The decision to use Russian as a proxy for Jews also introduces potential error. Finally, I should note that the San Francisco ethnicity data was incompletely recorded. For the other two counties, we recorded the numbers of foreign-born Mexicans, Italians, and Russians for each census tract. For SF we recorded those numbers only if the particular ethnic group comprised at least 2 percent of the tract population. This affects the statistical variation and thus the coefficients for ethnicity in that county should be viewed cautiously. As it happens there were very few Mexicans living in San Francisco and the regression equations showed little significant effect for the other two groups.

How to read the regression tables

Start at the bottom with R squared (). This statistic evaluates the goodness of fit of the entire equation, showing the fraction of the total variation in percent vote for the candidate explained by the seven independent variables working together.  In table 2 (Sinclair's % of vote), R squared in the Los Angeles column is .68 which should be interpreted as showing that the equation is very effective. The .87 and .82 statistics for the other counties give us even more confidence in those equations. Now look at Haight’s % of vote. The R square values are much lower, cautioning that the equations are less effective.

The Standardized Coefficients (Beta) tell us about the effects of each of the independent variables when the other six are held constant. Positive signs mean that an increase in that variable will be associated with an increase in the percent of the vote, negative signs mean that the effect will be reversed. Thus in Sinclair’s LA vote, the negative coefficient for % of female adults (-.18) means that holding everything else constant more women voted against him than for him. But -.18 is not as strong as the .67 coefficient for % blue-collar males. If we had complete data that accounted for nonvoters, we could create probability estimates from these statistics, reading the .67 as meaning that for every 1% increase in the number of blue collar males, Sinclair's vote is likely to increase by 0.67%. For every 1% increase in the number of adult females, Sinclair's vote was likely to decrease by 0.18% holding all other variables constant.

The Standard Errors (SE) and asterisks show whether the coefficients are statistically significant. A single asterisk means that the coefficient is significant at the .05 confidence level. Still better is the .01 confidence level indicated by a double asterisk.  Any value that does not have an asterisk should be regarded as unreliable.

For more on reading regression tables:

Leadership sample methods and data


  • Upton Sinclair’s End Poverty Paper, Dec/January 1933/34 through August, 1934
  • City Directories for many cities
  • Great Registers (voting registration) for Alameda, Kern,  Los Angeles, Marin, Monterey, Orange, San Francisco, and San Luis Obispo counties
Data issues:

Two leadership samples were created, one consisting of EPIC club chairs and assembly district secretaries, the other of EPIC legislative candidates. The first includes 167 men and women who were identified as club chairs in the first two issues of Upton Sinclair End Poverty Paper (Dec-January 1933/34, February 1934), 72 others who were listed by the time of the August primary as assembly district leaders or leaders in charge of one of the eighteen campaign headquarters around the state, and 53 candidates for the state legislature were identified in the same source.

We sought information about occupation and previous voting registrations in city directories and in the Great Registers of voters maintained by many California counties, finding occupational information for 159 (two-thirds) of the leadership sample and 44 out of 53 candidates. We found 1932 voter registration data for 113 of the leaders and 23 of the candidates.