An unbiased analysis of census data reveals not one but many maps of the United States. plot of chunk unnamed-chunk-2

The original inspiration for this post comes from a New York Times article. By combining 6 socio-economic observables at a the county level, the author puts together a map that in his view describes where life is hard in America. My goal here is not the question that map or the article’s conclusions, but to revisit its methods. The choice of the 6 variables – “education (percentage of residents with at least a bachelor’s degree), median household income, unemployment rate, disability rate, life expectancy and obesity” – may seem pretty uncontroversial at first. But right in the comments you can find reasonable complaints – and a few creative ones. Why obesity and not some other disease? And how can income be compared when cost of living is so different among different counties? Why not include natural beauty in the statistics? Then there is the issue of combining the 6 variables. One can’t average life expectancy and income, or weight and education. So the author decides to take the ranks of each variable and average those. That may sound neutral but it isn’t. It’s like saying that the difference in income between the richest and poorest county, about $85k, is equivalent to the difference in life expectancy, 11 years. If it were my life, I’d put a higher price on it. Only moderately reassuring is the claim that the ranks are highly correlated and slight variations in the method, such as excluding a variable, are of little consequence. The other problem I have with this method is that it aims straight for the conclusion, but one suspects there is more in the data. It’s easy to find counties that rank pretty high overall but where the median income is not that high or other discrepancies. Even if the main story is that those 6 variables are correlated, one would like to hear about the side plots.

The starting point of my investigation is the census data. I used unzip, xls2csv and split to get csv files – the latter step is because xls2csv concatenates multiple sheets during the conversion, but luckily they are all the same length. Once that’s done, we start creating a data frame

files = list.files("splits/")
read.one =
  function(fname)
    read.csv(file.path("splits", fname), nrows = 3198, fill = FALSE)
clean.one =
  function (df)
    df[,sapply(df, function(x) length(unique(x))) >50]
all = lapply(files, function(x) clean.one(read.one(x)))
census = do.call(cbind, all)

The cleaning starts removing columns that don’t really represent observables but metadata, like notes about how the data was collected. The heuristic of containing at least 50 distinct values to be an interesting variable was tested on a sampling of columns and seems to be good enough. The next steps are filtering steps to remove repeated columns, one odd non numeric column and rows representing states instead of counties.

census = census[, unique(names(census))]
census = census[, -1993]
census = census [grep(census$Areaname, pattern = ","), ]

We are also later going to need separate columns with county name and state, so let’s add them right away.

census$state = unlist(strsplit(as.character(census$Areaname), split = ", "))[c(F,T)]
census$county = tolower(unlist(strsplit(as.character(census$Areaname), split = ", "))[c(T,F)])

The last bit of data is the metadata file with explanations for each variable.

mastdata = read.csv("csv/Mastdata.csv")
row.names(mastdata) = mastdata$Item_Id

What we have in census are about 6000 variables for more than 3000 counties. Not 6 carefully hand picked ones. It’s a lot of variables to get to understand one by one, but they come in related groups: population, economy, society etc. There is also data for several decades, at least for some variables. The units are also variable: counts, dollars, years and they are summarized in various ways as absolute, average, median, rank, per unit and more. So you can have variables such as “Private nonfarm annual payroll”, measured in thousands of dollars or “Nonfederal physicians - inactive (Dec. 31) 2006” as an absolute count. My plan was to use pretty general methods without relying on domain knowledge at all. But my first pass analysis hit a roadblock, which is that counties have the most variable population size and that is a confounding factor for anything that happens to individuals. A large total income doesn’t make people rich in a county that includes a large town. The methods I describe in the following technically work even without the normalization step I will now describe, but the results were difficult to interpret. The normalization step was simply to divide any absolute count by the average of county population in two different years (because counties can merge or disappear). To find the absolute counts, since the metadata is ambiguous, I used a search pattern on the variable descriptions. It works for most variables, and for the rest we need to rely on the robustness of the methods we will employ.

already.normalized = 
  grep(
    "mean |median |rate |average | percent |per ", 
    mastdata$Item_Description, 
    ignore.case = TRUE)
unnormalized = 
  grep("used for", mastdata$Item_Description, ignore.case = TRUE)
normalize.mask = 
  c(1,2, 6429, 6430, 
    match(mastdata$Item_Id[setdiff(already.normalized, unnormalized)], names(census)))
normalize.mask = normalize.mask[!is.na(normalize.mask)]

censusn = census
censusn [, -normalize.mask] =
  census [, -normalize.mask]/(census$POP010210D + census$POP010130D + 1)

At this point the plan called for a principal component analysis of the data, but optimism is not enough to make it work. The sequence of sorted eigenvalues was decreasing way too slowly, which means the data doesn’t fit the linear assumptions behind PCA very well. You may say that was to be expected even from the cursory description of the variables above, but there’s nothing like trying and looking at the diagnostics.

What is a method to summarize this data that is the most assumption free? Well, I don’t know about “the most”, but surely random forests can handle variables measured in different units and related in non-linear ways. Unfortunately, I know applications of random forests to classification and regression, but there’s no response variable here. After a little research I found out that it is possible to use random forests in unsupervised mode. It’s based on creating a synthetic control data set and then training the random forest to classify real vs. synthetic data. The result is a proximity matrix among data points “based on the frequency that pairs of data points are in the same terminal nodes”. I am quoting from the documentation of package randomForest because in R, luckily, there’s a package for that.

censusn.urf <- randomForest(censusn[, 3:6428])

This takes several hours. The only columns excluded are county name, state and census code.

The proximity matrix can be embedded exactly in high dimension and then a number of techniques can be used to project it to a lower dimension while trying to preserve distances. randomForest comes with the function cmdscale to do just that, so why not use it?

mdsn = cmdscale(1 - censusn.urf$proximity, k = 10, eig = TRUE)

This function does compute a PCA and returns, among other things, the sorted eigenvalues, and we can see how they decrease rapidly. That’s a good sign.

qplot(1:length(mdsn$eig), mdsn$eig)

plot of chunk unnamed-chunk-10

We can also plot all the projected points in any two of the dimensions. It is easy to observe that there are dependencies between dimensions (while they should be uncorrelated, according to PCA theory).

pairs(mdsn$points, pch = ".") 

plot of chunk unnamed-chunk-11

To eliminate dependencies there is a technique, called Independent Component Analysis, and a package for that, fastICA. The result of the cmdscale function can be used as a starting point for this additional step.

ica = fastICA(mdsn$points, n.comp = 10)
pairs(ica$S, pch = ".") 

plot of chunk unnamed-chunk-12

That looks good enough for me. Let’s get to mapping! I am going to use the package ggplot2 and its very useful geom_map. A map with shaded areas is called a choropleth. The first thing is to link the ICA results with the mapping data, from package maps.

icadf = 
  data.frame(
    cbind(
      data.frame(
        county  = census$county, 
        state = census$state, 
        areaname = census$Areaname), 
      ica$S[,order(apply(ica$S, 2, sd), decreasing = TRUE)]))
choropleth = 
  unique(merge(county_df, icadf, by = c("state", "county")))
choropleth = 
  choropleth[order(choropleth$order), ]
choropleth = 
  fortify(choropleth)

I am not sure why the unique and fortify calls are necessary, but I adapted available examples. The process is imperfect, due to variation in the naming and spelling of counties between the two data sources. It’s possible to do better, but I didn’t want to go on a long diversion only for a handful of counties. Pull requests are welcome. And here’s our first map, based on the first independent component:

ggplot(choropleth, aes(map_id  = id )) + 
  scale_fill_gradient2(low = "dark blue",  mid = "grey", high = "dark red") + 
  geom_map(aes(fill = X1), map = choropleth) + 
  expand_limits(x = choropleth$long, y = choropleth$lat) +
  xlab("") + ylab("")

plot of chunk unnamed-chunk-14

If the experts see a way to simplify this somewhat involved code, please help. In particular, it doesn’t seem to do much without the expand_limits, so what is the point of geom_map alone? This map is obviosuly delineated geographically, but what does it mean? We need some assistance in interpreting these results. One approach is finding original variables that are correlated or anti-correlated with the summarized variables. Because of previous observations on linearity of the data or lack thereof, we shall use Spearman correlation. The other is to find “champions”, input data points that occupy the extremes of the scale defined by each summarized variable. I found the first enlightening, the second more entertaining or distracting (Wikipedia has an article about each county). I know there are other approaches to interpreting this kind of summaries, please chip in in the comments to suggest additional ones.

First let’s prepare some data to assist the interpretations:

cormat = cor(icadf[, 4:13], censusn[,3:6428], method = "spearman")
champions = apply(icadf[,4:13], 2, function(x) as.character(icadf$areaname[c(which.min(x), which.max(x))]))

Unfortunately going through long lists of highly correlated variables is not a particularly fun task. I will try to summarize a possible interpretation, which is very non rigorous, but you can find full tables of 100 most and least correlated variables in the appendix, linked from each figure. Package xtable will help with formatting a long-ish list of variables. In particular, if you notice that certain variables repeat from one component to another, it may very well be because of temporal changes that I glossed over. For instance, two components with a strong white presence may be independent because one is correlating better with white presence in recent years, the other in past years.

This is the function that generates maps, champions and links

eigenmap =
  function(i) {
    cat(
      paste0(
        "<br>Champions: ", 
        paste0(
          "<a href=\"https://en.wikipedia.org/wiki/",
          sub(",", "_County,", 
              sub(" ", "_", as.character(champions[2:1,i]))), 
          "\">",
          as.character(champions[2:1,i]),
          "</a>",
          collapse = " and "), 
        "(red, blue)<br>"))
    cat(paste0("[Descriptive variables](#dv", i, ")<br>"))
    dim = names(choropleth)[7 + i]
    print(
      ggplot(choropleth, aes(map_id  = id )) + 
        scale_fill_gradient2(low = "dark blue", mid = "grey", high ="dark red") + 
        geom_map(aes_string(fill = dim), map = choropleth) + 
        expand_limits(x = choropleth$long, y = choropleth$lat) +
        xlab("") + ylab(""))}

This is the one that creates tables of important variables.

important.variables =
  function(i) {
    cat(paste0("<a name=\"dv", i, "\"><h3>Component ", i, "</h3></a>"))
    lapply(list(TRUE, FALSE),
           function(x){
             cat(if(x) "<br><h3>Correlated: red</h3>" else "<h3>Anticorrelated: blue</h3>")
             print(
               xtable(
                 data.frame(
                   `Variable Description` =
                     sort(
                       mastdata$Item_Description[
                         which (
                           is.element(
                             mastdata$Item_Id, names(head(sort(cormat[i,], decreasing = x), 100))))]))),
               type = "html")})}

And now, without further ado, the maps.

The Great Plains

eigenmap(1) 


Champions: Hettinger, ND and Duplin, NC(red, blue)
Descriptive variables
plot of chunk unnamed-chunk-18

This map is very well delineated with few dark blue regions. The red areas are not lighter and spread out, so think of them as opposed to the Great Plains. They have more unemployment, people with disabilities, lower educational achievement, supplemental security income, carpooling, poverty and school enrollment. The blue areas have bigger farms, more employment, educational achievement, government spending not directed at individuals, government insurance, property taxes and personal income, an older and Republican leaning population.

Minnesota vs Michigan

eigenmap(2) 


Champions: Jackson, MO and Benzie, MI(red, blue)
Descriptive variables
plot of chunk unnamed-chunk-19

The red areas have stronger manifacturing, with higher median income. On the other hand, the blue areas have less health insurance, more poverty and vacant housing.

Appalachia and Ohio Valley vs Colorado and Texas.

eigenmap(3) 


Champions: Owen, IN and Montgomery, NY(red, blue)
Descriptive variables
plot of chunk unnamed-chunk-20

The red areas have an older, whiter population using Medicare and Social Security. The blue areas have large farms with a strong Latino presence and more renters.

The Mississippi

eigenmap(4) 


Champions: Erie, PA and Holmes, MS(red, blue)
Descriptive variables
plot of chunk unnamed-chunk-21

The red areas have an older population using Medicare, retirement benefits and with more veterans. The blue areas have higher unemployment, poverty, federal grants, children, declining population and lean Democratic.

Texas, Oklahoma and Kansas

eigenmap(5) 


Champions: Lincoln, OK and Grand Traverse, MI(red, blue)
Descriptive variables
plot of chunk unnamed-chunk-22

Here the red is mostly concentrated in those three states, with a sprinkling over Appalachia. The related variables tells us that red means older farmers, longer commutes, recent domestic immigration, a higher death rate until 1985, smaller farms, lots of construction, home ownership and whiter and older people. Blue areas instead have more population but only until the early 80s, a higher birth rate until the early 90s, a bigger labor force, local government and personal income, and more kids enrolled in school, also limited to years before 1990.

Growing suburbia

eigenmap(6) 


Champions: Newton, GA and Saginaw, MI(red, blue)
Descriptive variables
plot of chunk unnamed-chunk-23 This map shows a very interesting ring pattern around some cities, including Atlanta, Dallas an Minneapolis. The red areas show strong population increase, including migration, and increase in available housing and high median income. The blue areas have a higher death rate, Federal Government payments to individuals, more widows, single person households and older people receiving social security.

South and Midwest vs New England and West

eigenmap(7) 


Champions: Box Elder, UT and Covington, AL(red, blue)
Descriptive variables
plot of chunk unnamed-chunk-24

There’s a variety of minorities on the red side, foreign speakers and naturalized citizens, hotels and restaurants. The blue side has more black people, a higher mortality rate, lower educational achievement, employment in farms and manufacturing and fewer vehicles available. Fun fact: more people walk to work on the red side.

Pennsylvania vs California and Texas

eigenmap(8) 


Champions: Lubbock, TX and Lycoming, PA(red, blue)
Descriptive variables