An unbiased analysis of census data reveals not one but many maps of the United States. plot of chunk unnamed-chunk-2

The original inspiration for this post comes from a New York Times article. By combining 6 socio-economic observables at a the county level, the author puts together a map that in his view describes where life is hard in America. My goal here is not the question that map or the article’s conclusions, but to revisit its methods. The choice of the 6 variables – “education (percentage of residents with at least a bachelor’s degree), median household income, unemployment rate, disability rate, life expectancy and obesity” – may seem pretty uncontroversial at first. But right in the comments you can find reasonable complaints – and a few creative ones. Why obesity and not some other disease? And how can income be compared when cost of living is so different among different counties? Why not include natural beauty in the statistics? Then there is the issue of combining the 6 variables. One can’t average life expectancy and income, or weight and education. So the author decides to take the ranks of each variable and average those. That may sound neutral but it isn’t. It’s like saying that the difference in income between the richest and poorest county, about $85k, is equivalent to the difference in life expectancy, 11 years. If it were my life, I’d put a higher price on it. Only moderately reassuring is the claim that the ranks are highly correlated and slight variations in the method, such as excluding a variable, are of little consequence. The other problem I have with this method is that it aims straight for the conclusion, but one suspects there is more in the data. It’s easy to find counties that rank pretty high overall but where the median income is not that high or other discrepancies. Even if the main story is that those 6 variables are correlated, one would like to hear about the side plots.

The starting point of my investigation is the census data. I used unzip, xls2csv and split to get csv files – the latter step is because xls2csv concatenates multiple sheets during the conversion, but luckily they are all the same length. Once that’s done, we start creating a data frame

files = list.files("splits/")
read.one =
  function(fname)
    read.csv(file.path("splits", fname), nrows = 3198, fill = FALSE)
clean.one =
  function (df)
    df[,sapply(df, function(x) length(unique(x))) >50]
all = lapply(files, function(x) clean.one(read.one(x)))
census = do.call(cbind, all)

The cleaning starts removing columns that don’t really represent observables but metadata, like notes about how the data was collected. The heuristic of containing at least 50 distinct values to be an interesting variable was tested on a sampling of columns and seems to be good enough. The next steps are filtering steps to remove repeated columns, one odd non numeric column and rows representing states instead of counties.

census = census[, unique(names(census))]
census = census[, -1993]
census = census [grep(census$Areaname, pattern = ","), ]

We are also later going to need separate columns with county name and state, so let’s add them right away.

census$state = unlist(strsplit(as.character(census$Areaname), split = ", "))[c(F,T)]
census$county = tolower(unlist(strsplit(as.character(census$Areaname), split = ", "))[c(T,F)])

The last bit of data is the metadata file with explanations for each variable.

mastdata = read.csv("csv/Mastdata.csv")
row.names(mastdata) = mastdata$Item_Id

What we have in census are about 6000 variables for more than 3000 counties. Not 6 carefully hand picked ones. It’s a lot of variables to get to understand one by one, but they come in related groups: population, economy, society etc. There is also data for several decades, at least for some variables. The units are also variable: counts, dollars, years and they are summarized in various ways as absolute, average, median, rank, per unit and more. So you can have variables such as “Private nonfarm annual payroll”, measured in thousands of dollars or “Nonfederal physicians - inactive (Dec. 31) 2006” as an absolute count. My plan was to use pretty general methods without relying on domain knowledge at all. But my first pass analysis hit a roadblock, which is that counties have the most variable population size and that is a confounding factor for anything that happens to individuals. A large total income doesn’t make people rich in a county that includes a large town. The methods I describe in the following technically work even without the normalization step I will now describe, but the results were difficult to interpret. The normalization step was simply to divide any absolute count by the average of county population in two different years (because counties can merge or disappear). To find the absolute counts, since the metadata is ambiguous, I used a search pattern on the variable descriptions. It works for most variables, and for the rest we need to rely on the robustness of the methods we will employ.

already.normalized = 
  grep(
    "mean |median |rate |average | percent |per ", 
    mastdata$Item_Description, 
    ignore.case = TRUE)
unnormalized = 
  grep("used for", mastdata$Item_Description, ignore.case = TRUE)
normalize.mask = 
  c(1,2, 6429, 6430, 
    match(mastdata$Item_Id[setdiff(already.normalized, unnormalized)], names(census)))
normalize.mask = normalize.mask[!is.na(normalize.mask)]

censusn = census
censusn [, -normalize.mask] =
  census [, -normalize.mask]/(census$POP010210D + census$POP010130D + 1)

At this point the plan called for a principal component analysis of the data, but optimism is not enough to make it work. The sequence of sorted eigenvalues was decreasing way too slowly, which means the data doesn’t fit the linear assumptions behind PCA very well. You may say that was to be expected even from the cursory description of the variables above, but there’s nothing like trying and looking at the diagnostics.

What is a method to summarize this data that is the most assumption free? Well, I don’t know about “the most”, but surely random forests can handle variables measured in different units and related in non-linear ways. Unfortunately, I know applications of random forests to classification and regression, but there’s no response variable here. After a little research I found out that it is possible to use random forests in unsupervised mode. It’s based on creating a synthetic control data set and then training the random forest to classify real vs. synthetic data. The result is a proximity matrix among data points “based on the frequency that pairs of data points are in the same terminal nodes”. I am quoting from the documentation of package randomForest because in R, luckily, there’s a package for that.

censusn.urf <- randomForest(censusn[, 3:6428])

This takes several hours. The only columns excluded are county name, state and census code.

The proximity matrix can be embedded exactly in high dimension and then a number of techniques can be used to project it to a lower dimension while trying to preserve distances. randomForest comes with the function cmdscale to do just that, so why not use it?

mdsn = cmdscale(1 - censusn.urf$proximity, k = 10, eig = TRUE)

This function does compute a PCA and returns, among other things, the sorted eigenvalues, and we can see how they decrease rapidly. That’s a good sign.

qplot(1:length(mdsn$eig), mdsn$eig)

plot of chunk unnamed-chunk-10

We can also plot all the projected points in any two of the dimensions. It is easy to observe that there are dependencies between dimensions (while they should be uncorrelated, according to PCA theory).

pairs(mdsn$points, pch = ".")