How plyrmr was ahead of the curve
I recently attended a talk by the always excellent Hadley Wickham about his latest work on creating and visualizing many models.
I combine here two snippets from that tutorial for your convenience and further discussion:
gapminder %>%
group_by(continent, country) %>%
nest() %>%
mutate(model = purrr::map(data, ~ lm(lifeExp ~ year, data = .)))
This code groups the data by the selected columns and then fits a linear model for each group using the specified variables. Very elegant indeed. You may notice I did not mention what nest
does. It changes the layout of the data, but it has a single argument and it can be inverted with unnest
. To speak somewhat figuratively, it doesn’t add or remove anything; it is like a format change. As I saw this example, it jogged my memory: my old work plyrmr
allowed to do pretty much the same, without any nest
call. Let’s grab a similar snippet from the plyrmr
input("/tmp/mtcars") %|%
group(carb) %|%
transmute(model = list(lm(mpg ~ cyl + disp)))
Forget that this works on distributed data sets and other differences. At an abstract level, it takes a structured data set, groups it by some variables and then fits a model for each group. But it doesn’t require nesting or unnesting and it doesn’t require the purrr::map
call inside the mutate
of the first snippet. The idea is: when a data set is grouped, each group should work like a separate little data set, which is a little what nest
helps with. In dplyr
, grouped datasets are kind of grouped, but also still kind of flat; they don’t go all the way. If you run a mutate
on them, the grouping is not very important; if you run a summarize
, it is. In plyrmr
, grouping seems equivalent to grouping and nesting at the same time. The expressions provided to transmute as ...
arguments are evaluated in a context where one group of data at a time is attached or otherwise available for evaluation. Hence the result is a dataset with a list of models as a column.
This is not to say that you should ditch dplyr
and use plyrmr
: there are several other differences and for the latter, unfortunately, development appears to have ceased. But as far as API design, I am very proud of what we were trying to do.