Yet Another Pipe Operator in R to unify interactive and programming use

Prologue

The pipe operator, %>% in its latest incarnation, is all the rage in R circles. I first saw it in a less-well-known package called vadr. Then one was added to dplyr, but I preferred my own implementation when working on plyrmr. Then a dedicated package emerged called magrittr and it became the de-facto standard among pipe lovers when dplyr switched to it. The pipe operator allows to write

f(g(g.arg1, g.arg2, ...), f.arg2, ...)

g(g.arg1, g.arg2, ...) %>% f(f.arg2, ...)

for any functions f and g. The advantages of this style have been discussed in depth and are not the subject of this post.

Critique of Non Standard Evaluation

It should be clear to anyone with a moderate knowledge of R that evaluating f(f.arg2, ...) while taking its first argument from somewhere else requires some form of non standard evaluation (NSE). Standard evaluation would complain about a missing argument or use a default if available. NSE has a long tradition in R going back to base functions such as transform and subset. In the case of those functions, columns of the first argument, always a data frame, can be mentioned by name in other arguments as if they were additional in-scope variables,

transform(mtcars, carb/cyl)

which is arguably better than

transform(mtcars, "carb/cyl")

transform(mtcars, mtcars$carb/mtcars$cyl)

The much more recent dplyr has picked up this idiom, improved it and applied it consistently to an organized set of primitives to manipulate data frames. Unfortunately, when one starts programming with these functions, some drawbacks emerge. The first and most obvious one, is that parametrizing arguments is difficult. Imagine we are writing a function that does something on a column, any column of a data frame: function(df, col). In the body of that function, we need to use transform to create a new column that depends on the column identified by col. You may think right off the bat something like transform(df, newcol = col^2), but that would just look for a column named "col", not anything to deal with the value of the variable col. There are even more subtle problems when using transform in functions nested inside other functions. The documentation for transform is pretty clear about this: “For programming it is better to use the standard subsetting arithmetic functions, and in particular the non-standard evaluation of argument transform [sic, there is no such argument] can have unanticipated consequences”. It seems to me that one of the great strengths of R is that it works both as a UI for people doing statistics as well as a programming language, and creating separate jargons for the two use cases may offer some short term benefits, but in the long run weakens the dual nature of R and makes the transition to programming harder. It’s coding candy: attractive, but not good for your teeth. dplyr offers some relief from this by providing NSE-free versions of the most important functions and a more general NSE implementation. Still, the duality is there and the section of the API using NSE needs to be replicated. That’s big price to pay. Adding that, perplexingly, the names of NSE and NSE-free functions differ only by a cryptic and pretty much invisible _, my opinion is that we can do better than that.

magrittr::`%>%` is not immune to the same type of criticism. For instance, one can write

library(magrittr)
mtcars %>% filter(mpg>15)

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
7  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
....

but not

myfilter = filter(mpg>15)

Error in filter_(.data, .dots = lazyeval::lazy_dots(...)): object 'mpg' not found

# aiming for:
# mtcars %>% myfilter

which means magrittr promotes the use of expressions that are not first class in R, because they are not assignable to a variable, cannot be passed to a function and so forth, which hampers programmability. Moreover, if we enter:

4 %>% sqrt(.)

[1] 2

where . is a special variable evaluating to the left side argument of the %>% operator. Surprisingly, though,

4  %>% sqrt(sqrt(.))

Error in sqrt(., sqrt(.)): 2 arguments passed to 'sqrt' which requires 1

fails, showing a lack of composability, an important goal in API design.

Critique of `purrr` reason

Given these considerations, I wasn’t too surprised when I found that a new package by dplyr’s author, purrr, tries a different approach that avoids NSE. purrr is a package for processing lists inspired by javascript’s underscore.js. A typical function is map, which applies a function to every element of its first argument, for example map(mtcars, class). Besides taking a function, map accepts also a character or a numeric, which it transforms into an accessor function. Moreover, one can pass formulas that provide a quick notation for defining functions and pretty much replace NSE. It only takes a little ~ in front of an expression to explicitly suspend the normal evaluation mechanism and trigger a context-dependent one. It’s a kind of on demand NSE and it expands the use of formulas outside model fitting. Formulas are perfectly set up for this, as they carry with them their intended evaluation environment, making it relatively easy to provide correct implementations that work in any context as opposed to, say, only at top level.

A New Pipe Operator

This gave me an idea: define a NSE-free pipe operator that processes its second argument like purrr::map does with its own. Thus was conceived a new package, yapo, for “Yet Another Pipe Operator”, a name chosen in homage to yacc and to acknowledge the proliferation of pipe operators. Taking dplyr and replacing NSE with the same approach would be equally interesting, but it will have to wait.

So how does this pipe operator look like? First of all, very much compatible with the one in magrittr, which is the same as the one in dplyr.

mtcars %>% filter(mpg > 15)

becomes

suppressMessages(library(yapo))
mtcars %>% ~filter(mpg > 15)

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
7  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
....

The difference is just one additional ~. This is a small price to pay for seamless parametrizability. Imagine you need to use that filter several times in a program, or pass it as an argument. You can just use a variable:

myfilter = ~filter(mpg > 15)
mtcars %>% myfilter

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
7  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
....

It just works as expected. Please try that with magrittr and let me know. The best I could come up with was

myfilter = function(x) filter(x, mpg > 15)

which is OK, but different, and that’s the whole point: getting almost the same conciseness as with NSE while developing a jargon, or DSL, that can work for interactive R as well programming in R. Another difference with magrittr is that yapo is meant to be simple in definition and implementation. Hence

4 %>% ~sqrt(sqrt(..))

[1] 1.414214

just works, no excuses. Please notice the use of .. instead of . to avoid confusion with . as used in models.

These are use cases suggested by dplyr, but there are others that come from purrr and are here unified in a single operator. What purrr can do on a list of elements, %>% does on a single element. For instance, purrr::map(a.list, a.string) accesses all the elements named after the value of a.string in the elements of list a.list, equivalent to

purrr::map(a.list, function(x) x[[a.string]])

It may be a small difference, but type the long version many enough times and you are going to be grateful for the shorthand. In analogy with purrr, we can use integer and character vectors on the right side of %>%, implicitly creating an accessor function that gets then applied to the left side, as in

mtcars %>% "carb"

 [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2

which is the same as mtcars[["carb"]]. You may be protesting that that’s a very small difference, but bear with me a little longer. %>% unifies vector, list, data frame, matrix, S3 and S4 object access. Yes, no more getting errors when using [[]] on S4 objects, enough of that. It works also on 2D data structures such as data frames and matrices, with the help of a couple of functions (credit @ctbrown for this idea). The default is column access. If, instead, row access is desired, one only needs to use the function Row as in

mtcars %>% Row(3)

$mpg
[1] 22.8

$cyl
[1] 4

$disp
[1] 108
....

One can also access multiple columns with the Range function as in

mtcars %>% Range(c("carb", "cyl"))

                    carb cyl
Mazda RX4              4   6
Mazda RX4 Wag          4   6
Datsun 710             1   4
Hornet 4 Drive         1   6
Hornet Sportabout      2   8
Valiant                1   6
Duster 360             4   8
....

Range and Row can be composed to select a range of rows:

mtcars %>% Row(Range(1:4))

                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

When selecting ranges, the result type is always the same as the input type, unlike with [,] and its ill-advised drop option. Of course, selecting ranges in S3 or S4 objects will fail in most cases because it doesn’t make sense. The formula notation keeps working and you can use it to cut down on the typing quite a bit. The evaluation environment of the formula is expanded, as we have seen, with a variable .. but also with a variable for each named element of the left argument of the pipe, in analogy with dplyr. Imagine you have a list of teams of people, each with personal information including a phone, in a three-level nested list (named at all levels).

teams = 
  list(
    Avengers = 
      list(
        Annie = 
          list(
            phone = "222-222-2222"),
        Paula = 
          list(
            phone = "333-333-3333")),
    EmptyTeam = list())

You can access Annie’s phone in team “Avengers” with

teams %>% ~Avengers %>% ~Annie %>% ~phone

[1] "222-222-2222"

which, using with the Rstudio shortcut for %>%, is pretty convenient to type, as opposed to

teams[["Avengers"]][["Annie"]][["phone"]]

[1] "222-222-2222"

(6 vs. 18 additional keystrokes, excluding names). Whether it looks better, that’s subjective.

The making of `yapo`

While a fairly simple package, there were a couple of technical hurdles in implementing yapo. The first is that custom operators in R, the ones that start and end with a %, have higher priority than ~. That would have forced us to protect every formula but the last one in a complex pipe with (). To avoid that, yapo reverses the priority of %>% and ~. It’s a testament to the flexibility of the language that this is at all possible. The other hairy problem was guessing when the first argument of a function is missing, as in filter(mpg > 15). We settled for testing for missing arguments with no defaults. For instance, the .data argument to filter has no default and is not provided in filter(mpg > 15). Hence it is necessary to add the special argument .. and the convention is to add it as the first, unnamed argument, which works well with dplyr functions and many other reasonably designed APIs. It’s a heuristic and if it doesn’t work in some cases you just have to explicitly add .., as in sqrt(sqrt(..)).

Thou shalt code

And with that, please install yapo and let me know how you like it. Install is as simple as devtools::install_github("piccolbo/yapo/pkg"). Remember to load after magrittr or dplyr to shadow their own pipe operators.