R anti-tips

Not all R tips are equally good. Let's set the record straight.

Anti-tip #1: For loops are slower than functions in the apply family

Why should that be the case? Let's see what the R interpreter has to say about it. Let's get some numbers to chew on first:

z = rnorm(10^6)

For loop first:

> system.time({x = 0; for(y in z) x = x + y})
user  system elapsed
0.521   0.004   0.526

To avoid the explicit loop a good match here is the Reduce function, which may be not exactly in the apply family, but it's faster than several attempts I made using those functions.

> system.time({x = Reduce('+', z)})
     user  system elapsed
    0.461   0.030   0.491

Faster, but not by much. The true tip: use C.

> system.time({x = sum(z)})
     user  system elapsed
    0.002   0.000   0.002

Now that's 250 times faster. That's worth talking about. The reason is that the interpreter does nothing here, compiled code does all the work. No black magic. Let's see if this is limited to sums or we can see the same effect again First with the for loop.

> system.time({x = z;  for (i in 1:length(z)) x[i] = x[i]^2})
   user  system elapsed
  2.110   0.030   2.139

Then an *apply type function:

> system.time({x = sapply(z, function(x) x^2)})
   user  system elapsed
  2.476   0.021   2.496

A tad slower, not sure if it is significant.

> system.time({x = z^2})
   user  system elapsed
  0.003   0.003   0.006

400X faster. Now you get my attention. To do this right, one would have to put some confidence intervals around these numbers, but out of experience using R and knowing a little about R internals and compiler and interpreter technology, I am confident the final answer will be that for or apply, it doesn't really matter. As a matter of programming style, I believe apply functions to be far superior. I wrote a whole package using I think only two for loops which seemed absolutely necessary. Speed is not the argument though.

Anti-tip #2: Don't use nested loops

This is a particularly pernicious anti-tip. The previous one would have resulted in people wasting time to remove loops just to find out that their program was about as slow, but likely shorter and easier to understand. In this case the anti-tip discourages a very useful R optimization technique: optimizing only the innermost loop to reap most of the speed benefits. Let's see this in two steps. First there is absolutely nothing wrong with nested loops. They are as slow as single loops with the same number of iterations:

> system.time({x = rnorm(10^6); I = numeric(10^6);  
  for (i in 1:10^6) {k = sample(I, 1); x[k] = x[k]^2}})
   user  system elapsed 
  7.589   0.248   7.837 

> system.time({x = rnorm(10^6); I = numeric(10^6);   
  for (i in 1:1000) 
    for(j in 1:1000) {k = sample(I, 1); x[k] = x[k]^2}})
   user  system elapsed 
  7.486   0.233   7.770

With that notion put to rest, let's see the fast inner loop approach in action. This is with two loops:

> M = matrix(rnorm(10^6), ncol = 1000)
> system.time({for (i in 1:1000) for (j in 1:1000) M[i,j] = -M[i,j]})
   user  system elapsed 
  2.369   0.041   2.410

And this is with the inner loop replaced by a vectorized operation:

> system.time({for (i in 1:1000) M[i,] = -M[i,]})
   user  system elapsed 
  0.028   0.001   0.030

80X faster! You may say: yes but you don't have nested loops any more. That is not the reason why it is faster as the previous pair of examples showed. The reason is that the interpreter is going through only thousands of steps, while millions of steps take place in compiled code. Once you have given a 1000X "break" to the interpreter, that's enough to approach C speeds. Not completely

> system.time({M = -M})
   user  system elapsed 
  0.003   0.000   0.004

If you had 10 nested loops and the innermost required a large enough amount of work, say 1000 operations as a rule of thumb, then optimizing away that innermost loop would be enough to give a considerable boost. You would still have 9 nested loops and it would approach C speeds. Nesting is not the problem, the problem is compiled vs interpreted code. The important message is that, depending on the algorithm, you may have to replace with a fast library function or, at worst, rewrite in C only a small fraction of your code.