The essential R packages
Much has been said about the richness of the system of packages for R, but where is one supposed to start?
The availability of a wide variety of packages has been long highlighted as one of the strengths of the R language. But the number is overwhelming — 5000 is the last I've heard and the growth is exponential — and the quality variable. When I talk about quality, I don't mean only "difficult to use", "buggy" or "slow", albeit that happens too. I also mean that some packages offer fundamental abstractions that you are likely to want in your toolset for one reason or another, whereas others have more specific goals, for instance they implement a specialized class of models or are companions for books and classes and so forth. Like other developers, I could just list and praise the ones I use or one could go for the crowdsourced solution of crantastic. Here I would like to suggest a data-driven approach based on the dependencies between packages and graph analysis. A package listed by another as a dependency can be seen as receiving an endorsement of sorts from the developers of the dependent package. After all, they have decided that using that package is better than the alternatives. Also, endorsement from authors of very important package can be seen as carrying more weight than the same for lesser packages. You can guess here a recursive definition whereby being an important package means being a dependency for other important packages. If one considers the graph with packages as vertices and dependencies as directed edges, one can recognize the familiar notion of page rank made popular by Google, whereby important sites are linked to by other important sites. So after some CRAN scraping (the data set is a little old, like 12/2011) and using the package igraph and specifically the page.rank function, here are the top 100 dependency-ranked packages. I entered a brief description by hand for about the first half, then run out of steam. Maybe we need a data-driven solution also for that task. Enjoy.
The availability of a wide variety of packages has been long highlighted as one of the strengths of the R language. But the number is overwhelming — 5000 is the last I've heard and the growth is exponential — and the quality variable. When I talk about quality, I don't mean only "difficult to use", "buggy" or "slow", albeit that happens too. I also mean that some packages offer fundamental abstractions that you are likely to want in your toolset for one reason or another, whereas others have more specific goals, for instance they implement a specialized class of models or are companions for books and classes and so forth. Like other developers, I could just list and praise the ones I use or one could go for the crowdsourced solution of crantastic. Here I would like to suggest a data-driven approach based on the dependencies between packages and graph analysis. A package listed by another as a dependency can be seen as receiving an endorsement of sorts from the developers of the dependent package. After all, they have decided that using that package is better than the alternatives. Also, endorsement from authors of very important package can be seen as carrying more weight than the same for lesser packages. You can guess here a recursive definition whereby being an important package means being a dependency for other important packages. If one considers the graph with packages as vertices and dependencies as directed edges, one can recognize the familiar notion of page rank made popular by Google, whereby important sites are linked to by other important sites. So after some CRAN scraping (the data set is a little old, like 12/2011) and using the package igraph and specifically the page.rank function, here are the top 100 dependency-ranked packages. I entered a brief description by hand for about the first half, then run out of steam. Maybe we need a data-driven solution also for that task. Enjoy.
1 | stats | 0.0962312835109951 | Distributions and other basic statistical stuff |
2 | methods | 0.0732606540057392 | Object oriented programming |
3 | graphics | 0.0536687309266182 | Of course, graphics |
4 | MASS | 0.0283011225469996 | Supporting material for Modern Applied Statistics with S |
5 | grDevices | 0.0281639967024237 | Graphical devices |
6 | utils | 0.0224799288855229 | In a snub to modularity, a little bit of everything, but very useful |
7 | lattice | 0.0163861320305732 | graphics |
8 | grid | 0.0126373607888249 | more graphics |
9 | Matrix | 0.0115594712568376 | Matrices |
10 | mvtnorm | 0.0108335460953897 | Multivariate Normal and t Distributions |
11 | sp | 0.00916721059561437 | Spatial data |
12 | tcltk | 0.00885654936181036 | GUI development |
13 | splines | 0.00871777304117854 | Needless to say, splines |
14 | nlme | 0.00603233299532761 | Mixed effects models |
15 | survival | 0.00590245542213706 | Survival analysis |
16 | cluster | 0.00569050414061241 | Clustering |
17 | R.methodsS3 | 0.00536103360510169 | Object oriented programming |
18 | coda | 0.00525607637692928 | MCMC |
19 | igraph | 0.00510936911063866 | Graphs (the combinatorial objects) |
20 | akima | 0.00448891508477221 | Interpolation of irregularly spaced data |
21 | rgl | 0.00448697035750645 | 3D graphics (openGL) |
22 | rJava | 0.00419658010963776 | Interface with Java |
23 | RColorBrewer | 0.00405898916813389 | Palette generations |
24 | ape | 0.00401423956752348 | Phylogenetics |
25 | gtools | 0.00390068663688166 | Functions that didn't fit anywhere else, including macros |
26 | nnet | 0.00372527822413159 | Neural networks |
27 | quadprog | 0.00346928434614538 | Quadratic programmin |
28 | boot | 0.00339455733075856 | Bootstrap |
29 | Hmisc | 0.00321230956674779 | Yet another miscellaneous package |
30 | car | 0.00306687776780923 | Companion to the Applied Regression book |
31 | lme4 | 0.00299902494303813 | Linear mixed-effects models |
32 | foreign | 0.00299020969373986 | Data compatibility |
33 | Rcpp | 0.00294488173058946 | R C++ integration |
34 | robustbase | 0.00292512759045668 | Robust statistics |
35 | zoo | 0.00291360656774946 | Regular and irregular Time Series |
36 | ggplot2 | 0.00280061452368686 | Graphics |
37 | iterators | 0.00271022721728954 | Iterators |
38 | XML | 0.00268297000192895 | XML |
39 | plyr | 0.00260013798376819 | In-memory data transformations |
40 | statmod | 0.00255576796128438 | Statistical modeling |
41 | tkrplot | 0.00253629634469558 | Plots as tk widgets |
42 | timeDate | 0.00241854401215965 | Time and date |
43 | fields | 0.00229020477891645 | Spatial data fitting |
44 | R.oo | 0.00224897565304714 | Object oriented programming |
45 | futile.paradigm | 0.00208727007738248 | Functional programming |
46 | abind | 0.00203562002853031 | Multidimensional array manipulation |
47 | rscproxy | 0.00199899977662843 | Interface to third party applications |
48 | scatterplot3d | 0.00194982279122935 | 3D scatter plot |
49 | distr | 0.00193739059491831 | Object oriented distributions |
50 | codetools | 0.00190284811878283 | Code analysis |
51 | corpcor | 0.00187713924111935 | Efficient Estimation of Covariance and (Partial) Correlation |
52 | numDeriv | 0.00186866167837909 | Numerical derivatives |
53 | gdata | 0.00186445901204259 | Data manipulation |
54 | emulator | 0.00186390193431536 | Bayesian emulation of computer programs |
55 | KernSmooth | 0.00183629272694307 | Kernel smoothing |
56 | mgcv | 0.00182832116584045 | Generalized ridge regression |
57 | ade4 | 0.00182738399748524 | Analysis of ecological data |
58 | foreach | 0.00182632366989875 | Alternative looping construct |
59 | e1071 | 0.00178029575562234 | Support material for a class |
60 | splus2R | 0.00176824350296979 | Support for porting from Splus |
61 | plotrix | 0.00174576155295491 | More graphics |
62 | RGtk2 | 0.00172084829088438 | GUI building with GTK |
63 | mclust | 0.00171720012190246 | Model-based clustering |
64 | colorspace | 0.00170618665568823 | Color Space manipulation |
65 | rgdal | 0.00169086925766161 | Geospatial data processing |
66 | gWidgets | 0.00167347646713519 | GUI building |
67 | tools | 0.00166343776456814 | Tools for package development |
68 | DBI | 0.00165189537436299 | |
69 | class | 0.00163669316246539 | |
70 | snow | 0.00163581475562725 | |
71 | tframe | 0.00162026150727402 | |
72 | pcaPP | 0.00161552199090754 | |
73 | stats4 | 0.00158184928979309 | |
74 | vegan | 0.00157719980281494 | |
75 | timeSeries | 0.00155718601562939 | |
76 | rgenoud | 0.00155684112512074 | |
77 | reshape | 0.00155396309497494 | |
78 | RCurl | 0.00151307683694413 | |
79 | rpart | 0.00150199881687968 | |
80 | Rcmdr | 0.00149432071343987 | |
81 | locfit | 0.00146482502191925 | |
82 | RJSONIO | 0.00146060707726276 | |
83 | maxLik | 0.00145055642526326 | |
84 | startupmsg | 0.0014445515325449 | |
85 | deSolve | 0.00143101879661299 | |
86 | tseries | 0.00140336389124161 | |
87 | gamlss | 0.00139669657806558 | |
88 | lars | 0.00139142435757209 | |
89 | caTools | 0.00137676796617264 | |
90 | R.utils | 0.00134070208104741 | |
91 | genetics | 0.00133801968423769 | |
92 | proto | 0.00132588926315005 | |
93 | np | 0.00132017944858541 | |
94 | spatstat | 0.00131066700412731 | |
95 | MCMCpack | 0.00127549927255682 | |
96 | maptools | 0.00127277095638128 | |
97 | rrcov | 0.00126919936569582 | |
98 | lpSolve | 0.00125502811609384 | |
99 | RcppArmadillo | 0.00125049110788447 | |
100 | copula | 0.00122860896379617 |