Looking for a map reduce language
On a quest for an elegant and effective map reduce language, I went through a number of options and put together some considerations. And the winner is …
Update: since writing this post, I was approached by Revolution Analytics to write yet another map reduce library, this time for R, and it is, of course, my new favorite. Check it out!
In a couple of blog entries from my personal blog I described some map-reduce algorithms for statistical and graph problems and sketched their implementation using pseudo-code. Pseudo-code has two problems: not everybody agrees on what a statement means and it doesn't run, so you can't test it or use it. Real programming languages on the other hand tend to obscure the logic of a program with unnecessary detail and have other issues that hinder readability, the reason why people resort to pseudo-code. But there is more to it than just aesthetics. Conciseness of code is related to programming abstractions, constructs that achieve higher generality and remove unnecessary detail; to reuse, whereby the same code is used in different contexts, reducing total program size; and even testing, that is concise programs can be tested more easily. In short, short is better. The elegance of less is hardly my own or a software engineering discovery. As Antoine de Saint-Exupery, French writer and aviator, so eloquently put it :
I will present for each language or library the implementation of a word count program, lifted from its documentation, since this has become sort of the "Hello World" for map reduce. I don't think such a simple program is the ultimate test of the quality of a language, so this is just to give a taste of the language. What I am most interested in is:
This is the original, the real thing, the current performance champion and what "real men" write in. It is also the most mature of the different options. But take a look:
Hadoop was started by Doug Cutting and is developed by a large community with a significant group employed at Yahoo.
Cascading
Cascading is a Java library written on top of Hadoop. It enables programming in a dataflow style, with some primitives inspired by SQL (like GroupBy). But according to a person closely related to the project, "it's still Java, it's still boilerplate code". My favorite line is number 18. Remarkably, it trims down the line count for the word count program to half as many as plain Hadoop. I don't have first hand experience with Cascading, but since there is no or little performance penalty compared to the real thing — depending on programmer skill, it could actually be better — it's worth a try for production work.
Cascading is developed by Chris Wensel at Concurrent.
C++ fits into the Java environment not without some effort, which is encapsulated in a library called Pipes. The word count program looks more compact than in Java/Hadoop. I've read opposite comments on the efficiency of Pipes/C++ vs Hadoop/Java and I suspect it may depend on the specific problem being tackled. Even if I used to be quite proficient in C++, I do not remember fondly the 8000 characters template-induced error messages and I don't think it is the type of language I would want to use to discuss algorithms or for prototyping.
Pipes is developed as part of the Hadoop project.
Hive is a SQL-like language that is interpreted on top of Hadoop. It can also be combined with small programs written in a variety of languages, to make up for the fact that the language itself is not general purpose. For what it does, it is very concise and expressive, but outside that you need to supplement it with other languages. Case in point, the word count example where two additional scripts are left as an exercise for the reader.
Hive started as part of the Hadoop project.
Pig adds to the limitations of Hive the hubris of creating a brand new language, as if creating a new programming language were easy. As you can see, it is inspired by SQL to a degree. It is not a general purpose language as clearly explained here. It interfaces with any JVM based language for custom extensions.
Rhipe
Rhipe is an R package to describe and execute map-reduce jobs. It is reasonably high level and satisfies all the criteria I listed above. It's not a speed daemon, because of R itself, there are some quirks in the API and it's still at an initial stage of development, but interesting.
Dumbo
Dumbo is a Hadoop library for python, but also imposes a set of tools to run dumbo program. If you look at the word count program in Dumbo, below, it almost looks like pseudo-code! Finally! But there is a serious catch. There can only be a run statement per dumbo-powered program — I asked the author himself after seeing some outlandish looking errors. To coordinate two runs, for instance one that starts based on the output of the first, one has to run separate python programs and go through the unix shell. This is different from static composition of jobs, which is well supported, but not general enough for my purposes. Other options for python include MR Job and pydoop, but I haven't had time to look into these yet.
Cascalog
Built on top of the already powerful cascading as a domain specific language within Clojure, Cascalog wins the word count conciseness contest with a one-liner. Indeed, word counting is simple enough that a line is all that it should take. But look at what a line:
Cascalog is developed by Nathan Marz at Backtype.
Final thoughts
At the end of this by necessity incomplete and unscientific language and library comparison, there is a winner and there isn't. There isn't because language comparison is always multidimensional and subjective but also because the intended applications are very different. On the other hand, looking for a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely. First, it is just a library, which means that one can continue to use the tools he's familiar with. I found it particularly useful to run map-reduce jobs in the interpreter, inspecting the inputs and outputs of each, an invaluable debugging help — but no, you can not step into a mapper or reducer, I use counters instead to trace what's going on in there. I also like that one can read and write sequence files with one call, to examine the output of previous jobs and decide what to do next. Additionally since R is a statistical language and Hadoop is the tool of choice for big data analytics, this seems like a natural fit. Personally, I am familiar with both, which helps, and I have used R, in combination with Hive or Hadoop, to do analytics in the past, but not at this level of integration. Since there is nothing like trying a more substantial example than word count to figure out a language pros and cons, stay tuned for a fairly complex example. After that is published, I plan to pose a friendly challenge to experts in the languages and libraries above or other Hadoop related languages and see what an implementation of the same algorithm would look like in their language of choice and learn something from the comparison. Maybe among my "25 readers" there is someone who will take it up.
Update: since writing this post, I was approached by Revolution Analytics to write yet another map reduce library, this time for R, and it is, of course, my new favorite. Check it out!
In a couple of blog entries from my personal blog I described some map-reduce algorithms for statistical and graph problems and sketched their implementation using pseudo-code. Pseudo-code has two problems: not everybody agrees on what a statement means and it doesn't run, so you can't test it or use it. Real programming languages on the other hand tend to obscure the logic of a program with unnecessary detail and have other issues that hinder readability, the reason why people resort to pseudo-code. But there is more to it than just aesthetics. Conciseness of code is related to programming abstractions, constructs that achieve higher generality and remove unnecessary detail; to reuse, whereby the same code is used in different contexts, reducing total program size; and even testing, that is concise programs can be tested more easily. In short, short is better. The elegance of less is hardly my own or a software engineering discovery. As Antoine de Saint-Exupery, French writer and aviator, so eloquently put it :
Unfortunately, in some circles, dull, predictable, repetitive code is considered simpler than short and to the point code, or at least tolerable. From java.util.Arrays:Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
478 /*In this case, repetition gets a free pass in exchange for efficiency. Very expressive languages tend to exact a higher toll on resources, and the different map-reduce environments we will look into are no exception.
479 * The code for each of the seven primitive types is largely identical.
480 * C'est la vie.
481 */
I will present for each language or library the implementation of a word count program, lifted from its documentation, since this has become sort of the "Hello World" for map reduce. I don't think such a simple program is the ultimate test of the quality of a language, so this is just to give a taste of the language. What I am most interested in is:
- Can I write reasonably concise, abstract programs in this language or library?
- Can I write the "inside" of map reduce, that is the code for the mapper and the reducer, as well as the "outside", the logic that decides which map reduce jobs to run?
- Is it general? Can I write any map-reduce program, including programs that require multiple map-reduce jobs, including the case of a data dependent number and type of jobs?
This is the original, the real thing, the current performance champion and what "real men" write in. It is also the most mature of the different options. But take a look:
public static class MapClass extends MapReduceBase48 lines to write a word count program (and I stripped the import statements at the top out of mercy)! My favorite line is number 5, a line devoted to redefining the number one. This makes sense in a world where programmer productivity is measured by number of lines of code written or for a production job that runs on a 1,000 node cluster for 5 hours every night, in which case efficiency may trump other considerations. But for a blog, for discussing and enjoying code, anything remotely more interesting than a word count program would not fit the size of an entry but would have to be an attachment, as John Mount did with his painstaking Java/Hadoop implementation of logistic regression. I wonder how many people opened that tar file and read it through and through.
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setInputPath(new Path(args[0]));
conf.setOutputPath(new Path(args[1]));
JobClient.runJob(conf);
}
}
Hadoop was started by Doug Cutting and is developed by a large community with a significant group employed at Yahoo.
Cascading
Cascading is a Java library written on top of Hadoop. It enables programming in a dataflow style, with some primitives inspired by SQL (like GroupBy). But according to a person closely related to the project, "it's still Java, it's still boilerplate code". My favorite line is number 18. Remarkably, it trims down the line count for the word count program to half as many as plain Hadoop. I don't have first hand experience with Cascading, but since there is no or little performance penalty compared to the real thing — depending on programmer skill, it could actually be better — it's worth a try for production work.
Cascading is developed by Chris Wensel at Concurrent.
Scheme sourceScheme = new TextLine( new Fields( "line" ) );Pipes — C++
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
Pipe assembly = new Pipe( "wordcount" );
String regex = "(?>!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
assembly = new GroupBy( assembly, new Fields( "word" ) );
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
flow.complete();
C++ fits into the Java environment not without some effort, which is encapsulated in a library called Pipes. The word count program looks more compact than in Java/Hadoop. I've read opposite comments on the efficiency of Pipes/C++ vs Hadoop/Java and I suspect it may depend on the specific problem being tackled. Even if I used to be quite proficient in C++, I do not remember fondly the 8000 characters template-induced error messages and I don't think it is the type of language I would want to use to discuss algorithms or for prototyping.
Pipes is developed as part of the Hadoop project.
class WordCountMap: public HadoopPipes::Mapper {Hive
public:
WordCountMap(HadoopPipes::TaskContext& context){}
void map(HadoopPipes::MapContext& context) {
std::vector<std::string> words =
HadoopUtils::splitString(context.getInputValue(), " ");
for(unsigned int i=0; i < words.size(); ++i) {
context.emit(words[i], "1");
}
}
};
class WordCountReduce: public HadoopPipes::Reducer {
public:
WordCountReduce(HadoopPipes::TaskContext& context){}
void reduce(HadoopPipes::ReduceContext& context) {
int sum = 0;
while (context.nextValue()) {
sum += HadoopUtils::toInt(context.getInputValue());
}
context.emit(context.getInputKey(), HadoopUtils::toString(sum));
}
};
int main(int argc, char *argv[]) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<WordCountMap,
WordCountReduce>());
}
Hive is a SQL-like language that is interpreted on top of Hadoop. It can also be combined with small programs written in a variety of languages, to make up for the fact that the language itself is not general purpose. For what it does, it is very concise and expressive, but outside that you need to supplement it with other languages. Case in point, the word count example where two additional scripts are left as an exercise for the reader.
Hive started as part of the Hadoop project.
FROMPig
(MAP docs.contents USING 'tokenizer_script' AS word, cnt
FROM docs
CLUSTER BY word) map_output
REDUCE map_output.word, map_output.cnt USING 'count_script' AS word, cnt;
Pig adds to the limitations of Hive the hubris of creating a brand new language, as if creating a new programming language were easy. As you can see, it is inspired by SQL to a degree. It is not a general purpose language as clearly explained here. It interfaces with any JVM based language for custom extensions.
A = load '/tmp/bible+shakes.nopunc';Pig development was started at Yahoo.
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = group C by word;
E = foreach D generate COUNT(C) as count, group as word;
F = order E by count desc;
store F into '/tmp/wc';
Rhipe
Rhipe is an R package to describe and execute map-reduce jobs. It is reasonably high level and satisfies all the criteria I listed above. It's not a speed daemon, because of R itself, there are some quirks in the API and it's still at an initial stage of development, but interesting.
rhinit()Rhipe is developed by Saptarshi Guha at Mozilla.
m <- expression({
y <- strsplit(unlist(map.values)," ")
lapply(y,function(r) rhcollect(r,T))
})
r <- expression(
pre={
count=0
},
reduce={
count <- sum(as.numeric(unlist(reduce.values)),count)
},post={
rhcollect(reduce.key,count)
})
z=rhmr(map=m,reduce=r,comb=T,inout=c("text","sequence"),ifolder="/tmp/50mil",ofolder='/tmp/tof')
rhex(z)
Dumbo
Dumbo is a Hadoop library for python, but also imposes a set of tools to run dumbo program. If you look at the word count program in Dumbo, below, it almost looks like pseudo-code! Finally! But there is a serious catch. There can only be a run statement per dumbo-powered program — I asked the author himself after seeing some outlandish looking errors. To coordinate two runs, for instance one that starts based on the output of the first, one has to run separate python programs and go through the unix shell. This is different from static composition of jobs, which is well supported, but not general enough for my purposes. Other options for python include MR Job and pydoop, but I haven't had time to look into these yet.
def mapper(key,value):Dumbo is developed by Klaas Bosteels at last.fm.
for word in value.split(): yield word,1
def reducer(key,values):
yield key,sum(values)
if __name__ == "__main__":
import dumbo
dumbo.run(mapper,reducer)
Cascalog
Built on top of the already powerful cascading as a domain specific language within Clojure, Cascalog wins the word count conciseness contest with a one-liner. Indeed, word counting is simple enough that a line is all that it should take. But look at what a line:
(?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count))It probably looks familiar to anybody who's familiar with it. Conciseness can become terseness, but once some domain specific concepts have been grasped a terse program such as this might become perfectly clear. It was to me at some point. My misgivings here are more about the JVM-powered revival of LISP in the form of Clojure. LISP has been around some 50-odd years without taking off despite several attempts at its revival (Common LISP, Scheme, Arc and now Clojure). I suspect something is wrong with it, even if popularity is not an accurate gauge of language quality, as BASIC has long proved. Personally, I dislike LISP odd syntax, the widespread use of side effects in a functional language and the poor abstraction that lists represent over RAM, from a performance point of view — indeed LISP variants often add additional data structures, somehow negating the "LIS" part of the language. In the specific case of Clojure, the fact that a compiled language is compiled into an interpreted one, JVM bytecode, combining a slow dev cycle with suboptimal performance, makes me think Clojure users must be glutton for punishment.
Cascalog is developed by Nathan Marz at Backtype.
Final thoughts
At the end of this by necessity incomplete and unscientific language and library comparison, there is a winner and there isn't. There isn't because language comparison is always multidimensional and subjective but also because the intended applications are very different. On the other hand, looking for a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely. First, it is just a library, which means that one can continue to use the tools he's familiar with. I found it particularly useful to run map-reduce jobs in the interpreter, inspecting the inputs and outputs of each, an invaluable debugging help — but no, you can not step into a mapper or reducer, I use counters instead to trace what's going on in there. I also like that one can read and write sequence files with one call, to examine the output of previous jobs and decide what to do next. Additionally since R is a statistical language and Hadoop is the tool of choice for big data analytics, this seems like a natural fit. Personally, I am familiar with both, which helps, and I have used R, in combination with Hive or Hadoop, to do analytics in the past, but not at this level of integration. Since there is nothing like trying a more substantial example than word count to figure out a language pros and cons, stay tuned for a fairly complex example. After that is published, I plan to pose a friendly challenge to experts in the languages and libraries above or other Hadoop related languages and see what an implementation of the same algorithm would look like in their language of choice and learn something from the comparison. Maybe among my "25 readers" there is someone who will take it up.
Brian, good point and add scrunch https://github.com/cloudera/crunch/tree/master/scrunch to the mix. This post is getting obsolete pretty quickly, as sign of the frantic activity around Hadoop.
val lines = fromTextFile("hdfs://in/...")
val counts = lines.flatMap(_.split(" "))
.map(word => (word, 1))
.groupByKey
.combine(_+_)
persist(toTextFile(counts, "hdfs://out/..."))
From an end user perspective there is no difference between invoking SENTENCES and any other keyword in the hive syntax. Under the covers, it's all java anyway, but the point is to abstract people that want to be abstracted while still providing a powerful too.
So, yes, I consider it part of the core language
SELECT count(*), word
FROM
SELECT explode(sentences(lower(contents))) as word
FROM docs) A
group by word