Mapreduce Everywhere

Mapreduce could extend its reach beyond — or inside — the data center. Coming soon to a computer near you?

The local Hadoop SF meetings cover a variety of topics, mostly practical. But on one occasion the discussion took a speculative turn: does Hadoop have legs or is it a stop-gap measure? People supporting the stop-gap thesis presented a hardware argument. SSDs and distributed DRAM-based storage are replacing hard drives in many applications and their performance characteristics are different enough to make Hadoop obsolete. In particular they undermine one of the basic premises of the mapreduce model, that is that sequential access is much faster than random access. While I was trying to mount a defense based on capacity and price, somebody chimed in saying that the latency issue exists at all levels of the memory hierarchy. Wait a minute, flash memory and DRAM don’t spin around and don’t have read/write heads to reposition. You send in an address to read and you get a value out, right? Wrong. I was kindly pointed to this very readable paper, ”The Pathologies of big data”, which shows, in Figure 3, how sequential access is faster for spinning disks as well SSDs and DRAM. The gap with random access is staggering for HDs at more than 5 orders of magnitude, slightly less for SSDs and goes down to one order of magnitude for DRAM, but it is still there and getting worse over time. Therefore, one of the assumptions behind mapreduce seems likely to be with us for several years to come. Following this reasoning further, mapreduce may prove useful to program on architectures other than clusters. In fact if you Google [GPU and mapreduce] or [FPGA mapreduce] you’ll find quite a few interesting reads, for instance:
Another example of a new use of map-reduce is Spark. Yes, it still runs on clusters but the role of hard disks is replaced by a ‘Resilient Distributed Dataset’, an abstraction over distributed DRAM, striking a completely different price/performance trade-off than Hadoop.
These are mostly academic forays, but some are more practical, like the Word Count example above. FPGAs may be a bit more exotic, but using GPUs for non graphics related tasks is a major trend in HPC, and Amazon itself offers GPU-capable instances as part of their EC2 service.

What are the major assumptions behind the mapreduce model? Let me give it a try:
  1. Needless to say, parallel processing.
  2. A two-layer memory model: one is fast but low-capacity, used within the map and reduce functions, the other is slow but high capacity.
  3. Sequential access is better than random. Sort, don’t index. It doesn’t mean one has to read all the data, all the time, but the trend is toward bigger chunks.
  4. Locality: resources local to a processing element are faster than remote ones, all else being equal.
It’s a brave new world. Long standing abstractions such as a flat memory model are reversed. Writing programs under these new assumptions is a cultural shift.

Where else could the mapreduce model prove useful? Look no further than your very own laptop or server. Let’s check the four assumptions:
  1. The number of cores is ever increasing, with AMD packing 16 in their server CPUs and HP 1152 in their ARM/Calxeda based Redstone platform, a 4U server (goodness starts at 0:55). In the HPC market, Intel is squeezing more than 50 cores on their x86 compatible Knight’s Corner and embedded processors manufacturers are outdoing each other, with GreenArray at 144 cores and Adapteva promising a staggering 4096. With nowhere to go as far as clock speed, running out of ideas to make each cycle accomplish more and with power limits the main concern for designers of mobile devices and data centers alike, chip makers are likely to make core count follow the next variant of Moore’s law. The free lunch is over. We need to master parallel programming, now, and mapreduce is the gentle way.
  2. Memory layers below the DRAM abound, they are called registers, L1, L2 and L3 cache, from the fastest to the slowest. Size is inversely proportional to speed. Currently, most programmers are blissfully unaware of these subtleties and let optimizing compilers do the best they can. If programs were rewritten to follow the mapreduce model, they would be naturally suited to maximize locality of reference (p. 12:3, bottom), improving cache hits and prefetch effectiveness. Maybe a compiler could be tailored for mapreduce programs and squeeze out even more speed.
  3. Sequential access is faster, even for DRAM. That’s the realization that inspired this post in the first place.
  4. L1 and L2 are core-local, whereas L3 and DRAM are shared in recent Intel designs. This may change one day as the effort to provide the convenience of a unified memory space takes an ever increasing toll on performance and power consumption.
I thought I’d back this with a little experiment. I copied a big vector into another one, the first time in order from the first element to the last, the second in a random order (specifically a random permutation of the index range). Each vector takes more than 70 MB of memory vs. an L3 cache size of 6MB in a 2011 sandy bridge i7 quad processor. This is from an R session:

big = 10000000
A = rnorm(big)#fill with some random numbers
B = A
sequential = 1:big
random = sample(sequential, big, replace=FALSE)
system.time({B = A[sequential]})
user system elapsed
0.097 0.023 0.120
system.time({B = A[random]})
user system elapsed
0.246 0.027 0.273
system.time({B = A[sequential]})
user system elapsed
0.074 0.000 0.074
system.time({B = A[random]})
user system elapsed
0.232 0.000 0.233
[...]

We can observe a roughly 3-fold difference if we copy elements in order as opposed according to a random permutation of positions. See this other example for a similar effect observed by modifying a matrix multiplication algorithm to enhance locality, this time in C.

In summary, the mapreduce model seems to be a good fit for a variety of promising architectures. The basic assumptions on which it is based are ubiquitous.  Maybe it is a unifying programming model that will help us in the transition to parallel computing.

Comments

Unknown
http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf