A map reduce algorithm for connected components

In a recently published book about algorithms for the map reduce model of computation, a simple connected components algorithm based on lablel propagation is proposed, but its complexity depends on the diameter of the graph, which can be very large. It turns out we can get rid of that dependency with a completely different algorithm, ported from the PRAM model.

The authors observe the many if not most practically occurring web-sized graphs have small diameters (the "small world" phenomenon) and therefore their algorithm is of practical importance. At the same time, new types of large graphs become available to researchers and practitioners who want to data mine them and do not know a priori what the diameter of these graphs is. And if you don't buy this motivation for an algorithm with a diameter-independent upper bound, it's just "because it's there".

Researching the literature for parallel algorithms for connected components, I found one called Random Mate (I am not sure if the correct attribution is to Hillel Gazit or John H. Reif, my access to older literature is limited, but I found a nice write up ) that seems amenable to a map reduce implementation. The basic idea is that, instead of extending components one or few nodes at a time, we should merge them and enjoy an exponential increase in component size for each iteration, on average. More specifically, components are represented with directed forests overlaid on top of the input graph, that is using the same nodes but a separate set of edges. The trees in the forest are kept very shallow, essentially stars or very close to stars, so that eventually each node will have as parent the root of a tree, providing a convenient representation for each component. The most technical bit is that at each step, nodes are randomly assigned to two sets, let's call them "high" and "low" (the above write up uses "male" and "female"), but what counts really is the type of the parent in the forest. Edges that hit two nodes with parents of the same type are out of the game for that specific iteration. Edges that hit nodes with one high and one low parent can only be used in one direction, that is to attach the forest of the low parent under the forest of the high parent. If we didn't do this, working in parallel we could end up creating loops, thus turning trees into general graphs (specifically, DAGs) and violating a key invariant of the algorithm. It is still possible that a low node could have multiple neighboring high nodes belonging to different trees, but only one wins based on the properties of the PRAM CRCW model of parallel computation which specifies a way for concurrent writes to be sorted out.

The following is a sketch of a map-reduce version of this PRAM algorithm. Its correctness and performance bounds follow from the fact that it closely emulates the original, step by step and therefore the original proof is still valid. In my pseudo code, we will consider tables as on disk data structures and consider join operations as primitive operations. Tables are implemented as distributed file system files. See the aforementioned book for the details on how to implement joins, or the Hive software. We will build a directed forest on top of the graph, so we will devote a table $F$ to the edge list of the forest and on to $E$, the input graph $G=(V,E)$ (undirected edges are represented as pairs of directed ones). $F$ is initialized as $(v,v) \;\forall v \in V$ that is all the self-loops (this is a slight departure from the definition of a forest, but let me still call it a forest, it makes for simpler code with no special handling of the root nodes). The first operation is a join described in SQL-like language:

select E.u as u , E.v as v, F1.v as p1, F2.v as p2 from F F1 join E on E.u = F1.u join F F2 on E.v = F2.u

That is, we "annotate" each edge with the parents in the forest of each vertex hit by the edge. Now we describe a map phase with input $(u, v, p_1, p_2)$. $r$ is the map reduce round number.

def map(u, v, p1, p2):
  if p1 == p2: #same component
    return [] #do nothing
  else: 
    h1 = hash(p1, r)%2 #randomly assign "high" and "low"
    h2 = hash(p2, r)%2
    if h1 == h2:
      return [] #if same give up
    else:
      return [(p1,p2) if h1 else (p2,p1)] #otherwise merge one into the other

This phase, in the original formulation for the PRAM, requires a CRCW model, without preference for a specific variant, so we will use the arbitrary CRCW model and implement it in the reduce phase, the first node of the edge being designated as the key.

def reduce(list):
  return list[0]

(this is amenable to be used as a combiner as well)
The output of this reducer becomes F', a table of updates to the forest that needs to be merged with the existing F.

select coalesce(F'.u, F.u), coalesce(F'.v, F.v) from F left outer join F' on F.u = F'.u

which is a way of saying: keep all the edges in F unless there is one in F' with the same starting node, then replace it.

Finally, there is a path shortening phase that turns trees in the forest into stars. This is implemented with a self join on the forest.

select F1.u, F2.v from F F1 join F F2 on F1.v = F2.u

Again this replaces F. The termination criterion is that if there are no edges in G with different parents, we are done.

As in the original algorithm, the number of iteration is $O(\log(N))$, each of which requires sorting of all the edges, the graph's and the forest's. There are highly optimized sorting implementations for the map reduce model, but I haven't found a discussion of their asymptotic complexity. From a quick look at Terasort, I think it requires $O(N\log(N))$ work and takes $O(\log(N))$ time with $\Omega(N)$ available processors. These are the dominating computational costs, as the mappers and reducers above all execute in constant time, for an overall $O(N(\log(N))^2)$ work and $O((\log(N))^2)$ time. There is one detail to take into account for the reducer that implements the effects of the arbitrary CRCW PRAM model: for a high degree node, one reducer might get a very large proportion of the input. This is not a problem, as the reducer can just return after reading the first line of input and we can use the reducer as a combiner for and additional optimization.

Another potential optimization could be to use, instead of a random bipartition of the nodes, a random priority. This way more mergers would happen at each step. The central part of the algorithm becomes:

def map(u, v, p1, p2):
  if p1 == p2: #same component
    return [] #do nothing
  else:
    h1 = hash(p1, r) #randomly assign priority
    h2 = hash(p2, r)
    if h1 == h2:
      return [] #if same give up
    else:
      return [(p1,p2) if h1 > h2 else (p2,p1)] #otherwise merge one into the other

That would require though a more powerful path shortening phase, since the correctness proof relies on the trees in the forest having depth one at the end of each iteration, and the current path shortening can only halve the depth. I expect that this modified algorithm would be faster for dense graphs for which the forest has many fewer edges.

Even without this optimization, you might have noticed that the diameter of the graph was notably absent from the discussion, and in fact this algorithm works well even for paths, which was our original goal. It could require more iterations though than the label propagation algorithm when the diameter of the graph is $\Omega(\log(N))$ as far as I can tell from these bounds, but I don't have an example where it does. It might be possible to show that this algorithm works faster than $O(\log(N))$ when the diameter is small.