Parallel PageRank

David Gleich, Matt Rasmussen, and Leonid Zhukov.

The goal of this project was to investigate solutions methods for the PageRank problem on a distributed parallel architecture. We implemented a parallel PageRank computation and experimental system using the MPI and PETSc libraries. This webpage documents some of our results on this project.

Presentations and Publications

  • Fast Parallel PageRank: A Linear System Approach. Yahoo! Technical Report, 2004. [pdf]
  • Fast Parallel PageRank: A Linear System Apparoch. Short presentation. SCREAM 2005, Stanford University.
  • Fast Parallel PageRank: Methods and Evaluations. Poster. BASCD 2005, University of San Franciso. [pdf]
  • New! Scalable Computing for Power Law Graphs: Experience with Parallel PageRank. Yahoo! Technical Report, 2005. [pdf]

Our System and Results

For the PageRank problem, there are many computational approaches. The following flowchart summarizes our options for computing PageRank in parallel.

The following picture describes our system. At the base we have a 120-node Beowulf cluster of RLX nodes. Each node has dual Intel Xeon 2.8 GHz processors and 4 GB RAM. These nodes are connected via a 120-port Gigabit Ethernet switch.

Above the hardware layer, we use the MPI standard to control the operation of the cluster and use the PETSc library for its efficient parallel numerical routines. At the top, we have our Parallel PageRank code which is a modular, experimental framework to rapidly prototype and evaluate parallel methods for PageRank computation.

Our results are summarized in the following table show the benefit of this approach.

For the 6 graphs presented here, we present runtimes on the number of processors listed underneath the graph. The size of the graph is given in number of pages (nodes) first and number of links (edges) second. For each method, we list the number of iterations taken, the time per iteration, and the total time. The shaded boxes are the best method in a row. A star denotes the use of a preconditioner. The residual used for convergence was 10^-7.

A natural question with a parallel system is the scaling of the problem. The following plot indicates that on a fully-connected network topology, PageRank scales well.