CS525: Parallel Computing
Spring 2013.

Ananth Grama, ayg@cs.purdue.edu, 494 6964
  • Assignment 1 (Due Jan 25): You are required to write a parallel PageRank program in MPI. Web graphs of various sizes are available from the Stanford Network Repository.

    Your program must take one of these files as input. For the purposes of this assignment, one processor reads the file and communicates equal number of nodes to all other processes. The processes initialize their pagerank values to the identity vector (normalized) and compute page rank through iterative multiplication with the matrix (power iterations). The process continues until the pageranks do not change significantly (the two-norm of the pagerank vector changes by less than 10^{-5}, say.

    The key challenge in this assignment is to write a highly optimized matrix-vector product. As indicated in the class, you do not want to broadcast the entire intermediate pagerank vector to all processors. This is unlikely to give you good performance on larger number of processors. For this reason, you only want to communicate the necessary elements of the vector to processors that need them.

    To get started on the mc cluster, here are some instructions.



    Grading: Your solutions will be graded by running the program on the test inputs. You will be graded on two counts -- first, your serial runtime should not be very high (i.e., you must have a meaningfully efficient serial sparse matrix-vector product). Second, you should get reasonable parallel speedup. The grading scale is as follows: 30% points for programs that execute correctly. 70% of the points for performance. All programs will be run on the mc (mc01-18) cluster in the department (this is the cluster you should use to benchmark your programs as well) on up to 16 cores. If you get a speedup of more than 12, you will get all 70%. If you get a speedup of between 8 and 12, you get 50%, 4 and 8, you get 30%, 1 - 4, you get 15%. No points for speedup of under 1 (!). These speedups are for the largest problem instance in the Stanford web respository that your program can execute on.

    Here are the submission instructions.

  • Assignment 2 (Due Feb 8): Write a posix threaded program for computing approximate ranking of nodes using discrete time random walks. In this approach, a random walker starts at each node of a given network. The walker makes a transition to one of the neighbors (or to itself) with equal probability. She then leaves a mark at the visited nodes. Each walker advances by a step in this manner. After all walkers have advanced, the process is repeated. There are theoretical proofs, which show that in a small number of such steps, highly ranjed nodes can be identified as those with the highest number of marks.

    A naive pthreads implementation would associate one thread with a single random walker. However, for large networks, this would result in a large number of threads (and unacceptably high overhead). Therefore, your program should ask the user for the desired number of threads. The nodes in the graph (rows in the sparse matrix) should be partitioned equally across these threads. Each thread advances the random walk by one step from all of these nodes, while collecting the marks at the destination nodes. All threads then synchronize and repeat the process. Please note that many threads may cross boundaries of nodes assigned to them.

    Your program should also take as parameter the number of desired steps. At the end of these steps, the program should print out the 100 highest ranked nodes in the networks (i.e., the top 100 nodes with largest number of marks). Please use the networks from your previous assignment.



    Grading: Your solutions will be graded by running the program on the test inputs. You will be graded on two counts -- first, your serial runtime should not be very high (i.e., you must have a meaningfully efficient serial implementation). Second, you should get reasonable parallel speedup. The grading scale is as follows: 30% points for programs that execute correctly. 70% of the points for performance. All programs will be run on the mc machines in the Department (this is the cluster you should use to benchmark your programs as well) on up to 16 cores. If you get a speedup of more than 12, you will get all 70%. If you get a speedup of between 8 and 12, you get 50%, 4 and 8, you get 30%, 1 - 4, you get 15%. No points for speedup of under 1 (!). These speedups are for the largest problem instance in the Stanford web respository that your program can execute on.

  • Assignment 3: (Due Feb 20) Problems 2.6, 2.12, 2.13, 2.20, 2.23, 2.24, and 2.25 of the text `Introduction to Parallel Computing', by Grama et al.

  • Assignment 4: (Due Mar 6). Problem 1: Estimate constants t_s and t_w for your platform. This is done by ping-ponging a message a large number of times, using this to compute return trip time, and doing this for an increasing message size to fit a line through the times you observe.

    Problem 2: Estimate the time for MPI_Alltoall, MPI_Allgather, and MPI_Allreduce. Carefully design your experiment to evaluate these times. Plot the times as a function of message size. On the same plot, show the theoretical estimate for the algorithm from your text.

    Problems 5.1, 5.2, 5.4, 5.5, 5.10, and 5.13 of the text `Introduction to Parallel Computing', by Grama et al.

  • Assignment 5: (Due Mar 29). This homework requires you to program a reducer (from MapReduce) in MPI. In this problem, each processor has a table of key-value pairs. You may assume that this table is the same size at every processor and that keys and values are both integers. The output of your program should be another (partitioned/ distributed) table in which each key in the input appears only once in the output, and the associated value with this key is the sum of all values associated with the key in the input (note that since we may have multiple key-value pairs in the input tables, these may also be distributed).

    The way to implement this is to first do a local reduce operation. This can be done either using a sort on the keys followed by reduce, or using a hash. You are free to choose your implementation. After this, you can be sure that each key appears only once at each processor (this reduces communication in the second step). In the second step, each key is mapped to a processor (this is done simply using a function such as key modulo p). Make sure you first collect all the key-value pairs in buckets and communicate them using a single group communication operation. In the final step, each processor (now called a reducer) does a second local reduction.

    Implement this algorithm in MPI and time your program for the weak scaling case (table size per processor stays constant as you increase the number of processors).

  • Assignment 6: (Due April 15). This homework requires you to code sparse single-source shortest path. Use one of the graphs from your first assignment. The user is prompted to enter a source (an id for a node). The program is then requires to compute the single source shortest path. The graphs are always treated as sparse -- i.e., do not use full adjacency matrices, do not use any quadratic complexity data structures. The grading scale is as follows: 30% points for programs that execute correctly. 70% of the points for performance. All programs will be run on the mc (mc01-18) cluster in the department (this is the cluster you should use to benchmark your programs as well) on up to 16 cores. If you get a speedup of more than 12, you will get all 70%. If you get a speedup of between 8 and 12, you get 50%, 4 and 8, you get 30%, 1 - 4, you get 15%. No points for speedup of under 1 (!). These speedups are for the largest problem instance in the Stanford web respository that your program can execute on.