CS525: Parallel Computing
Spring 2013.
Ananth Grama, ayg@cs.purdue.edu, 494 6964
-
Assignment 1 (Due Jan 25): You are required to write a parallel PageRank
program in MPI. Web graphs of various sizes are available from the
Stanford Network Repository.
Your program must take one of these files as input. For the purposes of
this assignment, one processor reads the file and communicates equal
number of nodes to all other processes. The processes initialize their
pagerank values to the identity vector (normalized) and compute page
rank through iterative multiplication with the matrix (power iterations).
The process continues until the pageranks do not change significantly
(the two-norm of the pagerank vector changes by less than 10^{-5}, say.
The key challenge in this assignment is to write a highly optimized
matrix-vector product. As indicated in the class, you do not want to
broadcast the entire intermediate pagerank vector to all processors. This
is unlikely to give you good performance on larger number of processors.
For this reason, you only want to communicate the necessary elements of
the vector to processors that need them.
To get started on the mc cluster,
here are some instructions.
Grading: Your solutions will be graded by running the program on
the test inputs. You will be graded on two counts -- first, your serial
runtime should not be very high (i.e., you must have a meaningfully
efficient serial sparse matrix-vector product). Second, you should get
reasonable parallel speedup. The grading scale is as follows:
30% points for programs that execute correctly. 70% of the points for
performance. All programs will be run on the mc (mc01-18) cluster in the
department (this is the cluster you should use to benchmark your programs
as well) on up to 16 cores. If you get a speedup of more than 12, you will
get all 70%. If you get a speedup of between 8 and 12, you get 50%,
4 and 8, you get 30%, 1 - 4, you get 15%. No points for speedup of under
1 (!). These speedups are for the largest problem instance in the Stanford
web respository that your program can execute on.
Here are the submission instructions.
-
Assignment 2 (Due Feb 8): Write a posix threaded program for computing
approximate ranking of nodes using discrete time random walks. In this
approach, a random walker starts at each node of a given network. The walker
makes a transition to one of the neighbors (or to itself) with equal
probability. She then leaves a mark at the visited nodes. Each walker advances
by a step in this manner. After all walkers have advanced, the process is
repeated. There are theoretical proofs, which show that in a small number
of such steps, highly ranjed nodes can be identified as those with the highest
number of marks.
A naive pthreads implementation would associate one thread with a single
random walker. However, for large networks, this would result in a large
number of threads (and unacceptably high overhead). Therefore, your
program should ask the user for the desired number of threads. The nodes in
the graph (rows in the sparse matrix) should be partitioned equally across
these threads. Each thread advances the random walk by one step from all of
these nodes, while collecting the marks at the destination nodes. All
threads then synchronize and repeat the process. Please note that many threads
may cross boundaries of nodes assigned to them.
Your program should also take as parameter the number of desired steps.
At the end of these steps, the program should print out the 100 highest
ranked nodes in the networks (i.e., the top 100 nodes with largest number
of marks). Please use the networks from your previous assignment.
Grading: Your solutions will be graded by running the program on
the test inputs. You will be graded on two counts -- first, your serial
runtime should not be very high (i.e., you must have a meaningfully
efficient serial implementation). Second, you should get
reasonable parallel speedup. The grading scale is as follows:
30% points for programs that execute correctly. 70% of the points for
performance. All programs will be run on the mc machines in the
Department (this is the cluster you should use to benchmark your programs
as well) on up to 16 cores. If you get a speedup of more than 12, you will
get all 70%. If you get a speedup of between 8 and 12, you get 50%,
4 and 8, you get 30%, 1 - 4, you get 15%. No points for speedup of under
1 (!). These speedups are for the largest problem instance in the Stanford
web respository that your program can execute on.
-
Assignment 3: (Due Feb 20) Problems 2.6, 2.12, 2.13, 2.20, 2.23,
2.24, and 2.25 of the text `Introduction to Parallel Computing', by Grama et al.
-
Assignment 4: (Due Mar 6).
Problem 1: Estimate constants t_s and t_w for your platform. This is done by
ping-ponging a message a large number of times, using this to compute return trip
time, and doing this for an increasing message size to fit a line through the
times you observe.
Problem 2: Estimate the time for MPI_Alltoall, MPI_Allgather, and MPI_Allreduce.
Carefully design your experiment to evaluate these times. Plot the times as a function
of message size. On the same plot, show the theoretical estimate for the algorithm
from your text.
Problems 5.1, 5.2, 5.4, 5.5, 5.10, and 5.13 of the text `Introduction to
Parallel Computing', by Grama et al.
-
Assignment 5: (Due Mar 29).
This homework requires you to program a reducer (from MapReduce) in MPI. In this
problem, each processor has a table of key-value pairs. You may assume that this table
is the same size at every processor and that keys and values are both integers. The
output of your program should be another (partitioned/ distributed) table in which
each key in the input appears only once in the output, and the associated value with
this key is the sum of all values associated with the key in the input (note that
since we may have multiple key-value pairs in the input tables, these may also be
distributed).
The way to implement this is to first do a local reduce operation. This can be done
either using a sort on the keys followed by reduce, or using a hash. You are free to
choose your implementation. After this, you can be sure that each key appears only
once at each processor (this reduces communication in the second step). In the
second step, each key is mapped to a processor (this is done simply using a function
such as key modulo p). Make sure you first collect all the key-value pairs in
buckets and communicate them using a single group communication operation.
In the final step, each processor (now called a reducer) does a second local
reduction.
Implement this algorithm in MPI and time your program for the weak scaling
case (table size per processor stays constant as you increase the number of
processors).
-
Assignment 6: (Due April 15).
This homework requires you to code sparse single-source shortest path. Use one of
the graphs from your first assignment. The user is prompted to enter a source (an id
for a node). The program is then requires to compute the single source shortest path.
The graphs are always treated as sparse -- i.e., do not use full adjacency matrices,
do not use any quadratic complexity data structures.
The grading scale is as follows:
30% points for programs that execute correctly. 70% of the points for
performance. All programs will be run on the mc (mc01-18) cluster in the
department (this is the cluster you should use to benchmark your programs
as well) on up to 16 cores. If you get a speedup of more than 12, you will
get all 70%. If you get a speedup of between 8 and 12, you get 50%,
4 and 8, you get 30%, 1 - 4, you get 15%. No points for speedup of under
1 (!). These speedups are for the largest problem instance in the Stanford
web respository that your program can execute on.