Fast matrix computations for pair-wise and column-wise Katz scores and commute times

Francesco Bonchi, Pooya Esfandiar, David F. Gleich, Chen Greif, and Laks V. S. Lakshmanan

J. Internet Mathematics, Accepted.

These codes are research prototypes and may not work for you. No promises, but do contact us if you run into problems. We will be reasonably helpful.

Download

fast-katz-20110405.tar.gz

(www.cs.purdue.edu) : last updated 2011-04-05 * fast-katz-commute-data.tar.xz commute-time data (www.cs.purdue.edu) : last updated 2011-04-xx {: .nobullets}

Overview

The package is organized by directory

matlab: The main algorithms
data: data files
experiments: the codes to generate the figures
tools: some extra codes we used

Usage

Here is a usage synopsis. Please read the codes for more information or send questions.

cd('../matlab');
load('../data/arxiv_lcc.mat');
% compute Katz and pairwise approximation using alg 1 
% with a=1e-4, tol=1e-4, and b=||A||_1
[Kij bounds time] = katz_pairwise(A,i,j,1e-4,alpha);
[Cij bounds time] = commute_pairwise(A,i,j,1e-4,alpha);
% compute a sparse approximation of the top-k results for Katz
x = katz_topk(A,alpha,i,1e-4);

Experiments

|Experiment|Description|Figure| |:------------------|:------------------------------------|:------------------| |katz_nverts/katz_localization.m | Compute localization data for exact Katz scores | Tab.2 | |katz_nverts/katz_localization_study.m | Write out table 2 from localization data | Tab.2 | |pairwise/commute_pairwise_graphs.m | Compute data for Commute pairwise experiment and plot figs | Fig.1 | |pairwise/katz_pairwise_graphs.m | Compute data for Katz pairwise experiment and plot figs | Fig.2 | |pairwise/commute_pairwise_matvecs_experiments.m | Generate matrix-vector performance data | Fig.3 | |pairwise/katz_pairwise_matvecs_experiments.m | Generate matrix-vector performance data | Fig.3 | |pairwise/matvec_plots.m | Generate the pairwise matvec performance figures | Fig.3 | |commute_topk/commute_topk_experiment.m | Generate figure 4 based on our current codes and precomputed commute time data | Fig.4 | |katz_topk/katz_topk.m | Compute data for Katz top-k experiment and plot figs | Fig.5 | |pairwise/runtime_tables.m | Read data from matvecs experiment and produce runtime tables | Tab.3 | |katz_topk/katz_topk_runtime_all.m | Compute and output runtime data on column-wise Katz scores | Tab.4 |

Data

DBLP coauthor graph

We extracted the DBLP coauthors graph from a recent snapshot (2005-2008) of the DBLP database. We considered only nodes (authors) that have at least three publications in the snapshot. There is an undirected edge between two authors if they have coauthored a paper. From the resulting set of nodes, we randomly chose a sample of 100,000 nodes, extracted the largest connected component, and discarded any weights on the edges.

arXiv coauthor graph

This dataset contains another coauthorship graph extracted by a snapshot (1990-2000) of arXiv, which is an e-print service owned, operated and funded by Cornell University, and which contains bibliographies in many fields including computer science and physics. This graph is much denser than DBLP. Again, we extracted the largest connected component of this graph and only work with that subset.

Flickr contacts

Flickr is a popular online-community for sharing photos, with millions of users. The first graph we construct is representative of its social network, in which the node set V represents users, and the edge set E is such that (u, v) is in E if and only if a user u has added user v as his/her contact. We start with a crawl extracted from Flickr in May 2006. This crawl began with a single user and continued until the total personalized PageRank on the set of uncrawled nodes was less than 0.0001. The result of the crawl was a graph with 820,878 nodes and 9,837,214 edges. In order to create a sub-graph suitable for our experimentation we performed the following steps. First, we created a graph from Flickr by taking all the contact relationships that were reciprocal, and second, we again took the largest connected component.