# Preprints

Manuscripts in preparation, under review

Take ACTION to identify high-resolution cell types and associated
transcriptional pathways.
*Journal paper*.
Shahin Mohammadi, Vikram Ravindra, David Gleich, and Ananth Grama.
*bioXriv*, Bioinformatics:081273, 2016.
[ bib |
DOI ]

A randomized algorithm for enumerating zonotope vertices.
*Preprint on arXiv*.
Kerrek Stinson, David F. Gleich, and Paul G. Constantine.
*arXiv*, math.NA:1602.06620, 2016.
[ bib |
http ]

Multi-way monte carlo method for linear systems.
*Preprint on arXiv*.
Tao Wu and David F. Gleich.
*arXiv*, cs.NA:1608.04361, 2016.
[ bib |
software |
http ]

Computing active subspaces.
*Preprint on arXiv*.
Paul G. Constantine and David F. Gleich.
*arXiv*, math.NA:1408.0545, 2014.
[ bib |
http ]

# Scholarly publications

AptRank: an adaptive PageRank model for protein function prediction on
bi-relational graphs.
*Journal paper*.
Biaobin Jiang, Kyle Kloster, David F. Gleich, and Michael Gribskov.
*Bioinformatics*, 33(12):1829–1836, June 2017.
[ bib |
DOI |
local |
software ]

The spacey random walk: a stochastic process for higher-order data.
*Journal paper*.
Austin Benson, David F. Gleich, and Lek-Heng Lim.
*SIAM Review*, 59(2):321–345, May 2017.
[ bib |
DOI |
software |
http ]

An optimization approach to locally-biased graph algorithms.
*Journal paper*.
Kimon Fountoulakis, David F. Gleich, and Michael W. Mahoney.
*Proceedings of the IEEE*, 105(2):256–272, February 2017.
[ bib |
DOI |
local ]

Locally-biased graph algorithms are algorithms that attempt to find local or small-scale structure in a large data graph. In some cases, this can be accomplished by adding some sort of locality constraint and calling a traditional graph algorithm; but more interesting are locally-biased graph algorithms that compute answers by running a procedure that does not even look at most of the input graph. This corresponds more closely to what practitioners from various data science domains do, but it does not correspond well with the way that algorithmic and statistical theory is typically formulated. Recent work from several research communities has focused on developing locally-biased graph algorithms that come with strong complementary algorithmic and statistical theory and that are useful in practice in downstream data science applications. We provide a review and overview of this work, highlighting commonalities between seemingly different approaches, and highlighting promising directions for future work.

Revisiting power-law distributions in spectra of real world networks.
*Conference proceedings*.
Nicole Eikmeier and David F. Gleich.
In *Proceedings of the 23rd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '17, pages 817–826, New York,
NY, USA, 2017. ACM.
[ bib |
DOI |
local |
software |
http ]

Localization in seeded PageRank.
*Journal paper*.
David F. Gleich, Kyle Kloster, and Huda Nassar.
*Internet Mathematics*, page Online, 2017.
[ bib |
DOI |
local |
software ]

Distributed fault tolerant linear system solvers based on erasure coding.
*Conference proceedings*.
Xuejiao Kang, David F. Gleich, Ahmed Sameh, and Ananth Grama.
In *2017 IEEE 37th International Conference on Distributed
Computing Systems (ICDCS)*, pages 2478–2485, 2017.
[ bib |
DOI |
local ]

We present efficient coding schemes and distributed implementations of erasure coded linear system solvers. Erasure coded computations belong to the class of algorithmic fault tolerance schemes. They are based on augmenting an input dataset, executing the algorithm on the augmented dataset, and in the event of a fault, recovering the solution from the corresponding augmented solution. This process can be viewed as the computational analog of erasure coded storage schemes. The proposed technique has a number of important benefits: (i) as the hardware platform scales in size and number of faults, our scheme yields increasing improvement in resource utilization, compared to traditional schemes; (ii) the proposed scheme is easy to code - the core algorithms remain the same; and (iii) the general scheme is flexible - accommodating a range of computation and communication tradeoffs. We present new coding schemes for augmenting the input matrix that satisfy the recovery equations of erasure coding with high probability in the event of random failures. These coding schemes also minimize fill (non-zero elements introduced by the coding block), while being amenable to efficient partitioning across processing nodes. We demonstrate experimentally that our scheme adds minimal overhead for fault tolerance, yields excellent parallel efficiency and scalability, and is robust to different fault arrival models.

Multimodal network alignment.
*Conference proceedings*.
Huda Nassar and David F. Gleich.
In *Proceedings of the 2017 SIAM International Conference on Data
Mining*, pages 615–623, 2017.
[ bib |
DOI |
local |
software ]

Correlation clustering with low-rank matrices.
*Conference proceedings*.
Nate Veldt, Anthony I. Wirth, and David F. Gleich.
In *Proceedings of the 26th International Conference on World
Wide Web*, WWW '17, pages 1025–1034, 2017.
[ bib |
DOI |
local |
software |
http ]

Retrospective higher-order markov processes for user trails.
*Conference proceedings*.
Tao Wu and David F. Gleich.
In *Proceedings of the 23rd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '17, pages 1185–1194, New York,
NY, USA, 2017. ACM.
[ bib |
DOI |
local |
software ]

Local higher-order graph clustering.
*Conference proceedings*.
Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich.
In *Proceedings of the 23rd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '17, pages 555–564, New York,
NY, USA, 2017. ACM.
[ bib |
DOI |
local |
software ]

Erasure coding for fault oblivious linear system solvers.
*Journal paper*.
Yao Zhu, Ananth Grama, and David F. Gleich.
*SIAM J. of Scientific Computing*, 39(1):C48–C64, 2017.
[ bib |
DOI |
local |
software |
http ]

A parallel min-cut algorithm using iteratively reweighted least squares.
*Journal paper*.
Yao Zhu and David F. Gleich.
*Parallel Computing*, 59:43–59, November 2016.
[ bib |
DOI |
local ]

Triangular alignment (TAME): A tensor-based approach for higher-order network
alignment.
*Journal paper*.
Shahin Mohammadi, David F. Gleich, Tamara G. Kolda, and Ananth Grama.
*Transactions on Computational Biology and Bioinformatics*,
Online:1–14, July 2016.
[ bib |
DOI |
local |
software ]

Overlapping community detection using neighborhood-inflated seed expansion.
*Journal paper*.
Joyce Jiyoung Whang, David F. Gleich, and Inderjit S. Dhillon.
*Transactions on Knowledge and Data Engineering*,
28(5):1272–1284, May 2016.
[ bib |
DOI |
local |
http ]

Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. In this paper, we propose an efficient overlapping community detection algorithm using a seed expansion approach. The key idea of our algorithm is to find good seeds, and then greedily expand these seeds based on a community metric. Within this seed expansion method, we investigate the problem of how to determine good seed nodes in a graph. In particular, we develop new seeding strategies for a personalized PageRank clustering scheme that optimizes the conductance community score. An important step in our method is the neighborhood inflation step where seeds are modified to represent their entire vertex neighborhood. Experimental results show that our seed expansion algorithm outperforms other state-of-the-art overlapping community detection methods in terms of producing cohesive clusters and identifying ground-truth communities. We also show that our new seeding strategies are better than existing strategies, and are thus effective in finding good overlapping communities in real-world networks.

Higher-order organization of complex networks.
*Journal paper*.
Austin Benson, David F. Gleich, and Jure Leskovec.
*Science*, 353(6295):163–166, 2016.
[ bib |
DOI |
local |
software ]

Mining and modeling character networks.
*Conference proceedings*.
Anthony Bonato, David Ryan D'Angelo, Ethan R. Elenberg, David F.
Gleich, and Yangyang Hou.
In Anthony Bonato, Fan Chung Graham, and Pawel Pralat, editors,
*International Workshop on Algorithms and Models for the Web-Graph*, WAW,
pages 100–114. Springer International Publishing, 2016.
[ bib |
DOI |
local ]

David F. Gleich and Michael W. Mahoney.
Mining large graphs.
In Peter Bühlmann, Petros Drineas, Michael Kane, and Mark van de
Laan, editors, *Handbook of Big Data*, Handbooks of modern statistical
methods, pages 191–220. CRC Press, 2016.
[ bib |
DOI |
local ]

Fast multiplier methods to optimize non-exhaustive, overlapping clustering.
*Conference proceedings*.
Yangyang Hou, Joyce Jiyoung Whang, David F. Gleich, and Inderjit
Dhillon.
In *SIAM Data Mining*, 2016.
Accepted.
[ bib |
http ]

Seeded PageRank solution paths.
*Journal paper*.
Kyle Kloster and David F. Gleich.
*European Journal of Applied Mathematics*, 27(6):812–845, 2016.
[ bib |
DOI |
local |
software ]

We study the behaviour of network diffusions based on the PageRank random walk from a set of seed nodes. These diffusions are known to reveal small, localized clusters (or communities), and also large macro-scale clusters by varying a parameter that has a dual-interpretation as an accuracy bound and as a regularization level. We propose a new method that quickly approximates the result of the diffusion for all values of this parameter. Our method efficiently generates an approximate solution path or regularization path associated with a PageRank diffusion, and it reveals cluster structures at multiple size-scales between small and large. We formally prove a runtime bound on this method that is independent of the size of the network, and we investigate multiple optimizations to our method that can be more practical in some settings. We demonstrate that these methods identify refined clustering structure on a number of real-world networks with up to 2 billion edges.

Massive graph processing on nanocomputers.
*Conference proceedings*.
Bryan P. Rainey and David F. Gleich.
In *IEEE International Conference on Big Data*, pages 3326–3335,
2016.
Third Workshop on High Performance Big Graph Data Management,
Analysis, and Mining.
[ bib |
DOI |
local |
software ]

Deconvolving feedback loops in recommender systems.
*Conference proceedings*.
Ayan Sinha, David F. Gleich, and Karthik Ramani.
In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,
editors, *Neural Information Processing Systems (NIPS)*, pages
3243–3251. Curran Associates, Inc., 2016.
[ bib |
local |
software |
http ]

A simple and strongly-local flow-based method for cut improvement.
*Conference proceedings*.
Luke N. Veldt, David F. Gleich, and Michael W. Mahoney.
In *International Conference on Machine Learning*, pages
1938–1947, 2016.
[ bib |
local |
software |
.html ]

General tensor spectral co-clustering for higher-order data.
*Conference proceedings*.
Tao Wu, Austin Benson, and David F. Gleich.
In *Advances in Neural Information Processing Systems 29*, pages
2559–2567, 2016.
http://arxiv.org/abs/1603.00395.
[ bib |
local |
software |
http ]

PageRank beyond the web.
*Journal paper*.
David F. Gleich.
*SIAM Review*, 57(3):321–363, August 2015.
[ bib |
DOI |
local ]

Google's PageRank method was developed to evaluate the importance of web-pages via their link structure. The mathematics of PageRank, however, are entirely general and apply to any graph or network in any domain. Thus, PageRank is now regularly used in bibliometrics, social and information network analysis, and for link prediction and recommendation. It's even used for systems analysis of road networks, as well as biology, chemistry, neuroscience, and physics. We'll see the mathematics and ideas that unite these diverse applications.

Tensor spectral clustering for partitioning higher-order network structures.
*Conference proceedings*.
Austin R. Benson, David F. Gleich, and Jure Leskovec.
In *Proceedings of the 2015 SIAM International Conference on Data
Mining*, pages 118–126, 2015.
[ bib |
DOI |
local ]

Spectral graph theory-based methods represent an important class of tools for studying the structure of networks. Spectral methods are based on a first-order Markov chain derived from a random walk on the graph and thus they cannot take advantage of important higher-order network substructures such as triangles, cycles, and feed-forward loops. Here we propose a Tensor Spectral Clustering (TSC) algorithm that allows for modeling higher-order network structures in a graph partitioning framework. Our TSC algorithm allows the user to specify which higher-order network structures (cycles, feed-forward loops, etc.) should be preserved by the network clustering. Higher-order network structures of interest are represented using a tensor, which we then partition by developing a multilinear spectral method. Our framework can be applied to discovering layered flows in networks as well as graph anomaly detection, which we illustrate on synthetic networks. In directed networks, a higher-order structure of particular interest is the directed 3-cycle, which captures feedback loops in networks. We demonstrate that our TSC algorithm produces large partitions that cut fewer directed 3-cycles than standard spectral clustering algorithms.

Sublinear column-wise actions of the matrix exponential on social networks.
*Journal paper*.
David F. Gleich and Kyle Kloster.
*Internet Mathematics*, 11(4–5):352–384, 2015.
[ bib |
DOI |
local ]

We consider stochastic transition matrices from large social and information networks. For these matrices, we describe and evaluate three fast methods to estimate one column of the matrix exponential. The methods are designed to exploit the properties inherent in social networks, such as a power-law degree distribution. Using only this property, we prove that one of our three algorithms has a sublinear runtime. We present further experimental evidence showing that all three of them run quickly on social networks with billions of edges, and they accurately identify the largest elements of the column.

Multilinear PageRank.
*Journal paper*.
David F. Gleich, Lek-Heng Lim, and Yongyang Yu.
*SIAM Journal on Matrix Analysis and Applications*,
36(4):1507–1541, 2015.
[ bib |
DOI |
local |
http ]

In this paper, we first extend the celebrated PageRank modification to a higher-order Markov chain. Although this system has attractive theoretical properties, it is computationally intractable for many interesting problems. We next study a computationally tractable approximation to the higher-order PageRank vector that involves a system of polynomial equations called multilinear PageRank. This is motivated by a novel “spacey random surfer” model, where the surfer remembers bits and pieces of history and is influenced by this information. The underlying stochastic process is an instance of a vertex-reinforced random walk. We develop convergence theory for a simple fixed-point method, a shifted fixed-point method, and a Newton iteration in a particular parameter regime. In marked contrast to the case of the PageRank vector of a Markov chain where the solution is always unique and easy to compute, there are parameter regimes of multilinear PageRank where solutions are not unique and simple algorithms do not converge. We provide a repository of these nonconvergent cases that we encountered through exhaustive enumeration and randomly sampling that we believe is useful for future study of the problem.

Using local spectral methods to robustify graph-based learning algorithms.
*Conference proceedings*.
David F. Gleich and Michael W. Mahoney.
In *Proceedings of the 21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '15, pages 359–368, New York,
NY, USA, 2015. ACM.
[ bib |
DOI |
local ]

Graph-based learning methods have a variety of names including semi-supervised and transductive learning. They typically use a diffusion to propagate labels from a small set of nodes with known class labels to the remaining nodes of the graph. While popular, these algorithms, when implemented in a straightforward fashion, are extremely sensitive to the details of the graph construction. Here, we provide four procedures to help make them more robust: recognizing implicit regularization in the diffusion, using a scalable push method to evaluate the diffusion, using rank-based rounding, and densifying the graph through a matrix polynomial. We study robustness with respect to the details of graph constructions, errors in node labeling, degree variability, and a variety of other real-world heterogeneities, studying these methods through a precise relationship with mincut problems. For instance, the densification strategy explicitly adds new weighted edges to a sparse graph. We find that this simple densification creates a graph where multiple diffusion methods are robust to several types of errors. This is demonstrated by a study with predicting product categories from an Amazon co-purchasing network.

Non-exhaustive, overlapping clustering via low-rank semidefinite programming.
*Conference proceedings*.
Yangyang Hou, Joyce Jiyoung Whang, David F. Gleich, and Inderjit S.
Dhillon.
In *Proceedings of the 21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '15, pages 427–436, New York,
NY, USA, 2015. ACM.
[ bib |
DOI |
local ]

Differential flux balance analysis of quantitative proteomic data on protein
interaction networks.
*Conference proceedings*.
Biaobin Jiang, David F. Gleich, and Michael Gribskov.
In *Symposium on Signal Processing and Mathematical Modeling of
Biological Processes with Applications to Cyber-Physical Systems for Precise
Medicine*, GlobalSIP, pages 977–981. IEEE, 2015.
[ bib |
DOI |
local ]

Strong localization in personalized PageRank.
*Conference proceedings*.
Huda Nassar, Kyle Kloster, and David F. Gleich.
In *Proceedings of the 2015 Workshop on Algorithms for the
Webgraph*, number 9479 in LNCS, pages 190–202, 2015.
[ bib |
DOI |
local |
software |
http ]

The personalized PageRank diffusion is a fundamental tool in network analysis tasks like community detection and link prediction. It models the spread of a quantity from a set of seed nodes, and it has been observed to stay localized near this seed set. We derive an upper-bound on the number of entries necessary to approximate a personalized PageRank vector in graphs with skewed degree sequences. This bound shows localization under mild assumptions on the maximum and minimum degrees. Experimental results on random graphs with these degree sequences show the bound is loose and support a conjectured bound.

Parallel maximum clique algorithms with applications to network analysis.
*Journal paper*.
Ryan A. Rossi, David F. Gleich, and Assefaw H. Gebremedhin.
*SIAM Journal on Scientific Computing*, 37(5):C589–C616, 2015.
[ bib |
DOI |
local ]

We present a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. The method exhibits a roughly linear runtime scaling over real-world networks ranging from a thousand to a hundred million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. At its heart the algorithm employs a branch-and-bound strategy with novel and aggressive pruning techniques. The pruning techniques include the combined use of core numbers of vertices along with a good initial heuristic solution to remove the vast majority of the search space. In addition, the exploration of the search tree is parallelized. During the search, processes immediately communicate changes to upper and lower bounds on the size of the maximum clique. This exchange of information occasionally results in a superlinear speedup because tasks with large search spaces can be pruned by other processes. We demonstrate the impact of the algorithm on applications using two different network analysis problems: computation of temporal strong components in dynamic networks and determination of compression-friendly ordering of nodes of massive networks.

Non-exhaustive, overlapping k-means.
*Conference proceedings*.
Joyce Jiyoung Whang, Inderjit S. Dhillon, and David F. Gleich.
In *Proceedings of the 2015 SIAM International Conference on Data
Mining*, pages 936–944, 2015.
[ bib |
DOI |
local ]

Traditional clustering algorithms, such as k-means, output a clustering that is disjoint and exhaustive, that is, every single data point is assigned to exactly one cluster. However, in real datasets, clusters can overlap and there are often outliers that do not belong to any cluster. This is a well recognized problem that has received much attention in the past, and several algorithms, such as fuzzy k-means have been proposed for overlapping clustering. However, most existing algorithms address either overlap or outlier detection and do not tackle the problem in a unified way. In this paper, we propose a simple and intuitive objective function that captures the issues of overlap and non-exhaustiveness in a unified manner. Our objective function can be viewed as a reformulation of the traditional k-means objective, with easy-to-understand parameters that capture the degrees of overlap and non-exhaustiveness. By studying the objective, we are able to obtain a simple iterative algorithm which we call NEO-K-Means (Non-Exhaustive Overlapping K-Means). Furthermore, by considering an extension to weighted kernel k-means, we can tackle the case of non-exhaustive and overlapping graph clustering. This extension allows us to apply our NEO-K-Means algorithm to the community detection problem, which is an important task in network analysis. Our experimental results show that the new objective and algorithm are effective in finding ground-truth clusterings that have varied overlap and non-exhaustiveness; for the case of graphs, we show that our algorithm outperforms state-of-the-art overlapping community detection methods.

Model reduction with MapReduce-enabled tall and skinny singular value
decomposition.
*Journal paper*.
Paul G. Constantine, David F. Gleich, Yangyang Hou, and Jeremy
Templeton.
*SIAM J. Sci. Comput.*, 36(5):S166–S191, November 2014.
[ bib |
DOI |
local |
http ]

Dimensionality of social networks using motifs and eigenvalues.
*Journal paper*.
Anthony Bonato, David F. Gleich, Myunghwan Kim, Dieter Mitsche,
Pawel Pralat, Amanda Tian, and Stephen J. Young.
*PLoS ONE*, 9(9):e106052, September 2014.
[ bib |
DOI |
local ]

<p>We consider the dimensionality of social networks, and develop experiments aimed at predicting that dimension. We find that a social network model with nodes and links sampled from an <italic>m</italic>-dimensional metric space with power-law distributed influence regions best fits samples from real-world networks when <italic>m</italic> scales logarithmically with the number of nodes of the network. This supports a logarithmic dimension hypothesis, and we provide evidence with two different social networks, Facebook and LinkedIn. Further, we employ two different methods for confirming the hypothesis: the first uses the distribution of motif counts, and the second exploits the eigenvalue distribution.</p>

A dynamical system for PageRank with time-dependent teleportation.
*Journal paper*.
David F. Gleich and Ryan A. Rossi.
*Internet Mathematics*, 10(1–2):188–217, June 2014.
[ bib |
DOI |
local ]

We propose a dynamical system that captures changes to the network centrality of nodes as external interest in those nodes varies. We derive this system by adding time-dependent teleportation to the PageRank score. The result is not a single set of importance scores, but rather a time-dependent set. These can be converted into ranked lists in a variety of ways, for instance, by taking the largest change in the importance score. For an interesting class of dynamic teleportation functions, we derive closed-form solutions for the dynamic PageRank vector. The magnitude of the deviation from a static PageRank vector is given by a PageRank problem with complex-valued teleportation parameters. Moreover, these dynamical systems are easy to evaluate. We demonstrate the utility of dynamic teleportation on both the article graph of Wikipedia, where the external interest information is given by the number of hourly visitors to each page, and the Twitter social network, where external interest is the number of tweets per month. For these problems, we show that using information from the dynamical system helps improve a prediction task and identify trends in the data.

Scalable methods for nonnegative matrix factorizations of near-separable
tall-and-skinny matrices.
*Conference proceedings*.
Austin R. Benson, Jason D. Lee, Bartek Rajwa, and David F. Gleich.
In *Proceedings of Neural Information Processing Systems*, pages
945–953, 2014.
Selected for Spotlight Presentation.
[ bib |
http ]

Anti-differentiating approximation algorithms: A case study with min-cuts,
spectral, and flow.
*Conference proceedings*.
David F. Gleich and Michael M. Mahoney.
In *Proceedings of the International Conference on Machine
Learning (ICML)*, pages 1018–1025, 2014.
[ bib |
local |
http ]

Heat kernel based community detection.
*Conference proceedings*.
Kyle Kloster and David F. Gleich.
In *Proceedings of the 20th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '14, pages 1386–1395, New York,
NY, USA, 2014. ACM.
[ bib |
DOI |
local ]

Using triangles to improve community detection in directed networks.
*Conference proceedings*.
Christine Klymko, David F. Gleich, and Tamara G. Kolda.
In *Proceedings of the ASE BigData Conference*, Stanford, CA,
2014.
Full version on arXiv http://arxiv.org/abs/1404.5874.
[ bib |
DOI |
http ]

Fast maximum clique algorithms for large graphs.
*Conference proceedings*.
Ryan A. Rossi, David F. Gleich, Assefaw H. Gebremedhin, and Md.
Mostofa Ali Patwary.
In *Poster Proceedings of WWW2014*, pages 365–366, 2014.
[ bib |
DOI |
local ]

A nearly-sublinear method for approximating a column of the matrix exponential
for matrices from large, sparse networks.
*Conference proceedings*.
Kyle Kloster and David F. Gleich.
In Anthony Bonato, Michael Mitzenmacher, and Pawel Pralat,
editors, *Algorithms and Models for the Web Graph*, volume 8305 of *
Lecture Notes in Computer Science*, pages 68–79. Springer International
Publishing, December 2013.
[ bib |
DOI |
local ]

We consider random-walk transition matrices from large social and information networks. For these matrices, we describe and evaluate a fast method to estimate one column of the matrix exponential. Our method runs in sublinear time on networks where the maximum degree grows doubly logarithmic with respect to the number of nodes. For collaboration networks with over 5 million edges, we find it runs in less than a second on a standard desktop machine.

Direct tall-and-skinny QR factorizations in MapReduce architectures.
*Conference proceedings*.
A.R. Benson, D.F. Gleich, and J. Demmel.
In *Big Data, 2013 IEEE International Conference on*, pages
264–272, October 2013.
[ bib |
DOI |
local ]

The QR factorization and the SVD are two fundamental matrix decompositions with applications throughout scientific computing and data analysis. For matrices with many more rows than columns, so-called “tall-and-skinny matrices,” there is a numerically stable, efficient, communication-avoiding algorithm for computing the QR factorization. It has been used in traditional high performance computing and grid computing environments. For MapReduce environments, existing methods to compute the QR decomposition use a numerically unstable approach that relies on indirectly computing the Q factor. In the best case, these methods require only two passes over the data. In this paper, we describe how to compute a stable tall-and-skinny QR factorization on a MapReduce architecture in only slightly more than 2 passes over the data. We can compute the SVD with only a small change and no difference in performance. We present a performance comparison between our new direct TSQR method, indirect TSQR methods that use the communication-avoiding TSQR algorithm, and a standard unstable implementation for MapReduce (Cholesky QR). We find that our new stable method is competitive with unstable methods for matrices with a modest number of columns. This holds both in a theoretical performance model as well as in an actual implementation.

The power and Arnoldi methods in an algebra of circulants.
*Journal paper*.
David F. Gleich, Chen Greif, and James M. Varah.
*Numerical Linear Algebra with Applications*, 20:809–831,
October 2013.
[ bib |
DOI |
local ]

Circulant matrices play a central role in a recently proposed formulation of three-way data computations. In this setting, a three-way table corresponds to a matrix where each "scalar" is a vector of parameters defining a circulant. This interpretation provides many generalizations of results from matrix or vector-space algebra. We derive the power and Arnoldi methods in this algebra. In the course of our derivation, we define inner products, norms, and other notions. These extensions are straightforward in an algebraic sense, but the implications are dramatically different from the standard matrix case. For example, a matrix of circulants has a polynomial number of eigenvalues in its dimension; although, these can all be represented by a carefully chosen canonical set of eigenvalues and vectors. These results and algorithms are closely related to standard decoupling techniques on block-circulant matrices using the fast Fourier transform.

Overlapping community detection using seed set expansion.
*Conference proceedings*.
Joyce Jiyoung Whang, David F. Gleich, and Inderjit S. Dhillon.
In *Proceedings of the 22nd ACM international conference on
Conference on information and knowledge management*, CIKM '13, pages
2099–2108, New York, NY, USA, October 2013. ACM.
[ bib |
DOI |
local ]

Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. One of the most successful techniques for finding overlapping communities is based on local optimization and expansion of a community metric around a seed set of vertices. In this paper, we propose an efficient overlapping community detection algorithm using a seed set expansion approach. In particular, we develop new seeding strategies for a personalized PageRank scheme that optimizes the conductance community score. The key idea of our algorithm is to find good seeds, and then expand these seed sets using the personalized PageRank clustering procedure. Experimental results show that this seed set expansion approach outperforms other state-of-the-art overlapping community detection methods. We also show that our new seeding strategies are better than previous strategies, and are thus effective in finding good overlapping clusters in a graph.

Message-passing algorithms for sparse network alignment.
*Journal paper*.
Mohsen Bayati, David F. Gleich, Amin Saberi, and Ying Wang.
*ACM Trans. Knowl. Discov. Data*, 7(1):3:1–3:31, March 2013.
[ bib |
DOI |
local ]

Network alignment generalizes and unifies several approaches for forming a matching or alignment between the vertices of two graphs. We study a mathematical programming framework for network alignment problem and a sparse variation of it where only a small number of matches between the vertices of the two graphs are possible. We propose a new message passing algorithm that allows us to compute, very efficiently, approximate solutions to the sparse network alignment problems with graph sizes as large as hundreds of thousands of vertices. We also provide extensive simulations comparing our algorithms with two of the best solvers for network alignment problems on two synthetic matching problems, two bioinformatics problems, and three large ontology alignment problems including a multilingual problem with a known labeled alignment.

Expanders, tropical semi-rings, and nuclear norms: oh my!
*Journal paper*.
David F. Gleich.
*XRDS*, 19(3):32–36, March 2013.
[ bib |
DOI |
local ]

A multicore algorithm for network alignment via approximate matching.
*Conference proceedings*.
Arif Khan, David F. Gleich, Mahantesh Halappanavar, and Alex Pothen.
In *Proceedings of the 2012 ACM/IEEE International Conference for
High Performance Computing, Networking, Storage and Analysis*, SC '12, pages
64:1–64:11, Los Alamitos, CA, USA, November 2012. IEEE Computer Society
Press.
[ bib |
local |
.pdf ]

Moment based estimation of stochastic Kronecker graph parameters.
*Journal paper*.
David F. Gleich and Art B. Owen.
*Internet Mathematics*, 8(3):232–256, August 2012.
[ bib |
DOI |
local ]

Stochastic Kronecker graphs supply a parsimonious model for large sparse real-world graphs. They can specify the distribution of a large random graph using only three or four parameters. Those parameters have, however, proved difficult to choose in specific applications. This article looks at method-of-moments estimators that are computationally much simpler than maximum likelihood. The estimators are fast, and in our examples, they typically yield Kronecker parameters with expected feature counts closer to a given graph than we get from KronFit. The improvement is especially prominent for the number of triangles in the graph.

Vertex neighborhoods, low conductance cuts, and good seeds for local community
methods.
*Conference proceedings*.
David F. Gleich and C. Seshadhri.
In *KDD2012*, pages 597–605, August 2012.
[ bib |
DOI |
local ]

Overlapping clusters for distributed computation.
*Conference proceedings*.
Reid Andersen, David F. Gleich, and Vahab Mirrokni.
In *Proceedings of the fifth ACM international conference on Web
search and data mining*, WSDM '12, pages 273–282, New York, NY, USA,
February 2012. ACM.
[ bib |
DOI |
local |
http ]

Fast matrix computations for pairwise and columnwise commute times and Katz
scores.
*Journal paper*.
Francesco Bonchi, Pooya Esfandiar, David F. Gleich, Chen Greif, and
Laks V.S. Lakshmanan.
*Internet Mathematics*, 8(1-2):73–112, 2012.
[ bib |
DOI |
local ]

We explore methods for approximating the commute time and Katz score between a pair of nodes. These methods are based on the approach of matrices, moments, and quadrature developed in the numerical linear algebra community. They rely on the Lanczos process and provide upper and lower bounds on an estimate of the pairwise scores. We also explore methods to approximate the commute times and Katz scores from a node to all other nodes in the graph. Here, our approach for the commute times is based on a variation of the conjugate gradient algorithm, and it provides an estimate of all the diagonals of the inverse of a matrix. Our technique for the Katz scores is based on exploiting an empirical localization property of the Katz matrix. We adapt algorithms used for personalized PageRank computing to these Katz scores and theoretically show that this approach is convergent. We evaluate these methods on 17 real-world graphs ranging in size from 1000 to 1,000,000 nodes. Our results show that our pairwise commute-time method and columnwise Katz algorithm both have attractive theoretical properties and empirical performance.

Distinguishing signal from noise in an SVD of simulation data.
*Conference proceedings*.
Paul G. Constantine and David F. Gleich.
In *Proceedings of the IEEE Conference on Acoustics, Speech, and
Signal Processing*, pages 5333–5336, 2012.
[ bib |
DOI |
local ]

Dynamic PageRank using evolving teleportation.
*Conference proceedings*.
Ryan A. Rossi and David F. Gleich.
In Anthony Bonato and Jeannette Janssen, editors, *Algorithms and
Models for the Web Graph*, volume 7323 of *Lecture Notes in Computer
Science*, pages 126–137. Springer Berlin Heidelberg, 2012.
[ bib |
DOI |
local ]

The importance of nodes in a network constantly fluctuates based on changes in the network structure as well as changes in external interest. We propose an evolving teleportation adaptation of the PageRank method to capture how changes in external interest influence the importance of a node. This framework seamlessly generalizes PageRank because the importance of a node will converge to the PageRank values if the external influence stops changing. We demonstrate the effectiveness of the evolving teleportation on the Wikipedia graph and the Twitter social network. The external interest is given by the number of hourly visitors to each page and the number of monthly tweets for each user.

Tall and skinny QR factorizations in MapReduce architectures.
*Conference proceedings*.
Paul G. Constantine and David F. Gleich.
In *Proceedings of the second international workshop on MapReduce
and its applications*, MapReduce '11, pages 43–50, New York, NY, USA, June
2011. ACM.
[ bib |
DOI |
local ]

Overlapping clusters for distributed computation.
*Conference proceedings*.
Reid Andersen, David F. Gleich, and Vahab S Mirrokni.
In *Poster proceedings of the SIAM Workshop on Combinatorial and
Scientific Computing (CSC)*, 2011.
Poster.
[ bib |
local ]

A factorization of the spectral Galerkin system for parameterized matrix
equations: derivation and applications.
*Journal paper*.
Paul G. Constantine, David F. Gleich, and Gianluca Iaccarino.
*SIAM Journal of Scientific Computing*, 33(5):2995–3009, 2011.
[ bib |
DOI |
local ]

Recent work has explored solver strategies for the linear system of equations arising from a spectral Galerkin approximation of the solution of PDEs with parameterized (or stochastic) inputs. We consider the related problem of a matrix equation whose matrix and right-hand side depend on a set of parameters (e.g., a PDE with stochastic inputs semidiscretized in space) and examine the linear system arising from a similar Galerkin approximation of the solution. We derive a useful factorization of this system of equations, which yields bounds on the eigenvalues, clues to preconditioning, and a flexible implementation method for a wide array of problems. We complement this analysis with (i) a numerical study of preconditioners on a standard elliptic PDE test problem and (ii) a fluids application using existing CFD codes; the MATLAB codes used in the numerical studies are available online.

Rank aggregation via nuclear norm minimization.
*Conference proceedings*.
David F. Gleich and Lek-Heng Lim.
In *Proceedings of the 17th ACM SIGKDD international conference
on Knowledge discovery and data mining*, KDD '11, pages 60–68, New York, NY,
USA, 2011. ACM.
[ bib |
DOI |
local ]

The process of rank aggregation is intimately intertwined with the structure of skew-symmetric matrices. We apply recent advances in the theory and algorithms of matrix completion to skew-symmetric matrices. This combination of ideas produces a new method for ranking a set of items. The essence of our idea is that a rank aggregation describes a partially filled skew-symmetric matrix. We extend an algorithm for matrix completion to handle skew-symmetric data and use that to extract ranks for each item. Our algorithm applies to both pairwise comparison and rating data. Because it is based on matrix completion, it is robust to both noise and incomplete data. We show a formal recovery result for the noiseless case and present a detailed study of the algorithm on synthetic data and Netflix ratings.

Some computational tools for digital archive and metadata maintenance.
*Journal paper*.
David F. Gleich, Ying Wang, Xiangrui Meng, Farnaz Ronaghi, Margot
Gerritsen, and Amin Saberi.
*BIT Numerical Mathematics*, 51:127–154, 2011.
[ bib |
DOI |
local ]

Computational tools are a mainstay of current search and recommendation technology. But modern digital archives are astonishingly diverse collections of older digitized material and newer born digital content. Finding interesting material in these archives is still challenging. The material often lacks appropriate annotation—or metadata—so that people can find the most interesting material. We describe four computational tools we developed to aid in the processing and maintenance of large digital archives. The first is an improvement to a graph layout algorithm for graphs with hundreds of thousands of nodes. The second is a new algorithm for matching databases with links among the objects, also known as a network alignment problem. The third is an optimization heuristic to disambiguate a set of geographic references in a book. And the fourth is a technique to automatically generate a title from a description.

Random alpha PageRank.
*Journal paper*.
Paul G. Constantine and David F. Gleich.
*Internet Mathematics*, 6(2):189–236, September 2010.
[ bib |
DOI |
local |
http ]

We suggest a revision to the PageRank random surfer model that considers the influence of a population of random surfers on the PageRank vector. In the revised model, each member of the population has its own teleportation parameter chosen from a probability distribution, and consequently, the ranking vector is random. We propose three algorithms for computing the statistics of the random ranking vector based respectively on (i) random sampling, (ii) paths along the links of the underlying graph, and (iii) quadrature formulas. We find that the expectation of the random ranking vector produces similar rankings to its deterministic analogue, but the standard deviation gives uncorrelated information (under a Kendall-tau metric) with myriad potential uses. We examine applications of this model to web spam.

Tracking the random surfer: empirically measured teleportation parameters in
PageRank.
*Conference proceedings*.
David F. Gleich, Paul G. Constantine, Abraham Flaxman, and Asela
Gunawardana.
In *WWW '10: Proceedings of the 19th international conference on
World wide web*, pages 381–390, April 2010.
[ bib |
DOI |
local ]

PageRank computes the importance of each node in a directed graph under a random surfer model governed by a teleportation parameter. Commonly denoted alpha, this parameter models the probability of following an edge inside the graph or, when the graph comes from a network of web pages and links, clicking a link on a web page. We empirically measure the teleportation parameter based on browser toolbar logs and a click trail analysis. For a particular user or machine, such analysis produces a value of alpha. We find that these values nicely fit a Beta distribution with mean edge-following probability between 0.3 and 0.7, depending on the site. Using these distributions, we compute PageRank scores where PageRank is computed with respect to a distribution as the teleportation parameter, rather than a constant teleportation parameter. These new metrics are evaluated on the graph of pages in Wikipedia.

An inner-outer iteration for PageRank.
*Journal paper*.
David F. Gleich, Andrew P. Gray, Chen Greif, and Tracy Lau.
*SIAM Journal of Scientific Computing*, 32(1):349–371, February
2010.
[ bib |
DOI |
local ]

We present a new iterative scheme for PageRank computation. The algorithm is applied to the linear system formulation of the problem, using inner-outer stationary iterations. It is simple, can be easily implemented and parallelized, and requires minimal storage overhead. Our convergence analysis shows that the algorithm is effective for a crude inner tolerance and is not sensitive to the choice of the parameters involved. The same idea can be used as a preconditioning technique for nonstationary schemes. Numerical examples featuring matrices of dimensions exceeding 100,000,000 in sequential and parallel environments demonstrate the merits of our technique. Our code is available online for viewing and testing, along with several large scale examples.

Spectral methods for parameterized matrix equations.
*Journal paper*.
Paul G. Constantine, David F. Gleich, and Gianluca Iaccarino.
*SIAM Journal on Matrix Analysis and Applications*,
31(5):2681–2699, 2010.
[ bib |
DOI |
local ]

We apply polynomial approximation methods—known in the numerical PDEs context as spectral methods—to approximate the vector-valued function that satisfies a linear system of equations where the matrix and the right-hand side depend on a parameter. We derive both an interpolatory pseudospectral method and a residual-minimizing Galerkin method, and we show how each can be interpreted as solving a truncated infinite system of equations; the difference between the two methods lies in where the truncation occurs. Using classical theory, we derive asymptotic error estimates related to the region of analyticity of the solution, and we present a practical residual error estimate. We verify the results with two numerical examples.

Fast Katz and commuters: Efficient approximation of social relatedness over
large networks.
*Conference proceedings*.
Pooya Esfandiar, Francesco Bonchi, David F. Gleich, Chen Greif, Laks
V. S. Lakshmanan, and Byung-Won On.
In *Algorithms and Models for the Web Graph*, 2010.
[ bib |
DOI |
local ]

Motivated by social network data mining problems such as link prediction and collaborative filtering, significant research effort has been devoted to computing topological measures including the Katz score and the commute time. Existing approaches typically approximate all pairwise relationships simultaneously. In this paper, we are interested in computing: the score for a single pair of nodes, and the top-k nodes with the best scores from a given source node. For the pairwise problem, we apply an iterative algorithm that computes upper and lower bounds for the measures we seek. This algorithm exploits a relationship between the Lanczos process and a quadrature rule. For the top-k problem, we propose an algorithm that only accesses a small portion of the graph and is related to techniques used in personalized PageRank computing. To test the scalability and accuracy of our algorithms we experiment with three real-world networks and find that these algorithms run in milliseconds to seconds without any preprocessing.

Algorithms for large, sparse network alignment problems.
*Conference proceedings*.
Mohsen Bayati, Margot Gerritsen, David F. Gleich, Amin Saberi, and
Ying Wang.
In *Proceedings of the 9th IEEE International Conference on Data
Mining*, pages 705–710, December 2009.
[ bib |
DOI |
arXiv |
local ]

We propose a new distributed algorithm for sparse variants of the network alignment problem that occurs in a variety of data mining areas including systems biology, database matching, and computer vision. Our algorithm uses a belief propagation heuristic and provides near optimal solutions for an NP-hard combinatorial optimization problem. We show that our algorithm is faster and outperforms or nearly ties existing algorithms on synthetic problems, a problem in bioinformatics, and a problem in ontology matching. We also provide a unified framework for studying and comparing all network alignment solvers.

A Monte Carlo method for solving unsteady adjoint equations.
*Journal paper*.
Qiqi Wang, David F. Gleich, Amin Saberi, Nasrollah Etemadi, and
Parviz Moin.
*Journal of Computational Physics*, 227(12):6184–6205, June
2008.
[ bib |
DOI |
local ]

Traditionally, solving the adjoint equation for unsteady problems involves solving a large, structured linear system. This paper presents a variation on this technique and uses a Monte Carlo linear solver. The Monte Carlo solver yields a forward-time algorithm for solving unsteady adjoint equations. When applied to computing the adjoint associated with Burgersï¿½ equation, the Monte Carlo approach is faster for a large class of problems while preserving sufficient accuracy.

Approximating personalized PageRank with minimal use of webgraph data.
*Journal paper*.
David F. Gleich and Marzia Polito.
*Internet Mathematics*, 3(3):257–294, December 2007.
[ bib |
DOI |
local ]

In this paper, we consider the problem of calculating fast and accurate approximations to the personalized PageRank score of a webpage. We focus on techniques to improve speed by limiting the amount of web graph data we need to access. Our algorithms provide both the approximation to the personalized PageRank score as well as guidance in using only the necessary information—and therefore sensibly reduce not only the computational cost of the algorithm but also the memory and memory bandwidth requirements. We report experiments with these algorithms on web graphs of up to 118 million pages and prove a theoretical approximation bound for all. Finally, we propose a local, personalized web-search system for a future client system using our algorithms.

Using polynomial chaos to compute the influence of multiple random surfers in
the PageRank model.
*Conference proceedings*.
Paul G. Constantine and David F. Gleich.
In Anthony Bonato and Fan Chung Graham, editors, *Proceedings of
the 5th Workshop on Algorithms and Models for the Web Graph (WAW2007)*,
volume 4863 of *Lecture Notes in Computer Science*, pages 82–95.
Springer, 2007.
[ bib |
DOI |
local ]

The PageRank equation computes the importance of pages in a web graph relative to a single random surfer with a constant teleportation coefficient. To be globally relevant, the teleportation coefficient should account for the influence of all users. Therefore, we correct the PageRank formulation by modeling the teleportation coefficient as a random variable distributed according to user behavior. With this correction, the PageRank values themselves become random. We present two methods to quantify the uncertainty in the random PageRank: a Monte Carlo sampling algorithm and an algorithm based the truncated polynomial chaos expansion of the random quantities. With each of these methods, we compute the expectation and standard deviation of the PageRanks. Our statistical analysis shows that the standard deviation of the PageRanks are uncorrelated with the PageRank vector.

Scalable computing with power-law graphs: Experience with parallel PageRank.
*Conference proceedings*.
David F. Gleich and Leonid Zhukov.
In *SuperComputing 2005*, November 2005.
Poster.
[ bib |
local |
.pdf ]

Recommender systems research at Yahoo! Research Labs.
*Conference proceedings*.
Dennis Decoste, David F. Gleich, Tejaswi Kasturi, Sathiya Keerthi,
Omid Madani, Seung-Taek Park, David M. Pennock, Corey Porter, Sumit Sanghai,
Farial Shahnaz, and Leonid Zhukov.
In *Beyond Personalization*, San Diego, CA, January 2005.
Position Statement.
[ bib |
local ]

We describe some of the ongoing projects at Yahoo! Research Labs that involve recommender systems. We discuss recommender systems related problems and solutions relevant to Yahoo!’s business.

The World of Music: SDP embedding of high dimensional data.
*Conference proceedings*.
David F. Gleich, Leonid Zhukov, Matthew Rasmussen, and Kevin Lang.
In *Information Visualization 2005*, 2005.
Interactive Poster.
[ bib |
local |
.pdf ]

In this paper we investigate the use of Semidefinite Programming (SDP) optimization for high dimensional data layout and graph visualization. We developed a set of interactive visualization tools and used them on music artist ratings data from Yahoo!. The computed layout preserves a natural grouping of the artists and provides visual assistance for browsing large music collections.

An SVD based term suggestion and ranking system.
*Conference proceedings*.
David F. Gleich and Leonid Zhukov.
In *ICDM '04: Proceedings of the Fourth IEEE International
Conference on Data Mining (ICDM'04)*, pages 391–394, Brighton, UK, November
2004. IEEE Computer Society.
[ bib |
DOI |
local ]

In this paper, we consider the application of the singular value decomposition (SVD) to a search term suggestion system in a pay-for-performance search market. We propose a novel positive and negative refinement method based on orthogonal subspace projections. We demonstrate that SVD subspace-based methods: 1) expand coverage by reordering the results, and 2) enhance the clustered structure of the data. The numerical experiments reported in this paper were performed on Overture's pay-per-performance search market data.

# Technical reports

Three results on the PageRank vector: eigenstructure, sensitivity, and the
derivative.
*Conference proceedings*.
David F. Gleich, Peter Glynn, Gene H. Golub, and Chen Greif.
In Andreas Frommer, Michael W. Mahoney, and Daniel B. Szyld, editors,
*Web Information Retrieval and Linear Algebra Algorithms*, number 07071
in Dagstuhl Seminar Proceedings. Internationales Begegnungs- und
Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany, 2007.
[ bib |
local |
http ]

The three results on the PageRank vector are preliminary but shed light on the eigenstructure of a PageRank modified Markov chain and what happens when changing the teleportation parameter in the PageRank model. Computations with the derivative of the PageRank vector with respect to the teleportation parameter show predictive ability and identify an interesting set of pages from Wikipedia.

Hierarchical directed spectral graph partitioning, David F. Gleich. Information Networks, Stanford University, Final Project, 2005, 2006. Cited over 6 times. [ bib | local | .pdf ]

In this report, we examine the generalization of the Laplacian of a graph due to Fan Chung. We show that Fan Chung’s generalization reduces to examining one particular symmetrization of the adjacency matrix for a directed graph. From this result, the directed Cheeger bounds trivially follow. Additionally, we implement and examine the beneﬁts of directed hierarchical spectral clustering empirically on a dataset from Wikipedia. Finally, we examine a set of competing heuristic methods on the same dataset.

The world of music: User ratings; spectral and spherical embeddings; map projections, David F. Gleich, Matthew Rasmussen, Kevin Lang, and Leonid Zhukov. Online report, 2006. [ bib | local | .pdf ]

In this paper we present an algorithm for layout and visualization of music collections based on similarities between musical artists. The core of the algorithm consists of a non-linear low dimensional embedding of a similarity graph constrained to the surface of a hyper-sphere. This approach effectively uses additional dimensions in the embedding. We derive the algorithm using a simple energy minimization procedure and show the relationships to several well known eigenvector based methods. We also describe a method for constructing a similarity graph from user ratings, as well as procedures for mapping the layout from the hyper-sphere to a 2d display. We demonstrate our techniques on Yahoo! Music user ratings data and a MusicMatch artist similarity graph.

Finite calculus: A tutorial for solving nasty sums, David F. Gleich. Combinatorics, Final Paper, Stanford University, 2004., 2005. [ bib | local ]

Mtf , bit , and comb: A guide to deterministic and randomized online algorithms for the list access problem, Kevin Andrew and David F. Gleich. Advanced Algorithms, Harvey Mudd College, Final Project, 2004. [ bib | local ]

In this survey, we discuss two randomized online algorithms for the list access problem. First, we review competitive analysis and show that the MTF algorithm is 2-competitive using a potential function. Then, we introduce randomized competitive analysis and the associated adversary models. We show that the randomized BIT algorithm is 7/4-competitive using a potential function argument. We then introduce the pairwise property and the TIMESTAMP algorithm to show that the COMB algorithm, a COMBination of the BIT and TIMEST AMP algorithms, is 8/5-competitive. COMB is the best known randomized algorithm for the list access program.

David F. Gleich, Leonid Zhukov, and Pavel Berkhin. Fast parallel PageRank: A linear system approach. Technical Report YRL-2004-038, Yahoo! Research Labs, 2004. [ bib | local | .pdf ]

In this paper we investigate the convergence of iterative stationary and Krylov subspace methods for the PageRank linear system, including the convergence dependency on teleportation. We demonstrate that linear system iterations converge faster than the simple power method and are less sensitive to the changes in teleportation. In order to perform this study we developed a framework for parallel PageRank computing. We describe the details of the parallel implementation and provide experimental results obtained on a 70-node Beowulf cluster.

Topic identiﬁcation in soft clustering using pca and ica, Leonid Zhukov and David F. Gleich. Online report, Yahoo! research labs, 2004. [ bib | local ]

Many applications can benefit from soft clustering, where each datum is assigned to multiple clusters with membership weights that sum to one. In this paper we present a comparison of principal component analysis (PCA) and independent component analysis (ICA) when used for soft clustering. We provide a short mathematical background for these methods and demonstrate their application to a sponsored links search listings dataset. We present examples of the soft clusters generated by both methods and compare the results.

Three methods for improving relevance in web search., Erin Bodine, David F. Gleich, Cathy Kurata, Jordan Kwan, Lesley Ward, and Daniel Fain. Clinic Report, Harvey Mudd College, May 9 2003. 102 pages. Includes fully documented program code on accompanying CD. [ bib | local ]

The 2002–2003 Overture clinic project evaluated and implemented three different methods for improving relevance ordering in web search. The three methods were bottom up micro information unit (MIU) analysis, top down MIU analysis, and proximity scoring. We ran these three methods on the top 200 web pages returned for each of 58 queries by an already existing algorithmic search engine. We used two metrics, precision and relevance ordering, to evaluate the results. Precision deals with how relevant the web page is for a given query, while relevance ordering is how well-ordered the returned results are. We evaluated the precision of each method and of the algorithmic search engine by hand. For relevance ordering, we recruited other humans to compare pages and used their decisions to generate an ideal ranking for each query. The results of each of our methods and of the algorithmic search engine are then compared to this ideal ranking vector using Kendall’s Tau. Our bottom up MIU analysis method achieved the highest precision score of 0.78 out of 1.00. In addition, bottom up MIU analysis received the second highest correlation coefficient (or relevance ordering score) of 0.107 while the algorithmic search engine received the highest correlation coefficient of 0.121. Interestingly, our proximity scoring method received high relevance ordering scores when the algorithmic search engine received low relevance ordering scores.

Machine learning in computer chess: Genetic programming and krk, David F. Gleich. Independent Study Report, Harvey Mudd College, 2003. [ bib | local ]

In this paper, I describe genetic programming as a machine learning paradigm and evaluate its results in attempting to learn basic chess rules. Genetic programming exploits a simulation of Darwinian evolution to construct programs. When applied to the King-Rook-King (KRK) chess endgame problem, genetic programming shows promising results in spite of a lack of signiﬁcant chess knowledge.

# Other reviewed publications

These usually underwent some form of review

David F. Gleich and Paul G. Constantine.
Ranking web pages.
In Nicholas J. Higham, Mark R. Dennis, Paul Glendinning, Paul A.
Martin, Fadil Santosa, and Jared Tanner, editors, *The Princeton
Companion to Applied Mathematics*, pages 755–757. Princeton University
Press, Princeton, NJ, USA, 2015.
[ bib ]

Review of: Numerical algorithms for personalized search in self-organizing
information networks by Sep Kamvar, Princeton Univ. Press, 2010, 160pp.,
ISBN13: 978-0-691-14503-7.
*Journal paper*.
David F. Gleich.
*Linear Algebra and its Applications*, 435(4):908 – 909, 2011.
[ bib |
DOI |
local ]

# Ph.D. Theses

Just one, thankfully

David F. Gleich.
*Models and Algorithms for PageRank Sensitivity*.
PhD thesis, Stanford University, September 2009.
[ bib |
local |
.pdf ]

The PageRank model helps evaluate the relative importance of nodes in a large graph, such as the graph of links on the world wide web. An important piece of the PageRank model is the teleportation parameter α. We explore the interaction between α and PageRank through the lens of sensitivity analysis. Writing the PageRank vector as a function of α allows us to take a derivative, which is a simple sensitivity measure. As an alternative approach, we apply techniques from the field of uncertainty quantification. Regarding α as a random variable produces a new PageRank model in which each PageRank value is a random variable. We explore the standard deviation of these variables to get another measure of PageRank sensitivity. One interpretation of this new model shows that it corrects a small oversight in the original PageRank formulation. Both of the above techniques require solving multiple PageRank problems, and thus a robust PageRank solver is needed. We discuss an inner-outer iteration for this purpose. The method is low-memory, simple to implement, and has excellent performance for a range of teleportation parameters. We show empirical results with these techniques on graphs with over 2 billion edges.