# Preprints

Manuscripts in preparation, under review

Flow-based algorithms for improving clusters: A unifying framework, software,
and performance.
*Preprint on arXiv*.
K. Fountoulakis, M. Liu, D. F. Gleich, and M. W. Mahoney.
*arXiv*, cs.LG:2004.09608, 2020.
[ bib |
arXiv |
software ]

Local hypergraph clustering using capacity releasing diffusion.
*Preprint on arXiv*.
Rania Ibrahim and David F. Gleich.
*arXiv*, cs.SI:2003.04213, 2020.
[ bib |
arXiv |
software ]

Low rank methods for multiple network alignment.
*Preprint on arXiv*.
Huda Nassar, Georgios Kollias, Ananth Grama, and David F. Gleich.
*arXiv*, cs.SI:1809.08198, 2018.
[ bib |
software |
http ]

A randomized algorithm for enumerating zonotope vertices.
*Preprint on arXiv*.
Kerrek Stinson, David F. Gleich, and Paul G. Constantine.
*arXiv*, math.NA:1602.06620, 2016.
[ bib |
http ]

Computing active subspaces.
*Preprint on arXiv*.
Paul G. Constantine and David F. Gleich.
*arXiv*, math.NA:1408.0545, 2014.
[ bib |
http ]

# In press

Just waiting for details on these ones now.

Neighborhood and pagerank methods for pairwise link prediction.
*Journal paper*.
Huda Nassar, Austin R. Benson, and David F. Gleich.
*Social Network Analysis and Mining*, 10:63, 2020.
[ bib |
DOI |
arXiv |
local ]

Strongly local p-norm-cut algorithms for semi-supervised learning and local
graph clustering.
*Conference proceedings*.
Meng Liu and David F. Gleich.
In *Proceedings of NeurIPS*, 2020.
Accepted.
[ bib |
arXiv |
software ]

# Scholarly publications

Parameterized objectives and algorithms for clustering bipartite graphs and
hypergraphs.
*Conference proceedings*.
Nate Veldt, Anthony Wirth, and David F. Gleich.
In *Proceeding of KDD2020*, 2020.
Accepted.
[ bib |
DOI |
arXiv |
software ]

A parallel projection method for metric constrained optimization.
*Conference proceedings*.
Cameron Ruggles, Nate Veldt, and David F. Gleich.
In *Proceedings of the SIAM Workshop on Combinatorial Scientific
Computing 2020 (CSC20)*, pages 43–53. SIAM, 2020.
[ bib |
DOI |
arXiv |
local |
software ]

Yanfei Ren and David F. Gleich.
A simple study of pleasing parallelism on multicore computers.
In A. Grama and A. H. Sameh, editors, *Parallel Algorithms in
Computational Science and Engineering*, pages 325–346. Birkhäuser, 2020.
Report version on this website differs slightly from final published
chapter.
[ bib |
DOI |
local |
software ]

Using cliques with higher-order spectral embeddings improves graph
visualizations.
*Conference proceedings*.
Huda Nassar, Caitlin Kennedy, Shweta Jain, Austin R. Benson, and
David F. Gleich.
In *Proceedings of The Web Conference 2020*, WWW '20, pages
2927–2933, New York, NY, USA, 2020. Association for Computing Machinery.
[ bib |
DOI |
local |
software ]

Graph clustering in all parameter regimes.
*Conference proceedings*.
Junhao Gan, David F. Gleich, Nate Veldt, Anthony Wirth, and Xin
Zhang.
In Javier Esparza and Daniel Král, editors, *45th
International Symposium on Mathematical Foundations of Computer Science (MFCS
2020)*, volume 170 of *Leibniz International Proceedings in Informatics
(LIPIcs)*, pages 39:1–39:15, Dagstuhl, Germany, 2020. Schloss
Dagstuhl–Leibniz-Zentrum für Informatik.
[ bib |
DOI |
arXiv |
local |
http ]

Non-exhaustive, overlapping clustering.
*Journal paper*.
Joyce Jiyoung Whang, Yangyang Hou, David F. Gleich, and Inderjit
Dhillon.
*Transactions on Pattern Analysis and Machine Intelligence*,
41(11), November 2019.
[ bib |
DOI ]

Traditional clustering algorithms, such as K-Means, output a clustering that is disjoint and exhaustive, i.e., every single data point is assigned to exactly one cluster. However, in many real-world datasets, clusters can overlap and there are often outliers that do not belong to any cluster. While this is a well-recognized problem, most existing algorithms address either overlap or outlier detection and do not tackle the problem in a unified way. In this paper, we propose an intuitive objective function, which we call the NEO-K-Means (Non-Exhaustive, Overlapping K-Means) objective, that captures the issues of overlap and non-exhaustiveness in a unified manner. Our objective function can be viewed as a reformulation of the traditional K-Means objective, with easy-to-understand parameters that capture the degrees of overlap and non-exhaustiveness. By considering an extension to weighted kernel K-Means, we show that we can also apply our NEO-K-Means idea to overlapping community detection, which is an important task in network analysis. To optimize the NEO-K-Means objective, we develop not only fast iterative algorithms but also more sophisticated algorithms using low-rank semidefinite programming techniques. Our experimental results show that the new objective and algorithms are effective in finding ground-truth clusterings that have varied overlap and non-exhaustiveness.

Multi-way monte carlo method for linear systems.
*Journal paper*.
Tao Wu and David F. Gleich.
*SIAM Journal on Scientific Computing*, 41(6):A3449–A3475,
January 2019.
[ bib |
DOI |
arXiv |
local |
software ]

Learning resolution parameters for graph clustering.
*Conference proceedings*.
Nate Veldt, Anthony Wirth, and David F. Gleich.
In *The World Wide Web Conference*, WWW '19, pages 1909–1919,
New York, NY, USA, 2019. ACM.
[ bib |
DOI |
arXiv |
local |
software ]

Finding clusters of well-connected nodes in a graph is an extensively studied problem in graph-based data analysis. Because of its many applications, a large number of distinct graph clustering objective functions and algorithms have already been proposed and analyzed. To aid practitioners in determining the best clustering approach to use in different applications, we present new techniques for automatically learning how to set clustering resolution parameters. These parameters control the size and structure of communities that are formed by optimizing a generalized objective function. We begin by formalizing the notion of a parameter fitness function, which measures how well a fixed input clustering approximately solves a generalized clustering objective for a specific resolution parameter value. Under reasonable assumptions, which suit two key graph clustering applications, such a parameter fitness function can be efficiently minimized using a bisection-like method, yielding a resolution parameter that fits well with the example clustering. We view our framework as a type of single-shot hyperparameter tuning, as we are able to learn a good resolution parameter with just a single example. Our general approach can be applied to learn resolution parameters for both local and global graph clustering objectives. We demonstrate its utility in several experiments on real-world data where it is helpful to learn resolution parameters from a given example clustering.

Flow-based local graph clustering with better seed set inclusion.
*Conference proceedings*.
Nate Veldt, Christine Klymko, and David F. Gleich.
In *Proceedings of the SIAM International Conference on Data
Mining*, pages 378–386, 2019.
[ bib |
DOI |
arXiv |
local |
software ]

Flow-based methods for local graph clustering have received significant recent attention for their theoretical cut improvement and runtime guarantees. In this work we present two improvements for using flow-based methods in real-world semi-supervised clustering problems. Our first contribution is a generalized objective function that allows practitioners to place strict and soft penalties on excluding specific seed nodes from the output set. This feature allows us to avoid the tendency, often exhibited by previous flow-based methods, to contract a large seed set into a small set of nodes that does not contain all or even most of the seed nodes. Our second contribution is a fast algorithm for minimizing our generalized objective function, based on a variant of the push-relabel algorithm for computing preflows. We make our approach very fast in practice by implementing a global relabeling heuristic and employing a warm-start procedure to quickly solve related cut problems. In practice our algorithm is faster than previous related flow-based methods, and is also more robust in detecting ground truth target regions in a graph thanks to its ability to better incorporate semi-supervised information about target clusters.

Metric-constrained optimization for graph clustering algorithms.
*Journal paper*.
Nate Veldt, David Gleich, Anthony Wirth, and James Saunderson.
*SIAM J. Mathematics of Data Science*, 1(2):333–355, 2019.
[ bib |
DOI |
arXiv |
local |
software ]

We outline a new approach for solving linear programming relaxations of NP-hard graph clustering problems that enforce triangle inequality constraints on output variables. Extensive previous research has shown that solutions to these relaxations can be used to obtain good approximation algorithms for clustering objectives. However, these are rarely solved in practice due to their high memory requirements. We first prove that the linear programming relaxation of the correlation clustering objective is equivalent to a special case of a well-known problem in machine learning called metric nearness. We then develop a general solver for metric-constrained linear and quadratic programs by generalizing and improving a simple projection algorithm, originally developed for metric nearness. We give several novel approximation guarantees for using our approach to find lower bounds for challenging graph clustering tasks such as sparsest cut, maximum modularity, and correlation clustering. We demonstrate the power of our framework by solving relaxations of these problems involving up to 10^{7}variables and 10^{11}constraints.

Rigid graph alignment.
*Conference proceedings*.
Vikram Ravindra, Huda Nassar, David F. Gleich, and Ananth Grama.
In *Complex Networks and Their Applications VIII*, pages
621–632. Springer International Publishing, 2019.
[ bib |
DOI |
arXiv |
local ]

An increasingly important class of networks is derived from physical systems that have a spatial basis. Specifically, nodes in the network have spatial coordinates associated with them, and conserved edges in two networks being aligned have correlated distance measures. An example of such a network is the human brain connectome – a network of co-activity of different regions of the brain, as observed in a functional MRI (fMRI). Here, the problem of identifying conserved patterns corresponds to the alignment of connectomes. In this context, one may structurally align the brains through co-registration to a common coordinate system. Alternately, one may align the networks, ignoring the structural basis of co-activity. In this paper, we formulate a novel problem – rigid graph alignment, which simultaneously aligns the network, as well as the underlying structure. We formally specify the problem and present a method based on expectation maximization, which alternately aligns the network and the structure via rigid body transformations. We demonstrate that our method significantly improves the quality of network alignment in synthetic graphs. We also apply rigid graph alignment to functional brain networks derived from 20 subjects drawn from the Human Connectome Project (HCP), and show over a two-fold increase in quality of alignment. Our results are broadly applicable to other applications and abstracted networks that can be embedded in metric spaces – e.g., through spectral embeddings.

Coin-flipping, ball-dropping, and grass-hopping for generating random graphs
from matrices of probabilities.
*Journal paper*.
Arjun S. Ramani, Nicole Eikmeier, and David F. Gleich.
*SIAM Review*, 61(3):549–595, 2019.
[ bib |
DOI |
arXiv |
local |
software ]

Pairwise link prediction.
*Conference proceedings*.
Huda Nassar, Austin R. Benson, and David F. Gleich.
In *Proceedings of the 2019 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining*, ASONAM 19, pages 386–393,
2019.
[ bib |
DOI |
arXiv |
local |
software ]

Nonlinear diffusion for community detection and semi-supervised learning.
*Conference proceedings*.
Rania Ibrahim and David F. Gleich.
In *The World Wide Web Conference*, WWW '19, pages 739–750, New
York, NY, USA, 2019. ACM.
[ bib |
DOI |
local |
software ]

Diffusions, such as the heat kernel diffusion and the PageRank vector, and their relatives are widely used graph mining primitives that have been successful in a variety of contexts including community detection and semi-supervised learning. The majority of existing methods and methodology involves linear diffusions, which then yield simple algorithms involving repeated matrix-vector operations. Recent work, however, has shown that sophisticated and complicated techniques based on network embeddings and neural networks can give empirical results superior to those based on linear diffusions. In this paper, we illustrate a class of nonlinear graph diffusions that are competitive with state of the art embedding techniques and outperform classic diffusions. Our new methods enjoy much of the simplicity underlying classic diffusion methods as well. Formally, they are based on nonlinear dynamical systems that can be realized with an implementation akin to applying a nonlinear function after each matrix-vector product in a classic diffusion. This framework also enables us to easily integrate results from multiple data representations in a principled fashion. Furthermore, we have some theoretical relationships that suggest choices of the nonlinear term. We demonstrate the benefits of these techniques on a variety of synthetic and real-world data.

Classes of preferential attachment and triangle preferential attachment models
with power-law spectra.
*Journal paper*.
Nicole Eikmeier and David F. Gleich.
*Journal of Complex Networks*, page cnz040, 2019.
Available online ahead of print.
[ bib |
DOI |
arXiv |
software ]

Centrality in dynamic competition networks.
*Conference proceedings*.
Anthony Bonato, Nicole Eikmeier, David F. Gleich, and Rehan Malik.
In *Complex Networks and Their Applications VIII*, pages
105–116. Springer International Publishing, 2019.
[ bib |
DOI |
arXiv |
local ]

Computing tensor z-eigenvectors with dynamical systems.
*Journal paper*.
Austin Benson and David F. Gleich.
*SIAM Journal on Matrix Analysis and Applications*,
40(4):1311–1324, January 2019.
[ bib |
DOI |
arXiv |
local |
software ]

The hyperkron graph model for higher-order features.
*Conference proceedings*.
Nicole Eikmeier, Arjun S. Ramani, and David F. Gleich.
In *Proceedings of the International Conference on Data Mining
(ICDM)*, pages 941–946, November 2018.
[ bib |
DOI |
arXiv |
local |
software ]

In this manuscript we present the HyperKron Graph model: an extension of the Kronecker Model, but with a distribution over hyperedges. We prove that we can efficiently generate graphs from this model in time proportional to the number of edges times a small log-factor, and find that in practice the runtime is linear with respect to the number of edges. We illustrate a number of useful features of the HyperKron model including non-trivial clustering and highly skewed degree distributions. Finally, we fit the HyperKron model to real-world networks, and demonstrate the model's flexibility with a complex application of the HyperKron model to networks with coherent feed-forward loops.

Multimodal network diffusion predicts future disease-gene-chemical
associations.
*Journal paper*.
Chih-Hsu Lin, Daniel M Konecki, Meng Liu, Stephen J Wilson, Huda
Nassar, Angela D Wilkins, David F Gleich, and Olivier Lichtarge.
*Bioinformatics*, page bty858, October 2018.
[ bib |
DOI |
local ]

Gauss's law for networks directly reveals community boundaries.
*Journal paper*.
Ayan Sinha, David F. Gleich, and Karthik Ramani.
*Scientific Reports*, 8(1):11909, August 2018.
[ bib |
DOI |
supplementary info |
local ]

A geometric approach to characterize the functional identity of single cells.
*Journal paper*.
Shahin Mohammadi, Vikram Ravindra, David F. Gleich, and Ananth Grama.
*Nature Communications*, 9(1):1516, April 2018.
[ bib |
DOI |
local |
software ]

A correlation clustering framework for community detection.
*Conference proceedings*.
Nate Veldt, David F. Gleich, and Anthony Wirth.
In *Proceedings of the 2018 World Wide Web Conference*, WWW '18,
pages 439–448, Republic and Canton of Geneva, Switzerland, 2018.
International World Wide Web Conferences Steering Committee.
[ bib |
DOI |
arXiv |
local |
software ]

Low rank spectral network alignment.
*Conference proceedings*.
Huda Nassar, Nate Veldt, Shahin Mohammadi, Ananth Grama, and David F.
Gleich.
In *Proceedings of the 2018 World Wide Web Conference*, WWW '18,
pages 619–628. International World Wide Web Conferences Steering Committee,
2018.
[ bib |
DOI |
local |
software ]

Correlation clustering generalized.
*Conference proceedings*.
David F. Gleich, Nate Veldt, and Anthony Wirth.
In *Proceedings of 29th International Symposium on Algorithms and
Computation*, pages 44:1–44:13, 2018.
Authors in Alphabetical Order following CS Theory Convention.
[ bib |
DOI |
arXiv |
local |
http ]

Dynamic competition networks: Detecting alliances and leaders.
*Conference proceedings*.
Anthony Bonato, Nicole Eikmeier, David F. Gleich, and Rehan Malik.
In Anthony Bonato, Pawel Pralat, and Andrei Raigorodskii,
editors, *Algorithms and Models for the Web Graph*, pages 115–144, 2018.
[ bib |
DOI |
arXiv |
local ]

We consider social networks of competing agents that evolve dynamically over time. Such dynamic competition networks are directed, where a directed edge from nodes u to v corresponds a negative social interaction. We present a novel hypothesis that serves as a predictive tool to uncover alliances and leaders within dynamic competition networks. Our focus is in the present study is to validate it on competitive networks arising from social game shows such as Survivor and Big Brother.

Triangular alignment (TAME): A tensor-based approach for higher-order network
alignment.
*Journal paper*.
Shahin Mohammadi, David F. Gleich, Tamara G. Kolda, and Ananth Grama.
*Transactions on Computational Biology and Bioinformatics*,
14(6):1446–1458, November 2017.
Published online (July 2016) ahead of print.
[ bib |
DOI |
arXiv |
local |
software ]

AptRank: an adaptive PageRank model for protein function prediction on
bi-relational graphs.
*Journal paper*.
Biaobin Jiang, Kyle Kloster, David F. Gleich, and Michael Gribskov.
*Bioinformatics*, 33(12):1829–1836, June 2017.
[ bib |
DOI |
arXiv |
local |
software ]

The spacey random walk: a stochastic process for higher-order data.
*Journal paper*.
Austin Benson, David F. Gleich, and Lek-Heng Lim.
*SIAM Review*, 59(2):321–345, May 2017.
[ bib |
DOI |
arXiv |
local |
software |
http ]

An optimization approach to locally-biased graph algorithms.
*Journal paper*.
Kimon Fountoulakis, David F. Gleich, and Michael W. Mahoney.
*Proceedings of the IEEE*, 105(2):256–272, February 2017.
[ bib |
DOI |
arXiv |
local ]

Locally-biased graph algorithms are algorithms that attempt to find local or small-scale structure in a large data graph. In some cases, this can be accomplished by adding some sort of locality constraint and calling a traditional graph algorithm; but more interesting are locally-biased graph algorithms that compute answers by running a procedure that does not even look at most of the input graph. This corresponds more closely to what practitioners from various data science domains do, but it does not correspond well with the way that algorithmic and statistical theory is typically formulated. Recent work from several research communities has focused on developing locally-biased graph algorithms that come with strong complementary algorithmic and statistical theory and that are useful in practice in downstream data science applications. We provide a review and overview of this work, highlighting commonalities between seemingly different approaches, and highlighting promising directions for future work.

Erasure coding for fault oblivious linear system solvers.
*Journal paper*.
Yao Zhu, Ananth Grama, and David F. Gleich.
*SIAM J. of Scientific Computing*, 39(1):C48–C64, 2017.
[ bib |
DOI |
local |
software |
http ]

Local higher-order graph clustering.
*Conference proceedings*.
Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich.
In *Proceedings of the 23rd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '17, pages 555–564, New York,
NY, USA, 2017. ACM.
[ bib |
DOI |
local |
software ]

Retrospective higher-order markov processes for user trails.
*Conference proceedings*.
Tao Wu and David F. Gleich.
In *Proceedings of the 23rd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '17, pages 1185–1194, New York,
NY, USA, 2017. ACM.
[ bib |
DOI |
arXiv |
local |
software ]

Correlation clustering with low-rank matrices.
*Conference proceedings*.
Nate Veldt, Anthony I. Wirth, and David F. Gleich.
In *Proceedings of the 26th International Conference on World
Wide Web*, WWW '17, pages 1025–1034, 2017.
[ bib |
DOI |
arXiv |
local |
software |
http ]

Multimodal network alignment.
*Conference proceedings*.
Huda Nassar and David F. Gleich.
In *Proceedings of the 2017 SIAM International Conference on Data
Mining*, pages 615–623, 2017.
[ bib |
DOI |
arXiv |
local |
software ]

Distributed fault tolerant linear system solvers based on erasure coding.
*Conference proceedings*.
Xuejiao Kang, David F. Gleich, Ahmed Sameh, and Ananth Grama.
In *2017 IEEE 37th International Conference on Distributed
Computing Systems (ICDCS)*, pages 2478–2485, 2017.
[ bib |
DOI |
local ]

We present efficient coding schemes and distributed implementations of erasure coded linear system solvers. Erasure coded computations belong to the class of algorithmic fault tolerance schemes. They are based on augmenting an input dataset, executing the algorithm on the augmented dataset, and in the event of a fault, recovering the solution from the corresponding augmented solution. This process can be viewed as the computational analog of erasure coded storage schemes. The proposed technique has a number of important benefits: (i) as the hardware platform scales in size and number of faults, our scheme yields increasing improvement in resource utilization, compared to traditional schemes; (ii) the proposed scheme is easy to code - the core algorithms remain the same; and (iii) the general scheme is flexible - accommodating a range of computation and communication tradeoffs. We present new coding schemes for augmenting the input matrix that satisfy the recovery equations of erasure coding with high probability in the event of random failures. These coding schemes also minimize fill (non-zero elements introduced by the coding block), while being amenable to efficient partitioning across processing nodes. We demonstrate experimentally that our scheme adds minimal overhead for fault tolerance, yields excellent parallel efficiency and scalability, and is robust to different fault arrival models.

Localization in seeded PageRank.
*Journal paper*.
David F. Gleich, Kyle Kloster, and Huda Nassar.
*Internet Mathematics*, page Online, 2017.
[ bib |
DOI |
arXiv |
local |
software ]

Revisiting power-law distributions in spectra of real world networks.
*Conference proceedings*.
Nicole Eikmeier and David F. Gleich.
In *Proceedings of the 23rd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '17, pages 817–826, New York,
NY, USA, 2017. ACM.
[ bib |
DOI |
local |
software |
http ]

A parallel min-cut algorithm using iteratively reweighted least squares.
*Journal paper*.
Yao Zhu and David F. Gleich.
*Parallel Computing*, 59:43–59, November 2016.
[ bib |
DOI |
arXiv |
local |
software ]

Overlapping community detection using neighborhood-inflated seed expansion.
*Journal paper*.
Joyce Jiyoung Whang, David F. Gleich, and Inderjit S. Dhillon.
*Transactions on Knowledge and Data Engineering*,
28(5):1272–1284, May 2016.
[ bib |
DOI |
local |
http ]

Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. In this paper, we propose an efficient overlapping community detection algorithm using a seed expansion approach. The key idea of our algorithm is to find good seeds, and then greedily expand these seeds based on a community metric. Within this seed expansion method, we investigate the problem of how to determine good seed nodes in a graph. In particular, we develop new seeding strategies for a personalized PageRank clustering scheme that optimizes the conductance community score. An important step in our method is the neighborhood inflation step where seeds are modified to represent their entire vertex neighborhood. Experimental results show that our seed expansion algorithm outperforms other state-of-the-art overlapping community detection methods in terms of producing cohesive clusters and identifying ground-truth communities. We also show that our new seeding strategies are better than existing strategies, and are thus effective in finding good overlapping communities in real-world networks.

General tensor spectral co-clustering for higher-order data.
*Conference proceedings*.
Tao Wu, Austin Benson, and David F. Gleich.
In *Advances in Neural Information Processing Systems 29*, pages
2559–2567, 2016.
http://arxiv.org/abs/1603.00395.
[ bib |
local |
software |
http ]

A simple and strongly-local flow-based method for cut improvement.
*Conference proceedings*.
Nate Veldt, David F. Gleich, and Michael W. Mahoney.
In *International Conference on Machine Learning*, pages
1938–1947, 2016.
[ bib |
local |
software |
.html ]

Deconvolving feedback loops in recommender systems.
*Conference proceedings*.
Ayan Sinha, David F. Gleich, and Karthik Ramani.
In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,
editors, *Neural Information Processing Systems (NIPS)*, pages
3243–3251. Curran Associates, Inc., 2016.
[ bib |
local |
software |
http ]

Massive graph processing on nanocomputers.
*Conference proceedings*.
Bryan P. Rainey and David F. Gleich.
In *IEEE International Conference on Big Data*, pages 3326–3335,
2016.
Third Workshop on High Performance Big Graph Data Management,
Analysis, and Mining.
[ bib |
DOI |
local |
software ]

Seeded PageRank solution paths.
*Journal paper*.
Kyle Kloster and David F. Gleich.
*European Journal of Applied Mathematics*, 27(6):812–845, 2016.
[ bib |
DOI |
arXiv |
local |
software ]

We study the behaviour of network diffusions based on the PageRank random walk from a set of seed nodes. These diffusions are known to reveal small, localized clusters (or communities), and also large macro-scale clusters by varying a parameter that has a dual-interpretation as an accuracy bound and as a regularization level. We propose a new method that quickly approximates the result of the diffusion for all values of this parameter. Our method efficiently generates an approximate solution path or regularization path associated with a PageRank diffusion, and it reveals cluster structures at multiple size-scales between small and large. We formally prove a runtime bound on this method that is independent of the size of the network, and we investigate multiple optimizations to our method that can be more practical in some settings. We demonstrate that these methods identify refined clustering structure on a number of real-world networks with up to 2 billion edges.

Fast multiplier methods to optimize non-exhaustive, overlapping clustering.
*Conference proceedings*.
Yangyang Hou, Joyce Jiyoung Whang, David F. Gleich, and Inderjit
Dhillon.
In *Proceedings of the 2016 SIAM International Conference on Data
Mining*, pages 297–305, 2016.
[ bib |
DOI |
http ]

David F. Gleich and Michael W. Mahoney.
Mining large graphs.
In Peter Bühlmann, Petros Drineas, Michael Kane, and Mark van de
Laan, editors, *Handbook of Big Data*, Handbooks of modern statistical
methods, pages 191–220. CRC Press, 2016.
[ bib |
DOI |
local ]

Mining and modeling character networks.
*Conference proceedings*.
Anthony Bonato, David Ryan D'Angelo, Ethan R. Elenberg, David F.
Gleich, and Yangyang Hou.
In Anthony Bonato, Fan Chung Graham, and Pawel Pralat, editors,
*International Workshop on Algorithms and Models for the Web-Graph*, WAW,
pages 100–114. Springer International Publishing, 2016.
[ bib |
DOI |
arXiv |
local ]

Higher-order organization of complex networks.
*Journal paper*.
Austin Benson, David F. Gleich, and Jure Leskovec.
*Science*, 353(6295):163–166, 2016.
[ bib |
DOI |
supplementary info |
local |
software ]

PageRank beyond the web.
*Journal paper*.
David F. Gleich.
*SIAM Review*, 57(3):321–363, August 2015.
[ bib |
DOI |
local ]

Google's PageRank method was developed to evaluate the importance of web-pages via their link structure. The mathematics of PageRank, however, are entirely general and apply to any graph or network in any domain. Thus, PageRank is now regularly used in bibliometrics, social and information network analysis, and for link prediction and recommendation. It's even used for systems analysis of road networks, as well as biology, chemistry, neuroscience, and physics. We'll see the mathematics and ideas that unite these diverse applications.

Non-exhaustive, overlapping k-means.
*Conference proceedings*.
Joyce Jiyoung Whang, Inderjit S. Dhillon, and David F. Gleich.
In *Proceedings of the 2015 SIAM International Conference on Data
Mining*, pages 936–944, 2015.
[ bib |
DOI |
local ]

Traditional clustering algorithms, such as k-means, output a clustering that is disjoint and exhaustive, that is, every single data point is assigned to exactly one cluster. However, in real datasets, clusters can overlap and there are often outliers that do not belong to any cluster. This is a well recognized problem that has received much attention in the past, and several algorithms, such as fuzzy k-means have been proposed for overlapping clustering. However, most existing algorithms address either overlap or outlier detection and do not tackle the problem in a unified way. In this paper, we propose a simple and intuitive objective function that captures the issues of overlap and non-exhaustiveness in a unified manner. Our objective function can be viewed as a reformulation of the traditional k-means objective, with easy-to-understand parameters that capture the degrees of overlap and non-exhaustiveness. By studying the objective, we are able to obtain a simple iterative algorithm which we call NEO-K-Means (Non-Exhaustive Overlapping K-Means). Furthermore, by considering an extension to weighted kernel k-means, we can tackle the case of non-exhaustive and overlapping graph clustering. This extension allows us to apply our NEO-K-Means algorithm to the community detection problem, which is an important task in network analysis. Our experimental results show that the new objective and algorithm are effective in finding ground-truth clusterings that have varied overlap and non-exhaustiveness; for the case of graphs, we show that our algorithm outperforms state-of-the-art overlapping community detection methods.

Parallel maximum clique algorithms with applications to network analysis.
*Journal paper*.
Ryan A. Rossi, David F. Gleich, and Assefaw H. Gebremedhin.
*SIAM Journal on Scientific Computing*, 37(5):C589–C616, 2015.
[ bib |
DOI |
arXiv |
local ]

We present a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. The method exhibits a roughly linear runtime scaling over real-world networks ranging from a thousand to a hundred million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. At its heart the algorithm employs a branch-and-bound strategy with novel and aggressive pruning techniques. The pruning techniques include the combined use of core numbers of vertices along with a good initial heuristic solution to remove the vast majority of the search space. In addition, the exploration of the search tree is parallelized. During the search, processes immediately communicate changes to upper and lower bounds on the size of the maximum clique. This exchange of information occasionally results in a superlinear speedup because tasks with large search spaces can be pruned by other processes. We demonstrate the impact of the algorithm on applications using two different network analysis problems: computation of temporal strong components in dynamic networks and determination of compression-friendly ordering of nodes of massive networks.

Strong localization in personalized PageRank.
*Conference proceedings*.
Huda Nassar, Kyle Kloster, and David F. Gleich.
In *Proceedings of the 2015 Workshop on Algorithms for the
Webgraph*, number 9479 in LNCS, pages 190–202, 2015.
[ bib |
DOI |
local |
software |
http ]

The personalized PageRank diffusion is a fundamental tool in network analysis tasks like community detection and link prediction. It models the spread of a quantity from a set of seed nodes, and it has been observed to stay localized near this seed set. We derive an upper-bound on the number of entries necessary to approximate a personalized PageRank vector in graphs with skewed degree sequences. This bound shows localization under mild assumptions on the maximum and minimum degrees. Experimental results on random graphs with these degree sequences show the bound is loose and support a conjectured bound.

Differential flux balance analysis of quantitative proteomic data on protein
interaction networks.
*Conference proceedings*.
Biaobin Jiang, David F. Gleich, and Michael Gribskov.
In *Symposium on Signal Processing and Mathematical Modeling of
Biological Processes with Applications to Cyber-Physical Systems for Precise
Medicine*, GlobalSIP, pages 977–981. IEEE, 2015.
[ bib |
DOI |
local ]

Non-exhaustive, overlapping clustering via low-rank semidefinite programming.
*Conference proceedings*.
Yangyang Hou, Joyce Jiyoung Whang, David F. Gleich, and Inderjit S.
Dhillon.
In *Proceedings of the 21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '15, pages 427–436, New York,
NY, USA, 2015. ACM.
[ bib |
DOI |
local ]

Using local spectral methods to robustify graph-based learning algorithms.
*Conference proceedings*.
David F. Gleich and Michael W. Mahoney.
In *Proceedings of the 21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '15, pages 359–368, New York,
NY, USA, 2015. ACM.
[ bib |
DOI |
local |
software ]

Graph-based learning methods have a variety of names including semi-supervised and transductive learning. They typically use a diffusion to propagate labels from a small set of nodes with known class labels to the remaining nodes of the graph. While popular, these algorithms, when implemented in a straightforward fashion, are extremely sensitive to the details of the graph construction. Here, we provide four procedures to help make them more robust: recognizing implicit regularization in the diffusion, using a scalable push method to evaluate the diffusion, using rank-based rounding, and densifying the graph through a matrix polynomial. We study robustness with respect to the details of graph constructions, errors in node labeling, degree variability, and a variety of other real-world heterogeneities, studying these methods through a precise relationship with mincut problems. For instance, the densification strategy explicitly adds new weighted edges to a sparse graph. We find that this simple densification creates a graph where multiple diffusion methods are robust to several types of errors. This is demonstrated by a study with predicting product categories from an Amazon co-purchasing network.

Multilinear PageRank.
*Journal paper*.
David F. Gleich, Lek-Heng Lim, and Yongyang Yu.
*SIAM Journal on Matrix Analysis and Applications*,
36(4):1507–1541, 2015.
[ bib |
DOI |
local |
software |
http ]

In this paper, we first extend the celebrated PageRank modification to a higher-order Markov chain. Although this system has attractive theoretical properties, it is computationally intractable for many interesting problems. We next study a computationally tractable approximation to the higher-order PageRank vector that involves a system of polynomial equations called multilinear PageRank. This is motivated by a novel “spacey random surfer” model, where the surfer remembers bits and pieces of history and is influenced by this information. The underlying stochastic process is an instance of a vertex-reinforced random walk. We develop convergence theory for a simple fixed-point method, a shifted fixed-point method, and a Newton iteration in a particular parameter regime. In marked contrast to the case of the PageRank vector of a Markov chain where the solution is always unique and easy to compute, there are parameter regimes of multilinear PageRank where solutions are not unique and simple algorithms do not converge. We provide a repository of these nonconvergent cases that we encountered through exhaustive enumeration and randomly sampling that we believe is useful for future study of the problem.

Sublinear column-wise actions of the matrix exponential on social networks.
*Journal paper*.
David F. Gleich and Kyle Kloster.
*Internet Mathematics*, 11(4–5):352–384, 2015.
[ bib |
DOI |
local ]

We consider stochastic transition matrices from large social and information networks. For these matrices, we describe and evaluate three fast methods to estimate one column of the matrix exponential. The methods are designed to exploit the properties inherent in social networks, such as a power-law degree distribution. Using only this property, we prove that one of our three algorithms has a sublinear runtime. We present further experimental evidence showing that all three of them run quickly on social networks with billions of edges, and they accurately identify the largest elements of the column.

Tensor spectral clustering for partitioning higher-order network structures.
*Conference proceedings*.
Austin R. Benson, David F. Gleich, and Jure Leskovec.
In *Proceedings of the 2015 SIAM International Conference on Data
Mining*, pages 118–126, 2015.
[ bib |
DOI |
local ]

Spectral graph theory-based methods represent an important class of tools for studying the structure of networks. Spectral methods are based on a first-order Markov chain derived from a random walk on the graph and thus they cannot take advantage of important higher-order network substructures such as triangles, cycles, and feed-forward loops. Here we propose a Tensor Spectral Clustering (TSC) algorithm that allows for modeling higher-order network structures in a graph partitioning framework. Our TSC algorithm allows the user to specify which higher-order network structures (cycles, feed-forward loops, etc.) should be preserved by the network clustering. Higher-order network structures of interest are represented using a tensor, which we then partition by developing a multilinear spectral method. Our framework can be applied to discovering layered flows in networks as well as graph anomaly detection, which we illustrate on synthetic networks. In directed networks, a higher-order structure of particular interest is the directed 3-cycle, which captures feedback loops in networks. We demonstrate that our TSC algorithm produces large partitions that cut fewer directed 3-cycles than standard spectral clustering algorithms.

Model reduction with MapReduce-enabled tall and skinny singular value
decomposition.
*Journal paper*.
Paul G. Constantine, David F. Gleich, Yangyang Hou, and Jeremy
Templeton.
*SIAM J. Sci. Comput.*, 36(5):S166–S191, November 2014.
[ bib |
DOI |
local |
http ]

Dimensionality of social networks using motifs and eigenvalues.
*Journal paper*.
Anthony Bonato, David F. Gleich, Myunghwan Kim, Dieter Mitsche,
Pawel Pralat, Amanda Tian, and Stephen J. Young.
*PLoS ONE*, 9(9):e106052, September 2014.
[ bib |
DOI |
local |
software ]

<p>We consider the dimensionality of social networks, and develop experiments aimed at predicting that dimension. We find that a social network model with nodes and links sampled from an <italic>m</italic>-dimensional metric space with power-law distributed influence regions best fits samples from real-world networks when <italic>m</italic> scales logarithmically with the number of nodes of the network. This supports a logarithmic dimension hypothesis, and we provide evidence with two different social networks, Facebook and LinkedIn. Further, we employ two different methods for confirming the hypothesis: the first uses the distribution of motif counts, and the second exploits the eigenvalue distribution.</p>

A dynamical system for PageRank with time-dependent teleportation.
*Journal paper*.
David F. Gleich and Ryan A. Rossi.
*Internet Mathematics*, 10(1–2):188–217, June 2014.
[ bib |
DOI |
local ]

We propose a dynamical system that captures changes to the network centrality of nodes as external interest in those nodes varies. We derive this system by adding time-dependent teleportation to the PageRank score. The result is not a single set of importance scores, but rather a time-dependent set. These can be converted into ranked lists in a variety of ways, for instance, by taking the largest change in the importance score. For an interesting class of dynamic teleportation functions, we derive closed-form solutions for the dynamic PageRank vector. The magnitude of the deviation from a static PageRank vector is given by a PageRank problem with complex-valued teleportation parameters. Moreover, these dynamical systems are easy to evaluate. We demonstrate the utility of dynamic teleportation on both the article graph of Wikipedia, where the external interest information is given by the number of hourly visitors to each page, and the Twitter social network, where external interest is the number of tweets per month. For these problems, we show that using information from the dynamical system helps improve a prediction task and identify trends in the data.

Fast maximum clique algorithms for large graphs.
*Conference proceedings*.
Ryan A. Rossi, David F. Gleich, Assefaw H. Gebremedhin, and Md.
Mostofa Ali Patwary.
In *Poster Proceedings of WWW2014*, pages 365–366, 2014.
[ bib |
DOI |
local ]

Using triangles to improve community detection in directed networks.
*Conference proceedings*.
Christine Klymko, David F. Gleich, and Tamara G. Kolda.
In *Proceedings of the ASE BigData Conference*, Stanford, CA,
2014.
Full version on arXiv http://arxiv.org/abs/1404.5874.
[ bib |
DOI |
http ]

Heat kernel based community detection.
*Conference proceedings*.
Kyle Kloster and David F. Gleich.
In *Proceedings of the 20th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining*, KDD '14, pages 1386–1395, New York,
NY, USA, 2014. ACM.
[ bib |
DOI |
local |
software ]

Anti-differentiating approximation algorithms: A case study with min-cuts,
spectral, and flow.
*Conference proceedings*.
David F. Gleich and Michael M. Mahoney.
In *Proceedings of the International Conference on Machine
Learning (ICML)*, pages 1018–1025, 2014.
[ bib |
local |
http ]

Scalable methods for nonnegative matrix factorizations of near-separable
tall-and-skinny matrices.
*Conference proceedings*.
Austin R. Benson, Jason D. Lee, Bartek Rajwa, and David F. Gleich.
In *Proceedings of Neural Information Processing Systems*, pages
945–953, 2014.
Selected for Spotlight Presentation.
[ bib |
http ]

A nearly-sublinear method for approximating a column of the matrix exponential
for matrices from large, sparse networks.
*Conference proceedings*.
Kyle Kloster and David F. Gleich.
In Anthony Bonato, Michael Mitzenmacher, and Pawel Pralat,
editors, *Algorithms and Models for the Web Graph*, volume 8305 of *
Lecture Notes in Computer Science*, pages 68–79. Springer International
Publishing, December 2013.
[ bib |
DOI |
local ]

We consider random-walk transition matrices from large social and information networks. For these matrices, we describe and evaluate a fast method to estimate one column of the matrix exponential. Our method runs in sublinear time on networks where the maximum degree grows doubly logarithmic with respect to the number of nodes. For collaboration networks with over 5 million edges, we find it runs in less than a second on a standard desktop machine.

Overlapping community detection using seed set expansion.
*Conference proceedings*.
Joyce Jiyoung Whang, David F. Gleich, and Inderjit S. Dhillon.
In *Proceedings of the 22nd ACM international conference on
Conference on information and knowledge management*, CIKM '13, pages
2099–2108, New York, NY, USA, October 2013. ACM.
[ bib |
DOI |
local |
software ]

Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. One of the most successful techniques for finding overlapping communities is based on local optimization and expansion of a community metric around a seed set of vertices. In this paper, we propose an efficient overlapping community detection algorithm using a seed set expansion approach. In particular, we develop new seeding strategies for a personalized PageRank scheme that optimizes the conductance community score. The key idea of our algorithm is to find good seeds, and then expand these seed sets using the personalized PageRank clustering procedure. Experimental results show that this seed set expansion approach outperforms other state-of-the-art overlapping community detection methods. We also show that our new seeding strategies are better than previous strategies, and are thus effective in finding good overlapping clusters in a graph.

The power and Arnoldi methods in an algebra of circulants.
*Journal paper*.
David F. Gleich, Chen Greif, and James M. Varah.
*Numerical Linear Algebra with Applications*, 20:809–831,
October 2013.
[ bib |
DOI |
local ]

Circulant matrices play a central role in a recently proposed formulation of three-way data computations. In this setting, a three-way table corresponds to a matrix where each "scalar" is a vector of parameters defining a circulant. This interpretation provides many generalizations of results from matrix or vector-space algebra. We derive the power and Arnoldi methods in this algebra. In the course of our derivation, we define inner products, norms, and other notions. These extensions are straightforward in an algebraic sense, but the implications are dramatically different from the standard matrix case. For example, a matrix of circulants has a polynomial number of eigenvalues in its dimension; although, these can all be represented by a carefully chosen canonical set of eigenvalues and vectors. These results and algorithms are closely related to standard decoupling techniques on block-circulant matrices using the fast Fourier transform.

Direct tall-and-skinny QR factorizations in MapReduce architectures.
*Conference proceedings*.
A.R. Benson, D.F. Gleich, and J. Demmel.
In *Big Data, 2013 IEEE International Conference on*, pages
264–272, October 2013.
[ bib |
DOI |
local |
software ]

The QR factorization and the SVD are two fundamental matrix decompositions with applications throughout scientific computing and data analysis. For matrices with many more rows than columns, so-called “tall-and-skinny matrices,” there is a numerically stable, efficient, communication-avoiding algorithm for computing the QR factorization. It has been used in traditional high performance computing and grid computing environments. For MapReduce environments, existing methods to compute the QR decomposition use a numerically unstable approach that relies on indirectly computing the Q factor. In the best case, these methods require only two passes over the data. In this paper, we describe how to compute a stable tall-and-skinny QR factorization on a MapReduce architecture in only slightly more than 2 passes over the data. We can compute the SVD with only a small change and no difference in performance. We present a performance comparison between our new direct TSQR method, indirect TSQR methods that use the communication-avoiding TSQR algorithm, and a standard unstable implementation for MapReduce (Cholesky QR). We find that our new stable method is competitive with unstable methods for matrices with a modest number of columns. This holds both in a theoretical performance model as well as in an actual implementation.

Expanders, tropical semi-rings, and nuclear norms: oh my!
*Journal paper*.
David F. Gleich.
*XRDS*, 19(3):32–36, March 2013.
[ bib |
DOI |
local ]

Message-passing algorithms for sparse network alignment.
*Journal paper*.
Mohsen Bayati, David F. Gleich, Amin Saberi, and Ying Wang.
*ACM Trans. Knowl. Discov. Data*, 7(1):3:1–3:31, March 2013.
[ bib |
DOI |
local |
software ]

Network alignment generalizes and unifies several approaches for forming a matching or alignment between the vertices of two graphs. We study a mathematical programming framework for network alignment problem and a sparse variation of it where only a small number of matches between the vertices of the two graphs are possible. We propose a new message passing algorithm that allows us to compute, very efficiently, approximate solutions to the sparse network alignment problems with graph sizes as large as hundreds of thousands of vertices. We also provide extensive simulations comparing our algorithms with two of the best solvers for network alignment problems on two synthetic matching problems, two bioinformatics problems, and three large ontology alignment problems including a multilingual problem with a known labeled alignment.

A multicore algorithm for network alignment via approximate matching.
*Conference proceedings*.
Arif Khan, David F. Gleich, Mahantesh Halappanavar, and Alex Pothen.
In *Proceedings of the 2012 ACM/IEEE International Conference for
High Performance Computing, Networking, Storage and Analysis*, SC '12, pages
64:1–64:11, Los Alamitos, CA, USA, November 2012. IEEE Computer Society
Press.
[ bib |
local |
.pdf ]

Vertex neighborhoods, low conductance cuts, and good seeds for local community
methods.
*Conference proceedings*.
David F. Gleich and C. Seshadhri.
In *KDD2012*, pages 597–605, August 2012.
[ bib |
DOI |
local ]

Moment based estimation of stochastic Kronecker graph parameters.
*Journal paper*.
David F. Gleich and Art B. Owen.
*Internet Mathematics*, 8(3):232–256, August 2012.
[ bib |
DOI |
local ]

Stochastic Kronecker graphs supply a parsimonious model for large sparse real-world graphs. They can specify the distribution of a large random graph using only three or four parameters. Those parameters have, however, proved difficult to choose in specific applications. This article looks at method-of-moments estimators that are computationally much simpler than maximum likelihood. The estimators are fast, and in our examples, they typically yield Kronecker parameters with expected feature counts closer to a given graph than we get from KronFit. The improvement is especially prominent for the number of triangles in the graph.

Overlapping clusters for distributed computation.
*Conference proceedings*.
Reid Andersen, David F. Gleich, and Vahab Mirrokni.
In *Proceedings of the fifth ACM international conference on Web
search and data mining*, WSDM '12, pages 273–282, New York, NY, USA,
February 2012. ACM.
[ bib |
DOI |
local |
http ]

Dynamic PageRank using evolving teleportation.
*Conference proceedings*.
Ryan A. Rossi and David F. Gleich.
In Anthony Bonato and Jeannette Janssen, editors, *Algorithms and
Models for the Web Graph*, volume 7323 of *Lecture Notes in Computer
Science*, pages 126–137. Springer Berlin Heidelberg, 2012.
[ bib |
DOI |
local ]

The importance of nodes in a network constantly fluctuates based on changes in the network structure as well as changes in external interest. We propose an evolving teleportation adaptation of the PageRank method to capture how changes in external interest influence the importance of a node. This framework seamlessly generalizes PageRank because the importance of a node will converge to the PageRank values if the external influence stops changing. We demonstrate the effectiveness of the evolving teleportation on the Wikipedia graph and the Twitter social network. The external interest is given by the number of hourly visitors to each page and the number of monthly tweets for each user.

Distinguishing signal from noise in an SVD of simulation data.
*Conference proceedings*.
Paul G. Constantine and David F. Gleich.
In *Proceedings of the IEEE Conference on Acoustics, Speech, and
Signal Processing*, pages 5333–5336, 2012.
[ bib |
DOI |
local ]

Fast matrix computations for pairwise and columnwise commute times and Katz
scores.
*Journal paper*.
Francesco Bonchi, Pooya Esfandiar, David F. Gleich, Chen Greif, and
Laks V.S. Lakshmanan.
*Internet Mathematics*, 8(1-2):73–112, 2012.
[ bib |
DOI |
local ]

We explore methods for approximating the commute time and Katz score between a pair of nodes. These methods are based on the approach of matrices, moments, and quadrature developed in the numerical linear algebra community. They rely on the Lanczos process and provide upper and lower bounds on an estimate of the pairwise scores. We also explore methods to approximate the commute times and Katz scores from a node to all other nodes in the graph. Here, our approach for the commute times is based on a variation of the conjugate gradient algorithm, and it provides an estimate of all the diagonals of the inverse of a matrix. Our technique for the Katz scores is based on exploiting an empirical localization property of the Katz matrix. We adapt algorithms used for personalized PageRank computing to these Katz scores and theoretically show that this approach is convergent. We evaluate these methods on 17 real-world graphs ranging in size from 1000 to 1,000,000 nodes. Our results show that our pairwise commute-time method and columnwise Katz algorithm both have attractive theoretical properties and empirical performance.

Tall and skinny QR factorizations in MapReduce architectures.
*Conference proceedings*.
Paul G. Constantine and David F. Gleich.
In *Proceedings of the second international workshop on MapReduce
and its applications*, MapReduce '11, pages 43–50, New York, NY, USA, June
2011. ACM.
[ bib |
DOI |
local ]

Some computational tools for digital archive and metadata maintenance.
*Journal paper*.
David F. Gleich, Ying Wang, Xiangrui Meng, Farnaz Ronaghi, Margot
Gerritsen, and Amin Saberi.
*BIT Numerical Mathematics*, 51:127–154, 2011.
[ bib |
DOI |
local ]

Computational tools are a mainstay of current search and recommendation technology. But modern digital archives are astonishingly diverse collections of older digitized material and newer born digital content. Finding interesting material in these archives is still challenging. The material often lacks appropriate annotation—or metadata—so that people can find the most interesting material. We describe four computational tools we developed to aid in the processing and maintenance of large digital archives. The first is an improvement to a graph layout algorithm for graphs with hundreds of thousands of nodes. The second is a new algorithm for matching databases with links among the objects, also known as a network alignment problem. The third is an optimization heuristic to disambiguate a set of geographic references in a book. And the fourth is a technique to automatically generate a title from a description.

Rank aggregation via nuclear norm minimization.
*Conference proceedings*.
David F. Gleich and Lek-Heng Lim.
In *Proceedings of the 17th ACM SIGKDD international conference
on Knowledge discovery and data mining*, KDD '11, pages 60–68, New York, NY,
USA, 2011. ACM.
[ bib |
DOI |
local ]

The process of rank aggregation is intimately intertwined with the structure of skew-symmetric matrices. We apply recent advances in the theory and algorithms of matrix completion to skew-symmetric matrices. This combination of ideas produces a new method for ranking a set of items. The essence of our idea is that a rank aggregation describes a partially filled skew-symmetric matrix. We extend an algorithm for matrix completion to handle skew-symmetric data and use that to extract ranks for each item. Our algorithm applies to both pairwise comparison and rating data. Because it is based on matrix completion, it is robust to both noise and incomplete data. We show a formal recovery result for the noiseless case and present a detailed study of the algorithm on synthetic data and Netflix ratings.

A factorization of the spectral Galerkin system for parameterized matrix
equations: derivation and applications.
*Journal paper*.
Paul G. Constantine, David F. Gleich, and Gianluca Iaccarino.
*SIAM Journal of Scientific Computing*, 33(5):2995–3009, 2011.
[ bib |
DOI |
local ]

Recent work has explored solver strategies for the linear system of equations arising from a spectral Galerkin approximation of the solution of PDEs with parameterized (or stochastic) inputs. We consider the related problem of a matrix equation whose matrix and right-hand side depend on a set of parameters (e.g., a PDE with stochastic inputs semidiscretized in space) and examine the linear system arising from a similar Galerkin approximation of the solution. We derive a useful factorization of this system of equations, which yields bounds on the eigenvalues, clues to preconditioning, and a flexible implementation method for a wide array of problems. We complement this analysis with (i) a numerical study of preconditioners on a standard elliptic PDE test problem and (ii) a fluids application using existing CFD codes; the MATLAB codes used in the numerical studies are available online.

Overlapping clusters for distributed computation.
*Conference proceedings*.
Reid Andersen, David F. Gleich, and Vahab S Mirrokni.
In *Poster proceedings of the SIAM Workshop on Combinatorial and
Scientific Computing (CSC)*, 2011.
Poster.
[ bib |
local ]

Random alpha PageRank.
*Journal paper*.
Paul G. Constantine and David F. Gleich.
*Internet Mathematics*, 6(2):189–236, September 2010.
[ bib |
DOI |
local |
http ]

We suggest a revision to the PageRank random surfer model that considers the influence of a population of random surfers on the PageRank vector. In the revised model, each member of the population has its own teleportation parameter chosen from a probability distribution, and consequently, the ranking vector is random. We propose three algorithms for computing the statistics of the random ranking vector based respectively on (i) random sampling, (ii) paths along the links of the underlying graph, and (iii) quadrature formulas. We find that the expectation of the random ranking vector produces similar rankings to its deterministic analogue, but the standard deviation gives uncorrelated information (under a Kendall-tau metric) with myriad potential uses. We examine applications of this model to web spam.

Tracking the random surfer: empirically measured teleportation parameters in
PageRank.
*Conference proceedings*.
David F. Gleich, Paul G. Constantine, Abraham Flaxman, and Asela
Gunawardana.
In *WWW '10: Proceedings of the 19th international conference on
World wide web*, pages 381–390, April 2010.
[ bib |
DOI |
local ]

PageRank computes the importance of each node in a directed graph under a random surfer model governed by a teleportation parameter. Commonly denoted alpha, this parameter models the probability of following an edge inside the graph or, when the graph comes from a network of web pages and links, clicking a link on a web page. We empirically measure the teleportation parameter based on browser toolbar logs and a click trail analysis. For a particular user or machine, such analysis produces a value of alpha. We find that these values nicely fit a Beta distribution with mean edge-following probability between 0.3 and 0.7, depending on the site. Using these distributions, we compute PageRank scores where PageRank is computed with respect to a distribution as the teleportation parameter, rather than a constant teleportation parameter. These new metrics are evaluated on the graph of pages in Wikipedia.

An inner-outer iteration for PageRank.
*Journal paper*.
David F. Gleich, Andrew P. Gray, Chen Greif, and Tracy Lau.
*SIAM Journal of Scientific Computing*, 32(1):349–371, February
2010.
[ bib |
DOI |
local ]

We present a new iterative scheme for PageRank computation. The algorithm is applied to the linear system formulation of the problem, using inner-outer stationary iterations. It is simple, can be easily implemented and parallelized, and requires minimal storage overhead. Our convergence analysis shows that the algorithm is effective for a crude inner tolerance and is not sensitive to the choice of the parameters involved. The same idea can be used as a preconditioning technique for nonstationary schemes. Numerical examples featuring matrices of dimensions exceeding 100,000,000 in sequential and parallel environments demonstrate the merits of our technique. Our code is available online for viewing and testing, along with several large scale examples.

Fast Katz and commuters: Efficient approximation of social relatedness over
large networks.
*Conference proceedings*.
Pooya Esfandiar, Francesco Bonchi, David F. Gleich, Chen Greif, Laks
V. S. Lakshmanan, and Byung-Won On.
In *Algorithms and Models for the Web Graph*, 2010.
[ bib |
DOI |
local ]

Motivated by social network data mining problems such as link prediction and collaborative filtering, significant research effort has been devoted to computing topological measures including the Katz score and the commute time. Existing approaches typically approximate all pairwise relationships simultaneously. In this paper, we are interested in computing: the score for a single pair of nodes, and the top-k nodes with the best scores from a given source node. For the pairwise problem, we apply an iterative algorithm that computes upper and lower bounds for the measures we seek. This algorithm exploits a relationship between the Lanczos process and a quadrature rule. For the top-k problem, we propose an algorithm that only accesses a small portion of the graph and is related to techniques used in personalized PageRank computing. To test the scalability and accuracy of our algorithms we experiment with three real-world networks and find that these algorithms run in milliseconds to seconds without any preprocessing.

Spectral methods for parameterized matrix equations.
*Journal paper*.
Paul G. Constantine, David F. Gleich, and Gianluca Iaccarino.
*SIAM Journal on Matrix Analysis and Applications*,
31(5):2681–2699, 2010.
[ bib |
DOI |
local ]

We apply polynomial approximation methods—known in the numerical PDEs context as spectral methods—to approximate the vector-valued function that satisfies a linear system of equations where the matrix and the right-hand side depend on a parameter. We derive both an interpolatory pseudospectral method and a residual-minimizing Galerkin method, and we show how each can be interpreted as solving a truncated infinite system of equations; the difference between the two methods lies in where the truncation occurs. Using classical theory, we derive asymptotic error estimates related to the region of analyticity of the solution, and we present a practical residual error estimate. We verify the results with two numerical examples.

Algorithms for large, sparse network alignment problems.
*Conference proceedings*.
Mohsen Bayati, Margot Gerritsen, David F. Gleich, Amin Saberi, and
Ying Wang.
In *Proceedings of the 9th IEEE International Conference on Data
Mining*, pages 705–710, December 2009.
[ bib |
DOI |
arXiv |
local ]

We propose a new distributed algorithm for sparse variants of the network alignment problem that occurs in a variety of data mining areas including systems biology, database matching, and computer vision. Our algorithm uses a belief propagation heuristic and provides near optimal solutions for an NP-hard combinatorial optimization problem. We show that our algorithm is faster and outperforms or nearly ties existing algorithms on synthetic problems, a problem in bioinformatics, and a problem in ontology matching. We also provide a unified framework for studying and comparing all network alignment solvers.

A Monte Carlo method for solving unsteady adjoint equations.
*Journal paper*.
Qiqi Wang, David F. Gleich, Amin Saberi, Nasrollah Etemadi, and
Parviz Moin.
*Journal of Computational Physics*, 227(12):6184–6205, June
2008.
[ bib |
DOI |
local ]

Traditionally, solving the adjoint equation for unsteady problems involves solving a large, structured linear system. This paper presents a variation on this technique and uses a Monte Carlo linear solver. The Monte Carlo solver yields a forward-time algorithm for solving unsteady adjoint equations. When applied to computing the adjoint associated with Burgersï¿½ equation, the Monte Carlo approach is faster for a large class of problems while preserving sufficient accuracy.

Approximating personalized PageRank with minimal use of webgraph data.
*Journal paper*.
David F. Gleich and Marzia Polito.
*Internet Mathematics*, 3(3):257–294, December 2007.
[ bib |
DOI |
local ]

In this paper, we consider the problem of calculating fast and accurate approximations to the personalized PageRank score of a webpage. We focus on techniques to improve speed by limiting the amount of web graph data we need to access. Our algorithms provide both the approximation to the personalized PageRank score as well as guidance in using only the necessary information—and therefore sensibly reduce not only the computational cost of the algorithm but also the memory and memory bandwidth requirements. We report experiments with these algorithms on web graphs of up to 118 million pages and prove a theoretical approximation bound for all. Finally, we propose a local, personalized web-search system for a future client system using our algorithms.

Using polynomial chaos to compute the influence of multiple random surfers in
the PageRank model.
*Conference proceedings*.
Paul G. Constantine and David F. Gleich.
In Anthony Bonato and Fan Chung Graham, editors, *Proceedings of
the 5th Workshop on Algorithms and Models for the Web Graph (WAW2007)*,
volume 4863 of *Lecture Notes in Computer Science*, pages 82–95.
Springer, 2007.
[ bib |
DOI |
local ]

The PageRank equation computes the importance of pages in a web graph relative to a single random surfer with a constant teleportation coefficient. To be globally relevant, the teleportation coefficient should account for the influence of all users. Therefore, we correct the PageRank formulation by modeling the teleportation coefficient as a random variable distributed according to user behavior. With this correction, the PageRank values themselves become random. We present two methods to quantify the uncertainty in the random PageRank: a Monte Carlo sampling algorithm and an algorithm based the truncated polynomial chaos expansion of the random quantities. With each of these methods, we compute the expectation and standard deviation of the PageRanks. Our statistical analysis shows that the standard deviation of the PageRanks are uncorrelated with the PageRank vector.

Scalable computing with power-law graphs: Experience with parallel PageRank.
*Conference proceedings*.
David F. Gleich and Leonid Zhukov.
In *SuperComputing 2005*, November 2005.
Poster.
[ bib |
local |
.pdf ]

The World of Music: SDP embedding of high dimensional data.
*Conference proceedings*.
David F. Gleich, Leonid Zhukov, Matthew Rasmussen, and Kevin Lang.
In *Information Visualization 2005*, 2005.
Interactive Poster.
[ bib |
local |
.pdf ]

In this paper we investigate the use of Semidefinite Programming (SDP) optimization for high dimensional data layout and graph visualization. We developed a set of interactive visualization tools and used them on music artist ratings data from Yahoo!. The computed layout preserves a natural grouping of the artists and provides visual assistance for browsing large music collections.

Recommender systems research at Yahoo! Research Labs.
*Conference proceedings*.
Dennis Decoste, David F. Gleich, Tejaswi Kasturi, Sathiya Keerthi,
Omid Madani, Seung-Taek Park, David M. Pennock, Corey Porter, Sumit Sanghai,
Farial Shahnaz, and Leonid Zhukov.
In *Beyond Personalization*, San Diego, CA, January 2005.
Position Statement.
[ bib |
local ]

We describe some of the ongoing projects at Yahoo! Research Labs that involve recommender systems. We discuss recommender systems related problems and solutions relevant to Yahoo!’s business.

An SVD based term suggestion and ranking system.
*Conference proceedings*.
David F. Gleich and Leonid Zhukov.
In *ICDM '04: Proceedings of the Fourth IEEE International
Conference on Data Mining (ICDM'04)*, pages 391–394, Brighton, UK, November
2004. IEEE Computer Society.
[ bib |
DOI |
local ]

In this paper, we consider the application of the singular value decomposition (SVD) to a search term suggestion system in a pay-for-performance search market. We propose a novel positive and negative refinement method based on orthogonal subspace projections. We demonstrate that SVD subspace-based methods: 1) expand coverage by reordering the results, and 2) enhance the clustered structure of the data. The numerical experiments reported in this paper were performed on Overture's pay-per-performance search market data.

# Technical reports

Three results on the PageRank vector: eigenstructure, sensitivity, and the
derivative.
*Conference proceedings*.
David F. Gleich, Peter Glynn, Gene H. Golub, and Chen Greif.
In Andreas Frommer, Michael W. Mahoney, and Daniel B. Szyld, editors,
*Web Information Retrieval and Linear Algebra Algorithms*, number 07071
in Dagstuhl Seminar Proceedings. Internationales Begegnungs- und
Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany, 2007.
[ bib |
local |
http ]

The three results on the PageRank vector are preliminary but shed light on the eigenstructure of a PageRank modified Markov chain and what happens when changing the teleportation parameter in the PageRank model. Computations with the derivative of the PageRank vector with respect to the teleportation parameter show predictive ability and identify an interesting set of pages from Wikipedia.

The world of music: User ratings; spectral and spherical embeddings; map projections, David F. Gleich, Matthew Rasmussen, Kevin Lang, and Leonid Zhukov. Online report, 2006. [ bib | local | .pdf ]

In this paper we present an algorithm for layout and visualization of music collections based on similarities between musical artists. The core of the algorithm consists of a non-linear low dimensional embedding of a similarity graph constrained to the surface of a hyper-sphere. This approach effectively uses additional dimensions in the embedding. We derive the algorithm using a simple energy minimization procedure and show the relationships to several well known eigenvector based methods. We also describe a method for constructing a similarity graph from user ratings, as well as procedures for mapping the layout from the hyper-sphere to a 2d display. We demonstrate our techniques on Yahoo! Music user ratings data and a MusicMatch artist similarity graph.

Hierarchical directed spectral graph partitioning, David F. Gleich. Information Networks, Stanford University, Final Project, 2005, 2006. [ bib | local | .pdf ]

In this report, we examine the generalization of the Laplacian of a graph due to Fan Chung. We show that Fan Chung’s generalization reduces to examining one particular symmetrization of the adjacency matrix for a directed graph. From this result, the directed Cheeger bounds trivially follow. Additionally, we implement and examine the beneﬁts of directed hierarchical spectral clustering empirically on a dataset from Wikipedia. Finally, we examine a set of competing heuristic methods on the same dataset.

Finite calculus: A tutorial for solving nasty sums, David F. Gleich. Combinatorics, Final Paper, Stanford University, 2004., 2005. [ bib | local ]

Topic identiﬁcation in soft clustering using pca and ica, Leonid Zhukov and David F. Gleich. Online report, Yahoo! research labs, 2004. [ bib | local ]

Many applications can benefit from soft clustering, where each datum is assigned to multiple clusters with membership weights that sum to one. In this paper we present a comparison of principal component analysis (PCA) and independent component analysis (ICA) when used for soft clustering. We provide a short mathematical background for these methods and demonstrate their application to a sponsored links search listings dataset. We present examples of the soft clusters generated by both methods and compare the results.

David F. Gleich, Leonid Zhukov, and Pavel Berkhin. Fast parallel PageRank: A linear system approach. Technical Report YRL-2004-038, Yahoo! Research Labs, 2004. [ bib | local | software | .pdf ]

In this paper we investigate the convergence of iterative stationary and Krylov subspace methods for the PageRank linear system, including the convergence dependency on teleportation. We demonstrate that linear system iterations converge faster than the simple power method and are less sensitive to the changes in teleportation. In order to perform this study we developed a framework for parallel PageRank computing. We describe the details of the parallel implementation and provide experimental results obtained on a 70-node Beowulf cluster.

Mtf , bit , and comb: A guide to deterministic and randomized online algorithms for the list access problem, Kevin Andrew and David F. Gleich. Advanced Algorithms, Harvey Mudd College, Final Project, 2004. [ bib | local ]

In this survey, we discuss two randomized online algorithms for the list access problem. First, we review competitive analysis and show that the MTF algorithm is 2-competitive using a potential function. Then, we introduce randomized competitive analysis and the associated adversary models. We show that the randomized BIT algorithm is 7/4-competitive using a potential function argument. We then introduce the pairwise property and the TIMESTAMP algorithm to show that the COMB algorithm, a COMBination of the BIT and TIMEST AMP algorithms, is 8/5-competitive. COMB is the best known randomized algorithm for the list access program.

Three methods for improving relevance in web search., Erin Bodine, David F. Gleich, Cathy Kurata, Jordan Kwan, Lesley Ward, and Daniel Fain. Clinic Report, Harvey Mudd College, May 9 2003. 102 pages. Includes fully documented program code on accompanying CD. [ bib | local ]

The 2002–2003 Overture clinic project evaluated and implemented three different methods for improving relevance ordering in web search. The three methods were bottom up micro information unit (MIU) analysis, top down MIU analysis, and proximity scoring. We ran these three methods on the top 200 web pages returned for each of 58 queries by an already existing algorithmic search engine. We used two metrics, precision and relevance ordering, to evaluate the results. Precision deals with how relevant the web page is for a given query, while relevance ordering is how well-ordered the returned results are. We evaluated the precision of each method and of the algorithmic search engine by hand. For relevance ordering, we recruited other humans to compare pages and used their decisions to generate an ideal ranking for each query. The results of each of our methods and of the algorithmic search engine are then compared to this ideal ranking vector using Kendall’s Tau. Our bottom up MIU analysis method achieved the highest precision score of 0.78 out of 1.00. In addition, bottom up MIU analysis received the second highest correlation coefficient (or relevance ordering score) of 0.107 while the algorithmic search engine received the highest correlation coefficient of 0.121. Interestingly, our proximity scoring method received high relevance ordering scores when the algorithmic search engine received low relevance ordering scores.

Machine learning in computer chess: Genetic programming and krk, David F. Gleich. Independent Study Report, Harvey Mudd College, 2003. [ bib | local ]

In this paper, I describe genetic programming as a machine learning paradigm and evaluate its results in attempting to learn basic chess rules. Genetic programming exploits a simulation of Darwinian evolution to construct programs. When applied to the King-Rook-King (KRK) chess endgame problem, genetic programming shows promising results in spite of a lack of signiﬁcant chess knowledge.

# Other reviewed publications

These usually underwent some form of review

David F. Gleich and Paul G. Constantine.
Ranking web pages.
In Nicholas J. Higham, Mark R. Dennis, Paul Glendinning, Paul A.
Martin, Fadil Santosa, and Jared Tanner, editors, *The Princeton
Companion to Applied Mathematics*, pages 755–757. Princeton University
Press, Princeton, NJ, USA, 2015.
[ bib ]

Review of: Numerical algorithms for personalized search in self-organizing
information networks by Sep Kamvar, Princeton Univ. Press, 2010, 160pp.,
ISBN13: 978-0-691-14503-7.
*Journal paper*.
David F. Gleich.
*Linear Algebra and its Applications*, 435(4):908 – 909, 2011.
[ bib |
DOI |
local ]

# Ph.D. Theses

Just one, thankfully

David F. Gleich.
*Models and Algorithms for PageRank Sensitivity*.
PhD thesis, Stanford University, September 2009.
[ bib |
local |
.pdf ]

The PageRank model helps evaluate the relative importance of nodes in a large graph, such as the graph of links on the world wide web. An important piece of the PageRank model is the teleportation parameter α. We explore the interaction between α and PageRank through the lens of sensitivity analysis. Writing the PageRank vector as a function of α allows us to take a derivative, which is a simple sensitivity measure. As an alternative approach, we apply techniques from the field of uncertainty quantification. Regarding α as a random variable produces a new PageRank model in which each PageRank value is a random variable. We explore the standard deviation of these variables to get another measure of PageRank sensitivity. One interpretation of this new model shows that it corrects a small oversight in the original PageRank formulation. Both of the above techniques require solving multiple PageRank problems, and thus a robust PageRank solver is needed. We discuss an inner-outer iteration for this purpose. The method is low-memory, simple to implement, and has excellent performance for a range of teleportation parameters. We show empirical results with these techniques on graphs with over 2 billion edges.