Networks spill the beans
Lada Adamic, University of Michigan
Slides: pdf (8.5mb)
Network structure is shaped by underlying dynamical processes. In this talk I will illustrate how network topology can be tied to other co-evolving system properties in three different information sharing settings. In online question and answer forums, different network topologies correspond to varying alignment in expertise between askers and repliers, guiding the selection of algorithms for identifying expertise. In financial trading networks, network structure reflects the arrival of information in the market. And in information exchange networks, e.g. email and Twitter, network structure can be used to infer the diversity and novelty of the information being exchanged, without needing to look at the content. All three examples suggest that network structure can be rather revealing, once probed with suitable tools.
Lada A. Adamic is an associate professor in the School of Information at the University of Michigan. She is also affiliated with the Center for the Study of Complex Systems and EECS. Her research interests center on information dynamics in networks: how information diffuses, how it can be found, and how it influences the evolution of a network's structure. She worked previously in Hewlett-Packard's Information Dynamics Lab. Her projects have included identifying expertise in online question and answer forums, studying the dynamics of viral marketing, and characterizing the structure in blogs and other online communities.
Efficient graph mining
Karsten Borgwardt, Max Planck Institute
Slides: pdf (1mb)
Two fundamental algorithmic problems that tend to reoccur in almost all graph mining problems are graph comparison and searching through all pairs of nodes in a graph.
Graph comparison is a key step in detecting patterns in large networks, while searching pairs of nodes is a fundamental step in link prediction and related tasks.
The former is an expensive operation, as it often involves repeated (sub)graph isomorphism checks, the latter tends to be extremely expensive in large networks,
as it scales quadratically with the number of network nodes.
My research group has developed efficient solutions to these two problems: In Shervashidze and Borgwardt (NIPS 2009) and Shervasdhize et al. (JMLR 2011), we present a method for graph comparison which scales only linearly in the number of edges of the graphs. In Achlioptas et al. (KDD 2011), we show how to search pairs of nodes in a graph in subquadratic time. I will present both these algorithms in my talk, and I will explain why the latter may solve one of the central open problems in statistical genetics.
Karsten Borgwardt studied Computer Science and Biology at LMU Munich, the University of Oxford and NICTA Canberra. After his PhD at LMU Munich under the supervision of Prof. Dr. Hans-Peter Kriegel and Prof. Dr. Alex Smola, he spent a year as a postdoc with Prof. Dr. Zoubin Ghahramani at the University of Cambridge, before moving to the Max Planck Institutes in Tübingen, where he now heads a group of 10 members as a W2 research group leader (~associate professor). His awards include the Heinz-Schwärtzel-Award for Foundations of Computer Science in 2007, and the NIPS 2009 Outstanding Student Paper Award as supervisor and co-author.
Defined by Linear Combinations of
Constrained Random Walks
William Cohen, Carnegie Melon University
Slides: pdf (6.7mb)
Many datasets can be naturally encoded as a labeled directed graph:
for instance, scientific literature can be represented as a graph of
documents, terms, and meta-data, with edges corresponding to
containment of a term in a document, authorship of a document by a
person, and so on. One popular way of querying such a graph is via
queries based on proximity measures, such as Random Walk with Restart
In this talk, we describe a novel learnable proximity measure based on
RWR. Instead of introducing one weight per edge label, as in most
prior work, we introduce one weight for each edge label sequence. In
this model proximity is defined by a weighted combination of simple
"random walk experts", each corresponding to conducting a random walk
constrained to follow a particular sequence of labeled edges.
Experiments on eight tasks using graphs based on literature from two
subdomains of biology show that the new learning method significantly
outperforms the prior methods. We extend the method to support two
additional types of experts to model intrinsic properties of entities:
"query-independent experts", which generalize the PageRank measure,
and "popular entity experts" which allow rankings to be adjusted for
particular entities that are especially important.
Finally, we present experiments in which we use this approach to learn
relationships in the ontology of NELL, a wide-coverage, large-scale
information extraction system for web data. We show that these types
of learnable "proximity measures" are general enough to accurately
model a significant number of real-world relations, and that they
outperform an alternative technique that learns to model relations
based on more traditional logical rules.
This is joint work with Ni Lao and Tom Mitchell.
William Cohen received his bachelor's degree in Computer Science
from Duke University in 1984, and a PhD in Computer Science from
Rutgers University in 1990. From 1990 to 2000 Dr. Cohen worked at AT&T
Bell Labs and later AT&T Labs-Research, and from April 2000 to May
2002 Dr. Cohen worked at Whizbang Labs, a company specializing in
extracting information from the web. Dr. Cohen is President of the
International Machine Learning Society, an Action Editor for the
Journal of Machine Learning Research, and an Action Editor for the
journal ACM Transactions on Knowledge Discovery from Data. He is also
an editor, with Ron Brachman, of the AI and Machine Learning series of
books published by Morgan Claypool. In the past he has also served as
an action editor for the journal Machine Learning, the journal
Artificial Intelligence, and the Journal of Artificial Intelligence
Research. He was General Chair for the 2008 International Machine
Learning Conference, held July 6-9 at the University of Helsinki, in
Finland; Program Co-Chair of the 2006 International Machine Learning
Conference; and Co-Chair of the 1994 International Machine Learning
Conference. Dr. Cohen was also the co-Chair for the 3rd Int'l AAAI
Conference on Weblogs and Social Media, which was held May 17-20, 2009
in San Jose, and is the co-Program Chair for the 4rd Int'l AAAI
Conference on Weblogs and Social Media, which will be held May 23-26
at George Washington University in Washington, D. C. He is a AAAI
Fellow, and in 2008, he won the SIGMOD "Test of Time" Award for the
most influential SIGMOD paper of 1998. Dr. Cohen's research interests
include information integration and machine learning, particularly
information extraction, text categorization and learning from large
datasets. He holds seven patents related to learning, discovery,
information retrieval, and data integration, and is the author of more
than 100 publications.
Beyond link prediction and collaborative filtering: Learning to
Charles Elkan, University of California, San Diego
Slides: pdf (1.8mb)
Predicting links between people in a social network is an example of a
task that requires representing and learning degrees of affinity
between pairs of items. Predicting the rating that a person will give
to a movie is another task in the same class. As a third example, in
reinforcement learning the long-term reward Q(s,a) can be viewed as
the degree of affinity between the action a and the state s.
In this talk I will present a general solution to the problem of
learning to predict degrees of affinity. The new method induces
latent features to represent items, and combines the latent features
with any available explicit features to make accurate predictions.
Experiments with a dataset of pairs of members of the eHarmony dating
service show that learning affinity in this way is much faster and
more accurate than learning a distance metric, the best previous
approach. For link prediction, the new method achieves state of the
art accuracy over a wider range of datasets than any previous method.
(Joint work with Aditya Menon.)
Charles Elkan is a professor in the Department
of Computer Science and Engineering at the University of California,
San Diego. In 1998/99 he was Visiting Associate Professor at Harvard
University. Dr. Elkan is known for his research in machine learning,
data mining, and computational biology. In particular, the MEME
algorithm he developed with his Ph.D. student Tim Bailey has been used
in over 2000 published research projects in biology. Dr. Elkan has
won several best paper awards and data mining contests, and some of
his graduate students have become leaders at companies including
Google, Yahoo, Zillow, and IBM, while others have held faculty
positions at Columbia University, the University of Washington, and
other universities inside and outside the U.S.
The Physicists' Contribution to Clustering: Modularity Maximization and Beyond
Michelle Girvan, University of Maryland
Slides: pdf (1.8mb)
In the last several years, the problem of network clustering or community finding has received considerable attention from both computer scientists and physicists. In this talk, I will review approaches introduced in the physics literature and also discuss two new studies that extend these approaches to more nuanced clustering problems. The first of these studies introduces an alternative measure of community structure that is functionally motivated (in contrast to the structurally motivated modularity measure). The second study focuses on the problem of finding near-optimal network partitions. These studies suggest a need to approach different classes of network clustering problems with different tools. In addition, they highlight the utility of collaborative efforts between physicists and computer scientists.
Michelle Girvan is an assistant professor in the Department of Physics and the Institute for Physical Science and Technology at the University of Maryland. Her research interests focus on the theory and applications of complex networks. In terms of theory, she has worked extensively on defining and identifying modularity in networks. She currently works on applications of network theory to gene regulation and the spread of ideas. Girvan received her Ph.D. in physics from Cornell University and received undergraduate degrees in physics and math at MIT. Before joining the faculty at the University of Maryland, she was a postdoctoral fellow at the Santa Fe Institute.
Mining Structured Data on the Web
Alon Halevy, Google Research
Slides: pdf (5.7mb)
The World-Wide Web is the biggest repository of factual information available to mankind. Facts on the Web are typically embedded in text and sometimes presented in more structured form such as tables, lists, or cards. As a result, since the WWW emerged, it has been mined for knowledge bases that can be used for query answering and higher-level inference. One of the key challenges in mining structured data from the Web is to separate the raw data from the surrounding presentation and to recover the semantics of the data which may not be explicit in the presentation.
I will describe some of the challenges we face in the WebTables Project, where we are trying to extract high-quality data tables from HTML tables on the Web. In particular, I will focus on the opportunities that arise when we consider the relationships between disparate tables on the Web and aggregate features of data across the entire Web.
Alon Halevy heads the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the database group. In 1999, Dr. Halevy co–founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic Inc., a company that created search engines for the deep web, and was acquired by Google. Dr Halevy is a Fellow of the Association for Computing Machinery, received the the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999–2000). He received his Ph.D in Computer Science from Stanford University in 1993. He is also a bit of a coffee nut.
Latent Factor Models for Networks and Relational Arrays
Peter Hoff, Univeristy of Washington
Slides: pdf (2.5mb)
Latent factor models are a useful tool for describing network and relational patterns. These models are based on well-known matrix decomposition methods, and thus have a rich mathematical framework upon which to build. Additionally, the parameters in these models are easy to interpret: Roughly speaking, a latent factor model posits that the relationship between two nodes is a function of observed and unobserved (latent) characteristics, potentially in addition to contextual factors.
In this talk I will give a brief review of latent factor models and show ways in which the basic models can be extended. In particular, I will show how to extend the basic model to accommodate multiway relational arrays, such as networks measured over time or the measurement of several relational variables on a common nodeset. I will also show how nodal clustering can be incorporated into the latent factor model, allowing for flexible notions of what it means for a set of nodes to constitute a group or cluster.
Peter Hoff is an Associate Professor of Statistics and Biostatistics at
the University of Washington. He has developed a variety of Bayesian
methods for multivariate data, including covariance and copula
estimation, cluster analysis, mixture modeling and social network
analysis. He is on the editorial board of the Annals of Applied
Statistics, JRSSB and Statistical Science.
Extracting insight from large networks: implications of small-scale and large-scale structure
Michael Mahoney, Stanford University
Slides: pdf (16.3mb)
Recent empirical work has demonstrated that, although there often exists
meaningful "small scale" structure (e.g., clustering structure around a
single individual at the size-scale of roughly 100 individuals) in large
social and information networks, analogous "large scale" structure (e.g.,
meaningful or statistically significant properties of tens or hundreds of
thousands of individuals) either is lacking entirely or is of a form that
is extremely difficult for traditional machine learning and data analysis
tools to identify reliably. For example, there are often small clusters
which provide a "bottleneck" to diffusions (e.g., diffusive-based dynamic
processes of the form of interest in viral marketing applications and
tipping point models of network dynamics); on the other hand, there are
typically no large clusters that have analogous bottlenecks, and thus
diffusion-based metrics (and the associated machine learning and data
analysis tools) are simply much less meaningful (or discriminative or
useful) if one is interested in analyzing the network at large sizes.
This empirical work will be briefly reviewed, and its implications for
extracting insight from large networks with popular machine learning and
data analysis tools will be discussed.
Michael Mahoney is at Stanford University. His research interests center around algorithms for very large-scale statistical data analysis, including both theoretical and applied aspects of problems in scientific and Internet domains. His current research interests include geometric network analysis; developing approximate computation and regularization methods for large informatics graphs; and applications to community detection, clustering, and information dynamics in large social and information networks. He has also worked on randomized matrix algorithms and their applications to genetics, medical imaging, and Internet problems. He has been a faculty member at Yale University and a researcher at Yahoo, and his PhD was is computational statistical mechanics at Yale University.