# Jennifer Neville

## NSF Career Grant

### Machine Learning Methods and Statistical Analysis Tools for Single Network Domains

Machine learning researchers focus on two distinct learning scenarios for structured network data (i.e., where there are statistical dependencies among the attributes of linked nodes). In the first scenario, the domain consists of a population of structured examples (e.g., chemical compounds) and we can reason about learning algorithms asymptotically, as the number of structured examples increases. In the second scenario, the domain consists of a single, potentially infinite-sized network (e.g., the World Wide Web). In these "single network" domains, an increase in data corresponds to acquiring a larger portion of the underlying network. Even when there are a set of network samples available for learning and prediction, they correspond to subnetworks drawn from the same underlying network and thus may be dependent.

Although estimation and inference methods from the field of *statistical relational learning* have been successfully applied in single-network domains, the algorithms were initially developed for populations of networks, and thus the theoretical foundation for learning and inference in single networks is scant. This work focuses on the development of robust statistical methods for single network domains---since many large network datasets about complex systems rarely have more than a few subnetworks available for model estimation and evaluation. Specifically, the aims of the project include (1) strengthening the theoretical foundation for learning in single network domains, (2) creating accurate methods for determining the significance of discovered patterns and features, (3) formulating novel model selection and evaluation methods, and (4) developing improved approaches for network learning and prediction based on the unique characteristics of single network domains.

The research will enhance our understanding of the mechanisms that influence the performance of network analysis methods and drive the development novel methods for complex network domains. Expanding the applicability of machine learning techniques for single network domains could have a transformational impact across a broad range of areas (e.g., psychology, communications, education, political science) where current methods limit research to the investigation of processes in dyad or small group settings. Also, the project results will serve as an example application of computer science in broader network science context, which will attract and retain students that might not otherwise be engaged by conventional CS topics.

NSF Career Grant IIS-1149789: $496,640; PI; Jan 2012

*Activities*

**Predictive Modeling in Networks:**
Ensemble methods have been widely studied as a means of reducing classification error by combining multiple models for prediction. Much of the earlier work on ensemble techniques has focused on i.i.d. domains (where objects are independent and identically distributed, and models use exact inference techniques). While there has been some recent investigation of ensembles for relational domains, these previous works have a number of limitations in that: (1) there is no theoretical analysis to show the mechanism by which ensembles reduce model error in relational domains, (2) they focus on the reduction of only one type of error (due to either learning or inference), and (3) they make an assumption about the datasets having multiple relations.
We have formulated a theoretical framework to compare the errors made by different relational ensembles and show the reason why some methods do better than others. Moreover, we have developed a relational ensemble framework that combines a relational ensemble learning approach with a relational ensemble inference approach. We use the first method for learning the ensemble, while focusing on reducing error due to variance in learning. We use the second method for applying the ensemble for inference, while focuses on reducing error due to variance in inference. We show how the unique combination of these methods offers the largest classification accuracy improvement, compared to various other approaches on both synthetic and real-world datasets. Furthermore, our ensemble model is applicable in domains that do not necessarily have multiple relational graphs, thereby overcoming that limitation of existing relational ensembles. In fact, our framework is applicable in both single-graph and multi-graph settings.

H. Eldardiry and J. Neville. **An Analysis of How Ensembles of Collective Classifiers Improve Predictions in Graphs**. In *Proceedings of the 21st ACM International Conference on Information and Knowledge Management*, 2012. [PDF]

Relational and Gaussian Markov networks have both been applied successfully for classification in relational settings. However, since Gaussian Markov networks model joint distributions over *continuous* label space, it is difficult to use them to reason about uncertainty in discrete labels. On the other hand, relational Markov networks model probability distributions over *discrete* label space, but since they condition on the graph structure, the marginal probability for an instance will vary based on the structure of the subnetwork observed around the instance. This implies that the marginals will not be *identical* across instances and can sometimes result in poor prediction performance. In this work, we propose a novel latent relational model based on *copulas* which allows use to make predictions in a discrete label space while ensuring identical marginals and at the same time incorporating some desirable properties of modeling relational dependencies in a continuous space. To combine the strengths of Gaussian Markov networks' continuous representation with the strengths of relational Markov network methods for learning the dependencies among relational attributes, we have propose a novel latent relational model based on copulas. The copula approach allows us to make the intuitively reasonable assumption of identical marginals an explicit constraint in the model. This provides a mechanism to remove the "independent" but not the "identical" from the conventional IID assumption when we model dependencies in relational datasets. While such a model is conceptually desirable for network data, to the best of our knowledge, copulas have not been applied to predictive tasks in real world network datasets.

R. Xiang and J. Neville. **Collective Inference for Network Data with Copula Latent Markov Networks**. In *Proceedings of the 6th ACM International Conference on Web Search and Data Mining*, 2013. [PDF]

In network classification models, a typical assumption is that all edges are known when computing the joint distribution of the attributes in the network. Specifically, for each entity in the network, learning algorithms assume that the neighbors of the entity (and their attributes) are known. Such settings include social networks (e.g., Facebook) where a person's friends are known, enabling relational prediction of a person's attributes based on characteristics of their friends. However, in many network domains, relationship information may not be readily available for all nodes in the network due to privacy or legal restrictions, or because a cost is associated with acquiring the connections of a node. For example, it is unreasonable to expect to be able to access the phone records of an entire population when attempting to identify a handful of individuals involved in illegal or fraudulent activities. To address the issue of classification in domains where network structure can be sampled at a cost, we formulated the "Active Exploration" (AE) task, which aims to identify all items in a network with a desired trait (i.e., positive labels) given only partial information about the network. The AE process iteratively queries for labels or network structure within a limited budget; thus, accurate predictions prior to making each query is critical to maximizing the number of positives gathered. However, the targeted AE query process produces partially observed networks that can create difficulties for predictive modeling. In particular, we have shown that these partial networks can exhibit extreme label correlation bias, which makes it difficult for conventional relational learning methods to accurately estimate relational parameters. To overcome this issue, we model the joint distribution of possible edges and labels to improve learning and inference.

J. Pfeiffer III, J. Neville, and P. Bennett. **Active Exploration in Networks: Using Probabilistic Relationships for Learning and Inference**. In *Proceedings of the 23st ACM International Conference on Information and Knowledge Management*, 2014. [PDF]

In addition, we investigated the performance of more general relational machine learning (RML) methods. Due to the statistical correlations between linked nodes in the network, many RML methods focus on predicting node features (i.e., labels) using the network relationships. However, many domains are comprised of a single, partially-labeled network. Thus, relational versions of Expectation Maximization (i.e., R-EM), which jointly learn parameters and infer the missing labels, can outperform methods that learn parameters from the labeled data and apply them for inference on the unlabeled nodes. Although R-EM methods can significantly improve predictive performance in networks that are densely labeled, they do not achieve the same gains in sparsely labeled networks and can perform worse than RML methods. We have shown the fixed-point methods that R-EM uses for approximate learning and inference result in errors that prevent convergence in sparsely labeled networks. To address this, we propose two methods that do not experience this problem. First, we develop a Relational Stochastic EM (R-SEM) method, which uses stochastic parameters that are not as susceptible to approximation errors. Then we develop a Relational Data Augmentation (R-DA) method, which integrates over a range of stochastic parameter values for inference. R-SEM and R-DA can use any collective RML algorithm for learning and inference in partially labeled networks.

J. Pfeiffer III, J. Neville, and P. Bennett. **Composite Likelihood Data Augmentation for Within-Network Statistical Relational Learning**. In *Proceedings of the 14th IEEE International Conference on Data Mining*, 2014. [PDF]

We note that existing relational machine learning (RML) approaches have limitations that prevent their application in large scale domains. First, semi-supervised methods for RML do not fully utilize all the unlabeled instances in the network. Second, the collective inference procedures necessary to jointly infer the missing labels are generally viewed as too expensive to apply in large scale domains. In our recent work, we address each of these limitations. We analyze the effect of full semi-supervised RML and find that collective inference methods can introduce considerable bias into predictions. We correct this by implementing a maximum entropy constraint on the inference step, forcing the predictions to have the same distribution as the observed labels. Next, we outline a massively scalable variational inference algorithm for large scale relational network domains. We extend this inference algorithm to incorporate the maximum entropy constraint, proving that it only requires a constant amount of overhead while remaining massively parallel.

J. Pfeiffer III, J. Neville, and P. Bennett. **Overcoming Relational Learning Biases to Accurately Predict Preferences in Large Scale Networks**. In *Proceedings of the 24th International World Wide Web Conference (WWW)*, 2015. [PDF]

**Generative Models of Networks:**
Research in statistical relational learning focuses on methods to exploit correlations among the attributes of linked nodes to predict user characteristics with greater accuracy. Concurrently, research on generative graph models has primarily focused on modeling network structure without attributes, producing several models that are able to replicate structural characteristics of networks such as power law degree distributions or community structure. However, there has been little work on how to generate networks with real-world structural properties and correlated attributes. We have developed the Attributed Graph Model (AGM) framework to jointly model network structure and vertex attributes. Our framework learns the attribute correlations in the observed network and exploits a generative graph model, such as the Kronecker Product Graph Model (KPGM) and Chung Lu Graph Model (CL), to compute structural edge probabilities. AGM then combines the attribute correlations with the structural probabilities to sample networks conditioned on attribute values, while keeping the expected edge probabilities and degrees of the input graph model.

J. Pfeiffer III, S. Moreno, T. La Fond, J. Neville, and B. Gallagher. **Attributed Graph Models: Modeling network structure with correlated attributes**. In *Proceedings of the 23rd International World Wide Web Conference (WWW)*, 2014. [PDF]

In addition, we have investigated scalable generative graph models that focus on modeling distributions of (unattributed) graphs that match real world network properties and scale to large datasets. In network analysis, the "assortativity" statistic, defined as the correlation between the degrees of linked nodes in the network, is often used to summarize the joint degree distribution. The measure can distinguish between types of networks---social networks commonly exhibit positive assortativity, in contrast to biological or technological networks that are typically disassortative. Despite this, little work has focused on scalable graph models that capture assortativity in networks. In our recent work, extending the ideas above, we developed a generative graph model that explicitly estimates and models the joint degree distribution. Our Binned Chung Lu method accurately captures both the joint degree distribution and assortativity, while still matching characteristics such as the degree distribution and clustering coefficients. Further, our method has subquadratic learning and sampling methods that enable scaling to large, real world networks.

J. Moore, S. Mussmann, J. Pfeiffer III, and J. Neville. **Incorporating Assortativity and Degree Dependence into Scalable Network Models**. In *Proceedings of the 29th AAAI Conference on Artificial Intelligence*, 2015. [PDF]

**Hypothesis Testing Across Networks:**
The recent interest in networks has fueled a great deal of research on the analysis and modeling of graphs. However, much of this work has focused on analyzing the structure of a single large network drawn from a specific domain (e.g., Facebook). Although some of the work has compared the structure of networks from various domains (e.g., comparing biological networks to social networks), and these efforts have empirically explored the similarities (or differences) among their network statistics, there has little effort to assess whether the observed differences are statistically significant. The lack of statistical hypothesis testing methods to compare across networks is because previous work has focused on assessing the significance of patterns within a network (e.g., reciprocity) rather than comparing across networks. Here we exploit recent work on mixed Kronecker Product Graph Models (mKPGMs)—which accurately capture the structural characteristics of real world networks and their natural variation—to develop a principled approach for hypothesis testing across networks. Specifically, we developed an algorithm to test the hypothesis that two graphs are drawn from the same distribution by learning a statistical model for P(G) an accepting/rejecting the hypothesis based on likelihood.

S. Moreno and J. Neville. **Network Hypothesis Testing Using Mixed Kronecker Product Graph Models**. In *Proceedings of the 13th IEEE International Conference on Data Mining*, 2013. [PDF]

**Network Sampling:**
We have explored a natural progression of computational models for sampling networks—from static to streaming. The majority of previous work has focused on sampling from static graphs, which is the simplest and least restrictive problem setup. In addition to the static setting, we also investigated the more challenging issues of sampling from disk-resident graphs and graph streams. This led us to propose a spectrum of computational models for network sampling including: (1) static graphs, (2) large graphs, and (3) streaming graphs. This spectrum not only provides insights into the complexity of the computational models (i.e., static vs. streaming), but also the complexity of the algorithms that are designed for each scenario. A subtle but important consequence is that any algorithm designed to work over graph streams is also applicable in the simpler computational models (i.e., static graphs). However, the converse is not true, algorithms designed for sampling a static graph that can fit in memory, may not be generally applicable for graph streams. Within this framework, we investigated the task of sampling representative subgraphs from static and streaming graphs. We propose a family of sampling methods based on the concept of graph induction, which are both space-efficient and runs in a single pass over the edges. Overall our family of proposed sampling methods generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the graph properties for streaming and static graphs.

N. Ahmed, J. Neville, and R. Kompella. **Network Sampling: From Static to Streaming Graphs**. *ACM Transactions on Knowledge Discovery from Data*, 8, 2014. [PDF]

N. Ahmed, N. Duffield, J. Neville, and R. Kompella. **Graph Sample and Hold: A Framework for Big-Graph Analytics**. In *Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*, 2014. [PDF]

We exploited our understanding of sampling from graph streams to develop efficient methods for counting graphlets (subgraph motifs) exactly in static networks. The use of graphlet counts as network signatures has witnessed a tremendous success and impact in a variety of domains recently, but there are few methods to assess the frequencies of these subgraph patterns in large networks. Specifically, existing methods are not scalable to large networks with millions of nodes and edges, which impedes the application of graphlets to new problems that require large-scale network analysis. To address this problem, we developed a fast, efficient, and parallel algorithm for counting graphlets of size k = {3,4}-nodes that take only a fraction of the time to compute when compared with the current methods used. Our proposed graphlet counting algorithms leverage a number of proven combinatorial arguments for different graphlets. For each edge, we count a few graphlets, and with these counts along with the combinatorial arguments, we obtain the exact counts of others in constant time.

N. Ahmed, J. Neville, R. Rossi, and N. Duffield.**Efficient Graphlet Counting for Large Networks**. In

*Proceedings of the 15th IEEE International Conference on Data Mining*, 2015. [PDF]

Relational classification has been extensively studied recently due to its applications in social, biological, technological, and information networks. Much of the work in relational learning has focused on analyzing input data that comprise a *single* network. Although machine learning researchers have considered the issue of how to sample training and test sets from the input network (for evaluation), the mechanisms which are used to *construct* the input networks have largely been ignored. In most cases, the input network has itself been sampled from a larger target network (e.g., Facebook) and often the researcher is unaware of {\em how} the input network was constructed or what impact that may have on evaluation of the relational models. Since the goal in evaluating relational classification algorithms is to accurately assess their performance on the larger target network, it is critical to understand what impact the initial sampling method may have on our estimates of classification accuracy.
We considered different sampling methods and systematically studied their impact on evaluation of relational classification.

N. Ahmed, J. Neville, and R. Kompella. **Network Sampling Designs for Relational Classification**. In *Proceedings of the 6th International AAAI Conference on Weblogs and Social Media*, 2012. [PDF]