## Parametric Models to Support Statistical Hypothesis Testing over Graphs

Graphs provide a natural representation of real-world networks e.g. world-wide web, biological networks, social networks. Robust statistical models, which can accurately represent distributions over collections of graphs, are critical for principled quantitative investigation of networks and their properties. Specifically, since sampling distributions (either analytical or empirical) can be used to determine the likelihood of a given sample, statistical models facilitate hypothesis testing and anomaly detection (e.g., graphs with low likelihood can be flagged as anomalous). However, unlike metric spaces the space of graphs exhibits a combinatorial structure which poses significant theoretical and practical challenges which need to be overcome for accurate estimation and efficient inference.

This project investigates the interplay between choice of model representation, parameter estimation, and sampling/inference to develop statistical models that can accurately estimate (parametric) probability distributions over the space of graphs (i.e., graph populations). Specifically, the project focuses on a novel probabilistic graph model that generates graphs by quilting together a set of subgraphs sampled from simpler basis graph models and the application of the resulting model to (i) explore and define graph classes, (ii) detect anomalies and assess their significance, and (iii) investigate graph dynamics and formally characterize notions of temporal stationarity and dynamic evolution.

NSF Grant IIS-1219015: $245,920; co-PI; September 2012

## Algorithms for Sampling Similar Graphs Using Subgraph Signatures

Graphs and networks are a natural representation across a wide range of disciplines and domains. Statistical tools have recently been brought to bear on the analysis of graphs, yielding rich dividends in various application areas. The aim of this project is to use tools from statistics and graph theory to develop algorithms that generate similar graphs efficiently. Since graph data is often expensive to collect, it is desirable to synthetically generate graphs. To be widely applicable however, the generated graphs need to both preserve the semantics of the original data (i.e., be drawn from the same distribution) and be efficient to compute.

Two key questions form the core emphasis of the current project. First, how does one measure similarity between two graphs? Second, how can this notion of similarity be used to generate new graphs? On the topic of similarity, the project will investigate representations to preserve global properties, propose new, efficient, representations for signatures, and explore sampling techniques and their convergence behavior. On the topic of generation of new graphs, the project will develop an exponential random graph model using signatures, investigate feature selection via regularization, propose novel methods to sample from the exponential random graph model and novel techniques to produce proposal graphs, and provide rigorous empirical validation across a range of application areas.

NSF Grant IIS-0916686: $164,846; co-PI; Sept 2009

## Machine Learning Methods and Statistical Analysis Tools for Single Network Domains

In many real-world network domains, the data consists of a single, potentially infinite-sized network (e.g., the World Wide Web, Facebook). In these "single network" domains, an increase in data corresponds to acquiring a larger portion of the underlying network. Although estimation and inference methods from the field of *statistical relational learning* have been successfully applied in single-network domains, the algorithms were initially developed for populations of networks, and thus the theoretical foundation for learning and inference in single networks is scant. This work focuses on the development of robust statistical methods for single network domains---since many large network datasets about complex systems rarely have more than a few subnetworks available for model estimation and evaluation. Specifically, the aims of the project include (1) strengthening the theoretical foundation for learning in single network domains, (2) creating accurate methods for determining the significance of discovered patterns and features, (3) formulating novel model selection and evaluation methods, and (4) developing improved approaches for network learning and prediction based on the unique characteristics of single network domains.

NSF CAREER IIS-1149789: $496,638; PI; January 2012

## Emerging Frontiers of Science of Information

The overarching vision of the Center for Science of Information is to develop rigorous principles guiding the extraction, manipulation, and exchange of information, integrating elements of space, time, structure, semantics and context. Our mission is to advance science and technology through a new quantitative understanding of the representation, communication, and processing of information in biological, physical, social, and engineered systems.

NSF Science & Technology Center Grant CCF-0939370: $200,000; Senior Personnel; Aug 2010

## Towards Better Modeling of Communication Activity Dynamics in Large-Scale Online Social Networks

Online social networks (OSNs) have witnessed tremendous growth and popularity over the recent years. The huge success and increasing popularity of social networks makes it important to characterize and study their behavior in detail. Recent work in analyzing online social network data has focused primarily on either static social network structure or evolving social networks. However, popular OSNs sites provide mechanisms to form and maintain community over time by facilitating communication, content sharing, and other forms of activities.

This research will develop a suite of algorithmic and analytic methods to support the characterization and modeling of activity networks. In particular, we will conduct static and temporal characterization studies of social network activity, study sampling techniques that can preserve graph properties for different communication activity graphs, investigate the fundamental theoretical trade-offs between preserving different properties of the graph, and develop procedural modeling techniques to generate social network activity graphs to better represent the temporal dynamics and burstiness of activity patterns.

</p>

NSF Grant IIS-1017898: $248,226; PI; Sept 2010

## Toward Intrusion Tolerant Clouds

The goal of this project is to invent, develop and transition the replication and overlay messaging engines necessary to make public and private clouds resilient to sophisticated intrusion attacks. In addition to facilitating cloud infrastructure builders, the same technologies will also enable application builders to make their cloud applications more resilient to intrusions. The four components of our approach are: Intrusion- tolerant replication, intrusion-tolerant overlay messaging, ability to protect against massive intrusions through diversity, and the detection and prediction of coming attacks that, in turn, allow our replication and overlay messaging engines to automatically react and prepare in time.

DARPA Grant N66001124014: $367,028; co-PI; November 2011

## MAASCOM: Modeling, Analysis, and Algorithms for Stochastic Control of Multi-Scale Networks

Current networks security monitoring is often driven by reliance on threshold counts for traffic volumes or alerts. While useful, these do not exploit relationships among network flows. In this project, we are developing methods to learn statistical models for evolving network flows, exploiting not only the relational structure but also the temporal evolution of the network. Although several techniques for learning relational models from network data have been developed recently, research has focused primarily on the task of attribute and link prediction in static domains. In flow domains, the modeling task will need to take the temporal evolution of the data in account, while predicting not only the structure of the network (e.g., which flows are related) but also the occurrence of complex events throughout the network (e.g., a coordinated intrusion across multiple hosts). This necessitates that we move beyond previous assumptions of a static, well-structured domain and develop more elaborate relational models and algorithms.

More specifically, we are developing automated methods to infer the structure of the underlying flow network from flow streams, capitalizing on our experience modeling both static and dynamic relational domains. We aim to move significantly beyond current data mining and machine learning approaches, which would model the flows independently. Instead, we will exploit the relational dependencies hidden in the flow structure, which are induced by the shared underlying network topology and application goals. In addition, we are developing methods to identify the occurrence of complex events that are temporally and topologically distributed in the network. For example, a denial-of-service attack may consist of a single low-capacity flow from one source that diffuses slowly throughout the network and then results a set of distributed high- capacity flows to a target node. Although there are some current efforts to incorporate temporal dynamics into relational models, these approaches make strict Markov assumptions to make the models tractable. We conjecture that more elaborate dependency structures will be necessary to effectively identify distributed patterns in streaming flow data. Thus, instead of using Markov assumptions in our models, we will define and explore a limited model space of temporal-relational motifs.