Nesreen Ahmed, Purdue University

Jennifer Neville, Purdue University

**DESCRIPTION**

Network data appears in various domains, including social, communication, and information sciences. Analysis of such data is crucial for making inferences and predictions about these networks, and moreover, for understanding the different processes that drive their evolution. However, a major bottleneck to perform such analysis is the massive size of real-life networks, which makes modeling and analyzing these networks simply infeasible. Further, many networks such as social and communication networks are not even visible to the public due to privacy concerns, and other networks such as the Web, are only accessible via crawling. Therefore, to overcome the above challenges, researchers use network sampling overwhelmingly as a key statistical approach to select a sub-population of interest that can be studied thoroughly.

In this tutorial, we aim to explore the diverse methodologies and applications of network sampling. We will begin with a discussion of the problem setting in terms of objectives (such as, sampling a representative subgraph, sampling graphlets, etc.), population of interest (vertices, edges, motifs), and sampling methodologies (such as Metropolis-Hastings, random walk, and snowball sampling). We will then present a number of applications of these methods, and outline both the resulting opportunities and possible biases of the different methods in each application. Finally, we will outline other challenges such as sampling from network streams.

This tutorial will explore the unique opportunities and challenges for
predictive modeling with social network data. We will begin with a
description of the problem setting, including examples of various
applications of social network mining (e.g., targeted marketing,
on-line advertising, fraud detection). We will then present a number
of characteristics of social network data that differentiate it from
the traditional settings for inference and learning, and outline the
resulting opportunities for significantly improved inference and
learning. We will discuss specific techniques for capitalizing on each
of the opportunities in statistical models, and outline both
methodological issues and potential modeling pathologies that are
unique to network data.

**OUTLINE** (preliminary, subject to change)

This tutorial will begin by illustrating the problem setting with various examples of applications in data collection, structure analysis, parameter estimation, classification, A/B testing, and infor- mation diffusion. It will then present the different objectives, population of interest, and sampling methods.

- The tutorial will begin by an overview of network sampling and the various objectives.
- The tutorial will present an introduction to different sampling methodologies (direct sampling, snowball sampling, random walk sampling, importance sampling, MCMC sampling, Gibbs sampling), and will discuss the relation between uniform sampling and counting.
- Sampling to estimate network parameters--the tutorial will discuss the state-of-the-art net- work sampling methods that are used to estimate global properties and parameters of the network, or the node attributes.
- Sampling small subgraphs for network structure analysis--the tutorial will discuss the state- of-the-art network sampling methods that are used to sample frequent subgraphs, count triangles and graphlets.
- Sampling a representative subgraph--the tutorial will discuss the state-of-the-art network sampling methods used to sample a subgraph that preserves various topological properties of the network, such as degree distribution, community structure, and personalized pagerank. The tutorial will also discuss the use of MCMC techniques to reduce the sample bias.
- Sampling from network streams.
- Application of network sampling in machine learning and network mining tasks.

**ABOUT THE INSTRUCTORS**

Mohammad A. Hasan is an Assistant Professor of Computer Science at Indiana University-Purdue University, Indianapolis (IUPUI). Before that, he was a Senior Research Scientist at eBay Research Labs, San Jose, CA. He received a Ph.D. degree in Computer Science from Rensselaer Polytechnic Institute (RPI) in 2009, and an MS degree in Computer Science from the University of Minnesota, Twin Cities in 2002. His research interest focuses on developing novel algorithms in data mining, data management, information retrieval, machine learning, social network analysis, and bioinformatics. One of his particular interests is to develop algorithms for sampling small substructures from large networks. He developed methods for: (1) sampling frequent subgraphs from a graph database, (2) sampling triangles and graphlets from a large network, and (3) Sampling interesting subgraph patterns using interactive feedbacks, all using Markov Chain Monte Carlo (MCMC) sampling algorithm. His doctoral dissertation won the ACM SIGKDD doctoral dissertation award in 2010. He is also a recepient of NSF CAREER award in 2012.

Nesreen Ahmed is a 5th year Ph.D. student working with Jennifer Neville in the Computer Science Department at Purdue University. Her Ph.D research is focused on statistical network sampling and network stream sampling. She has worked on research developing machine learning algorithms for time series forecasting, statistical predictive analysis of social media, and digital marketing. She has worked as a research intern at Adobe ATL labs, a research assistant at the data mining and computer modeling center of excellence in Egypt, and a teaching assistant at Cairo University.

Jennifer Neville is an associate professor at Purdue University with a joint appointment in the Departments of Computer Science and Statistics. She received her PhD from the University of Massachusetts Amherst in 2006. In 2012, she was awarded an NSF Career Award, in 2008 she was chosen by IEEE as one of "AI's 10 to watch", and in 2007 was selected as a member of the DARPA Computer Science Study Group. Her research focuses on developing data mining and machine learning techniques for relational domains, including citation analysis, fraud detection, and social network analysis.