|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Current Research ProjectsAlgorithms for Sampling Similar Graphs Using Subgraph Signatures Graphs and networks are a natural representation across a wide range of disciplines and domains. Statistical tools have recently been brought to bear on the analysis of graphs, yielding rich dividends in various application areas. The aim of this project is to use tools from statistics and graph theory to develop algorithms that generate similar graphs efficiently. Since graph data is often expensive to collect, it is desirable to synthetically generate graphs. To be widely applicable however, the generated graphs need to both preserve the semantics of the original data (i.e., be drawn from the same distribution) and be efficient to compute. Two key questions form the core emphasis of the current project. First, how does one measure similarity between two graphs? Second, how can this notion of similarity be used to generate new graphs? On the topic of similarity, the project will investigate representations to preserve global properties, propose new, efficient, representations for signatures, and explore sampling techniques and their convergence behavior. On the topic of generation of new graphs, the project will develop an exponential random graph model using signatures, investigate feature selection via regularization, propose novel methods to sample from the exponential random graph model and novel techniques to produce proposal graphs, and provide rigorous empirical validation across a range of application areas. NSF Grant IIS-0916686: $164,846; co-PI; Sept 2009 Machine Learning Techniques to Model the Impact of Relational Communication on Distributed Team Effectiveness Although social science research has examined some relational aspects of distributed teams, the approaches that examine the impact of interpersonal communication on team effectiveness are limited by the statistical techniques used for analysis. The goal of this project is to exploit recent advances in the field of machine learning to study relational communication flow in distributed group settings, analyze its impact on effectiveness, and understand the complex interdependencies among group members. We posit that it is the dependencies among team members that hold the key to understanding the organizational and relational processes that impact the success (or failure) of distributed teams. The project will take an iterative approach to theory formation and refinement based on a combined analysis of large-scale observational data and experimental studies, exploring both the local and global characteristics of communication and its impact on team properties. We will extract and analyze publicly available data from open-source software development projects (e.g., project mailing lists and bug reports) to develop joint models of effectiveness based on relational patterns of communication among members. We will then use the results of this analysis to develop targeted hypotheses for empirical evaluation in laboratory experiments. NSF Grant SES-0823313: $205,311; PI; Aug 2008 Fusion and Analysis of Multi-Source Relational Data Although there has been significant progress in recent years developing analysis tools for relational datasets, two critical technical barriers limit the applicability of the models. First, current techniques assume a single-source environment where the relationships among objects are completely observed in the training set. However, in many real-world domains, data is collected from and/or stored in separate sources---each source may record observations for only a subset of the entities or a subset of the relations. Second, current techniques assume unambiguous, precise relational information but in many real-world domains the recorded link structure has uncertainty due to measurement error and/or source reliability. Since relational models improve performance by propagating information throughout the relational graph structure, it is critical to incorporate these uncertainties into the analysis process to limit the influence of erroneous observations. The main elements of our project focus on three aspects of multi-source analysis: (1) data fusion, (2) learning from data with uncertainty and source information, and (3) evaluation of data and source quality. We will build on past work in entity resolution to develop algorithms to align data from multiple sources and infer the underlying network structure by exploiting the relational links in each data set. We will also extend the models and algorithms we have developed for relational domains to represent and reason with additional source information and confidence levels. Finally, we will develop a framework to reason about the quality of individual sources and the impact of noisy and/or biased link observations on model performance, exploiting our recent work on evaluation techniques that assess errors unique to relational domains. The primary contribution of the work will be the development of efficient and accurate methods to combine data from a variety of sources for knowledge discovery and decision-making support in relational domains. DARPA/DSO Grant NBCH1080005: $499,877; PI; June 2008 MAASCOM: Modeling, Analysis, and Algorithms for Stochastic Control of Multi-Scale Networks Current networks security monitoring is often driven by reliance on threshold counts for traffic volumes or alerts. While useful, these do not exploit relationships among network flows. In this project, we are developing methods to learn statistical models for evolving network flows, exploiting not only the relational structure but also the temporal evolution of the network. Although several techniques for learning relational models from network data have been developed recently, research has focused primarily on the task of attribute and link prediction in static domains. In flow domains, the modeling task will need to take the temporal evolution of the data in account, while predicting not only the structure of the network (e.g., which flows are related) but also the occurrence of complex events throughout the network (e.g., a coordinated intrusion across multiple hosts). This necessitates that we move beyond previous assumptions of a static, well-structured domain and develop more elaborate relational models and algorithms. More specifically, we are developing automated methods to infer the structure of the underlying flow network from flow streams, capitalizing on our experience modeling both static and dynamic relational domains. We aim to move significantly beyond current data mining and machine learning approaches, which would model the flows independently. Instead, we will exploit the relational dependencies hidden in the flow structure, which are induced by the shared underlying network topology and application goals. In addition, we are developing methods to identify the occurrence of complex events that are temporally and topologically distributed in the network. For example, a denial-of-service attack may consist of a single low-capacity flow from one source that diffuses slowly throughout the network and then results a set of distributed high- capacity flows to a target node. Although there are some current efforts to incorporate temporal dynamics into relational models, these approaches make strict Markov assumptions to make the models tractable. We conjecture that more elaborate dependency structures will be necessary to effectively identify distributed patterns in streaming flow data. Thus, instead of using Markov assumptions in our models, we will define and explore a limited model space of temporal-relational motifs. MURI/ARO Grant W911NF-08-1-0238: $250,000; co-PI; May 2008 Mining Transaction Streams to Infer Semantic Relations This project focuses on understanding and exploiting the information in large-scale, dynamic relational networks. In a growing number of relational domains, the data record temporal sequences of interactions among entities. For example, in online communities sites such as facebook.com, members continuously visit other members' pages, accessing content, posting comments, and transferring files. These transactional use patterns could be analyzed to infer the nature and strength of relationships among members, which may then in turn be exploited to improve personalization efforts, marketing strategies, and system design. The goal of the proposed work is to develop automated methods to infer semantic relationships among entities from streams of relational transactions. We conjecture that low-level interactions among entities provide evidence of latent high-level relationships between individuals, and that the patterns of interactions over time can be accurately and efficiently mined to identify relationships that confer homophily. For example, we may have communication events (e.g., phone calls, emails), data access/transfer events (e.g., web browsing, file access), or localization events (e.g., meetings, conferences). These low-level transactions among individuals are easy to observe/record and although a single event does not (necessarily) indicate a meaningful relationship between the participating parties, repeated interactions over time do suggest a strong relationship. Microsoft Research Gift: $50,000; PI; June 2007 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||