III: Small: Towards Scalable and Comprehensive Uncertain Data Management

Sponsor: National Science Foundation

This material is based upon work supported by the National Science Foundation under Grant No: IIS-09168724

Principal Investigators:
Sunil Prabhakar
Xiangyu Zhang

Graduate Students:

Project Summary

Due to the importance of uncertain data for a large number of applications, there has been significant recent interest in database support for uncertain data. Existing work in this area includes new models for uncertain data, prototype implementations, and efficient query processing algorithms for specific types of queries. Despite the recent efforts, several important aspects of uncertain data management remain unexplored. This project addresses two of these areas: Query Optimization and Support for Non-Relational Operators. The first goal is about efficient execution of uncertain data queries. As with traditional data, efficient execution is necessary for ensuring the viability of uncertain data management systems. However, due to the complications of ensuring correct results, and the need for CPU-intensive operations over probability distributions, the goal is critical and challenging. In this project, automatic query optimizations are developed, through query rewriting rules that involve probability threshold operators, corresponding access methods, and cost estimation functions. The difficulty of handling uncertainty when dealing with non-relational operators has been expressed in many domains. The project aims to advance the capability of tracking the exact impact of uncertain inputs as data is processed by arbitrary programs, leveraging advanced techniques from the area of program analysis. A key problem with traditional Monte Carlo based solutions lies in correctly identifying independence in the output of Monte Carlo simulations. Data lineage tracing, which identifies the set of inputs used to compute an output value, is used to address the challenge. Furthermore, a program dependence tracing based approach is devised to trace the propagation of uncertainty during execution of arbitrary binary code. The technique does not rely on Monte Carlo simulations, and does not require access to source code or domain knowledge.

Goals, Objectives, and Targeted Activities

The goals of the project are to enhance uncertain data management beyond relational operators and provide efficient evaluation of relational queries over uncertain data.

Publications

W. N. Sumner, T. Bao, X. Zhang, and S. Prabhakar, Coalescing Executions for Fast Uncertainty Analysis ,the International Conference of Software Engineering (ICSE), Hawaii, 2011 PDF
T. Bao, Y. Zheng, and X. Zhang, White Box Sampling in Uncertain Data Processing Enabled by Program Analysis, Object Oriented Programming, Systems, Languages and Applications (OOPSLA), Tucson, AZ, 2012. PDF
T. Bao and X. Zhang, On-the-fly Detection of Instability Problems in Floating-Point Program Execution, Object Oriented Programming, Systems, Languages and Applications (OOPSLA), Indianapolis, IN, 2013. PDF
Tao Bao, Yunhui Zheng, Zhiqiang Lin, Xiangyu Zhang, Dongyan Xu, "Strict Control Dependence and Its Effect on Dynamic Information Flow Analyses", 2010. In proceedings of the International Symposium on Software Testing and Analysis Bibliography: (ISSTA)
Chris Mayfield Jennifer Neville Sunil Prabhakar, "ERACER: A Database Approach for Statistical Inference and Data Cleaning" ,Indianapolis, USA, 2010. Proc. of the ACM International Conference on Management of Data (SIGMOD)
Yinian Qi Rohit Jain Sarvjeet Singh Sunil Prabhakar, "Threshold Query Optimization for Uncertain Data", 2010. Proc. of the ACM International Conference on Management of Data (SIGMOD)

Disclaimer

Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Last Modified by Sunil Prabhakar on 23rd February, 2014.