III: Small: Towards Scalable and Comprehensive Uncertain Data Management
Sponsor: National Science Foundation
This material is based upon work supported by the National Science Foundation
under Grant No: IIS-09168724
Due to the importance of uncertain data for a large number of applications, there has been significant recent interest in database support for uncertain data. Existing work in this area includes new models for uncertain data, prototype implementations, and efficient query processing algorithms for specific types of queries. Despite the recent efforts, several important aspects of uncertain data management remain
unexplored. This project addresses two of these areas: Query Optimization and Support for Non-Relational Operators.
The first goal is about efficient execution of uncertain data queries. As with traditional data, efficient execution is necessary for ensuring the viability of uncertain data management systems. However, due to the complications of ensuring correct results, and the need for CPU-intensive
operations over probability distributions, the goal is critical and
challenging. In this project, automatic query optimizations are developed, through query rewriting rules that involve probability threshold operators, corresponding access methods, and cost estimation functions.
The difficulty of handling uncertainty when dealing with non-relational operators has been expressed in many domains. The project aims to advance the capability of tracking the exact impact of uncertain inputs as
data is processed by arbitrary programs, leveraging advanced techniques from the area of program analysis. A key problem with traditional Monte Carlo based solutions lies in correctly identifying independence in
the output of Monte Carlo simulations. Data lineage tracing, which identifies the set of inputs used to compute an output value, is used to address the challenge. Furthermore, a program dependence tracing based
approach is devised to trace the propagation of uncertainty during execution of arbitrary binary code. The technique does not rely on Monte Carlo simulations, and does not require access to source code or domain knowledge.
Goals, Objectives, and Targeted Activities
The goals of the project are to enhance uncertain data management
beyond relational operators and provide efficient evaluation of relational queries over uncertain data.
W. N. Sumner, T. Bao, X. Zhang, and S. Prabhakar,
Coalescing Executions for Fast Uncertainty Analysis ,the International Conference of Software Engineering (ICSE), Hawaii, 2011
T. Bao, Y. Zheng, and X. Zhang,
White Box Sampling in Uncertain Data Processing Enabled by Program Analysis, Object Oriented Programming, Systems, Languages and Applications (OOPSLA),
Tucson, AZ, 2012.
T. Bao and X. Zhang,
On-the-fly Detection of Instability Problems in Floating-Point Program Execution, Object Oriented Programming, Systems, Languages and Applications (OOPSLA),
Indianapolis, IN, 2013.
Tao Bao, Yunhui Zheng, Zhiqiang Lin, Xiangyu Zhang, Dongyan Xu, "Strict Control Dependence and Its Effect on Dynamic Information Flow Analyses", 2010. In proceedings of the International Symposium on Software Testing and Analysis Bibliography: (ISSTA)
Chris Mayfield Jennifer Neville Sunil Prabhakar, "ERACER: A Database Approach for Statistical Inference and Data Cleaning" ,Indianapolis, USA, 2010. Proc. of the ACM International Conference on Management of Data (SIGMOD)
Yinian Qi Rohit Jain Sarvjeet Singh Sunil Prabhakar, "Threshold Query Optimization for Uncertain Data", 2010.
Proc. of the ACM International Conference on Management of Data (SIGMOD)
Any opinions, findings and conclusions or recomendations expressed in this
material are those of the author(s) and do not necessarily reflect
the views of the National Science Foundation
Last Modified by Sunil Prabhakar
on 23rd February, 2014.