Xiangyu's Projects - Uncertain Data Processing

Better Uncertain Data Processing via Program Analysis

Uncertain data processing is becoming more and more important. In scientific computation, data are collected through instruments or sensors that may be exposed to rough environmental conditions, leading to errors. Computational processing of these data may hence draw faulty conclusions. For example, a protein may be mistakenly classified as a cancer indicator by slightly altering a parameter of the program used to process experimental data. Such parameters are uncertain because they are provided by biologists based on their experience. Such mistakes may be highly costly because expensive follow-up wet-bench experiments may be guided by the faulty results.

Traditionally, uncertainty analysis is conducted on the underlying mathematical models. However, modern data processing uses more complex models and relies on computers and programs. In this project, we aim to address the uncertain data processing problem from the program analysis perspective.

Recently, we have made the following progress.

Monte Carlo (MC) methods provide a simple and effective means of studying uncertainty. They randomly select input samples from predefined distributions and aggregate the computed outputs to yield statistical insights in the output space. We proposed a program analysis technique to improve the cost-effectiveness of MC methods. Assuming only part of the input is uncertain, the certain part of the input always leads to the same execution across multiple sample runs. We remove such redundancy by coalescing multiple sample runs in a single run. In the coalesced run, the program operates on a vector of values if uncertainty is present or a single value otherwise. We handle cases where control flow and pointers are uncertain [ICSE'11].
We have also developed a lineage tracing technique that instruments stored procedures (in databases) to track the set of input data used to compute individual outputs by tracking program dependence. Such relevent inputs are called the lineage of the output. Users can then focus their attention on the lineage sets of important outputs when analyzing uncertainty. The technique has helped bio-chemistry experts in processing their data [VLDB'07].

Funding

Towards Scalable and Comprehensive Uncertain Data Management, NSF-III-0916874, 2009-2012.

Students

Publications

ICSE	W. N. Sumner, T. Bao, X. Zhang, and S. Prabhakar . Coalescing Executions for Fast Uncertainty Analysis , IEEE/ACM International Conference on Software Engineering, 2011. [abstract][pdf]
VLDB	Mingwu Zhang, Xiangyu Zhang, Xiang Zhang, Sunil Prabhakar. Tracing Lineage Beyond Relational Operators , the 33rd International Conference on Very Large Databases, 2007. [abstract][pdf]