Data Quality

Principal Investigator: Ahmed Elmagarmid

Research Assistant: V. Verykios

Sponsors: Bellcore, SERC

Data quality is a pervasive and widespread problem in information and database systems. If the quality of the data is low, database answers to simple queries are consequently incorrect, high level decisions made from the data are biased, and entire systems run the risk of being untrustworthy to their users.

Various attempts have been made to conceptualize this problem, resulting in an orthogonal system of properties which must be observed by the data in order for them to have an acceptable quality. Some of these properties are: completeness, accuracy, currency, believability, etc.

There are three kinds of actions that can be taken regarding the quality of the data. First, a system can be built in such a way that it prevents data from becoming obsolete. Second, we can measure the quality of the data in order to have an estimate of the problem, and third we can improve it.

The object of this project is to develop a methodology for improving the quality of data stored in information systems that are distributed and share some or all of their data resources. Under such a scenario, data in one system can change without notifying the mirror sites for this change, and this can happen because of many reasons (bad design, broken communication, etc.). The end result of this situation will be that a large number of distributed data will refer to the same entity, and at the same time will have different values for their discriminating key values.

Identifying this information is a very big problem, especially when designing a data warehouse. Our goal is to build a framework for semi-automatic identification of approximate duplicate records. Our approach is based on training a decision tree to learn the mapping between the differences of feature values from records to a label that identifies the similarity of these records. The decision tree built this way, is annotated with fuzzy membership functions for the ranges of values resulted from the feature splittings by the tree, before it is converted into an "optimal" production rule set. The rule set becomes the knowledge base for an inference engine, which is finally used for deciding whether two new and unseen records are similar or not based on certainty factors calculated by the decision tree.

1998
Annual Research Report

Department of
Computer Sciences