Data Quality

Principal Investigators: Ahmed Elmagarmid, Richard Wang (MIT)

Sponsors: Bellcore, Inc. and SERC

The objective of this project is to develop a methodology for Total Data Quality Management (TDQM). The TDQM methodology aims to facilitate the implementation of an organization’s overall data quality policy formally expressed by top management. To benefit fully from the TDQM methodology in practice, however, tools, methods, and techniques must be developed, refined, and tested in various organizational setting. In so doing, many fundamental research issues will be addressed. In this project, we are investigating these research issues through the development of an experimental software testbed. Anchoring our investigation in such a software testbed will enable us to resolve research issues and evaluate research results concretely.

Very little can be done without extensive experimentation on the data. Very little formal work exists in the area of data quality. The problem is so pervasive and widely spread that immediate help is urgent. Software tools are needed in order to sample, measure, analyze and correct problems with databases. In addition, proper experiments have to be designed in order to measure databases for various aspects of the data quality. For example, experiments have to be designed and implemented to test data for completeness, accuracy, timeliness, believability etc. An experiment dealing with identifying the source for a data value has to incorporate tags in order to be able to track the identity of the source of the update. Furthermore, metrics of particular interest have to be identified in order to ensure that they are captured by some of the experiments. Therefore, the need for experimentation is dictated by the nature of the problem, the need for measurements and impact.