Purdue Receives $1.5 million Award from the National Science Foundation

Pictured (from Left to Right): Victor Raskin, Karen Chang, Chris Clifton, and Luo Si

A team of Purdue faculty members from the Department of Computer Science, the Department of English, and the School of Nursing have received a $1.5 million award from the National Science Foundation (NSF) to enhance the anonymization and de-identification of sensitive data.

The team includes Chris Clifton, Victor Raskin, Chyi-Kong Chang, and Luo Si. The overall $2.5 million project also involves Raquel Hill (Indiana University), Wei Jiang (Missouri U. of Science & Tech), Stephanie Sanders (The Kinsey Institute), and Erick Janssen (The Kinsey Institute).

The funding will allow a better understanding of how current de-identification and anonymization techniques control risks to privacy and confidentiality. The usefulness of anonymized data for real-world applications is not currently well understood, but this project will allow for greater understanding through the study of anonymization on three fronts:

1. Textual data, even when explicit identifiers are removed (names, dates, locations), can contain highly identifiable information. For example, a sample of chief complaint fields from the Indiana Network for Patient Care (INPC) found several instances of "phantom limb pain". Amputees can be visually identifiable, but the HIPAA Safe Harbor rules do not list this as "identifying information". Any policy explicitly listing all types of identifying data is likely to fail. Through a joint effort with computer science and linguistics, the project is developing new methods to remove specific details from text while preserving meaning, eliminating such highly identifiable information without a priori knowledge of what would be identifying.

2. Current anonymization research is based on unproven measures of identifiability. Through a re-identification challenge on synthetic data (but based on real healthcare data), the project is evaluating the efficacy of these measures. Interdisciplinary teams of students are given challenge problems - anonymized data with hypothetical healthcare data - and asked to make (hypothetical) inferences about health information of individuals. The results can be used to calibrate the effectiveness of different anonymization measures.

3. By partnering with healthcare studies at the Kinsey Institute and Purdue University School of Nursing, the project is comparing analyses on original data with analyses on anonymized data, and evaluating the impact of types of anonymization on research results. A related issue is determining the impact on data collection: Are individuals more candid in their responses if they know data will be anonymized? Outcomes are broadening the scope of research that can be performed on anonymized data, while ensuring that researchers know when access to individually identifiable data (with attendant restrictions and safeguards) is needed.

Through these tasks, the project is advancing our ability to utilize the wealth of data we now collect for the benefit of society, while ensuring individual privacy is protected.

 

Source: Project Website and NSF Award Abstract