Next: Further Work Up: Security and Privacy Previous: Related Work

Possible Solutions

We do not believe this is an unsolvable problem. There are a number of things we can do to prevent data mining. We divide these into:

Limiting Access: If we control access to the data to prevent users from obtaining a sufficiently large and varied sample of the database, we can lower the confidence in the results of any mining that is attempted (does the mined ``fact'' represent the database as a whole, or is it just an artifact of the small sample?) This is the approach taken by the secure DBMS community.
``Fuzz'' the data: If we alter the data, for example by forcing aggregation into daily records instead of individual transactions or slightly altering data values, we may prevent useful mining while still enabling the planned use of the data. This is an approach used by the U.S. Census Bureau.
Eliminate unnecessary groupings: Often data values contain unneeded information. For example, U.S. social security numbers are assigned by office (the first three digits identify the office where the social security number was obtained). In addition, the offices often assign these sequentially. Therefore, by grouping social security numbers by the high-order digits, we can group people by location (or location and age) with reasonable reliability. This can provide additional information for use in data mining.
This can be useful even if we do not know how social security numbers are assigned. Simply clustering along the high-order bits of a ``unique identifier'' is likely to group similar data elements. This similarity may be unknown. It may be chronological if the identifiers are assigned sequentially, or determine the source of the data element if individual data sources are given a ``batch'' of identifiers. The problem is that this grouping allows us to find similarities that would not be available otherwise (e.g. a high cancer rate among people with similar social security numbers), that lead us to look for further information (what else those people have in common?)
This can have security implications. For example, an organization that assigns telephone numbers sequentially based on location within a building could find its ``phone book'' mined to find out who is working on the same projects. Knowledge of the identifier assignment process is not necessary; simply finding a rule that a given group of people working on a known project can be determined from grouping on their telephone numbers can lead to the realization that people working on unknown projects can be guessed by grouping the telephone number.
The solution is to ensure that unique identifiers are assigned randomly; thus serving only as unique identifiers. This prevents meaningful grouping based on these identifiers, yet does not detract from their intended purpose.
Augment the data: Sometimes we can add to the data without altering its usefulness. If we know precisely how the data should be used, we may be able to add misleading data that will only be retrieved by inappropriate queries. For example, suppose that a phone book was populated with extra, fictitious, people. Asking for an individual's phone number would return the correct information, but queries to find all individuals in a department would return additional people not even in the company.
This requires that we augment the data in non-obvious ways (otherwise it would be simple to reconstruct the original database).
Audit: While the previous methods address the problems of releasing data ``to the world'', auditing can be a very effective deterrent against misuse by legitimate users inside an organization. Auditing does not enforce controls, but it may detect misuse so that administrative or criminal disciplinary action may be initiated. Of course, the question here is what should be saved, how can data mining be identified, and what inferences have been formed? Should the audit trail itself be mined in order to answer these questions?

The difficulty with all of these is knowing when you have developed a ``public'' version of the database that is not amenable to mining. To do this requires an understanding of the mining algorithms:

How do they decide if a given rule or output is ``interesting''? Knowing this allows us to alter or limit the data so as to prevent ``interesting'' rules (or plant false ``interesting'' rules to mislead the miner).
What are the performance characteristics? Some mining algorithms experience exponential run times under certain conditions. For example, one algorithm for mining association rules [AIS93] requires an exponential number of passes over the data if the number of items found in each ``transaction'' is large. We can utilize this to ensure that we make our data computationally infeasible to mine.

This is just a start. Further work is needed to determine just what we can do to prevent unwanted data mining.

Next: Further Work Up: Security and Privacy Previous: Related Work

Christopher W Clifton
Fri Aug 23 13:26:29 EDT 1996