Approximate Pattern Matching with q-Grams

Principal Investigator: Erkki Sutinen

Approximate pattern matching methods are needed in diverse application areas, like biocomputing and text databases. Due to the data sizes involved, efficient techniques for fast retrieval are required. One way to speed up the search is to apply filtration methods, which work in two phases: a fast scanning phase searches for potential matches which are verified by a slower but accurate checking phase.

The use of q-grams, i.e., substrings of length q, as a filter is based on the fact that each approximate match must resemble the searched pattern. This resemblance can be observed in the filtration phase as preserved q-grams of the searched pattern.

The use of the q-gram approach in approximate pattern matching results in various filtration methods. For example, one can utilize the fact that the preserved q-grams retain their original order. The variations differ from each other for example in the frequency of false matches they yield. A false match is a potential match which is falsified in the checking phase.

Besides practical algorithms, the q-gram approach provides researchers with a particular challenge: the algorithm analysis problems due to overlapping q-grams.

1998
Annual Research Report

Department of
Computer Sciences