Aref NSF CAREER 2005

Proposal Information

Proposal number : 0093116

PI : Walid G. Aref

Institution : Purdue University

Title : Research and Teaching of Database Technologies For

Emerging Applications

Requested duration : 15 September 2001 – 15 September 2006

Activities

9/01/2004 – 9/01/2005

Keywords

Spatio-temporal Database Systems, Continuous Query Processing, Ranking, Top-k Queries, Stream Data Systems, Extensible Spatial Indexing, Bioinformatics.

This project supports an integrated research and education effort in the area of database technologies for emerging applications. Its goal is to develop database technologies that address the demands of data-intensive applications that handle distributed, spatial, and multimedia data sources. We continue to make progress toward our targeted objectives in both research and education.

I. Research activities have focused on:

GPAC: Generic and Progressive Processing of Mobile Queries over Mobile Data. This research introduces a new family of Generic and Progressive algorithms (GPAC, for short) for continuous mobile queries over mobile objects. GPAC provides a general skeleton that can be tuned through a set of methods to behave as various continuous queries (e.g., continuous range queries and continuous k-nearest-neighbor queries). GPAC algorithms aim to provide three goals: (1) Online evaluation through an in-memory processing of the incoming mobile data. (2) Progressive evaluation through employing an incremental evaluation paradigm. (3) Fast query response through employing an anticipation paradigm. Query answer is anticipated and is cached in memory to allow for fast evaluation. GPAC algorithms are encapsulated in physical pipelined query operators. GPAC pipelined operators can be combined with traditional query operators in a query execution plan to support a wide variety of continuous queries. Experimental results based on a real implementation inside a prototype streaming database engine show the efficiency of GPAC operators in providing incremental and fast response for continuous queries.

SOLE: Scalable On-Line Execution of Continuous Queries on Spatio-temporal Data Streams. In this research, we introduce the Scalable On-Line Execution algorithm (SOLE, for short) for continuous and on-line evaluation of concurrent continuous spatio-temporal queries over data streams. Incoming spatio-temporal data streams are processed in-memory against a set of outstanding continuous queries. The SOLE algorithm utilizes the scarce memory resource efficiently by keeping track of only the significant objects. In-memory stored objects are expired (i.e., dropped) from memory once they become insignificant. SOLE is a scalable algorithm where all the continuous outstanding queries share the same buffer pool. In addition, SOLE is presented as a spatial join between two input streams, a stream of spatio-temporal objects and a stream of spatio-temporal queries. To cope with intervals of high arrival rates of objects and/or queries, SOLE utilizes a self-tuning approach based on load-shedding where some of the stored objects are dropped from memory. SOLE is implemented as a pipelined query operator that can be combined with traditional query operators in a query execution plan to support a wide variety of continuous queries. Performance experiments based on a real implementation of SOLE inside a prototype of a data stream management system show the scalability and efficiency of SOLE in highly dynamic environments.

Spatio-Temporal Histograms. In this research we introduce a framework for building and continuously maintaining spatio-temporal histograms (ST-Histograms, for short). ST-Histograms are used for selectivity estimation of continuous pipelined query operators. Unlike traditional histograms that examine and/or sample all incoming data tuples, ST-Histograms are built by monitoring the actual selectivities of the outstanding continuous queries. ST-Histograms have three main features: (1) The ST-Histograms are built with (almost) no overhead to the system. We use only feedback (i.e., the actual selectivity) from the existing continuous queries. (2) Rather than wasting system resources in maintaining accurate histograms for the whole spatial space, we only maintain accurate histograms for that part of the space that is relevant to the current existing queries. The rest of the space has less accurate histograms. (3) The ST-Histograms are equipped with a periodicity detection procedure that predicts the future execution of the continuous

queries. Hence, the query processing engine can continuously adapt the continuous query pipeline to reflect this prediction. Experimental results based on a real implementation inside a data stream management system show a superior performance of ST-Histograms in terms of providing accurate operator selectivity estimations with no extra overhead.

LUGrid: Update-tolerant Grid-based Indexing over Moving Objects. Indexing the locations of moving objects is one of the most important issues in spatio-temporal databases. Former proposed indexing approaches suffer from frequent updates of objects and thus do not scale well. In this reserach, we introduce LUGrid, an adaptive Lazy-Update Grid-based index that minimizes the cost of processing object updates. LUGrid has two important features, namely, lazy-insertion and lazy-deletion. Lazy-insertion reduces the cost of updates by flushing a batch of updates to disk at one time, so that the cost of multiple updates is amortized. Lazy-deletion further reduces updating costs by avoiding the deletion of obsolete entries from other disk pages and eliminating the need to maintain any auxiliary index. LUGrid adapts to object distributions through cell splitting and merging. In the paper, we extensively discuss the structure of LUGrid and the algorithms for update and query processing. Moreover, we provide theoretical analysis for estimating the update costs of LUGrid. Comprehensive experimental results indicate that while keeping query processing as efficient as former work, LUGrid is several times superior than former work in processing object updates.

R-trees with Update Memos: An Efficient Solution to the Frequent Updates Problem. The problem of frequently updating multi-dimensional indexes has arisen in many location-dependent applications. While the R-tree and its variants are one of the dominant choices for indexing multi-dimensional objects, the R-trees exhibit inferior performance in the presence of frequent updates. In this research, we introduce an R-tree variation, termed RUM-tree (stands for R-tree with Update Memo) that minimizes the cost of object updates. The RUM-tree processes updates in a memo-based approach that avoids disk accesses for purging old entries during an update process. Therefore, the cost of an update operation in the RUM-tree is reduced to the cost of only an insert operation. The removal of old object entries is carried out by the garbage cleaner inside the RUM-tree. In this research, we give the details of the RUM-tree and study the properties of the RUM-tree. Our theoretical analysis and experimental evaluation demonstrate that the RUM-tree outperforms other R-tree variations in scenarios with frequent updates.

SEA-CNN: Scalable Processing of Continuous K-Nearest Neighbor Queries in Spatio-temporal Databases. Location-aware environments are characterized by a large number of objects and a large number of continuous queries. Both the objects and continuous queries may change their locations over time. In this work, we focus on continuous k-nearest neighbor queries (CKNN, for short). We present a new algorithm, termed SEA-CNN, for answering continuously a collection of concurrent CKNN queries. SEA-CNN has two important features: incremental evaluation and shared execution. SEA-CNN achieves both efficiency and scalability in the presence of a set of concurrent queries. Furthermore, SEA-CNN does not make any assumptions about the movement of objects, e.g., the objects’ velocities and shapes of trajectories, or about the mutability of the objects and/or the queries, i.e., moving or stationary queries issued on moving or stationary objects. We provide theoretical analysis of SEA-CNN with respect to the execution costs, memory requirements and effects of tunable parameters. Comprehensive experimentation shows that SEA-CNN is highly scalable and is more efficient in terms of both I/O and CPU costs in comparison to other R-tree-based CKNN

techniques.

To Trie or Not to Trie? Realizing Space-partitioning Trees inside PostgreSQL: Experiences and Performance. Many evolving database applications warrant the use of non-traditional indexing mechanisms beyond B+-trees and hash tables. SP-GiST is an extensible indexing framework that broadens the class of supported indexes to include disk-based versions of a wide variety of space-partitioning trees, e.g., disk-based trie variants, quadtree variants, and kd-trees. This research makes a serious attempt at implementing and realizing SP-GiST-based indexes inside PostgreSQL. Several index types are realized inside PostgreSQL facilitated by rapid SP-GiST instantiations. Challenges, experiences, and performance issues are addressed. Performance comparisons are conducted from within PostgreSQL to compare update and search performances of SP-GiST-based indexes against the B+-tree and the R-tree for text string and point data sets. Interesting performance results are produced by this research. Results highlight the potential performance gains of SP-GiST-based indexes as well as several strengths and weaknesses of SP-GiST-based indexes that need to be addressed in future research.

Discovering Consensus Patterns using Progressive Hierarchical Clustering with Applications in Biological Databases. A consensus pattern is a highly conserved pattern with very few substitutions where no gaps are allowed within the pattern. Discovering consensus patterns has many useful applications, e.g., discovering motifs and tandem repeats in biological databases. The length of a consensus pattern is usually within a certain range that is determined by the domain knowledge. In this research, we introduce a disk-based progressive hierarchical clustering technique for discovering consensus patterns of a certain length range. The proposed technique has the ability to discover the consensus patterns with various requirements by applying a post-processing phase to refine the discovered patterns. The progressive nature of the proposed hierarchical clustering algorithm makes it scalable and efficient. Our experiments are conducted in the context of biological databases for discovering motif and tandem repeat consensus patterns.

II. Education activities are focused in the following activities:

Prototype Software Systems: Students involved in my research lab, both at the undergraduate and graduate levels, have had the opportunity to develop significant experience in systems-oriented database research.

SP-GiST (http://www.cs.purdue.edu/homes/aref/dbsystems_files/SP-GiST/): Since the release of SP-GiST last year, and with over thirty users world-wide, new effort by my group has been made to realize the SP-GiST framework inside PostgreSQL (an open-source extensible database developed at Berkeley). Many non-traditional indexes are now realizable in an industrial-strength database management system, and hence enabling efficient processing of non-traditional and emerging database applications.

Bioinformatics Undergraduate Research Course: I continued to offer this undergraduate research course. The course initiates a new collaborative research effort with professors in the Biology department at Purdue. This collaboration is in the form of a research course at the undergraduate level, where both CS and Biology undergraduate students register into the course. With the success in this course, I am currently proposing to the CS department of make this a regular CS undergraduate course (Titled: Bioinformatics Project) that is to be taken at the Junior/Senior level with a database systems course as a pre-requisite.

Graduating Students. The following students graduated during this past year. All were supported in part by this grant:

1. M.F. Mokbel (Ph.D. June 2005 – accepted an assistant professor job at University of Minnesota, Twin Cities).

2. M.G. Elfeky (Ph.D. March 2005 – accepted a job at Google, CA).

Tutorial (at the IEEE International Conference on Data Engineering - ICDE): Rank-aware Query Processing. With the success of this tutorial last year in the EDBT conference in Crete, Aref and his former student Ilyas gave this tutorial again at ICDE this year in Tokyo. The main goal of this tutorial was to give an in-depth look at rank-aware query processing as an increasingly interesting area of research to the database community and to give the necessary background and the state-of-the-art techniques in research prototypes and industrial-strength database engines. We give an inclusive background on ranking, voting and rank-aggregation algorithms in theory. We then give a detailed coverage of ranking query models--covering top-k selection and top-k join queries, and the various approaches recently taken by researchers in academia and industry to support these queries inside database systems.

Proposal Information

Proposal number : 0093116

PI : Walid G. Aref

Institution : Purdue University

Title : Research and Teaching of Database Technologies For

Emerging Applications

Requested duration : 15 September 2001 – 15 September 2006

Findings

9/01/2004 – 9/01/2005

Keywords

Spatio-temporal Database Systems, Continuous Query Processing, Ranking, Top-k Queries, Stream Data Systems, Extensible Spatial Indexing, Bioinformatics.

This section highlights our specific research findings in three major research areas of the project: Scalable Processing of Continuous Queries in Spatio-temporal Databases, Performance of Space-Partitioning Trees, Incremental Evaluation for Sliding-Window Continuous Queries in Stream Query Processing Servers, and Bio-informatics-related Research.

Spatio-temporal Database Systems. The wide spread of location-detection devices (e.g., GPS-like devices, RFID's, handheld devices, and cellular phones) has given rise to a wide variety of vital spatio-temporal database applications including location-aware services, traffic monitoring, and enhanced 911 service. These applications pose serious challenges to existing database engines.

Aref and his collaborators at Purdue developed the first prototype data stream management system (the PLACE server – PLACE stands for Pervasive Location-Aware Computing Environment) that handles massive spatio-temporal data streams in real-time. The PLACE server is the first data stream management system that provides support for both spatio-temporal queries and spatio-temporal data streams. Unlike previous systems and approaches for spatio-temporal databases that support spatio-temporal algorithms as high level applications, the PLACE server encapsulates continuous query algorithms into physical pipelined query operators that can be part of a larger query pipeline inside a data stream management system. The PLACE server addresses new challenges of streaming environments that include managing the scarce memory resource, dealing with uncertainty areas of moving queries, and adapting the system behavior to intervals of high workload via load shedding. The PLACE server was demonstrated at the 2004 International Conference on Very Large Data Bases. Aref and his collaborators developed incremental spatio-temporal query processing algorithms and indexing techniques. In order to handle the vast amounts of updates due to the large numbers of moving objects and moving queries, Aref and his collaborators developed tolerant indexes on both the objects and on the queries that can accommodate for the massive numbers of updates whenever the objects or the queries move. PLACE utilizes the notions of “positive” and “negative” tuples to incrementally update and send the answers of a query to a mobile client.

Other researchers in spatio-temporal databases consider instances of this problem, e.g., developing indexing structures to efficiently answer static range queries on moving objects, or moving queries on static objects, etc. However, these techniques do not scale from a systems point of view. In their SIGMOD 2004 paper, Aref and colleagues presented the SINA family of algorithms that scale to all object/query mutability combinations in a very efficient manner. Although relatively recent, the SINA paper is referenced widely in database systems venues.

In contrast to snapshot queries that are answered against one state of the database, continuous queries are queries that last over a long duration of time and are reevaluated every time a new change to the database or to the query occurs (e.g., when the location of the k-nearest-neighbor query changes or when a new tuple gets inserted into the database). To target scalability, Aref and his colleagues proposed to incrementally update the answer to “continuous” queries in contrast to reevaluating them from scratch. In the IEEE Mobile Data Management 2005 and IEEE ICDE 2005 conference papers, Aref and his colleagues introduced the notion of update tuples to incrementally update the answers to continuous k-nearest-neighbor and range queries for all object/query mutability combinations. They developed a shared execution paradigm among the concurrently outstanding continuous queries. The main idea is to abstract the problem of executing multiple continuous spatio-temporal queries into a spatio-temporal join operation. The inputs to the join operation are two streams; a stream of continuously moving objects and a stream of continuously moving queries. Furthermore, concurrently outstanding queries share the same in-memory buffers. In the PLACE continuous query processor, Aref and his colleagues go beyond the idea of implementing high level algorithms for continuous spatio-temporal queries. Instead, they encapsulate the spatio-temporal query algorithms into a set of primitive spatio-temporal pipelined operators that can be part of a larger query plan. Having a set of primitive spatio-temporal operators results in greatly enhancing the query performance by pushing the spatio-temporal operators into the query pipeline, and hence giving the query optimizer more flexibility in producing more efficient query plans. Both theoretical analysis and comprehensive experimentation demonstrate that the techniques proposed by Aref and his colleagues are highly scalable and are more efficient by an order of magnitude than competitive spatio-temporal approaches.

Currently, Aref and his students are studying spatio-temporal query optimization issues in PLACE. As a first step towards query optimization in spatio-temporal databases, in [SSTD’05], Aref and colleagues presented a framework for building and continuously maintaining spatio-temporal histograms (ST-Histograms, for short). ST-Histograms are used for selectivity estimation of continuous pipelined query operators. Unlike traditional histograms that examine and/or sample all incoming data tuples, ST-Histograms are built by monitoring the actual selectivities of the outstanding continuous queries. ST-Histograms have three main features: (1) The ST-Histograms are built with (almost) no overhead to the system. Feedback (i.e., the actual selectivity) from the existing continuous queries is used. (2) Rather than wasting system resources in maintaining accurate histograms for the whole spatial space, accurate histograms are maintained for only that part of the space that is relevant to the current existing queries. The rest of the space has less accurate histograms. (3) The ST-Histograms are equipped with a periodicity detection procedure that predicts the future execution of the continuous queries. Hence, the query processing engine can continuously adapt the continuous query pipeline to reflect this prediction. In fact, this is an instance of where data mining can fit evenly in query optimization. Experimental results based on a real implementation inside PLACE show a superior performance of ST-Histograms in terms of providing accurate operator selectivity estimations with no extra overhead (only 8.5% error for the existing queries of size 1% and an average error of 25% for new queries when the existing query coverage is 60%).

GPAC: Generic and Progressive Processing of Mobile Queries over Mobile Data. We introduced a new family of Generic and Progressive Algorithms (GPAC, for short) for continuous query evaluation over spatio-temporal data streams. GPAC is a general skeleton that can be tuned through a set of methods to behave as different continuous spatio-temporal queries. GPAC provides online, progressive, and fast response to continuous spatio-temporal queries. We described two versions of GPAC. The first version (with no caching) is simple to maintain, however, produces inaccurate answers. In the second version (with caching), we introduce the concept of anticipation, where the query answer can be anticipated beforehand and cached in a cache structure. We show how to realize two types of continuous spatio-temporal queries form GPAC, namely, continuous range queries and k-nearest-neighbor queries. GPAC algorithms are encapsulated into physical pipelined query operators. Pipelined operators are combined with traditional operators (e.g., selection and join) to support a wide variety of continuous spatio-temporal queries. Experimental results show that encapsulating GPAC into pipelined query operators is ten orders of magnitude better than implementing GPAC as a high level application.

SOLE: Scalable On-Line Execution of Continuous Queries on Spatio-temporal Data Streams. We presented the Scalable On-Line Execution algorithm (SOLE, for short) for continuous and on-line evaluation of concurrent continuous spatio-temporal queries over spatio-temporal data streams. SOLE is an in-memory algorithm that utilizes the scarce memory resources efficiently by keeping track of only those objects that are considered significant (i.e., the objects that are needed by at least one outstanding continuous query). SOLE utilizes a shared memory structure that is accessible by all outstanding continuous queries. SOLE is a unified framework that supports both stationary and moving queries. k-nearest-neighbor queries are modeled and treated as range queries. Uncertainty areas in continuous spatio-temporal queries are avoided in SOLE by using a caching technique. SOLE is encapsulated into a physical pipelined query operator that interacts with traditional query operators to support various query types. To cope with intervals of high arrival rates of objects and/or queries, SOLE utilizes load shedding techniques that aim to support more continuous queries, yet with an approximate answer. Two load shedding techniques are proposed, namely, eager load shedding and lazy load shedding. Experimental results based on a real implementation of SOLE inside a prototype data stream management system show that SOLE can support up to 20 times continuous queries more than the case of dealing with each query separately. Also, implementing SOLE as a pipelined operator results in up to 20 times performance advantage over a table function implementation. With the lazy load shedding, SOLE can support up to 13 times more queries than the case with no load shedding.

LUGrid: Update-tolerant Grid-based Indexing over Moving Objects. We proposed LUGrid; An adaptive Lazy-Update Grid-based indexing structure. LUGrid efficiently handles object updates by its unique lazy-update features, namely, lazy-insertion and lazy-deletion. Lazy-deletion reduces the update cost from traditional ``insertion cost plus deletion cost" to ``insertion cost only". The lazy-deletion functionality is provided by maintaining a memo structure to identify obsolete entries. Further, lazy-insertion groups updates and flushes multiple updates at one time, so that the cost for single update is amortized. The proposed lazy-update techniques in LUGrid can be applied to other index families. We give theoretical analysis for LUGrid and present an extensive set of experiment. Depending on different experiment settings, LUGrid is 2 to 8 times superior than former work in terms of the I/O cost for processing object updates.

R-trees with Update Memos: An Efficient Solution to the Frequent Updates Problem. Supporting frequent updates in indexes is receiving increasing attentions in the context of the spatio-temporal databases and moving object databases. As the R-tree and its variants have been regarded as one of the primary choices for spatial indexing, improving the R-tree's update performance with frequent updates is a challenging yet important issue. For R-tree update, the most costly part lies in searching the objects to be updated. In contrast to the former update approaches, we presented a memo-based approach to avoid the deletion I/O costs. In the proposed RUM-tree, object updates are ordered temporally according to the processing time. By maintaining the update memo, more than one entries of an object may coexist in the RUM-tree. The obsolete entries are deleted lazily and in batches. Garbage cleaner is employed to limit the garbage ratio in the RUM-tree and confine the size of the Update Memo. The RUM-tree along with the garbage cleaners outperforms significantly other R-tree variants in the update performance, while yielding similar search performance. We believe that the memo-based update approach has potential to support frequent updates in many other indexing structures, for instances, B-trees, quadtrees and Grid files.

SEA-CNN: Scalable Processing of Continuous K-Nearest Neighbor Queries in Spatio-temporal Databases. We investigated the problem of evaluating a large set of continuous k-nearest neighbor (CKNN) queries in spatio-temporal databases. We introduce the Shared Execution Algorithm (SEA-CNN, for short) to efficiently maintain the answer results of CKNN queries. SEA-CNN combines incremental evaluation and shared execution to minimize the costs when updating the query answers. With incremental evaluation, only queries affected by the motion of objects are reevaluated. To minimize the evaluation time, each affected query is associated with a searching region based on its former query answer. Under shared execution paradigm, concurrent queries are grouped in a common query table. Thus the problem of evaluating multiple queries is solved by a spatial join between the query table and the object table. Furthermore, SEA-CNN is a generally applicable framework. First, SEA-CNN does not make any assumptions about the movement of objects (e.g., velocities, trajectories). Second, SEA-CNN is suitable for processing moving/stationary queries issued on moving/stationary objects. We provide theoretical analysis of SEA-CNN in terms of the execution costs, the memory requirements and the effects of tunable parameters. Comprehensive experiments illustrate that SEA-CNN is highly scalable and is more efficient to R-tree-based CKNN techniques in terms of both the number of I/Os and the CPU costs.

Spatio-Temporal Histograms. We explored the usage of spatio-temporal histograms for selectivity estimation of spatio-temporal operators. We presented a general framework for building and continuously maintaining spatio-temporal histograms. The main idea of our proposed spatio-temporal histograms is to use a continuous feedback from the outstanding continuous queries to maintain a spatio-temporal histogram for only those parts of the spatial space that are of interest to at least one outstanding continuous query. Parts of the spatial space that are not of interest to any of the outstanding queries do not participate in maintaining the spatio-temporal histograms, thus the overhead of continuously maintaining our spatio-temporal histograms is reduced. Our proposed spatio-temporal histograms utilize periodicity detection techniques to discover temporal periodic patterns. Discovering temporal patterns provides pre-computation of the optimal query plan over the course of execution of continuous queries. Experimental results show that our spatio-temporal histograms provide only 8.5% error for the existing queries of size 1%. An average error of 25% for new queries when the existing query coverage is 60%.

To Trie or Not to Trie? Realizing Space-partitioning Trees inside PostgreSQL: Experiences and Performance. We presented a serious attempt at implementing and realizing SP-GiST-based indexes inside PostgreSQL. We realized three index structures; namely one variant of a forest trie, a kd-tree and a suffix tree. Several implementation challenges, experiences, and performance issues are addressed. Our experiments demonstrate the potential gain of the SP-GiST indexes. For example, the trie has more than 150\% search performance improvement over the B+-tree in the case of the exact match search, and it has more than 2 orders of magnitude search performance gain over the B+-tree in the case of the regular expression match search. The kd-tree also has more than 300% search performance improvement over the R-tree in the case of the point match search. Our experiments demonstrate several weaknesses of SP-GiST indexes that need to be addressed in future research. For example, the insertion time and the index size of the SP-GiST indexes involve higher overhead than those of the B+-tree, and the R-tree indexes.

Discovering Consensus Patterns using Progressive Hierarchical Clustering with Applications in Biological Databases. We proposed Bio-CP, a disk-based progressive hierarchical clustering technique for discovering consensus patterns. We demonstrated the applicability of Bio-CP in the context of biological databases for discovering motifs and tandem repeats. Bio-CP is applicable to a wide range of applications because any domain-specific requirements are applied in a post-processing phase. We proposed several scalability techniques to enhance the performance of Bio-CP and to allow efficient disk access. Our experiments illustrated that Bio-CP scales very well with respect to the processing time, the clustering validation degrees and the disk access. Bio-CP has more than 500\% processing time improvement, and more than 350% disk access improvement over the non-progressive clustering techniques.