Proposal Information
Proposal number :
0093116
PI :
Walid G. Aref
Institution :
Title :
Research and Teaching of Database Technologies For
Emerging Applications
Requested duration : 15
September 2001 – 15 September 2006
Activities
9/01/2004 – 9/01/2005
Keywords
Spatio-temporal
Database Systems, Continuous Query Processing, Ranking, Top-k Queries, Stream
Data Systems, Extensible Spatial Indexing, Bioinformatics.
This project supports an integrated research and education
effort in the area of database technologies for emerging applications. Its goal
is to develop database technologies that address the demands of data-intensive
applications that handle distributed, spatial, and multimedia data sources. We
continue to make progress toward our targeted objectives in both research and
education.
I.
Research activities have focused on:
GPAC:
Generic and Progressive Processing of Mobile Queries over
SOLE:
Scalable On-Line Execution of Continuous Queries on Spatio-temporal
Data Streams. In this research, we
introduce the Scalable On-Line Execution algorithm (SOLE, for short) for
continuous and on-line evaluation of concurrent continuous spatio-temporal
queries over data streams. Incoming spatio-temporal
data streams are processed in-memory against a set of outstanding continuous
queries. The SOLE algorithm utilizes the scarce memory resource efficiently by
keeping track of only the significant objects. In-memory stored objects
are expired (i.e., dropped) from memory once they become insignificant.
SOLE is a scalable algorithm where all the continuous outstanding queries share
the same buffer pool. In addition, SOLE is presented as a spatial join between
two input streams, a stream of spatio-temporal
objects and a stream of spatio-temporal queries. To
cope with intervals of high arrival rates of objects and/or queries, SOLE
utilizes a self-tuning approach based on load-shedding where some
of the stored objects are dropped from memory. SOLE is implemented as a pipelined
query operator that can be combined with traditional query operators in a query
execution plan to support a wide variety of continuous queries. Performance
experiments based on a real implementation of SOLE inside a prototype of a data
stream management system show the scalability and efficiency of SOLE in highly
dynamic environments.
Spatio-Temporal Histograms. In this research we introduce a framework for building and
continuously maintaining spatio-temporal histograms
(ST-Histograms, for short). ST-Histograms are used for selectivity estimation
of continuous pipelined query operators. Unlike traditional histograms that examine
and/or sample all incoming data tuples, ST-Histograms
are built by monitoring the actual selectivities
of the outstanding continuous queries. ST-Histograms have three main features:
(1) The ST-Histograms are built with (almost) no overhead to the system. We use
only feedback (i.e., the actual selectivity) from the existing continuous
queries. (2) Rather than wasting system resources in maintaining accurate
histograms for the whole spatial space, we only maintain accurate histograms
for that part of the space that is relevant to the current existing queries.
The rest of the space has less accurate histograms. (3) The ST-Histograms are
equipped with a periodicity detection procedure that predicts the future
execution of the continuous
queries. Hence, the query
processing engine can continuously adapt the continuous query pipeline to
reflect this prediction. Experimental results based on a real implementation
inside a data stream management system show a superior performance of ST-Histograms
in terms of providing accurate operator selectivity estimations with no extra
overhead.
LUGrid:
Update-tolerant Grid-based Indexing over Moving Objects. Indexing the locations of moving objects is one of the most
important issues in spatio-temporal databases. Former
proposed indexing approaches suffer from frequent updates of objects and thus
do not scale well. In this reserach, we introduce LUGrid, an adaptive Lazy-Update Grid-based index
that minimizes the cost of processing object updates. LUGrid
has two important features, namely, lazy-insertion and lazy-deletion.
Lazy-insertion reduces the cost of updates by flushing a batch of
updates to disk at one time, so that the cost of multiple updates is amortized.
Lazy-deletion further reduces updating costs by avoiding the deletion of
obsolete entries from other disk pages and eliminating the need to maintain any
auxiliary index. LUGrid adapts to object
distributions through cell splitting and merging. In the paper, we extensively
discuss the structure of LUGrid and the algorithms
for update and query processing. Moreover, we provide theoretical analysis for
estimating the update costs of LUGrid. Comprehensive
experimental results indicate that while keeping query processing as efficient
as former work, LUGrid is several times superior than former work in processing object updates.
R-trees with Update Memos: An Efficient Solution to the
Frequent Updates Problem. The problem of frequently updating
multi-dimensional indexes has arisen in many location-dependent applications. While
the R-tree and its variants are one of the dominant choices for indexing
multi-dimensional objects, the R-trees exhibit inferior performance in the
presence of frequent updates. In this research, we introduce an R-tree
variation, termed RUM-tree (stands for R-tree with
Update Memo) that minimizes the cost of object updates. The RUM-tree processes
updates in a memo-based approach that avoids disk
accesses for purging old entries during an update process. Therefore, the cost
of an update operation in the RUM-tree is reduced to the cost of only an insert
operation. The removal of old object entries is carried out by the garbage cleaner inside the RUM-tree. In this research,
we give the details of the RUM-tree and study the properties of the RUM-tree.
Our theoretical analysis and experimental evaluation demonstrate that the
RUM-tree outperforms other R-tree variations in scenarios with frequent updates.
SEA-CNN:
Scalable Processing of Continuous K-Nearest Neighbor Queries in Spatio-temporal Databases. Location-aware environments are characterized by a large number
of objects and a large number of continuous queries. Both the objects and
continuous queries may change their locations over time. In this work, we focus
on continuous k-nearest neighbor queries (CKNN, for short). We present a new
algorithm, termed SEA-CNN, for answering continuously a collection of
concurrent CKNN queries. SEA-CNN has two important features: incremental evaluation
and shared execution. SEA-CNN achieves both efficiency and scalability in the
presence of a set of concurrent queries. Furthermore, SEA-CNN does not make any
assumptions about the movement of objects, e.g., the objects’ velocities and
shapes of trajectories, or about the mutability of the objects and/or the queries,
i.e., moving or stationary queries issued on moving or stationary objects. We
provide theoretical analysis of SEA-CNN with respect to the execution costs, memory requirements
and effects of tunable parameters. Comprehensive experimentation shows that
SEA-CNN is highly scalable and is more efficient in terms of both I/O and CPU
costs in comparison to other R-tree-based CKNN
techniques.
To Trie or Not to Trie? Realizing Space-partitioning Trees inside PostgreSQL: Experiences and Performance. Many evolving database applications warrant the use of
non-traditional indexing mechanisms beyond B+-trees and hash tables. SP-GiST is an extensible indexing framework that broadens the
class of supported indexes to include disk-based versions of a wide variety of
space-partitioning trees, e.g., disk-based trie
variants, quadtree variants, and kd-trees.
This research makes a serious attempt at implementing and realizing SP-GiST-based
indexes inside PostgreSQL. Several index types are
realized inside PostgreSQL facilitated by rapid SP-GiST instantiations. Challenges, experiences, and performance
issues are addressed. Performance comparisons are conducted from within PostgreSQL to compare update and search performances of SP-GiST-based indexes against the B+-tree and the R-tree for
text string and point data sets. Interesting
performance results are produced by this research. Results highlight the
potential performance gains of SP-GiST-based indexes
as well as several strengths and weaknesses of SP-GiST-based
indexes that need to be addressed in future research.
Discovering Consensus Patterns using Progressive Hierarchical
Clustering with Applications in Biological Databases. A consensus pattern is a highly conserved pattern with very few
substitutions where no gaps are allowed within the pattern. Discovering
consensus patterns has many useful applications, e.g., discovering motifs and
tandem repeats in biological databases. The length of a consensus pattern is
usually within a certain range that is determined by the domain knowledge. In
this research, we introduce a disk-based progressive hierarchical clustering technique
for discovering consensus patterns of a certain length range. The proposed
technique has the ability to discover the consensus patterns with various
requirements by applying a post-processing phase to refine the discovered
patterns. The progressive nature of the proposed hierarchical clustering
algorithm makes it scalable and efficient. Our experiments are conducted in the
context of biological databases for discovering motif and tandem repeat
consensus patterns.
II.
Education activities are focused in the following activities:
Prototype Software Systems: Students involved in my research
lab, both at the undergraduate and graduate levels, have had the opportunity to
develop significant experience in systems-oriented database research.
SP-GiST (http://www.cs.purdue.edu/homes/aref/dbsystems_files/SP-GiST/):
Since the release of SP-GiST last year, and with over
thirty users world-wide, new effort by my group has been made to realize the
SP-GiST framework inside PostgreSQL
(an open-source extensible database developed at Berkeley). Many
non-traditional indexes are now realizable in an industrial-strength database
management system, and hence enabling efficient processing of non-traditional
and emerging database applications.
Bioinformatics Undergraduate Research Course: I continued to offer this
undergraduate research course. The course initiates a new collaborative
research effort with professors in the Biology department at Purdue. This
collaboration is in the form of a research course at the undergraduate level,
where both CS and Biology undergraduate students register into the course. With
the success in this course, I am currently proposing to the CS department of
make this a regular CS undergraduate course (Titled: Bioinformatics Project)
that is to be taken at the Junior/Senior level with a database systems course
as a pre-requisite.
Graduating
Students. The following students graduated during this past
year. All were supported in part by this grant:
1.
M.F. Mokbel (Ph.D. June
2005 – accepted an assistant professor job at
2.
M.G. Elfeky (Ph.D. March
2005 – accepted a job at
Tutorial
(at the IEEE International Conference on Data Engineering - ICDE): Rank-aware
Query Processing. With the success
of this tutorial last year in the EDBT conference in Crete, Aref
and his former student Ilyas gave this tutorial again
at ICDE this year in Tokyo. The main goal of this tutorial was to give an
in-depth look at rank-aware query processing as an increasingly interesting
area of research to the database community and to give the necessary background
and the state-of-the-art techniques in research prototypes and
industrial-strength database engines. We give an inclusive background on
ranking, voting and rank-aggregation algorithms in theory. We then give a
detailed coverage of ranking query models--covering top-k selection and top-k
join queries, and the various approaches recently taken by researchers in
academia and industry to support these queries inside database systems.
Proposal Information
Proposal number : 0093116
PI : Walid G. Aref
Institution :
Title : Research and Teaching of Database
Technologies For
Emerging
Applications
Requested duration : 15 September 2001 – 15 September 2006
Findings
9/01/2004 – 9/01/2005
Keywords
Spatio-temporal
Database Systems, Continuous Query Processing, Ranking, Top-k Queries, Stream
Data Systems, Extensible Spatial Indexing, Bioinformatics.
This section highlights our specific research findings in
three major research areas of the project: Scalable
Processing of Continuous Queries in Spatio-temporal
Databases, Performance of Space-Partitioning Trees, Incremental
Evaluation for Sliding-Window Continuous Queries in Stream Query Processing
Servers, and Bio-informatics-related Research.
Spatio-temporal Database Systems. The
wide spread of location-detection devices (e.g., GPS-like devices, RFID's,
handheld devices, and cellular phones) has given rise to a wide variety of
vital spatio-temporal database applications including
location-aware services, traffic monitoring, and enhanced 911 service. These
applications pose serious challenges to existing database engines.
Aref and his collaborators at Purdue developed the first
prototype data stream management system (the PLACE server – PLACE stands for
Pervasive Location-Aware Computing Environment) that handles massive spatio-temporal data streams in real-time. The PLACE server is the first data stream management system that provides support for both spatio-temporal queries and spatio-temporal
data streams. Unlike previous systems and approaches for spatio-temporal
databases that support spatio-temporal algorithms as
high level applications, the PLACE server encapsulates continuous query
algorithms into physical pipelined query operators that can be part of a larger
query pipeline inside a data stream management system. The PLACE server
addresses new challenges of streaming environments that include managing the
scarce memory resource, dealing with uncertainty areas of moving queries, and adapting the system
behavior to intervals of high workload via load shedding. The PLACE server was demonstrated at the 2004 International
Conference on Very Large Data Bases. Aref and his
collaborators developed incremental spatio-temporal
query processing algorithms and indexing techniques. In order to handle the
vast amounts of updates due to the large numbers of moving objects and moving
queries, Aref and his collaborators developed
tolerant indexes on both the objects and on the queries that can accommodate
for the massive numbers of updates whenever the objects or the queries move.
PLACE utilizes the notions of “positive” and “negative” tuples
to incrementally update and send the answers of a query to a mobile client.
Other
researchers in spatio-temporal databases consider
instances of this problem, e.g., developing indexing structures to efficiently
answer static range queries on moving objects, or moving queries on static
objects, etc. However, these techniques do not scale from a systems point of
view. In their SIGMOD 2004 paper, Aref and colleagues
presented the SINA family of algorithms that scale to all object/query
mutability combinations in a very efficient manner. Although relatively recent,
the SINA paper is referenced widely in database systems venues.
In contrast to
snapshot queries that are answered against one state of the database,
continuous queries are queries that last over a long duration of time and are
reevaluated every time a new change to the database or to the query occurs
(e.g., when the location of the k-nearest-neighbor query changes or when a new tuple gets inserted into the database). To target
scalability, Aref and his colleagues proposed to
incrementally update the answer to “continuous” queries in contrast to
reevaluating them from scratch. In the
IEEE Mobile Data Management 2005 and IEEE ICDE 2005 conference papers, Aref and his colleagues introduced the notion of update tuples to incrementally update the answers to continuous
k-nearest-neighbor and range queries for all object/query mutability
combinations. They developed a shared execution paradigm among the
concurrently outstanding continuous queries. The main idea is to abstract the
problem of executing multiple continuous spatio-temporal
queries into a spatio-temporal join operation. The
inputs to the join operation are two streams; a stream of continuously moving
objects and a stream of continuously moving queries. Furthermore, concurrently
outstanding queries share the same in-memory buffers. In the PLACE continuous
query processor, Aref and his colleagues go beyond
the idea of implementing high level algorithms for continuous spatio-temporal queries. Instead, they encapsulate the spatio-temporal query algorithms into a set of primitive spatio-temporal pipelined operators that can be part of a
larger query plan. Having a set of primitive spatio-temporal
operators results in greatly enhancing the query
performance by pushing the spatio-temporal operators
into the query pipeline, and hence giving the query optimizer more flexibility
in producing more efficient query plans. Both
theoretical analysis and comprehensive experimentation demonstrate that the
techniques proposed by Aref and his colleagues are
highly scalable and are more efficient by an order of magnitude than
competitive spatio-temporal approaches.
Currently, Aref and his students are studying
spatio-temporal query optimization issues in PLACE. As a first step towards query optimization in spatio-temporal
databases, in [SSTD’05], Aref and colleagues presented a framework for building and
continuously maintaining spatio-temporal histograms
(ST-Histograms, for short).
ST-Histograms are used for selectivity estimation of continuous pipelined query
operators. Unlike traditional histograms that examine and/or sample all
incoming data tuples, ST-Histograms are built by
monitoring the actual selectivities of the
outstanding continuous queries. ST-Histograms have three main features: (1) The
ST-Histograms are built with (almost) no overhead to the system. Feedback
(i.e., the actual selectivity) from the existing continuous queries is used.
(2) Rather than wasting system resources in maintaining accurate histograms for
the whole spatial space, accurate histograms are maintained for only that part
of the space that is relevant to the current existing queries. The rest of the
space has less accurate histograms. (3) The ST-Histograms are equipped with a
periodicity detection procedure that predicts the future execution of the
continuous queries. Hence, the query processing engine can continuously adapt
the continuous query pipeline to reflect this prediction. In fact, this is an
instance of where data mining can fit evenly in query optimization.
Experimental results based on a real implementation inside PLACE show a
superior performance of ST-Histograms in terms of providing accurate operator
selectivity estimations with no extra overhead (only 8.5% error for the
existing queries of size 1% and an average error of 25% for new queries when the
existing query coverage is 60%).
GPAC:
Generic and Progressive Processing of Mobile Queries over
SOLE:
Scalable On-Line Execution of Continuous Queries on Spatio-temporal
Data Streams. We presented the Scalable
On-Line Execution algorithm (SOLE, for short) for continuous and on-line
evaluation of concurrent continuous spatio-temporal
queries over spatio-temporal data streams. SOLE is an
in-memory algorithm that utilizes the scarce memory resources efficiently by
keeping track of only those objects that are considered significant
(i.e., the objects that are needed by at least one outstanding continuous
query). SOLE utilizes a shared memory structure that is accessible by all outstanding
continuous queries. SOLE is a unified framework that supports both stationary
and moving queries. k-nearest-neighbor queries
are modeled and treated as range queries. Uncertainty areas in
continuous spatio-temporal queries are avoided in
SOLE by using a caching technique. SOLE is encapsulated into a physical
pipelined query operator that interacts with traditional query operators to
support various query types. To cope with intervals of high arrival rates of objects
and/or queries, SOLE utilizes load shedding techniques that aim to
support more continuous queries, yet with an approximate answer. Two load
shedding techniques are proposed, namely, eager load shedding and lazy
load shedding. Experimental results based on a real implementation of SOLE
inside a prototype data stream management system show that SOLE can support up
to 20 times continuous queries more than the case of dealing with each query
separately. Also, implementing SOLE as a pipelined operator
results in up to 20 times performance advantage over a table function
implementation. With the lazy load shedding, SOLE can support up
to 13 times more queries than the case with no load shedding.
LUGrid:
Update-tolerant Grid-based Indexing over Moving Objects. We proposed LUGrid; An adaptive Lazy-Update Grid-based indexing
structure. LUGrid efficiently handles object updates
by its unique lazy-update features, namely, lazy-insertion and lazy-deletion.
Lazy-deletion reduces the update cost from traditional ``insertion cost plus
deletion cost" to ``insertion cost only". The lazy-deletion
functionality is provided by maintaining a memo structure to identify
obsolete entries. Further, lazy-insertion groups updates
and flushes multiple updates at one time, so that the cost for single update is
amortized. The proposed lazy-update techniques in LUGrid
can be applied to other index families. We give theoretical analysis for LUGrid and present an extensive set of experiment.
Depending on different experiment settings, LUGrid is
2 to 8 times superior than former work in terms of the
I/O cost for processing object updates.
R-trees with
Update Memos: An Efficient Solution to the Frequent Updates Problem. Supporting frequent updates in indexes is receiving
increasing attentions in the context of the spatio-temporal
databases and moving object databases. As the R-tree and its variants have been
regarded as one of the primary choices for spatial indexing, improving the
R-tree's update performance with frequent updates is a challenging yet
important issue. For R-tree update, the most costly part lies in searching the
objects to be updated. In contrast to the former update approaches, we
presented a memo-based approach to avoid the deletion I/O costs. In the
proposed RUM-tree, object updates are ordered temporally according to the
processing time. By maintaining the update memo, more than one entries of an
object may coexist in the RUM-tree. The obsolete entries are deleted lazily and
in batches. Garbage cleaner is employed to limit the garbage ratio in the
RUM-tree and confine the size of the Update Memo. The RUM-tree along with the
garbage cleaners outperforms significantly other R-tree variants in the update
performance, while yielding similar search performance. We believe that the
memo-based update approach has potential to support frequent updates in many
other indexing structures, for instances, B-trees, quadtrees
and Grid files.
SEA-CNN:
Scalable Processing of Continuous K-Nearest Neighbor Queries in Spatio-temporal Databases. We investigated the problem of evaluating a large set of
continuous k-nearest neighbor (CKNN) queries in spatio-temporal
databases. We introduce the Shared Execution Algorithm (SEA-CNN, for
short) to efficiently maintain the answer results of CKNN queries. SEA-CNN
combines incremental evaluation and shared execution to minimize the costs when
updating the query answers. With incremental evaluation, only queries affected
by the motion of objects are reevaluated. To minimize the evaluation time, each
affected query is associated with a searching region based on its former query
answer. Under shared execution paradigm, concurrent queries are grouped in a
common query table. Thus the problem of evaluating multiple queries is solved
by a spatial join between the query table and the object table. Furthermore,
SEA-CNN is a generally applicable framework. First, SEA-CNN does not make any
assumptions about the movement of objects (e.g., velocities, trajectories).
Second, SEA-CNN is suitable for processing moving/stationary queries issued on
moving/stationary objects. We provide theoretical analysis of SEA-CNN in terms
of the execution costs, the memory requirements and the effects of tunable parameters.
Comprehensive experiments illustrate that SEA-CNN is highly scalable and is
more efficient to R-tree-based CKNN techniques in terms of both the number of
I/Os and the CPU costs.
Spatio-Temporal Histograms. We explored the
usage of spatio-temporal histograms for selectivity
estimation of spatio-temporal operators. We presented
a general framework for building and continuously maintaining spatio-temporal histograms. The main idea of our proposed spatio-temporal histograms is to use a continuous
feedback from the outstanding continuous queries to maintain a spatio-temporal histogram for only those parts of the
spatial space that are of interest to at least one outstanding continuous
query. Parts of the spatial space that are not of interest to any of the
outstanding queries do not participate in maintaining the spatio-temporal
histograms, thus the overhead of continuously maintaining our spatio-temporal histograms is reduced. Our proposed spatio-temporal histograms utilize periodicity detection
techniques to discover temporal periodic patterns. Discovering temporal
patterns provides pre-computation of the optimal query plan over the course of execution
of continuous queries. Experimental results show that our spatio-temporal
histograms provide only 8.5% error for the existing queries of size 1%. An average error of 25% for new queries when the existing query
coverage is 60%.
To Trie or Not to Trie? Realizing Space-partitioning Trees inside PostgreSQL: Experiences and Performance. We presented a serious attempt at implementing and realizing SP-GiST-based indexes inside PostgreSQL.
We realized three index structures; namely one variant of a forest trie, a kd-tree and a suffix
tree. Several implementation challenges,
experiences, and performance issues are addressed. Our experiments demonstrate
the potential gain of the SP-GiST indexes. For
example, the trie has more than 150\% search
performance improvement over the B+-tree in the case of the exact match
search, and it has more than 2 orders of magnitude search performance gain over
the B+-tree in the case of the regular
expression match search. The kd-tree also has
more than 300% search performance improvement over the R-tree in the case of the
point match search. Our experiments demonstrate several weaknesses of
SP-GiST indexes that need to be addressed in future
research. For example, the insertion
time and the index size of the SP-GiST indexes
involve higher overhead than those of the B+-tree, and the R-tree indexes.
Discovering Consensus Patterns using Progressive Hierarchical
Clustering with Applications in Biological Databases. We proposed Bio-CP, a
disk-based progressive hierarchical clustering technique for discovering
consensus patterns. We demonstrated the applicability of Bio-CP in the context
of biological databases for discovering motifs and tandem repeats. Bio-CP is applicable to a wide range of
applications because any domain-specific requirements are applied in a
post-processing phase. We proposed several scalability techniques to enhance
the performance of Bio-CP and to allow efficient disk access. Our experiments illustrated that Bio-CP scales
very well with respect to the processing time, the clustering validation
degrees and the disk access. Bio-CP has more than 500\% processing time
improvement, and more than 350% disk access improvement over the
non-progressive clustering techniques.