Proposal Information
Proposal number :
0093116
PI :
Walid G. Aref
Institution :
Title :
Research and Teaching of Database Technologies For
Emerging Applications
Requested duration :
Activities
Keywords
Rank join, Spatial indexing, Adaptive
spatial query processing, Disk scheduling, Quality of service, High-dimensional
indexing, Multimedia disk scheduling, Data streaming, Window join, Time-series
data mining.
Project Summary
This project supports an integrated research and education
effort in the area of database technologies for emerging applications. Its goal
is to develop database technologies that address the demands of data-intensive
applications that handle distributed, spatial, and multimedia data sources. We
continue to make progress toward our targeted objectives in both research and
education.
I.
Research activities have focused on:
QoS-Aware Disk
Scheduling. We have developed a new
multimedia disk scheduling algorithm, termed Cascaded-SFC. The Cascaded-SFC
multimedia disk scheduler is applicable in environments where multimedia data
requests arrive with different quality of service (QoS)
requirements such as real-time deadline and user priority. Previous work on disk scheduling has focused
on optimizing the seek times and/or meeting the real-time deadlines. The
Cascaded-SFC disk scheduler provides a unified framework for multimedia disk
scheduling that scales with the number of scheduling parameters. The general
idea is based on modeling the multimedia disk requests as points in multiple
multi-dimensional sub-spaces, where each of the dimensions represents one of
the parameters (e.g., one dimension represents the request deadline, another
represents the disk cylinder number, and a third dimension represents the
priority of the request, etc.). Each multi-dimensional sub-space represents a
set of QoS parameters. Then the multimedia disk
scheduling problem reduces to the problem of finding a linear order to traverse
the multi-dimensional points in each sub-space. Multiple space-filling curves
are integrated in a cascaded way to provide a total order for the whole space. Comprehensive
experiments demonstrate the efficiency and scalability of the Cascaded-SFC disk
scheduling algorithm over other disk schedulers.
Adaptive Spatial
Query Processing. Traditional join algorithms aim to
optimize the query performance for producing the join result using minimal
cost. In the case of an index, this requires the ability to completely read and
process all the relations before producing the result. This approach fails to
cope with new applications that include: 1) Massive data size where the process
of reading the whole relations before producing the result suffers from a
considerable delay. 2) Data that comes from distributed sources over different
network connections. One of the connections may suffer from a networking delay,
which subsequently delays the query result. In this research effort, we develop
a hash-merge join algorithm, a non-blocking join algorithm that produces fast
and early join results without the need to process the whole data. The non-blocking
Hash-merge join algorithm (HMJ, for short) has two phases: the hashing phase,
and the merging phase. The hashing phase produces early join results by employing
an in-memory hash-based join algorithm. When memory is exhausted, part of the
data is flushed to disk using an elegant flushing policy. The merging phase continuously
produces join results by employing a disk-based sort-merge-like join algorithm.
Experimental results show that HMJ combines the advantages of state-of-the-art
non-blocking join algorithms while avoiding their shortcomings.
Scheduling
for Shared Window Joins Over Data Streams. This research is jointly conducted with Professor Mike Franklin
at the
Tracking Moving Objects in Sensor-Network
Databases. The
widespread use of sensor networks presents revolutionary opportunities for life
and environmental science applications. Many of these applications involve
continuous queries that require the tracking, monitoring, and correlation of
multi-sensor data that represent moving objects. In this research, we propose
to answer these queries using a multi-way stream window join operator. This
form of join over multi-sensor data must cope with the infinite nature of
sensor data streams and the delays in network transmission. This research introduces
a class of join algorithms, termed W-join, for joining multiple infinite data
streams. W-join addresses the infinite nature of the data streams by joining
stream data items that lie within a sliding window and
that match a certain join condition. W-join can be used to track the motion of
a moving object or detect the propagation of clouds of hazardous material or
pollution spills over time in a sensor network environment. We describe two new
algorithms for W-join, and address variations and local/global optimizations
related to specifying the nature of the window constraints to fulfill the posed
queries. The performance of the proposed algorithms is studied experimentally
in a prototype stream database system, using synthetic data streams and real
time-series data. Tradeoffs of the proposed algorithms and their advantages and
disadvantages are highlighted, given variations in the aggregate arrival rates
of the input data streams and the desired response times per query.
Performance of
multi-dimensional space-filling curves. A
space-filling curve is a way of mapping the multi-dimensional space into the
one-dimensional space. It acts like a thread that passes through every cell
element (or pixel) in the D-dimensional space so that every cell is visited
exactly once. There are numerous kinds of space-filling curves. The difference
between such curves is in their way of mapping to the one dimensional space.
Selecting the appropriate curve for any application requires knowledge of the
mapping scheme provided by each space-filling curve. A space-filling curve
consists of a set of segments. Each segment connects two consecutive
multi-dimensional points. Five different types of segments are distinguished,
namely, Jump, Contiguity, Reverse, Forward, and Still.
A description vector V=(J,C,R,F,S), where J,C,R,F, and S, are the percentages
of Jump, Contiguity, Reverse, Forward, and Still
segments in the space-filling curve, encapsulates all the properties of a
space-filling curve. The knowledge of V facilitates the process of selecting
the appropriate space-filling curve for different applications. We developed closed
formulas to compute the description vector V for any D-dimensional space and
grid size N for different space-filling curves. In this research, we conducted
a comparative study of different space filling curves with respect to the description
vector.
Partial
Periodic Patterns in Time-Series Databases.
Mining of periodic patterns in
time-series databases is an interesting data mining problem. It can be
envisioned as a tool for forecasting and prediction of the future behavior of
time-series data. Incremental mining refers to the issue of maintaining the
discovered patterns over time in the presence of more items being added into
the database. Because of the mostly append-only nature of updating time-series
data, incremental mining would be very effective and efficient. In this
research, we proposed several algorithms for incremental mining of partial periodic
patterns in time-series databases and analyzed their performance empirically.
The new algorithms allow for online adaptation of the thresholds in order to
produce interactive mining of partial periodic patterns. The storage overhead
of the incremental online mining algorithms is analyzed. Results show that the
storage overhead for storing the intermediate data structures pays off as the
incremental online mining of partial periodic patterns proves to be
significantly more efficient than the non-incremental non-online versions.
Moreover, we introduced a new problem, termed Merge Mining, as a generalization
of incremental mining Merge mining can be defined as merging the discovered
patterns of two or more databases that are mined independently of each other. We
proposed and analyzed a new algorithm for merge mining of partial periodic patterns
in time-series databases.
Periodicity
Detection in Time Series Databases. Discovering
the rate at which the time series is periodic has always been an obstacle for
fully automated periodicity mining. In existing periodic patterns mining
algorithms, the period length is user-specified. This is a considerable
disadvantage, especially in time series datasets where the period length is not
known a priori. In this research, we address the problem of detecting the periodicity
rate of a time series database. Two types of periodicities are defined, and a
scalable and computationally efficient algorithm is proposed for each type. The
algorithms require only one scan over a time series database of size n, with O(n log n) time complexity.
Joining
Ranked Inputs in Practice. Joining
ranked inputs is an essential requirement for many database applications, such
as ranking search results from multiple search engines and answering
multi-feature queries for multimedia retrieval systems. We introduce a new
practical pipelined query operator, termed NRA-RJ, that
produces a global rank from input ranked streams based on a score function. The
output of NRA-RJ can serve as a valid input to other NRA-RJ operators in the
query pipeline. Hence, the NRA-RJ operator can support a hierarchy of join
operations and can be easily integrated in query processing engines of
commercial database systems. The NRA-RJ operator bridges Fagin's optimal aggregation
algorithm into a practical implementation and contains several optimizations
that address performance issues. We compare the performance of NRA-RJ against
recent rank join algorithms. Experimental results demonstrate the performance trade-offs
among these algorithms. The experimental results are based on an empirical
study applied to a medical video application on top of a prototype database
system. The study reveals important design options and shows that the NRA-RJ
operator outperforms other pipelined rank join operators when the join
condition is an equi-join on key attributes.
Supporting
Top-k Join Queries in Relational Databases. Ranking
queries produce results that are ordered on some computed score. Typically,
these queries involve joins, where users are usually interested only in the
top-k join results. Current relational query processors do not handle ranking
queries efficiently, especially when joins are involved. In this research, we
address supporting top-k join queries in relational query processors. We
introduce a new rank-join algorithm that makes use of the individual orders of
its inputs to produce join results ordered on a user-specified scoring
function. The idea is to rank the join results progressively during the join
operation. We introduce two physical query operators based on variants of
ripple join that implement the rank join algorithm. The operators are
non-blocking and can be integrated into pipelined execution plans. We address
several practical issues and optimization heuristics to integrate the new join
operators in practical query processors. We implement the new operations inside
a prototype database engine based on PREDATOR. The experimental evaluation of
our approach compares recent algorithms for joining ranked inputs and shows
superior performance.
Medical
Video Mining for Efficient Database Indexing, Management and Access. To
achieve more efficient video indexing and access, we introduce a video database
management framework and strategies for video content structure and events
mining. The video shot segmentation and key-frame selection strategy are first
utilized to parse the continuous video stream into physical units. Video shot
grouping, group merging, and scene clustering schemes are then proposed to
organize the video shots into a hierarchical structure using clustered scenes,
scenes, groups, and shots, in increasing granularity from top to bottom. Then,
audio and video processing techniques are integrated to mine event information,
such as dialog, presentation and clinical operation, from the detected scenes.
Finally, the acquired video content structure and events are integrated to
construct a scalable video skimming tool which can be used to visualize the
video content hierarchy and event information for efficient access.
Experimental results are also presented to evaluate the performance of the
proposed framework and algorithms.
II.
Education and Research Tools:
I have involved both undergraduate and graduate students in
hands-on research activities and experimentation through the development of a
lab environment to facilitate the integration of research results and
education. I have developed and involved several tools that I now use in our
classes. These tools include SP-GiST (http://www.cs.purdue.edu/spgist/):
a spatial indexing
framework for space-partitioning trees, and vdbms (http://www.cs.purdue.edu/vdbms/
): a video database management system for advancing teaching and research in
multimedia database systems. We use both systems in our classes and in our
research.
III. Recent Publications:
1. S. Prabhakar, Y. Xia, D.V. Kalashnikov, W.G. Aref, S.E.
Hambrusch: Query Indexing and Velocity Constrained Indexing: Scalable
Techniques for Continuous Queries on Moving Objects. IEEE Trans. on Computers
51(10): 1124-1140 (2002).
2. I.F. Ilyas, W.G. Aref, A.K. Elmagarmid: Joining Ranked
Inputs in Practice. VLDB 2002: 950-961.
3. M.A. Hammad, M.J. Franklin, W.G.
Aref, A.K. Elmagarmid:
Scheduling for shared window joins over data streams. VLDB 2003.
4. I.F. Ilyas, W.G. Aref, A.K. Elmagarmid: Supporting Top-k
Join Queries in Relational Databases. VLDB 2003.
5. M.F. Mokbel, W.G. Aref, Ibrahim
Kamel: Performance of Multi-dimensional space-filling curves. ACM-GIS 2002:
149-154.
6. D.V. Kalashnikov, S. Prabhakar, S.E.
Hambrusch, W.G. Aref: Efficient Evaluation of Continuous Range Queries on
Moving Objects. DEXA 2002: 731-740.
7. M.F. Mokbel, T.M. Ghanem, and W.G. Aref , "Spatio-temporal Access Methods", IEEE Data Engineering
Bulletin, 26(2), 40-49, Jun., 2003.
8. X.Q. Zhu, J. Fan, W. Aref, A.C. Catlin,
A. Elmagarmid, "Medical Video Mining for Efficient Database Indexing, Management and
Access", ICDE'03 Proc. of the19th Intl. Conference on Data Engineering.
March 2003.
9. M. Mokbel and
10. M.A. Hammad, W.G. Aref and A.K. Elmagarmid, "Stream Window Join: Tracking Moving
Objects in Sensor-Network Databases", Proc. of 15th International
Conference on Scientific and Statistical Database Management, SSDBM 2003.