Proposal Information

Proposal Information

Proposal number : 0093116

PI : Walid G. Aref

Institution : Purdue University

Title : Research and Teaching of Database Technologies For

Emerging Applications

Requested duration : 15 September 2001 – 15 September 2006

Activities

9/01/2002 – 9/01/2003

Keywords

Rank join, Spatial indexing, Adaptive spatial query processing, Disk scheduling, Quality of service, High-dimensional indexing, Multimedia disk scheduling, Data streaming, Window join, Time-series data mining.

Project Summary

This project supports an integrated research and education effort in the area of database technologies for emerging applications. Its goal is to develop database technologies that address the demands of data-intensive applications that handle distributed, spatial, and multimedia data sources. We continue to make progress toward our targeted objectives in both research and education.

I. Research activities have focused on:

QoS-Aware Disk Scheduling. We have developed a new multimedia disk scheduling algorithm, termed Cascaded-SFC. The Cascaded-SFC multimedia disk scheduler is applicable in environments where multimedia data requests arrive with different quality of service (QoS) requirements such as real-time deadline and user priority. Previous work on disk scheduling has focused on optimizing the seek times and/or meeting the real-time deadlines. The Cascaded-SFC disk scheduler provides a unified framework for multimedia disk scheduling that scales with the number of scheduling parameters. The general idea is based on modeling the multimedia disk requests as points in multiple multi-dimensional sub-spaces, where each of the dimensions represents one of the parameters (e.g., one dimension represents the request deadline, another represents the disk cylinder number, and a third dimension represents the priority of the request, etc.). Each multi-dimensional sub-space represents a set of QoS parameters. Then the multimedia disk scheduling problem reduces to the problem of finding a linear order to traverse the multi-dimensional points in each sub-space. Multiple space-filling curves are integrated in a cascaded way to provide a total order for the whole space. Comprehensive experiments demonstrate the efficiency and scalability of the Cascaded-SFC disk scheduling algorithm over other disk schedulers.

Optimal Range Query Processing Using Spectral Locality-preserving Mappings. A locality-preserving mapping from the multi-dimensional space into the one-dimensional space is beneficial for many applications (e.g., range queries, nearest-neighbor queries, clustering, and declustering) when multi-dimensional data is placed into one-dimensional storage (e.g., the disk). The idea behind a locality-preserving mapping is to map points that are nearby in the multi-dimensional space into points that are nearby in the one-dimensional space. For the past two decades, fractals (e.g., the Hilbert and Peano space-filling curves) have been considered the natural method for providing a locality-preserving mapping to support efficient answer for range queries and similarity search queries. In this research, we go beyond the idea of fractals. Instead, we use a locality-preserving mapping algorithm (The Spectral LPM) that makes use of the spectrum of the multi-dimensional space. This research provably demonstrates how Spectral LPM provides a globally optimal mapping from the multi-dimensional space to the one-dimensional space, and hence outperforms fractals. The performance of Spectral LPM was contrasted with several fractal mappings in the context of range queries. Experimental results verified the analytical results that Spectral LPM gives superior performance over fractal-based mapping algorithms. The research gives rise to interesting open problems and challenges that relate to Spectral LPM.

Adaptive Spatial Query Processing. Traditional join algorithms aim to optimize the query performance for producing the join result using minimal cost. In the case of an index, this requires the ability to completely read and process all the relations before producing the result. This approach fails to cope with new applications that include: 1) Massive data size where the process of reading the whole relations before producing the result suffers from a considerable delay. 2) Data that comes from distributed sources over different network connections. One of the connections may suffer from a networking delay, which subsequently delays the query result. In this research effort, we develop a hash-merge join algorithm, a non-blocking join algorithm that produces fast and early join results without the need to process the whole data. The non-blocking Hash-merge join algorithm (HMJ, for short) has two phases: the hashing phase, and the merging phase. The hashing phase produces early join results by employing an in-memory hash-based join algorithm. When memory is exhausted, part of the data is flushed to disk using an elegant flushing policy. The merging phase continuously produces join results by employing a disk-based sort-merge-like join algorithm. Experimental results show that HMJ combines the advantages of state-of-the-art non-blocking join algorithms while avoiding their shortcomings.

Scheduling for Shared Window Joins Over Data Streams. This research is jointly conducted with Professor Mike Franklin at the University of California, Berkeley. The research addresses the problem of scheduling the shared execution of multiple window-join queries over data streams. Each join has its own sliding window, which can be expressed in terms of time units or tuple counts. Joining tuples from the underlying data streams may serve the purpose of multiple window join queries. Sharing the execution of these queries will maximize the utilization of system resources. One way to share the execution of multiple window joins is to use the largest window among all queries. This approach, the Largest Window Only(LWO), would penalize the response time of queries with small windows to serve the query with the largest window size. Two new algorithms for scheduling the execution of the shared window-join are developed; the Smallest Window First(SWF) and the Maximum Query Throughput (MQT) algorithms. An analytical study of the tradeoffs between the LWO and SWF algorithms leads to the development of the MQT algorithm. The performance study of the three algorithms shows that the MQT algorithm provides the best performance in terms of response time for all queries.

Tracking Moving Objects in Sensor-Network Databases. The widespread use of sensor networks presents revolutionary opportunities for life and environmental science applications. Many of these applications involve continuous queries that require the tracking, monitoring, and correlation of multi-sensor data that represent moving objects. In this research, we propose to answer these queries using a multi-way stream window join operator. This form of join over multi-sensor data must cope with the infinite nature of sensor data streams and the delays in network transmission. This research introduces a class of join algorithms, termed W-join, for joining multiple infinite data streams. W-join addresses the infinite nature of the data streams by joining stream data items that lie within a sliding window and that match a certain join condition. W-join can be used to track the motion of a moving object or detect the propagation of clouds of hazardous material or pollution spills over time in a sensor network environment. We describe two new algorithms for W-join, and address variations and local/global optimizations related to specifying the nature of the window constraints to fulfill the posed queries. The performance of the proposed algorithms is studied experimentally in a prototype stream database system, using synthetic data streams and real time-series data. Tradeoffs of the proposed algorithms and their advantages and disadvantages are highlighted, given variations in the aggregate arrival rates of the input data streams and the desired response times per query.

Performance of multi-dimensional space-filling curves. A space-filling curve is a way of mapping the multi-dimensional space into the one-dimensional space. It acts like a thread that passes through every cell element (or pixel) in the D-dimensional space so that every cell is visited exactly once. There are numerous kinds of space-filling curves. The difference between such curves is in their way of mapping to the one dimensional space. Selecting the appropriate curve for any application requires knowledge of the mapping scheme provided by each space-filling curve. A space-filling curve consists of a set of segments. Each segment connects two consecutive multi-dimensional points. Five different types of segments are distinguished, namely, Jump, Contiguity, Reverse, Forward, and Still. A description vector V=(J,C,R,F,S), where J,C,R,F, and S, are the percentages of Jump, Contiguity, Reverse, Forward, and Still segments in the space-filling curve, encapsulates all the properties of a space-filling curve. The knowledge of V facilitates the process of selecting the appropriate space-filling curve for different applications. We developed closed formulas to compute the description vector V for any D-dimensional space and grid size N for different space-filling curves. In this research, we conducted a comparative study of different space filling curves with respect to the description vector.

Partial Periodic Patterns in Time-Series Databases. Mining of periodic patterns in time-series databases is an interesting data mining problem. It can be envisioned as a tool for forecasting and prediction of the future behavior of time-series data. Incremental mining refers to the issue of maintaining the discovered patterns over time in the presence of more items being added into the database. Because of the mostly append-only nature of updating time-series data, incremental mining would be very effective and efficient. In this research, we proposed several algorithms for incremental mining of partial periodic patterns in time-series databases and analyzed their performance empirically. The new algorithms allow for online adaptation of the thresholds in order to produce interactive mining of partial periodic patterns. The storage overhead of the incremental online mining algorithms is analyzed. Results show that the storage overhead for storing the intermediate data structures pays off as the incremental online mining of partial periodic patterns proves to be significantly more efficient than the non-incremental non-online versions. Moreover, we introduced a new problem, termed Merge Mining, as a generalization of incremental mining Merge mining can be defined as merging the discovered patterns of two or more databases that are mined independently of each other. We proposed and analyzed a new algorithm for merge mining of partial periodic patterns in time-series databases.

Periodicity Detection in Time Series Databases. Discovering the rate at which the time series is periodic has always been an obstacle for fully automated periodicity mining. In existing periodic patterns mining algorithms, the period length is user-specified. This is a considerable disadvantage, especially in time series datasets where the period length is not known a priori. In this research, we address the problem of detecting the periodicity rate of a time series database. Two types of periodicities are defined, and a scalable and computationally efficient algorithm is proposed for each type. The algorithms require only one scan over a time series database of size n, with O(n log n) time complexity.

Joining Ranked Inputs in Practice. Joining ranked inputs is an essential requirement for many database applications, such as ranking search results from multiple search engines and answering multi-feature queries for multimedia retrieval systems. We introduce a new practical pipelined query operator, termed NRA-RJ, that produces a global rank from input ranked streams based on a score function. The output of NRA-RJ can serve as a valid input to other NRA-RJ operators in the query pipeline. Hence, the NRA-RJ operator can support a hierarchy of join operations and can be easily integrated in query processing engines of commercial database systems. The NRA-RJ operator bridges Fagin's optimal aggregation algorithm into a practical implementation and contains several optimizations that address performance issues. We compare the performance of NRA-RJ against recent rank join algorithms. Experimental results demonstrate the performance trade-offs among these algorithms. The experimental results are based on an empirical study applied to a medical video application on top of a prototype database system. The study reveals important design options and shows that the NRA-RJ operator outperforms other pipelined rank join operators when the join condition is an equi-join on key attributes.

Supporting Top-k Join Queries in Relational Databases. Ranking queries produce results that are ordered on some computed score. Typically, these queries involve joins, where users are usually interested only in the top-k join results. Current relational query processors do not handle ranking queries efficiently, especially when joins are involved. In this research, we address supporting top-k join queries in relational query processors. We introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a user-specified scoring function. The idea is to rank the join results progressively during the join operation. We introduce two physical query operators based on variants of ripple join that implement the rank join algorithm. The operators are non-blocking and can be integrated into pipelined execution plans. We address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. We implement the new operations inside a prototype database engine based on PREDATOR. The experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance.

Medical Video Mining for Efficient Database Indexing, Management and Access. To achieve more efficient video indexing and access, we introduce a video database management framework and strategies for video content structure and events mining. The video shot segmentation and key-frame selection strategy are first utilized to parse the continuous video stream into physical units. Video shot grouping, group merging, and scene clustering schemes are then proposed to organize the video shots into a hierarchical structure using clustered scenes, scenes, groups, and shots, in increasing granularity from top to bottom. Then, audio and video processing techniques are integrated to mine event information, such as dialog, presentation and clinical operation, from the detected scenes. Finally, the acquired video content structure and events are integrated to construct a scalable video skimming tool which can be used to visualize the video content hierarchy and event information for efficient access. Experimental results are also presented to evaluate the performance of the proposed framework and algorithms.

II. Education and Research Tools:

I have involved both undergraduate and graduate students in hands-on research activities and experimentation through the development of a lab environment to facilitate the integration of research results and education. I have developed and involved several tools that I now use in our classes. These tools include SP-GiST (http://www.cs.purdue.edu/spgist/): a spatial indexing framework for space-partitioning trees, and vdbms (http://www.cs.purdue.edu/vdbms/ ): a video database management system for advancing teaching and research in multimedia database systems. We use both systems in our classes and in our research.

III. Recent Publications:

1. S. Prabhakar, Y. Xia, D.V. Kalashnikov, W.G. Aref, S.E. Hambrusch: Query Indexing and Velocity Constrained Indexing: Scalable Techniques for Continuous Queries on Moving Objects. IEEE Trans. on Computers 51(10): 1124-1140 (2002).

2. I.F. Ilyas, W.G. Aref, A.K. Elmagarmid: Joining Ranked Inputs in Practice. VLDB 2002: 950-961.

3. M.A. Hammad, M.J. Franklin, W.G. Aref, A.K. Elmagarmid: Scheduling for shared window joins over data streams. VLDB 2003.

4. I.F. Ilyas, W.G. Aref, A.K. Elmagarmid: Supporting Top-k Join Queries in Relational Databases. VLDB 2003.

5. M.F. Mokbel, W.G. Aref, Ibrahim Kamel: Performance of Multi-dimensional space-filling curves. ACM-GIS 2002: 149-154.

6. D.V. Kalashnikov, S. Prabhakar, S.E. Hambrusch, W.G. Aref: Efficient Evaluation of Continuous Range Queries on Moving Objects. DEXA 2002: 731-740.

7. M.F. Mokbel, T.M. Ghanem, and W.G. Aref , "Spatio-temporal Access Methods", IEEE Data Engineering Bulletin, 26(2), 40-49, Jun., 2003.

8. X.Q. Zhu, J. Fan, W. Aref, A.C. Catlin, A. Elmagarmid, "Medical Video Mining for Efficient Database Indexing, Management and Access", ICDE'03 Proc. of the19th Intl. Conference on Data Engineering. March 2003. Bangalore, India.

9. M. Mokbel and W. Aref, "Spectral LPM: An Optimal Locality-Preserving Mapping using the Spectral (not Fractal) Order", ICDE'03 Proceedings of the 19th Intl. Conference on Data Engineering. March 2003. Bangalore, India.

10. M.A. Hammad, W.G. Aref and A.K. Elmagarmid, "Stream Window Join: Tracking Moving Objects in Sensor-Network Databases", Proc. of 15th International Conference on Scientific and Statistical Database Management, SSDBM 2003.