CAREER: Research and Teaching of Database Technologies for Emerging Applications

 

Walid G. Aref (PI)

Department of Computer Science

Purdue University

West Lafayette, Indiana 47907

Phone: 765 494 1997, Fax: 765 494 0739

E-mail: aref@cs.purdue.edu

URL: www.cs.purdue.edu/faculty/aref.html

 

List of Supported Students and Staff

·   Prof. Walid G. Aref (1 summer month per year).

·   Mohamed Mokbel (Ph.D. Student)

·   Elian Haliman (Undergraduate Student)

 

Project Award Information

·   Award Number: 0093116

·   Duration: 15 September 200115 September 2006

·   Title: Research and Teaching of Database Technologies for Emerging Applications

·   Keywords: Spatio-temporal Databases, Continuous Query Processing, Ranking, Top-k Queries, Stream Data Systems, Disk Scheduling, Quality of service, Adaptive spatial query processing, Spatial Indexing, High-dimensional indexing, Space-filling Curves.

 

Project Summary

This project supports an integrated research and education effort in the area of database technologies for emerging applications. Its goal is to develop database technologies that address the demands of data-intensive applications that handle distributed, spatial, and multimedia data sources. We continue to make progress toward our targeted objectives in both research and education. The research activities have focused on:

 

SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases. This research introduces the Scalable INcremental hash-based Algorithm (SINA, for short); a new algorithm for evaluating a set of concurrent continuous spatio-temporal queries. SINA is designed with two goals in mind: (1) Scalability in terms of the number of concurrent continuous spatio-temporal queries, and (2) Incremental evaluation of continuous spatio-temporal queries. SINA achieves scalability by employing a shared execution paradigm where the execution of continuous spatio-temporal queries is abstracted as a spatial join between a set of moving objects and a set of moving queries. Incremental evaluation is achieved by computing only the updates of the previously reported answer. We introduce two types of updates, namely positive and negative updates. Positive or negative updates indicate that a certain object should be added to or removed from the previously reported answer, respectively. SINA manages the computation of positive and negative updates via three phases: the hashing phase, the invalidation phase, and the joining phase. The hashing phase employs an in-memory hash-based join algorithm that results in a set of positive updates. The invalidation phase is triggered every T seconds or when the memory is fully occupied to produce a set of negative updates. Finally, the joining phase is triggered by the end of the invalidation phase to produce a set of both positive and negative updates that result from joining in-memory data with in-disk data. Experimental results show that SINA is scalable and is more efficient than other index-based spatio-temporal algorithms.

 

Rank-aware Query Optimization. Ranking is an important property that needs to be fully supported by current relational query engines. Recently, several rank-join query operators have been proposed based on rank-aggregation algorithms. Rank-join operators progressively rank the join results while performing the join operation. The new operators have a direct impact on traditional query processing and optimization. We introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query engines. The framework is based on extending the System R dynamic programming algorithm in both enumeration and pruning. We define ranking as an interesting property that triggers the generation of rank-aware query plans. Unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality to these operators. We introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operator. To our knowledge, this paper is the first effort in estimating the needed input size for optimal rank-aggregation algorithms.  Costing ranking plans, although challenging, is key to the full integration of rank-join operators in real-world query processing engines. We experimentally evaluate our framework by modifying the query optimizer of an open-source database management system. The experiments show the validity of our framework and the accuracy of the proposed estimation model.

 

Time Management for Sliding-Window Continuous Queries. Emerging data stream processing systems rely on windowing to enable on-the-fly processing of continuous queries over unbounded streams. As a result, several recent efforts have developed window-aware implementations of query operators such as joins and aggregates.  This focus on individual operators, however, ignores the larger issue of how to coordinate the pipelined execution of such operators when combined into a full windowed query plan. In this research, we first show how the straightforward application of traditional pipelined query processing techniques to sliding window queries can result in inefficient and incorrect behavior. We then present alternative execution techniques that guarantee correct behavior for pipelined sliding window queries and develop new algorithms for correctly evaluating window-based duplicate-elimination and Set operators in this context. A detailed performance study has been conducted using a prototype stream database system and both real and synthetic data streams. On average, our proposed approaches provide an order of magnitude reduction in delays of the query answers when compared to conventional pipelined execution.

 

Towards Scalable Location-aware Services: Requirements and Research Issues. The emergence of location-aware services calls for new real time spatio-temporal query processing algorithms that deal with large numbers of mobile objects and queries. Online query response is an important characterization of location-aware services. A delay in the answer to a query gives invalid and obsolete results, simply because moving objects can change their locations before the query responds. To handle large numbers of spatio-temporal queries efficiently, we propose the idea of sharing as a means to achieve scalability. In this research, we introduce several types of sharing in the context of continuous spatio-temporal queries. Examples of sharing in the context of real-time spatio-temporal database systems include sharing the execution, sharing the underlying space, sharing the sliding time windows, and sharing the objects of interest. We demonstrate how sharing can be integrated into query predicates, e.g., selection and spatial join processing. The goal of this research is to outline research directions and approaches that will lead to scalable and efficient location-aware services.

 

PLACE: A Query Processor for Handling Real-time Spatio-temporal Data Streams. The emergence of location-aware services calls for new real-time spatio-temporal query processing algorithms that deal with large numbers of mobile objects and queries. In this research, we prototype the PLACE server (Pervasive Location-Aware Computing Environments); a scalable location-aware database server that we are developing at Purdue University. The PLACE server addresses scalability by adopting an incremental evaluation mechanism for answering concurrently executing continuous spatio-temporal queries. The PLACE server supports a wide variety of stationary and moving continuous spatio-temporal queries through a set of pipelined spatio-temporal operators. The large numbers of moving objects generate real-time spatio-temporal data streams.

 

Nile: A Query Processing Engine for Data Streams. The Nile software prototype is a query processing engine for stream database systems. Nile extends the query processor engine of an object-relational database management system, to process continuous queries over data streams. The Nile design approach is motivated by the fact that industrial strength DBMSs provide greater flexibility in defining and executing queries over static data. On the other hand, many emerging applications, such as sensor-based environments, real-time retail transactions, and video surveillance, continuously report up-to-the-minute readings of sensor values, locations, status updates, etc. Therefore, extending DBMS functionality to support data streams plays a central role for these emerging applications. Nile uses traditional SQL operators and adopts the notion of long-running continuous queries that are also used by many emerging stream data systems. Nile considers windowed execution as an approach to restrict the size of the stored state in operators such as join. Continuous queries that use such operators are termed sliding window queries (SWQs). Nile supports (1) Flexible interface to integrate and communicate with different varieties of stream data sources, (2) Efficient and correct pipelined execution of SWQs over multiple data streams. The correct execution is enforced by two novel pipelined scheduling approaches: Time-probing approach and Negative tuple approach, and (3) Scalable design through the sharing (query clustering) of multiple SWQs.

 

Using Convolution to Mine Obscure Periodic Patterns in One Pass. The mining of periodic patterns in time series databases is an interesting data mining problem that can be envisioned as a tool for forecasting and predicting the future behavior of time series data. Existing periodic patterns mining algorithms either assume that the periodic rate (or simply the period) is user-specified, or try to detect potential values for the period in a separate phase. The former assumption is a considerable disadvantage, especially in time series databases where the period is not known a priori. The latter approach results in a multi-pass algorithm, which on the other hand is to be avoided in online environments (e.g., data streams). In this research, we develop an algorithm that mines periodic patterns in time series databases with unknown or obscure periods such that discovering the period is part of the mining process. Based on convolution, our algorithm requires only one pass over a time series of length n, with O(n log n) time complexity.

 

Bulk Operations for Space-Partitioning Trees. The emergence of extensible index structures, e.g., GiST (Generalized Search Tree) and SP-GiST (Space-Partitioning Generalized Search Tree), calls for a set of extensible algorithms to support different operations (e.g., insertion, deletion, and search). Extensible bulk operations (e.g., bulk loading and bulk insertion) are of the same importance and need to be supported in these index engines. In this research, we propose two extensible buffer-based algorithms for bulk operations in the class of space-partitioning trees; a class of hierarchical data structures that recursively decompose the space into disjoint partitions. The main idea of these algorithms is to build an in-memory tree of the target space-partitioning index. Then, data items are recursively partitioned into disk-based buffers using the in-memory tree. Although the second algorithm is designed for bulk insertion, it can be used in bulk loading as well. The proposed extensible algorithms are implemented inside SP-GiST; a framework for supporting the class of space-partitioning trees. Both algorithms have I/O bound O(NH/B), where N is the number of data items to be bulk loaded/inserted, B is the number of tree nodes that can fit in one disk page, H is the tree height in terms of pages after applying a clustering algorithm. Experimental results are provided to show the scalability and applicability of the proposed algorithms for the class of space-partitioning trees. A comparison of the two proposed algorithms shows that the first algorithm performs better in case of bulk loading. However the second algorithm is more general and can be used for efficient bulk insertion.

 

Scalable Spatio-temporal Continuous Query Processing for Location-aware Services. The emergence of location-aware services calls for new real-time spatio-temporal query processing algorithms that deal with large numbers of moving objects and large numbers of continuous spatio-temporal queries. In this project, we use shared execution as a mechanism to support scalability in location-aware servers. The main idea is to maintain a query table that stores information about continuous spatio-temporal queries. Then, answering spatio-temporal queries is abstracted as a spatial join among the moving objects and queries. Three query join polices are proposed aiming to minimize the cost of the join operation under the shared execution paradigm, namely the Clock-triggered Join Policy, the Incremental Join Policy, and the Hot Join Policy. We introduced the concept of a No-Action Region that is used in conjunction with the hot join policy. We proposed algorithms that calculate the No-Action region for objects and queries. Experimental performance demonstrates that the No-Action region is more efficient than other approaches when used along with the hot join policy. Experiments also demonstrate that the hot join policy outperforms the clock-triggered join policy and the incremental join policy in terms of both I/O and CPU costs.

Prototype Software Systems:

1. SP-GiST (http://www.cs.purdue.edu/spgist/): a spatial indexing framework for space-partitioning trees has been publicly released last year and since then has had many users world-wide.

2. vdbms (http://www.cs.purdue.edu/vdbms/ ): a video database management system for advancing teaching and research in multimedia database systems.

3. Nile and PLACE (see descriptions of both prototypes above). Both systems have been demonstrated successfully in two major database conferences: ICDE 2004 and VLDB 2004, respectively. We plan to release these two software systems publicly in the following year.

Publications and Products

Year 1  9/01/2001 – 9/01/2002

[1]     S. Prabhakar, Y. Xia, D.V. Kalashnikov, W.G. Aref, S.E. Hambrusch: Query Indexing and Velocity Constrained Indexing: Scalable Techniques for Continuous Queries on Moving Objects. IEEE Transactions on Computers 51(10): 1124-1140 (2002).

[2]     F.A. Kokkoras, H. Jiang, I.P. Vlahavas, A.K. Elmagarmid, E.N. Houstis, W.G. Aref. Smart VideoText: a video data model based on conceptual graphs. ACM Multimedia Systems Journal 8(4): 328-338 (2002).

[3]     X. Zhu, J. Fan, A. Elmagarmid, W.G. Aref, Hierarchical Video Summary For Medical Data, Proc. SPIE: Storage and Retrieval for Media Databases 2002, Vol. 4676, pp.395-406, San Jose, Jan. 2002.

[4]     W.G. Aref, A.K. Elmagarmid, J. Fan, J. Guo, M. Hammad, I.F. Ilyas, M.S. Marzouk, S. Prabhakar, A. Rezgui, S. Teoh, E. Terzi, Y. Tu, A. Vakali, X.Q. Zhu, A Distributed Database Server for Continuous Media. (Demo Paper). The IEEE Intl. Conf. on Data Eng. (ICDE), pp. 490-491, San Jose, CA, March 2002.

[5]     X. Zhu, J. Fan, W.G. Aref, A.K. Elmagarmid, ClassMiner: Mining Medical Video Content Structure and Events Towards Efficient Access and Scalable Skimming, In Proceedings of ACM SIGMOD workshop on Data Mining and Knowledge Discovery, pp. 9-16, Madison, June, 2002.

[6]     W.G. Aref, K. El-Bassyouni, I. Kamel, M. Mokbel: Scalable QoS-Aware Disk-Scheduling. IEEE Intl. Database Eng. & Appl. Symp. (IDEAS), pp. 256-265, Edmonton, Canada, July 2002.

[7]     I.F. Ilyas, W.G. Aref, A.K. Elmagarmid, Joining Ranked Inputs in Practice. The Intl. Conference on Very Large Data Bases (VLDB), pp. 950-961, Hong Kong, August 2002.

[8]     M.A. Hammad, W.G. Aref, and A.K. Elmagarmid, Search-based Buffer Management Policies for Streaming in Continuous Media Servers, IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, August 2002.

[9]     C. Berberidis, I.P. Vlahavas, W.G. Aref, M.J. Atallah, A.K. Elmagarmid, On the Discovery of Weak Periodicities in Large Time Series. In T. Elomaa, H. Mannila, H. Toivonen (Eds.): 6th European Conf. Principles of Data Mining and Knowledge Discovery (PKDD), Helsinki, Finland, August 2002. Lecture Notes in Computer Science 2431, pp. 51-61. Springer 2002.

[10]  C. Berberidis, W.G. Aref, M.J. Atallah, I.P. Vlahavas, A.K. Elmagarmid: Multiple and Partial Periodicity Mining in Time Series Databases. Proceedings of the 15th European Conference on Artificial Intelligence (ECAI), pp. 370-374, Lyon, France, July 2002.

Year 2  9/01/2002 – 9/01/2003

[11]  M.F. Mokbel, W.G. Aref, I. Kamel: Analysis of Multi-dimensional Space-filling Curves. GeoInformatica Intl. Journal, Springer Verlag, 7(3):179-209 (2003).

[12]  M.F. Mokbel, T.M. Ghanem, and W.G. Aref, "Spatio-temporal Access Methods", IEEE Data Engineering Bulletin, 26(2), 40-49, Jun., 2003.

[13]  D.V. K.alashnikov, S. Prabhakar, S.E. Hambrusch, W.G. Aref: Efficient Evaluation of Continuous Range Queries on Moving Objects. 13th Intl. Conf. Database and Expert Systems Applications (DEXA), Aix-en-Provence, France, September 2002. Lecture Notes in Computer Science 2453 Springer, pp. 731-740.

[14]  W.G. Aref, A.C. Catlin, J. Fan, A.K. Elmagarmid, M.A. Hammad, I.F. Ilyas, M.S. Marzouk, X. Zhu, A Video Database Management System for Advancing Video Database Research, In Proc. of the Intl. Workshop on Multimedia Information Systems, pp. 817, Tempe, Arizona, Oct. 2002.

[15]  W.G. Aref, M. Mokbel, and I. Kamel: Performance of Multi-dimensional Space-filling Curves. ACM Symp. on Adv. in Geog. Info. Sys. (ACM-GIS), pp. 149-154, McLean VA, Nov. 2002.

[16]  X.Q. Zhu, J. Fan, W.G. Aref, A.C. Catlin, A.K. Elmagarmid. Medical Video Mining for Efficient Database Indexing, Management and Access. In Proceedings of the IEEE 19th Intl. Conference on Data Engineering (ICDE), pp. 569-580, Bangalore, India, Mar. 2003.

[17]  M.F. Mokbel, W.G. Aref, and A. Grama, "Spectral LPM: An Optimal Locality-Preserving Mapping using the Spectral (not Fractal) Order in Proceedings of the IEEE 19th Intl. Conference on Data Engineering (ICDE), pp. 699-701, Bangalore, India, Mar. 2003.

[18]  M.A. Hammad, W.G. Aref and A.K. Elmagarmid. Stream Window Join: Tracking Moving Objects in Sensor-Network Databases. In Proceedings of 15th Intl. Conference on Scientific and Statistical Database Management (SSDBM), pp. 75-84, Boston, June 2003.

[19]  M.F. Mokbel and W.G. Aref, On Query Processing and Optimality Using Spectral Locality-Preserving Mappings. 8th Intl. Symp. on Spatial and Temporal Databases (SSTD), pp. 102-121, Santorini Island, Greece, July 2003.

Year 3  9/01/2003 – 9/01/2004

[20]  M. Hammad, M. Franklin, W.G. Aref and A. Elmagarmid. Scheduling for Shared Window Joins over Data Streams. Intl. Conf. on Very Large Data Bases, pp. 297-308, Berlin, Sept. 2003.

[21]  I. Ilyas, W.G. Aref, A. Elmagarmid. Supporting Top-k Join Queries in Relational Databases. The Intl. Conf. on Very Large Data Bases (VLDB), pp. 754-765, Berlin, Sept. 2003.

[22]  W.G. Aref, A. Catlin, A. Elmagarmid, J. Fan, M. Hammad, I. Ilyas, M. Marzouk, S. Prabhakar, Y. Tu, and X. Zhu. VDBMS: A Testbed Facility for Research in Video Database Benchmarking. In Proc. of the 9th Intl. Conf. on Dist. Multimedia Sys. (DMS), pp. 160-166, Miami, Sept. 2003.

[23]  M. Mokbel, W.G. Aref, S. Hambrusch, S. Prabhakar: Towards scalable location-aware services: requirements and research issues. ACM Intl. Symp. on Adv. in Geog. Info. Sys. (ACM-GIS), pp. 110-117, New Orleans, Nov. 2003.

[24]  W.G. Aref, A.C. Catlin, A.K. Elmagarmid J. Fan, M.A. Hammad, I. Ilyas, M. Marzouk and T. Ghanem. Video Query Processing in the VDBMS Testbed for Video Database Research. In Proc. of the 1st ACM Intl. Workshop on Multimedia Databases, pp. 25-32, New Orleans, Nov. 2003.

[25]  T.M. Ghanem and W.G. Aref. Databases Deepen the Web, IEEE Computer Magazine, Vol. 37, No. 1, pp. 116 -- 117, January 2004. (Overview Article)  

[26]  M.G. Elfeky, W.G. Aref, and A.K. Elmagarmid. Using Convolution to Mine Obscure Periodic Patterns in One Pass. In Procs. of the 9th International Conference on Extending Database Technology (EDBT), Heraklion - Crete, Greece, Mar. 2004.

[27]  T. Ghanem, R. Shah, M. Mokbel, W.G. Aref, J.S. Vitter. Bulk Operations for Space-Partitioning Trees. In Proc. of the 20th IEEE Intl. Conf. on Data Eng. (ICDE), pp. 29 -- 40, Boston, Mar. 2004.

[28]  M. Mokbel, M. Lu, W.G. Aref. Hash-Merge Join: A Non-blocking Join Algorithm for Producing Fast and Early Join Results. In Proc. of the 20th IEEE Intl Conf. on Data Eng. (ICDE), pp. 251 -- 262, Boston, Mar. 2004.

[29]  M.F. Mokbel, W.G. Aref, K. El-Bassyouni, I. Kamel. Scalable Multimedia Disk Scheduling. In Proceedings of the 20th IEEE Intl. Conf. on Data Eng. (ICDE), pp. 498 -- 509, Boston, Mar. 2004.

[30]  M. Hammad, M. Mokbel, M. Ali, W.G. Aref, A. Catlin, A.K. Elmagarmid, M. Eltabakh, M. Elfeky, T. Ghanem, R. Gwadera, I. Ilyas, M. Marzouk, X. Xiong. Nile: A Query Processing Engine for Data Streams. (Demo Paper). In Proc. of the 20th IEEE Intl. Conf. on Data Eng. (ICDE), pp. 851, Boston, Mar. 2004.

[31]  I.F. Ilyas, R. Shah, W.G. Aref, J.S. Vitter and A.K. Elmagarmid. Rank-aware Query Optimization. In Proceedings of the 2004 ACM SIGMOD Conf. on Mgmt. of Data, pp. 203 -- 214, Paris, France, June 2004.

[32]  M.F. Mokbel, X. Xiong, and W.G. Aref. SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases. In Proceedings of the 2004 ACM SIGMOD Conference on Management of Data, pp. 623 -- 634, Paris, France, June 2004.

[33]  X. Xiong, M.F. Mokbel, W.G. Aref, S.E. Hambrusch, S. Prabhakar. Scalable Spatio-temporal Continuous Query Processing for Location-aware Services. The 16th International Conference on Scientific and Statistical Databases, pp. 317 -- 326, Santorini Island, Greece, June 2004.

[34]  M. F. Mokbel, X. Xiong, M. Hammad, and W. G. Aref. Continuous Query Processing of Spatio-temporal Data Streams in PLACE. In Proc. of the 2nd Intl. Workshop on Spatio-temporal Databases Mgmt (STDBM), pp. 65 -- 72, co-located with VLDB, Toronto, Canada, August 2004.

[35]  M. F. Mokbel, X. Xiong, W. G. Aref, S. E. Hambrusch, S. Prabhakar, M. A. Hammad. PLACE: A Query Processor for Handling Real-time Spatio-temporal Data Streams. (Demo Paper) In Proc. of Intl. Conference on Very Large Data Bases (VLDB), pp. 1377 -- 1380, Toronto, Sept. 2004.

[36]  W.G. Aref, M.G. Elfeky, and A.K. Elmagarmid: Incremental, Online, and Merge Mining of Partial Periodic Patterns in Time-series Databases. IEEE Transactions of Knowledge and Data Engineering. Vol. 16, No. 3, pp. 332-342, March 2004.

[37]  W.G. Aref, A.C. Catlin, A.K. Elmagarmid, J. Fan, M.A. Hammad, I.F. Ilyas, M.S. Marzouk, S. Prabhakar, X. Zhu. VDBMS: A testbed facility for research in video database benchmarking. ACM Multimedia Systems Journal, Special Issue on Multimedia Document Mgmt. Sys., Vol. 9, No. 6, pp. 575 -- 585, 2004.

[38]  I. Ilyas, W.G. Aref and A.K. Elmagarmid. "Supporting Top-k Queries in Relational Databases". The VLDB Journal (Special Issue on Best Papers of the VLDB Conf., 2003), Vol. 13, No. 3, pp. 207 – 221, Sept.  2004.