CAREER:
Efficient I/O for Modern Database Applications
Sunil Prabhakar
Department of Computer Sciences
Purdue University
Contact
Information
Sunil Prabhakar
1398 Computer Sciences Building
West Lafayette, IN 47907-1398.
Phone: (765) 494-6008
Fax: (765) 494-0739
Email: sunil@cs.purdue.edu
WWW PAGE
http://www.cs.purdue.edu/homes/sunil
List of
Supported Students and Staff (optional)
·
Sunil Prabhakar
, PI
· Dmitri Kalashnikov , Graduate Research Assistant, Doctoral Student
Project
Award Information
Keywords
Parallel I/O,
Data Placement, Multimedia data, Multi-dimensional data, Tertiary Storage, Moving Object Databases, Sensor Databases.
Recent trends
in hardware development and data-intensive applications have resulted in a
performance bottleneck for storing and retrieving data. The original goal
of this project is to develop a broad class of innovative techniques to alleviate
the I/O bottleneck for modern database applications. The project focuses on
data-intensive applications that handle multi-dimensional and multimedia data.
The research has three major directions. The first is the development of
declustering schemes for the efficient execution of range and nearest-neighbor
queries over large multi-dimensional datasets under realistic assumptions
such as non-constant disk I/O times, and non-uniform data and query distributions.
The second addresses the storage and content-based retrieval of multiple-quality,
multimedia documents. This component of the project investigates integrated
techniques for placement, scheduling, migration, and reliability of continuous
media data on secondary and tertiary storage. The approach is to design, develop,
implement, and test the schemes on real datasets. In this manner the effects
of the simplifying assumptions typically made to ease analysis or development
can be identified and addressed. The third addresses the emerging applications
that require the management of constatly evolving data , such as sensor and
moving object databases. The project will result in a collection of new techniques
as well a prototype implementation and test results on real applications.
These will be made available for public access over the world-wide-web. The
expected impact is improved performance for a broad class of applications.
The education component aims to integrate I/O related issues for modern systems
into the graduate curriculum.
This involves the development of new web-based tools and projects
that will enable students to understand, experiment with I/O issues and solutions,
and facilitate distance learning.
Publications
and Products
Project
Impact
Human Resources. The project supported a Ph.D. student (D. Kalashnikov) who received his Ph.D. in May 2003. Other graduate students funded by different sources are addressing research issues identified by the project. These include Deepak Bobbarjung, Reynold Cheng, Jiangtao Li, Yicheng Tu and Yuni Xia. A project funded by IBM that addresses related issues of I/O management for large data sets has recently completed. The two projects address related issues for I/O management. D. Bobbarjung was funded through this project, and was addressing the problem of data placement for multi-resolution video. He defended his Master's thesis on this topic in May 2003. Microsoft Corp. has provided equipment and gift support for the development and evaluation of a prototype spatio-temporal testbed for moving objects. Several undergraduate students are involved in the development of the prototype system.
Goals,
Objectives, and Targeted Activities
The goals of the project are to develop novel I/O management techniques for large-scale multimedia, multi-dimensional, and evolving data. The activities of the project have centered on the development of data placement, migration, and indexing techniques for video and multi-dimensional spatio-temporal data. In paticular: 1) Data management schemes for very large amounts of video on hierarchical storage. We have developed a novel caching scheme for secondary storage that when coupled with replication on tertiary storage yields significant reductions in start-up latency for continuous multimedia objects such as video. We have also developed new placement schemes for tertiary storage that take into account relationships between objects to reduce expensive swapping of media. Current evaluation has been based upon simulation. Two CD jukeboxes have been acquired through a separate grant for experimental purposes. A tertiary storage level (based upon these jukeboxes) has been integrated into the video database prototype (based upon Predator and SHORE). 2) Efficient retrieval of multi-resolution video. We are currently investigating data placement schemes for the efficient retrieval of multi-resolution video from disks. Alternative schemes have been developed and are being tested using a simulation setup that has been developed on top of available disk simulators. 3) Indexing for moving objects. We have developed several new indexing techniques for spatio-temporal data to efficiently process large numbers of concurrent, ongoing queries over moving objects. A study addressing the issue of main-memory evaluation has been conducted. we have shown the superiority of a grod index over queries in this setting. We have developed efficient main memory algorithms for spatial joins. 4) We are currently addressing the use of replication for improving disk performance for multidimensional data. Preliminary results have been published. 5) We are investigating the impact of uncertainty in moving object and sensor databases and have developed probabilistic queries over uncertain data. In the upcoming months we will build upon our earlier work and investigate more efficient I/O management techniques for the moving objects environments including: novel indexing schemes, imprecision and approximation, and further prototype development,.
Area Background
Fueled by improvements in technology, evolving user requirements, and the availability of ever-increasing amounts of data in digital format, modern applications are handling very large amounts of data. Although rapid improvements are being made in virtually all aspects of computer hardware and networking, the rate of improvement in certain areas is not commensurate with others. Limited by the need for physical motion I/O technologies, chiefly magnetic disks and tertiary robotic libraries, are currently the performance bottleneck for data-intensive applications. Moreover, due to slower rates of improvements in I/O technologies the bottleneck will become even more severe in the future. Techniques for the efficient management of I/O are therefore crucial. The I/O bottleneck can be significantly relaxed through techniques such as declustering, placement, and scheduling that exploit the structure of the data and access patterns. Many modern applications are characterized by very large storage requirements, both in terms of individual object size as well as total storage volume. Two very important and common classes of data in such applications are: 1) multidimensional data such as multi-attribute relations, OLAP cubes, GIS or spatial data, and feature vectors that capture multimedia object content; and 2) multimedia data such as images, video, and audio. Based upon the general patterns of access for these applications and the structure of the data it is possible to provide improved performance through tailored I/O management techniques. More recently, the need to manage constantly evolving data from sensors and moving objects has arisen. novel solutions are needed to achieve the desired goals of scalability and near real-time query evaluation. This project is a step in these directions.
Project
and Area References
*All award information can be found on the on the NSF
on-line Awards
Abstracts system,
http://www.fastlane.nsf.gov/a6/A6Start.htm.