Tanu Malik


Email:    tmalik at cs dot purdue dot edu

              tmalik at purdue dot edu


Phone:    765-494-2235

Fax:       765-496-2275          


Address:  Young Ernst C. Hall 1017

               155 S Grant St

               West Lafayette, IN 47907

            

CV:        pdf

Tanu Malik is a Research Assistant Professor with the Cyber Center in Discovery Park at Purdue University and with the Indiana Center of Database Systems. Her research interests are in a wide variety of areas including but not limited to data federations, database caching, query execution and optimization, self-organizing database systems, and summary structures for cardinality estimation. A recurrent theme in her research is to re-examine the core principles of database technology in the light of new requirements emerging from scientific data. Her research has resulted in some innovative database technology for handling large distributed scientific data.


Tanu earned her PHD and MS in 2007 from the Department of Computer Science at Johns Hopkins University.  She earned her B.Tech in 1999 from the Department of Civil Engineering at Indian Institute of Technology, Kanpur. In between, she also had a stint at Department of Computer Science and Engineering at IIT, Bombay and Citibank.




    Currently, I am interested in three new (though unrelated) requirements of scientific data: satisfying currency requirements when scientific data is updated in real-time,

introducing self-management modules in curated archives and efficiently optimizing queries on graph data. Please see the following project pages for more details.


  1. •    Dynamic Data Caching for Network Bound Applications


  1. •    Adaptive Physical Design

   

  1. •    Structure Similarity in Graphs


I have also started work in two new exciting areas: data provenance and approximate data caching (Check back for more details). I am also continuing to expand on my graduate work to devise improved methods for scheduling queries for higher throughputs in scientific workloads.




   

   Previously, I focused on two important problems in scientific data management: combining large number of diverse data sources for execution of scientific queries and executing data-intensive scientific queries efficiently, in terms of both network and I/O, on these data sources.


The work led to SkyQuery--a system that federates data from several Petabyte size, autonomous and heterogeneous Astronomy databases scattered worldwide. Using SkyQuery, scientists can write declarative queries that compare and merge multiple astronomical datasets.


For efficient query execution, we proposed Bypass-Yield Caching, a novel caching framework that dramatically reduces the network bandwidth requirements of data-intensive federations such as SkyQuery. Within this work we also looked at query cardinality estimation techniques in distributed applications as well as physical design layout of proxy caches.


The success of SkyQuery and BYCaching, and its adoption by the National Virtual Observatory is an example of data management systems enabling scientific endeavors.

   

  1. •  Mar 2009: In SSDBM, 2009: “Adaptive Physical Design for Curated Archives”.

  2. •  Dec 2008: Dynamic data caching has a brand new project page

  3. •  Nov 2008: NEES pre-proposal got through NSF!

  4. •  Oct 2008: Our paper on Batch Query Processing Workloads was accepted at CIDR!

  1. •T. Malik, X. Wang, D. Dash, A. Chaudhary, R. Burns, and A. Ailamaki. Adaptive Physical Design for Curated Archives. To appear in Scientific and Statistical Database Management Systems, 2009

  2. •X. Wang, R. Burns, T. Malik. LifeRaft: Data-driven, Batch Processing for Exploration of Scientific Databases. In: Proc. of 5th Biennial Conference on Innovative Database Systems Research (CIDR), 2009.

  3. • T. Malik, R. Burns. Workload Aware Histograms for Remote Applications. In: Proc. of 10th Conference on Data Warehousing and Knowledge Discovery (DaWaK), 2008.

  4. • T. Malik, X. Wang, D. Dash, R. Burns, and A. Ailamaki. Automated Physical Design in Database Caches. In: ICDE Workshop on Self-Managing Database Systems (SMDB), 2008.

  5. • X. Wang, T. Malik, R. Burns, S. Papdomanolakis and A. Ailamaki. A Workload-driven Unit of Cache Replacement for Mid-Tier Database Caching. In: Proc. 12th Conference on Database Systems for Advanced Applications (DASFAA), 2007                                                                                                                                                                                                                                                                     


More...

Sloan Digital Sky Survey (SDSS) Workload

We are currently using workloads extracted from query logs on the SDSS database. A representative query trace from the DR4 version of the database are included.


Trace            DR4 (1.4 million queries)


Open SkyQuery Portal

A modified version of the Open SkyQuery portal with a semi-functional proxy cache that operates on SDSS at table granularity is available. The caching module is included under "CacheTools" and designed as a proof-of-concept. It demonstrates how bypass decisions are made in the SkyQuery framework but does not actually load and evict multi-gigabyte tables.


Portal            Source       Readme