Previously, I focused on two important problems in scientific data management: combining large number of diverse data sources for execution of scientific queries and executing data-intensive scientific queries efficiently, in terms of both network and I/O, on these data sources.
The work led to SkyQuery--a system that federates data from several Petabyte size, autonomous and heterogeneous Astronomy databases scattered worldwide. Using SkyQuery, scientists can write declarative queries that compare and merge multiple astronomical datasets.
For efficient query execution, we proposed Bypass-Yield Caching, a novel caching framework that dramatically reduces the network bandwidth requirements of data-intensive federations such as SkyQuery. Within this work we also looked at query cardinality estimation techniques in distributed applications as well as physical design layout of proxy caches.
The success of SkyQuery and BYCaching, and its adoption by the National Virtual Observatory is an example of data management systems enabling scientific endeavors.