The Database Systems Seminar


November 5, 2019 (Samuel Conte Distinguished Lecture Series):

Michael Stonebraker, Adjunct Professor of Computer Science, MIT

We Are Often Working on the Wrong Problem (10 misconceptions about what is important)


In the DBMS/Data Systems area, many of us seem to have lost our way.  This talk discusses 10 different problem areas 

in which there is considerable current research.  Then, I present why I believe much of the work is misguided, either 

because our assumptions about these problems are incorrect or because we are not paying attention to real users.  

Topics considered include machine learning (deep and conventional) , public blockchain, data warehouses, schema 

evolution and the cloud.


October 29, 2019 (Samuel Conte Distinguished Lecture Series):

Raghu Ramakrishnan, CTO for Data and Technical Fellow at Microsoft

Data in the Cloud


The cloud has forced a rethinking of database architectures.  Does this offer an opportunity to address the siloed nature of data management systems?  The question is especially important given the rise of machine learning and data governance. In this talk, I'll discuss these issues through the lens of the Microsoft data journey, both internal and external.


October 25, 2019: 

Cyrus Shahabi, Professor and Chair, Department of Computer Science, Univ. of Southern California

Transportation Data, Applications & Research for Smart Cities


In this talk, I first introduce the Integrated Media Systems Center (IMSC), a data science research center at USC 

that focuses in data-driven solutions for real-world applications. IMSC is motivated by the need to address 

fundamental Data Science problems related to applications with major societal impact. Towards this end, I delve into

 one specific application domain, Transportation, and discuss the design and development of a large-scale 

transportation data platform and its application to address real world problems in Smart Cities.  I will then continue 

covering some of our fundamental research in this area, in particular: 1) traffic forecasting and 2) ride matching. 


February 27, 2019:

Spyros Blanas, Professor, Department of Computer Science and Engineering, Ohio State University

Scaling Database Systems to High-performance Computers 


We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses 

two challenges for database systems. The first challenge is interoperability with established analytics libraries and 

tools. Massive datasets often consist of images (arrays) in file formats like FITS and HDF5. We will first present 

ArrayBridge, an open-source I/O library that allows SciDB, TensorFlow and HDF5-based programs to co-exist in a 

pipeline without converting between file formats. The second challenge is scalability, as warehouse-scale computers 

expose communication bottlenecks in foundational data processing operations. We will present GRASP, a parallel 

aggregation algorithm for high-cardinality aggregation that avoids unscalable all-to-all communication and leverages 

similarity to complete the aggregation faster than repartitioning. Finally, we will present an RDMA-aware data 

shuffling algorithm that transmits data up to 4X faster than MPI. We conclude by highlighting additional challenges 

that need to be overcome to scale database systems to massive computers.


December 3, 2018: 

Semih Salihoglu, Professor, Department of Computer Science, University of Waterloo

How can (worst-case optimal) joins be so interesting?


Worst-case optimality is perhaps the weakest notion of optimality for algorithms. A recent surprising theoretical 

development in databases has been the realization that the traditional join algorithms, which are based on binary 

joins, are not even worst-case optimal. Upon this realization, several surprisingly simple join algorithms have been 

developed that are provably worst-case optimal. Unlike traditional algorithms, which join subsets of tables at a time, 

worst-case join algorithms perform the join one attribute (or column) at a time. This talk gives an overview of 

several lines of work that my colleagues and I have been doing on worst-case join algorithms focusing on their 

application to subgraph queries. I will cover work from both distributed and serial settings. In the distributed 

setting, worst-case optimality is a yard-stick for two costs of an algorithm: (i) the load, i.e., amount of data per 

machine; and (ii) the total communication. Both load and communication complexity are at a trade-off with number 

of rounds an algorithm runs. I will describe how to achieve worst-case optimality in total communication and the 

performance of this algorithm on subgraph queries. It is an open theoretical problem to design constant-round 

algorithms with worst-case optimal load. In the serial setting, I will describe the optimizer of a prototype graph 

database called Graphflow that we are building at University of Waterloo. Graphflow's optimizer for subgraph queries 

mixes worst-case optimal join-style column-at-a-time processing seamlessly with traditional binary joins.