CS 44800: Introduction to Relational Database Management Systems

Project 2: Query Processing and Optimization

Team selection due 4:00pmEDT on Thursday, October 28; Individual portion due 11:59pmEDT Thursday, 4 November; Team portion due 11:59pmEST Monday, 8 November 2021

Updated 10/28 11:55EDT. (Numbering changes to report to more easily correspond to Gradescope, no changes to what is to be done)

Please turn in code on lab machines through turnin (turnin -c cs448 -p project2 <submission folder>) and the report through Gradescope, as with Project 1. Please do not turn in class files and cruft created by your development environment, just turn in the source needed to compile and run your project and the tests you use, along with a README.txt showing how to run your tests. As with project 1, things should run on the lab machines (amber01 - amber30.cs.purdue.edu). Make sure that you mark the start/end of each part in Gradescope. Please typeset your report; handwritten figures/drawings accepted where needed.

This assignment has three options for the individual portion, and will be done in teams of two or three. As with project 1, the teammates should be from your PSO, but need not be the same as project 1. If a two person team, one person will do task 1, one will do task 2 - you don't need to do task 3. Please send your PSO instructor your team selection and which team member will be doing which task by 4pmEDT on Thursday, October 28.

We encourage you to discuss your individual portions with your teammates. Even though you are primarily responsible for one task, understanding what your teammate(s) are doing will make the team integration portion much easier (since you'll think more about what you need to do for integration when doing your individual task.) Furthermore, it will give you an opportunity to learn about parts of the query processor that you don't need to modify for your task. Finally, explaining what you are doing to your teammate(s) will help you solidify your understanding of the parts of the system you are working with.

SimpleDB code base

We are providing a SimpleDB code base that has indexing, index join, and merge join included, although only index join will be used. While you can start with the default code base downloaded from Boston College, we recommend you start with the provided code base. This will make Tasks 2 and 3 a bit easier. It will make Task 1 more difficult, as you'll have to ensure that you don't try to use index join or merge join where they won't work (joining with an inequality). But it will make the team task (combining your code) considerably easier.

The code is available in the lab machines (amber01 - amber30.cs.purdue.edu) at /homes/cs448/SimpleDB_Proj2.zip, or can be downloaded using https. It includes a pre-loaded database with an index; this works on the lab machines, but on some platforms you may need to delete the studentdb directory and recreate using simpleclient/network/CreateProj2DB.java.

We recommend you use the client/server (network) interface, as getting used to this will make things easier when we get into concurrency. To start up and run from the command line the following will get you started (and includes queries that should use the index as well as not using an index.)

javac -cp . simpledb/server/StartServer.java javac -cp .:./simpleclient simpleclient/SimpleIJ.java java -cp . simpledb/server/StartServer & Running the server and client separate windows is a good idea, so you can keep the output message from the server separated from the output messages from the client
java -cp .:./simpleclient SimpleIJ Connect>
jdbc:simpledb://localhost
SQL> select EId, StudentID, SectionID, Grade from ENROLL where SectionID = 43
SQL> select EId, StudentID, SectionID, Grade from ENROLL where StudentID = 4
SQL> select StudentID, Grade from ENROLL,SECTION where SectionID = SectID and Prof = 'einstein'
SQL> select StudentID, Grade from ENROLL,STUDENT where StudentID=SId and MajorID=20

Note that there is an index on ENROLL, and the 2nd and 4th queries do use that index - although there isn't anything in the provided code that tells you they do.

Individual Task 1. Query processing and evaluation

SimpleDB supports only equality in select; this task is to add inequality. This will require updating the parser to support < as well as = (you don't need to do >, <=, >=, or !=, although you may if you wish. Better still, do OR, and you'll have the ability to do all of the operations while implementing only <), and carrying this through to support query processing. This will also necessitate changing cost estimation for selection. You should also ensure that join is handled correctly.

For full credit, you should use the provided code that has Index Join enabled. You will have to avoid using index join or merge join if you have a join criteria involving <, but make sure you do use it when the join is on equality.

Your individual report should include:

Instructions for running your code (should also be in README.txt file in the code)
A discussion of the changes made to implement inequality expressions.
How you have tested for correctness (and short test results). Note that the above queries are one test (e.g., query 2 should still use the index), but are not sufficient to show you've succeeded.
How you have tested to ensure that the join algorithm (as opposed to cross product) is being used when appropriate (and short test results). Again, the above queries should do the right thing (e.g., query 4 should still use the index), but are not sufficient.
Discussion of performance impacts (e.g., do you expect differences in performance between = and < queries? Can you demonstrate this?)

Individual Task 2. Query optimization: Choosing the proper join algorithm

SimpleDB contains code for both index join and merge join, although the default is to break a join into a cross product followed by select. The existing code supports using either index join or merge join, but not both. Task 2 is to enable both joins, and intelligently select the proper join to use based on cost estimates. Index join is useful when there are a small number of items to look up, but because of the non-sequential access, can be very slow if there are a lot of items to look up. You should be able to determine when this is appropriate based on the cost estimation statistics.

You'll also need tests to demonstrate that you are making appropriate choices, and the performance differences that result. You may find the opportunity to improve the existing Heuristic Query planner as part of doing this task.

Your individual report should include:

Instructions for running your code (should also be in README.txt file in the code)
A discussion of how you choose which algorithm to use (based on performance estimates of the algorithms.), and changes made to the code to implement this choice.
How you have tested for correctness (and short test results). Note that queries 3 and 4 are examples that can use the join algorithms you give, but are not sufficient for testing.
How you have tested to ensure that the join algorithm (as opposed to cross product) is being used when appropriate (and short test results). Note that the provided queries are probably not enough to test this.
How you have tested to see if performance was impacted (and short test results). The provided database is probably too small to do this, you'll likely need to either create your own database, or insert additional records into this one.

Individual Task 3. Block Nested Loop Join

The default join option in SimpleDB is cross product followed by select. Block nested loop join is much more efficient, and unlike merge join and hash join, can be used for any join. This task is to implement block nested loop join and provideappropriate cost estimates, as well as selecting when to use block nested loop.

Your individual report should include:

Instructions for running your code (should also be in README.txt file in the code)
A discussion of changes needed and overview of code you have written.
How you have tested for correctness (and short test results).
How you have tested to ensure that the join algorithm (as opposed to cross product) is being used (and short test results). The above queries mith be sufficient to show this (you'll need to figure out if they should be block nested loop or cross product). You might think if there is a case where the existing cross product should be used, and if so show that it is still being used.
A performance comparison between your block nested loop, the default cross product, and index join (and short test results). The provided database is probably too small to do this, you'll likely need to either create your own database, or insert additional records into this one.

Note that the team portion of Task 3 is much more heavily integrated with Tasks 1 and 2 than Tasks 1 and 2 are with each other (e.g., selecting when to use Block Nested Loop for an inequality join, or properly selecting Block Nested Loop based on cost estimates.) As such, if one of your team members expects to use late days for the individual portion, they should do task 3, as the team portion of Tasks 1 and 2 can be completed without worrying about Task 3.

Team Task: Integrate the Parts

The team portion, due after the individual portion, is to integrate the various individual pieces. You will need to ensure that the correct join algorithm is used for inequality (cross product, or block nested loop if you have task 3), and that cross product or block nested loop is chosen over index join and merge join when appropriate based on cost measures (and demonstrate that this works.)

Your team report should include:

Instructions for running your code (should also be in README.txt file in the code)
An overview of how you made appropriate choices of algorithms to use, and a discussion of changes made to the code (from what you turned in for your individual tasks).
How you have tested for correctness (and short test results). You may find this requires test cases beyond what you've done for the individual portions, but if you've thought about this in advance, you may find you've already created sufficient tests.
How you have tested to ensure that the correct join algorithm is being used (including choosing cross product/select or block nested loop if appropriate), and short test results. This will require additional queries, and probably more data, than is provided in the sample database.
A performance comparison between the the various algorithms (and short test results.) This will largely repeat what was done in Tasks 2 and 3, but should show how it is impacted by including inequality.

We have enabled a team submission feature in Gradescope, but how this works doesn't show up in the instructor view. Your report should include the Names and CAREER ID (email address, not the PUID number) of all teammates, and which one of you is turning in the code and full report. If the team feature seems to work (e.g., when you submit in gradescope, you can name multiple people as working on the project), then just turn in once as a group. If you don't see this option, then only one person should turn in the full report, others should just list the Names and CAREER ID of the team, and who is turning in the full report.

The team portion is due four days after the last individual portion is submitted. If one of your team members is late (and uses late days or is penalized for late work) on their individual portion, there will be no late penalty (or late days used on the team portion) until more than four days after the last individual portion is submitted.

The team code should be turned in by one team memberon lab machines using turnin -c cs448 -p team2 <submission folder> and the report through Gradescope using the team submission feature.