Please turn in code on lab machines through turnin (turnin -c cs448 -p project2 <submission folder>) and the report through Gradescope, as with Project 1. Please do not turn in class files and cruft created by your development environment, just turn in the source needed to compile and run your project and the tests you use, along with a README.txt showing how to run your tests. As with project 1, things should run on the lab machines (amber01 - amber30.cs.purdue.edu). Make sure that you mark the start/end of each part in Gradescope. Please typeset your report; handwritten figures/drawings accepted where needed.
This assignment has three options for the individual portion, and will be done in teams of two or three. As with project 1, the teammates should be from your PSO, but need not be the same as project 1. If a two person team, one person will do task 1, one will do task 2 - you don't need to do task 3. Please send your PSO instructor your team selection and which team member will be doing which task by 4pmEDT on Thursday, October 28.
We encourage you to discuss your individual portions with your teammates. Even though you are primarily responsible for one task, understanding what your teammate(s) are doing will make the team integration portion much easier (since you'll think more about what you need to do for integration when doing your individual task.) Furthermore, it will give you an opportunity to learn about parts of the query processor that you don't need to modify for your task. Finally, explaining what you are doing to your teammate(s) will help you solidify your understanding of the parts of the system you are working with.
We are providing a SimpleDB code base that has indexing, index join, and merge join included, although only index join will be used. While you can start with the default code base downloaded from Boston College, we recommend you start with the provided code base. This will make Tasks 2 and 3 a bit easier. It will make Task 1 more difficult, as you'll have to ensure that you don't try to use index join or merge join where they won't work (joining with an inequality). But it will make the team task (combining your code) considerably easier.
The code is available in the lab machines (amber01 - amber30.cs.purdue.edu) at /homes/cs448/SimpleDB_Proj2.zip
, or can be downloaded using https. It includes a pre-loaded database with an index; this works on the lab machines, but on some platforms you may need to delete the studentdb directory and recreate using simpleclient/network/CreateProj2DB.java.
We recommend you use the client/server (network) interface, as getting used to this will make things easier when we get into concurrency. To start up and run from the command line the following will get you started (and includes queries that should use the index as well as not using an index.)
javac -cp . simpledb/server/StartServer.java
javac -cp .:./simpleclient simpleclient/SimpleIJ.java
java -cp . simpledb/server/StartServer &
Running the server and client separate windows is a good idea, so you can keep the output message from the server separated from the output messages from the client
java -cp .:./simpleclient SimpleIJ
Connect>
jdbc:simpledb://localhost
SQL> select EId, StudentID, SectionID, Grade from ENROLL where SectionID = 43
SQL> select EId, StudentID, SectionID, Grade from ENROLL where StudentID = 4
SQL> select StudentID, Grade from ENROLL,SECTION where SectionID = SectID and Prof = 'einstein'
SQL> select StudentID, Grade from ENROLL,STUDENT where StudentID=SId and MajorID=20
Note that there is an index on ENROLL, and the 2nd and 4th queries do use that index - although there isn't anything in the provided code that tells you they do.
SimpleDB supports only equality in select; this task is to add inequality. This will require updating the parser to support <
as well as =
(you don't need to do >, <=, >=, or !=, although you may if you wish. Better still, do
, and you'll have the ability to do all of the operations while implementing only <), and carrying this through to support query processing. This will also necessitate changing cost estimation for selection. You should also ensure that join is handled correctly.OR
For full credit, you should use the provided code that has Index Join enabled. You will have to avoid using index join or merge join if you have a join criteria involving <, but make sure you do use it when the join is on equality.
Your individual report should include:
SimpleDB contains code for both index join and merge join, although the default is to break a join into a cross product followed by select. The existing code supports using either index join or merge join, but not both. Task 2 is to enable both joins, and intelligently select the proper join to use based on cost estimates. Index join is useful when there are a small number of items to look up, but because of the non-sequential access, can be very slow if there are a lot of items to look up. You should be able to determine when this is appropriate based on the cost estimation statistics.
You'll also need tests to demonstrate that you are making appropriate choices, and the performance differences that result. You may find the opportunity to improve the existing Heuristic Query planner as part of doing this task.
Your individual report should include:
The default join option in SimpleDB is cross product followed by select. Block nested loop join is much more efficient, and unlike merge join and hash join, can be used for any join. This task is to implement block nested loop join and provideappropriate cost estimates, as well as selecting when to use block nested loop.
Your individual report should include:
Note that the team portion of Task 3 is much more heavily integrated with Tasks 1 and 2 than Tasks 1 and 2 are with each other (e.g., selecting when to use Block Nested Loop for an inequality join, or properly selecting Block Nested Loop based on cost estimates.) As such, if one of your team members expects to use late days for the individual portion, they should do task 3, as the team portion of Tasks 1 and 2 can be completed without worrying about Task 3.
The team portion, due after the individual portion, is to integrate the various individual pieces. You will need to ensure that the correct join algorithm is used for inequality (cross product, or block nested loop if you have task 3), and that cross product or block nested loop is chosen over index join and merge join when appropriate based on cost measures (and demonstrate that this works.)
Your team report should include:
We have enabled a team submission
feature in Gradescope, but how this works doesn't show up in the instructor view. Your report should include the Names and CAREER ID (email address, not the PUID number) of all teammates, and which one of you is turning in the code and full report. If the team feature seems to work (e.g., when you submit in gradescope, you can name multiple people as working on the project), then just turn in once as a group. If you don't see this option, then only one person should turn in the full report, others should just list the Names and CAREER ID of the team, and who is turning in the full report.
The team portion is due four days after the last individual portion is submitted. If one of your team members is late (and uses late days or is penalized for late work) on their individual portion, there will be no late penalty (or late days used on the team portion) until more than four days after the last individual portion is submitted.
The team code should be turned in by one team memberon lab machines using turnin -c cs448 -p team2
<submission folder> and the report through Gradescope using the team submission feature.