Please turn in code on lab machines through turnin (turnin -c cs448 -p project3 <submission folder>) and the report through Gradescope, as with Project 2. Please do not turn in class files and cruft created by your development environment, just turn in the source needed to compile and run your project and the tests you use, along with a README.txt showing how to run your tests. As with project 2, things should run on the lab machines (amber01 - amber30.cs.purdue.edu). Make sure that you mark the start/end of each part in Gradescope. Please typeset your report; handwritten figures/drawings accepted where needed.
This assignment has three options for the individual portion, and will be done in teams of two or three. As with project 1, the teammates should be from your PSO, but need not be the same as projects 1 or 2 (although they can be). The projects are reasonably independent, you can each choose to do one of tasks 1, 2, or 3 (as long as you each do something different.) Note that Tasks 2 and 3 have some overlap, Task 2 isn't very interesting if Task 1 hasn't been done (think about why), and Task 3 is a bit harder to do nicely if Task 2 isn't done, so it is probably best to just do tasks 1 and 2 if you have a two person team. Please send your PSO instructor your team selection and which team member will be doing which task by 4pmEDT on Wednesday, December 1.
We encourage you to discuss your individual portions with your teammates. Even though you are primarily responsible for one task, understanding what your teammate(s) are doing will make the team integration portion much easier (since you'll think more about what you need to do for integration when doing your individual task.) Furthermore, it will give you an opportunity to learn about parts of the query processor that you don't need to modify for your task. Finally, explaining what you are doing to your teammate(s) will help you solidify your understanding of the parts of the system you are working with.
We recommend you start with the default SimpleDB 3.4 code base, which does implement basic logging and recovery.
The code is available in the lab machines (amber01 - amber30.cs.purdue.edu) at /homes/cs448/SimpleDB.zip
, or can be downloaded using https.
SimpleDB already includes support for checkpointing, however, it only does a checkpoint after recovering and before accepting any new queries. This is a quiescent checkpoint; all transactions must complete and nothing be running to do a checkpoint. It also only supports undo logging, this means that to commit a transaction, all pages modified by that transaction must be flushed to disk, then the commit record flushed to the log.
The current implementation of SimpleDB requires that all pages modifed to a transaction be written to disk before the transaction can commit (and write the commit record to the log.) SimpleDB implements only undo logging as a result, since all committed (or aborted) transactions are already reflected in the data on disk.
Task 1 is to implement Undo/Redo logging, so that you don't need to write all modifed pages before a transaction commits. The undo is already implemented, all you need to do is:
redofunction that takes a log record, and if the transaction is in the completed transaction list, writes the new value in the appropriate place. This is almost identical to the undo function, except that it uses the new value.
forced writeof pages when a transaction commits, so that you see the performance improvement from having an undo/redo log.
In class, we discussed a non-quiescent checkpoint, where there is a start checkpoint
log record that lists all transactions running at the time of the start checkpoint, and an end checkpoint
log record once all pages modified at the start have been written out. Task 2 is to implement this capability. This will require:
end checkpointlog record, and writing it to the log.
The undo logging needs to go back to the previous checkpoint to undo any transactions that may still be running. If the database stays up a long time, this could be expensive. A better approach is to do checkpoints either periodically, or on demand. Task 1 is to implement one of these. You can implement a timer that causes a checkpoint to occur (the buffer manager has an example of a timer; if a transaction waits too long for a buffer to be available it times out.) Alternatively, you can implement a new "checkpoint" command that will cause a checkpoint to occur.
The current checkpoint is part of the recovery process; when it recovers it undoes all in-process transactions, flushes all buffers modified as part of the undo, then writes a checkpoint record. You'll need to find a different way to flush all modified buffers. The page and buffer manager currently has code to flush a modified page when it is replaced, so you'll be able to use that as an example. You could either keep a list of all modified buffers, or you could go through all buffers and see if they are modified or not (since either is an in-memory operation, it should be fast.)
Perhaps the hardest part of this task is that this can only happen if no transactions are running, unless Task 2 (Fuzzy Checkpoint) is done. You can get nearly full credit if you just assume that no transactions are running (in other words, it is okay if it silently result in a corrupt database if other transactions are running and the database crashes/recovers), provided you note in your report that this could occur with your code. For full credit you should deal with this possibility, either through having a fuzzy checkpoint, or through waiting for other transactions to finish before doing the checkpoint.
The team portion is simply to put your pieces together and make them run together. This may involve turning off
some features, such as waiting for other transactions to stop before checkpointing if you do task 3. You may also find that you have multiple tasks make changes to the same modules, so integration will be easiest if you communicate well from the beginning.
Your team report should include:
We have enabled a team submission
feature in Gradescope, but how this works doesn't show up in the instructor view. Your report should include the Names and CAREER ID (email address, not the PUID number) of all teammates, and which one of you is turning in the code and full report. If the team feature seems to work (e.g., when you submit in gradescope, you can name multiple people as working on the project), then just turn in once as a group. If you don't see this option, then only one person should turn in the full report, others should just list the Names and CAREER ID of the team, and who is turning in the full report.
The team portion is due four days after the last individual portion is submitted. If one of your team members is late (and uses late days or is penalized for late work) on their individual portion, there will be no late penalty (or late days used on the team portion) until more than four days after the last individual portion is submitted.
The team code should be turned in by one team memberon lab machines using turnin -c cs448 -p team3
<submission folder> and the report through Gradescope using the team submission feature.