CS 44800: Introduction to Relational Database Management Systems

Assignment 3: Indexing

Due 11:59pmEDT Monday, 25 October 2021

Update 10/19 12:23 - 1: Corrected error in I2, clarified dealing with non-unique search keys.

Update 10/22 13:45 - 2: Corrected error in I1.

Please turn in a PDF through Gradescope. Make sure that you mark the start/end of each question in Gradescope. Assignments will be graded based on what you mark as the start/end of each question. Please typeset your answers where feasible, handwritten answers will be acceptable for drawings (e.g., trees).

1. B+ Tree

Given the following B+ Tree, an index by residence hall, and database file (consisting of students, their classes, and their residence hall):
B+ Tree, contact instructor if difficulty viewing
(Note that some of the data blocks are "off the page", but won't be needed for what you are doing.)

Note that search keys are not unique, which is a little difficult for B+ trees. One solution (discussed in Section 14.3.5 in the book) is to make each search key unique by appending a value (e.g., the primary key of the record, or a counter as shown here). All searches then become range searches (e.g., search for EHRT by looking for everything between EHRT and EHRT.999999).

Explain the steps involved in finding all tuples with the search key value PKRF: What blocks are read?
Show or describe the blocks read and the changes made to the tree and file by the following delete query: DELETE FROM students WHERE name = 'John Smith';
Note that this isn't deleting by search key - you'll have to think about how the index updates work in this case.
Show or describe the blocks read and the changes made to the tree and file by the following insert query: INSERT INTO students VALUES(314,'Jane Doe','CS44800','DUHM');

2. Extensible Hashing

Repeat Question 1, but for the hash index below:
Extensible Hash Table, contact instructor if difficulty viewing

3. Review question: Database Design

Is the relation in questions 1 and 2 a good design? Briefly explain why or why not.

4. B+ Tree

Create a B+ tree from the following sequence of key insertions and deletions. Assume a fan-out of 3. Show the final tree. Note that there are many programs online that will do this for you. Using one of these won't really help you understand what is going on (and will be poor preparation for an exam.) Furthermore, some have bugs and won't give you the correct result.

Note that the search key is not a key, it may occur multiple times. Deleting (as with "delete 18" and "delete 4") should delete ALL occurrences, as it does with SQL.

insert 28
insert 18
insert 2
insert 4
insert 18
insert 9
delete 18
insert 21
insert 24
insert 20
insert 19
insert 6
insert 5
delete 4

5. Extensible Hashing

Repeat question 4, but for extensible hashing. Assume a bucket size of 4.

6. Index Choices

Consider the following relation Employee (id: integer, name: String, salary: integer) where we have a hash index on id and a B+ tree index on salary. Now, consider the following three queries:

Select name From Employee Where salary >= 5000
Select name From Employee Where id = 10
Select name From Employee Where id = 10 AND salary >= 5000

For each of the above queries, decide which index/indexes you will use and explain why. Would any of these possible choices run into blocking operations that might impact pipelined evaluation, and if so, what would the potential impact be?

7. Index Scaling

Let us consider a relation R (s, t, u) containing 1 billion (10^9) records. Each page of the relation holds 10 records. R is organized as a heap file with unclustered indexes and the records in R are randomly ordered. Assume that atribute s is a candidate key for R, with value lying in range 0 to 999,999,999. Now, for each of the following queries, name the approach that would most likely require the fewest I/Os for processing the query. Justify your answer by approximately calculating the cost (in terms of disk accesses) for each approach to answer each query.

The approaches to consider are as follows:

Scanning through the whole heap file for R.
Using a B+ tree index on attribute R.s
Using a hash index on attribute R.s

The queries are:

Find all R tuples
Find all R tuples where R.s < 100
Find all R tuples where R.s = 100
Find all R tuples where R.s >100 and R.s<150

You'll need to start by estimating the size of the indexes. Assume you can fit 30 key/pointer pairs in a disk block (either a B+ tree index block, or a hash index block.) Additionally, assume that after accesing the data entry it takes one more disk access to get the actual record.