CS 47300: Web Information Search and Management

Assignment 2: Retrieval Models

Due 11:59pmEDT Friday, 11 September, 2020

Please turn in a PDF through Gradescope. If you didn't do Assignment 1, you'll need to access Gradescope through Brightspace the first time (to register you for the course in Gradescope.) Gradescope is pretty self-explanatory, but if you'd like you can watch the ITaP video instructions. Make sure that you mark the start/end of each question in Gradescope. Assignments will be graded based on what you mark as the start/end of each question. Please typeset your answers (You won't be doing handwritten responses during a Google phone interview, why would you do so for class?)

For this assignment, use the following (short) document corpus. Note that we've already done stemming and stopword removal.

D1: higher term frequenc longer document
D2: repeat term give higher term frequenc
D3: term frequenc import term document
D4: invers document frequenc import term corpu

1. Indexing

Give a vector space model representation of documents D1 and D2. For ease of search, the vectors should be alphabetized.
Build an inverted list index for the corpus. Again, this should be alphabetized. For the weights in the index, assume you are using this index for retrieval with a TFIDF retrieval model, with term frequency (TF) weighted as log₂(tf_kd)+1 and IDF as log₂(N/n_k), where tf_kd is the term frequency (number of occurrences of the term k in document d), N is the number of documents in the corpus, and n_k is the number of documents containing the term k. (Treat log₂(0) as -1, so that a term that doesn't exist in the document gets a weight of 0.)
Compute the cosine similarity between each document and the query invers
Compute the cosine similarity between each document and the query repeat higher cours
Why would we want the index alphabetized? What becomes easier to do with the alphabetized index?

2. Stemming

Stemming in the above corpus can be both good and bad. Assume document D4 was originally inverse document frequency is importance of a term in a corpus.

Give an example of a query that you hope would return D4, and would do so if the query and document were stemmed, but would not return D4 if the query and document were not stemmed.
Give an example of a query that you would not want to return D4, but because of stemming, would likely return D4.
Prove that stemming may help, and will never hurt, recall. This should be a mathematical proof.
Would you be able to prove that stemming always helps or hurts the F1 score? Either give a proof, or a counterexample showing that such a statement can't be made.

You can find a discussion and implementations of the Porter stemmer here. Unfortunately, the online version I've used in the past doesn't seem to be working, and the online versions I can find seem to hide cryptominers. For this question, you can provide a description of why you think something would stem in a particular way (e.g., a reasonable rule that would cause the behavior you expect.) Your answers to parts 1 and 2 should not depend on the same words being stemmed differently because of different stemming rules, but should be different queries that have unexpected outcomes under similar stemming rules. Your answers to 3 and 4 should not be dependent on any particular stemming algorithm or rules.

3. Boolean Retrieval

As noted in class, boolean retrieval refers to the idea that a document either matches the query or not. It is not limited to boolean expressions. For example, a boolean retrieval model may support proximity search, e.g., (term frequenc /3) would mean that term and frequenc occur within 3 words of each other.

Which documents in the above corpus would satisfy the query (term frequenc /3)?
Can you answer this query using only the inverted list index? Explain how, or why you can't.
Is there a better way to answer this query than exhaustively searching the entire corpus? Keep your answer brief - one to two paragraphs is plenty to give the idea.

4. TF/IDF

While this uses the documents given at the beginning, you may or may not be able to use the index you created in Question 1. You'll want to think about why.

Given the query inverse document frequency:

Compute the TFIDF scores between the above query and documents D1 and D4, using natural (raw) term frequency count, no document frequency, and cosine similarity.
Repeat the above TFIDF scores using logarithmic term frequency (1+log(tf_kd)) and inverse document frequency (log(N/n_d)).
Give the ranking for the above query for documents D1, D2, D3, and D4. Show your calculations. You may be able to do this without calculating the full TFIDF scores, if so, please describe how you did so.
What is a problem with using natural (raw) frequency? Explain with an example.

5. Evaluation

Consider the evaluation metrics from class: Precision, Recall, F-1 Score, MAP - Mean Average Precision, and MRR: Mean Reciprocal Rank.

Is it possible to build a search system which always achieves high (perfect or near-perfect) recall? Explain why or why not.
In what situation a system's MAP performance will be equal to its MRR performance?
What kind of metric is useful to evaluate a question-answering system? Explain your answer.