CS 47300: Web Information Search and Management

Assignment 3: Retrieval Models: LSI, Probabilistic; Web Search

Due 11:59pmEDT Wednesday, 23 September, 2020

Please turn in a PDF through Gradescope. If you didn't do Assignment 1, you'll need to access Gradescope through Brightspace the first time (to register you for the course in Gradescope.) Gradescope is pretty self-explanatory, but if you'd like you can watch the ITaP video instructions. Make sure that you mark the start/end of each question in Gradescope. Assignments will be graded based on what you mark as the start/end of each question. Please typeset your answers (You won't be doing handwritten responses during a Google phone interview, why would you do so for class?)

1. LSI

Given a term-document matrix and after applying SVD on it, explain in your own words the semantics of the U, S and V matrices.
Given the following U, S, and V matrices, and the query term frequenc, show how to compute similarity between the query and the documents. What is the ranking of documents?
```
     high      0 1                  D1  D2  D3
 U:  term      1 0    S:  2 0   V^T:  1   1   0
     frequenc  2 1        0 1        0   1   1
```
Bonus: Could this actually be a real set of matrices under LSI? Briefly explain why or why not. Hint: there is a way you can answer this mathematically.
Explain why zeroing low singular values works when we are reducing the number of dimensions? (This should be a mathematical explanation.)
Give some pros and cons (at least one of each) of using Latent Semantic Indexing for web search, other than computational cost/complexity/space. (In other words, assume that we have a cost- and space-efficient way of doing SVD and the appropriate matrix operations at web scale.) (Explaining each in one or two sentences are enough.)

2. Binary Independence Model

Given the following corpus:

D1: I love cats cats cats
D2: I love cats and I love dogs
D3: I love dogs
D4: I don't like cats
D5: I love puppies
D6: I love kittens

And the following queries:

Q1: cats
Q2: dogs

First, start with standard assumptions: a large corpus (N=1000), few relevant documents (S=3), and each term occurs in only the documents above. Because the numbers are small relative to N, you can assume that N-n or N-S is approximately N. Also assume p_i=0.5; i.e., if a term is relevant, it has a 0.5 probability of occuring in a relevant document. Calculate the Retrieval Status Value for each query and the documents above, and rank the documents.
Assume you were told that the relevant documents for Q1 were D1, D2, and D6; and the relevant documents for Q2 were D2, D3, and D5. Use these relevance scores to estimate values for p_cat, r_cat, p_dog, r_dog. Again, calculate the RSV between each query and the six documents and rank the results for each query.
Using some known example queries and relevant documents to calculate the term relevance probabilities, as with the second part of this question, can cause a few interesting things to happen. Pick one difference between the results for Part 1 and Part 2, and briefly describe what has changed and why.

3. OKAPI (BM25)

The BM25 model builds on the Binary Independence Model, but incorporates term frequency:

tf_td: Frequency of the term t in the document d. (f_i in the Croft et al. book Section 7.2.2.)
tf_tq: Frequency of the term t in the query. qf_i in the book.

The other change is to normalize by document length, so that documents with many terms or more frequent terms don't win simply because they are longer. There are also three parameters k₁, k₃ (k₂ in the book), and b that govern how much effect document term frequency, query term frequency, and length normalization have.

The Okapi / BM25 model is based on the Binary Independence Model. What values of the k/b parameters would cause BM25 to most closely approximate the BIM model, and how would it differ?
We've already shown that the BIM is very close to Inverse Document Frequency. With the k and b values set to 1, BM25 also incorporates term frequency. How does this differ from tf*idf with cosine similarity?

For the above questions, you should justify your answers mathematically. For example, for the second question you could show how a change in term frequency between two or three documents leads to different rankings in each model.

4. Web search

In the class, we discussed three bad assumptions for the relevance of classic IR: context ignored, individuals ignored, and corpus predetermined. For each of these, give one idea on how can we incorporate it into the relevance calculation. (2-3 sentences each.)
The goal of a search system is to fulfill the information needs of users that they typically specified with a query. However, the query text can be ambiguous and thus might have more than one meaning. For example, consider the query java. It might mean java programming language or an island in Indonesia. What would be a good strategy to retrieve documents for a search engine in such a scenario? Try searching with java in Google, and Bing and see how they perform! Do you see a winner? (hope they do not change their retrieval model frequently!)
How does the size of the index impact search performance? Does a larger index size guarantee a better performance? Why or why not? (Remember, search performance is asking how relevant the returned documents are, not how fast it runs.)

5. Web Crawling: Freshness

Assume we have two servers we want to crawl, A and B. We also have two class of documents: F changes frequently (30% of documents on server A, 20% on server B), and the rest, S, change slowly. Assume that documents are put into the system for crawling at the same rate that they are crawled, and that 50% are on each server.

To manage Freshness, we have a separate queue for F (frequent change) and S (infrequent change) documents. The crawler alternates between queues, taking one from F, then one from S, etc. (If a queue is empty, the next document is taken from the other queue.) Queues are FIFO.

Determine what happens in such a system, particularly:

Do the F documents get crawled more often than S documents?
Does one of the queues fill up?
Assume the mean change frequency of crawled web pages of A is once per 2 days, pages are crawled once every 4 days, and the mean change frequency of crawled web pages of B is once per 10 days, pages are crawled once every 10 days. What is the average age of the crawler A and B?