CS 54701: Information Retrieval

Assignment 3: Categorization and Clustering

Due 6:00amEDT Tuesday, 19 April, 2016

Note: Exercises with numbers are adapted from Manning, Raghavan, and Schütze; while you are encouraged to look there for further information, please answer the question as asked below (they may not be exactly the same.)

Text Categorization

Book Exercise 13.6:
Assume a situation where every document in the test collection has been assigned exactly one class, and that a classifier also assigns exactly one class to each document. This setup is called one-of classification (Section 14.5, page 306). Show that in one-of classification (i) the total number of false positive decisions equals the total number of false negative decisions and (ii) microaveraged F1 and accuracy are identical.
In text categorization, it is generally the presence of words/features that is viewed as important, rather than their absence.
1. Give an example where the absence of a word would be useful in determining if a document belongs to a particular class.
2. Does k-Nearest Neighbor give more weight to the presence or absence of a feature?
Give an example situation where a k-NN classifier would do poorly, but it should be possible to create a classifier that does well.
Book Exercise 15.5:
A strategy often used by purveyors of email spam is to follow the message they wish to send (such as buying a cheap stock or whatever) with a paragraph of text from another innocuous source (such as a news article). Why might this strategy be effective? How might it be addressed by a text classifier?

Text Clustering

Explain how a cluster containing documents about automobiles night end up containing a document that never uses the word automobile, instead using the word car. Assume a basic vector space model (e.g., no thesaurus that would show these words as synonyms.)

Adapted from Book Exercise 16.17:
Perform a K-means clustering for the documents in the table below. After how many iterations does K-means converge?

docID	document text
1	hot chocolate cocoa beans
2	cocoa ghana africa
3	beans harvest ghana
4	cocoa butter
5	butter truffles
6	sweet chocolate
7	sweet sugar
8	sugar cane brazil
9	sweet sugar beet
10	sweet cake icing
11	cake black forest