CS 54701: Information Retrieval

Assignment 3: Categorization and Clustering

Due 6:00amEDT Tuesday, 19 April, 2016

Note: Exercises with numbers are adapted from Manning, Raghavan, and Schütze; while you are encouraged to look there for further information, please answer the question as asked below (they may not be exactly the same.)

Text Categorization

  1. Book Exercise 13.6:
    Assume a situation where every document in the test collection has been assigned exactly one class, and that a classifier also assigns exactly one class to each document. This setup is called one-of classification (Section 14.5, page 306). Show that in one-of classification (i) the total number of false positive decisions equals the total number of false negative decisions and (ii) microaveraged F1 and accuracy are identical.
  2. In text categorization, it is generally the presence of words/features that is viewed as important, rather than their absence.
    1. Give an example where the absence of a word would be useful in determining if a document belongs to a particular class.
    2. Does k-Nearest Neighbor give more weight to the presence or absence of a feature?
  3. Give an example situation where a k-NN classifier would do poorly, but it should be possible to create a classifier that does well.
  4. Book Exercise 15.5:
    A strategy often used by purveyors of email spam is to follow the message they wish to send (such as buying a cheap stock or whatever) with a paragraph of text from another innocuous source (such as a news article). Why might this strategy be effective? How might it be addressed by a text classifier?

Text Clustering

  1. Explain how a cluster containing documents about automobiles night end up containing a document that never uses the word automobile, instead using the word car. Assume a basic vector space model (e.g., no thesaurus that would show these words as synonyms.)
  2. Adapted from Book Exercise 16.17:
    Perform a K-means clustering for the documents in the table below. After how many iterations does K-means converge?
    docIDdocument text
    1hot chocolate cocoa beans
    2cocoa ghana africa
    3beans harvest ghana
    4cocoa butter
    5butter truffles
    6sweet chocolate
    7sweet sugar
    8sugar cane brazil
    9sweet sugar beet
    10sweet cake icing
    11cake black forest

Valid XHTML 1.1