Assignment 3:  Categorization and Clustering
Due 6:00amEDT Tuesday, 19 April, 2016
Note:
Exercises with numbers are adapted from
Manning,
Raghavan, and Schütze; while you are encouraged
to look there for further information, please answer
the question as asked below (they may not be exactly
the same.)
Text Categorization
- 
Book Exercise 13.6:
Assume a situation where every document in the test collection has been assigned
exactly one class, and that a classifier also assigns exactly one class to each document.
This setup is called one-of classification (Section 14.5, page 306). Show that in one-of
classification (i) the total number of false positive decisions equals the total number
of false negative decisions and (ii) microaveraged F1 and accuracy are identical.
 
- 
In text categorization, it is generally the presence
of words/features that is viewed as important, rather than their
absence.
- 
Give an example where the absence of a word would be useful
in determining if a document belongs to a particular class.
 
- 
Does k-Nearest Neighbor give more weight to the presence
or absence of a feature?
 
 
- 
Give an example situation where a k-NN classifier would do poorly, but it should be possible to create a classifier that does well.
 
- 
Book
Exercise 15.5:
A strategy often used by purveyors of email spam is to follow the message
they wish to send (such as buying a cheap stock or whatever) with a paragraph of
text from another innocuous source (such as a news article). Why might this strategy
be effective? How might it be addressed by a text classifier?
 
Text Clustering
- 
Explain how a cluster containing documents about automobiles
night end up containing a document that never uses the word
automobile
, instead using the word car
.  Assume
a basic vector space model (e.g., no thesaurus that would
show these words as synonyms.)
 - 
Adapted from Book
Exercise 16.17:
Perform a K-means clustering for the documents in the
table below.  After how many
iterations does K-means converge?
| docID | document text | 
| 1 | hot chocolate cocoa beans | 
| 2 | cocoa ghana africa | 
| 3 | beans harvest ghana | 
| 4 | cocoa butter | 
| 5 | butter truffles | 
| 6 | sweet chocolate | 
| 7 | sweet sugar | 
| 8 | sugar cane brazil | 
| 9 | sweet sugar beet | 
| 10 | sweet cake icing | 
| 11 | cake black forest |