Assignment 3: Categorization and Clustering
Due 6:00amEDT Tuesday, 19 April, 2016
Note:
Exercises with numbers are adapted from
Manning,
Raghavan, and Schütze; while you are encouraged
to look there for further information, please answer
the question as asked below (they may not be exactly
the same.)
Text Categorization
-
Book Exercise 13.6:
Assume a situation where every document in the test collection has been assigned
exactly one class, and that a classifier also assigns exactly one class to each document.
This setup is called one-of classification (Section 14.5, page 306). Show that in one-of
classification (i) the total number of false positive decisions equals the total number
of false negative decisions and (ii) microaveraged F1 and accuracy are identical.
-
In text categorization, it is generally the presence
of words/features that is viewed as important, rather than their
absence.
-
Give an example where the absence of a word would be useful
in determining if a document belongs to a particular class.
-
Does k-Nearest Neighbor give more weight to the presence
or absence of a feature?
-
Give an example situation where a k-NN classifier would do poorly, but it should be possible to create a classifier that does well.
-
Book
Exercise 15.5:
A strategy often used by purveyors of email spam is to follow the message
they wish to send (such as buying a cheap stock or whatever) with a paragraph of
text from another innocuous source (such as a news article). Why might this strategy
be effective? How might it be addressed by a text classifier?
Text Clustering
-
Explain how a cluster containing documents about automobiles
night end up containing a document that never uses the word
automobile
, instead using the word car
. Assume
a basic vector space model (e.g., no thesaurus that would
show these words as synonyms.)
-
Adapted from Book
Exercise 16.17:
Perform a K-means clustering for the documents in the
table below. After how many
iterations does K-means converge?
docID | document text |
1 | hot chocolate cocoa beans |
2 | cocoa ghana africa |
3 | beans harvest ghana |
4 | cocoa butter |
5 | butter truffles |
6 | sweet chocolate |
7 | sweet sugar |
8 | sugar cane brazil |
9 | sweet sugar beet |
10 | sweet cake icing |
11 | cake black forest |