Introduction
In this assignment, you will develop a simple text categorization (TC) system for detecting email spam.
Particularly, you will work on an incomplete version of the Naive Bayes text categorization algorithm. You are expected to complete the implementation and study the empirical results with a real world email collection.
Both the email collection and the incomplete implementation of a text categorization system can be found in a "light" version of the Lemur toolkit here. The light version only runs on Unix/Linux. The email collection is a subset of an email collection originally created by Ion Androutsopoulos.
You can refer to the instruction to install the light Lemur toolkit. An easy way is to follow the basic procedures:
0. Please go to your scratch directory. Use "wget http://www.cs.purdue.edu/homes/lsi/CS547_2013_Spring/assignments/HW2/lemur-light2.tar.gz" to get the toolkit.
1. Use "tar -xzvf lemur-light2.tar.gz" to uncompress and unpack the package.
2. Enter the lemur-light2 root directory and use "./configure" to configure the Lemur toolkit
3. Use "gmake" in the lemur-light2 root directory to compile the package
4. Every time you change the code go to the lemur-light2 root directory to rebuild the code by "gmake"
There are two directories under the lemur-light2 root directory that are particular useful. The "eval_data/" directory contains the email collection (i.e., spam_data.sgml) and several parameter files to run the text categorization system. The "app/src/" directory contains the code of the applications. "BuildIndex.cpp" is the application for building indexes (complete). "TCEval.cpp" is incomplete, which you will work on. TC_eval.pl is a perl script that you will use to evaluate the performance of spam detection algorithm.
Implementation of text categorization system for detecting email spam
First, please build the inverted index files for the email collection. Enter the directory of "eval_data" under the light Lemur root directory. Use the command " ../app/obj/BuildIndex build_stemmed_nostopw_param spam_data.sgml" to build the index with text data processed by stemming and removing stop words.
Please complete the following experiments:
1. Complement the Implementation of the Navie Bayes Text Categorization Algorithm (50 points)
Please check out the source code TCEval.cpp under the directory of "app/src" under light Lemur root directory. Please first read through the code and understand the structure of the source code. Two functions within TCEval.cpp are not finished.
First, you need to complete the function of "estTrainModel" for estimating multinomial models for spam (i.e., relevant) model and for non-spam (i.e., irrelevant) model with the training data. (25 points)
Second, you need to complete the function of "getTestRst" for calculating spam detection results for test emails. (25 points)
2. Evaluation (30 points)
Please compile the complete source data. Now, enter the directory of "eval_data" under light Lemur root directory for testing your code. Use the command " ../app/obj/TCEval eval_TC_param" to generate spam detection result file for test emails.
"testOutput" is the spam detection result file for the test emails. Each line of the file corresponds to a test email. The format is as follows:
"FileName Probability of being Relevant (i.e., spam)"
First, you need to evaluate the result in "testOutput" by using the command "perl TC_eval.pl testOutput 0.5". (5 points)
Second, please read through the perl script "TC_eval.pl" and explain the meaning of the output result from the previous step (e.g., what does true positive mean and what is 0.5 in the command) (5 points)
Third, calculate two new measures of Precision and Recall (5 points)
Fourth, try different thresholds (e.g., 0.99, 0.9, 0.5, 0.1, 0.01, 0.001....); calculate the corresponding precision and recall measures with each threshold; plot a figure to show these values (x-axis, precision, y-axis-recall) (15 points)
3. Model Analysis (15 points)
The application of TCEval prints out two lists of words with the highest word probabilities in the relevant (i.e., spam) and in the irrelevant (i.e., non-spam) models. Please analyze these words.
First, are the words in the two lists the same or different? Give some examples of good words for detecting spam and bad words. Can you explain why there are bad words?
4. Algorithm Revisit (5 points)
Try your ideas to improve the Naive Bayes algorithm. Some possibilities are: (1) Try different smoothing methods (i.e., MAP, Jelinek Mercer...); (2) Try feature selection (keyword selection); for example, only consider the features (keywords) that are helpful for detecting spam. Full credit (i.e., 5 pts) will be given if you try some interesting algorithm (e.g., feature selection) other than only simple smoothing methods.
1. Please write a report of your experiments.
2. Please attach the softcopy of your code with the report.
Please contact Dzung Hong if you have any question