In this assignment, you will implement the retrieval based k-Nearest Neighbor (k-NN) text categorization algorithm with the Lemur toolkit (in C/C++).
You have been given access to the linux lab (LWSN B158): sslab01.cs.purdue.edu through sslab20.cs.purdue.edu. You can either do the homework in the lab or access the computers remotely with ssh or putty. One directory has been created for you under "/scratch". Please be noted that the files under /scratch is NOT backed up. You should back up the source code by yourself. Also, please do not put any irrelevant file under the directory as the space is limited. (Please note that you have to use the same computer in order to access the files under /scratch)
Both the text document collection and the incomplete implementation of a retrieval system can be found in a "light" version of the Lemur toolkit. The online version of the light lemur toolkit from here. The light version only runs on Unix/Linux.
You can refer to the instruction to install the light Lemur toolkit. An easy way is to follow the basic procedures:
0. Please go to your scratch directory. Use "wget http://www.cs.purdue.edu/~lsi/CS490W_Fall_2011/assignments/HW2/lemur-light2-new.tar.gz" to get the toolkit.
1. Use "gzip -d lemur-light2-new.tar.gz" and "tar -xvf lemur-light2-new.tar.gz" to uncompress and unpack the package.
2. Enter the lemur-light2 root directory and use "./configure" to configure the Lemur toolkit
3. Use "gmake" in the lemur-light root directory to compile the package
4. Every time you change the code go to the lemur-light root directory to rebuild the code by "gmake"
There are two directories under the lemur-light2 root directory that are particular useful. The "eval_data/" directory contains the text document collection (i.e., training_docs.data) to analyze and several parameter files to run the categorization system. The "app/src/" directory contains the code of the applications. "BuildIndex.cpp" is the application for building indexes (complete). "RetrievalEval.cpp" is incomplete, which you will work on.
Implementation of retrieval based k-Nearest Neighbor (k-NN) categorization
In this task, you are going to use the lemur toolkit for categorization. Particularly you are going to implement and re-use some of your implementation from Assignment 1 to finish the task by yourself. First, please build the inverted index files. Enter the directory of "eval_data" under the light Lemur root directory. Use your knowledge from Assignment1 to build the index files with text data processed by stemming and removing stop words. The training data is "training_docs.data" and the parameter file for indexing is "build_stemmed_nostopw_param_training_docs".
Now it's time to eliminate the stopwords and do stemming on the raw-query file, i.e. "raw_queries.data". Use your knowledge from Assigment1 (the command) & the parameter file "parse_test_docs_query_stemmed_nostopw_param" to produce your new (stemmed & stopwords eliminated) query file "test_docs_queries.data" which is ready to use for retrieval.
The source file "RetrievalEval.cpp" under "app/src/" contains several retrieval algorithms for the retrieval based k-Nearest Neighbor categorization as well as the categorization code itself. Please read through the annotated code to understand the framework. Different retrieval algorithms have different implementation of functions as computeMETHODWeight and/or computeMETHODAdjustedScore (METHOD should be replaced by the names of different retrieval method like RawTF). The retrieval score of a document (D) with respect to a query (Q) is calculated as follows:
S(Q, D) = scoreAdjust(Weight(q1,d1,Q,D) + ... + Weight(qN,dN,Q,D), Q, D)
Where computeMETHODWeight calculates the value of the Weight function for a matched term of a document and a query; computeMETHODAdjustedScore is used to calculate the value of scoreAdjust function (e.g., for score normalization).
One simple retrieval algorithm "RawTF" has been implemented. You can refer to the code for implementing other retrieval algorithms. For all these algorithms you can reuse your code from Assignment1.
Apart from these, you will also work on the classification implementation which is provided in function "performKNNCategorization".
To play with Lemur toolkit and implement your retrieval algorithms & categorization, please complete the following experiments:
1. Implement the retrieval-based k Nearest Neighbor (k-NN) text categorization algorithm (60 points)
In this part, you will first finish the implementation for categorization, then run the categorization for "RawTF" retrieval algorithm and see its categorization results; and then complete the "RawTFIDF" and "Okapi" retrieval algorithm implementations and see their categorization results.
First, before running the categorization experiments with the "RawTF" algorithm, you should complete the categorization implementation provided in the function "performKNNCategorization" in the "RetrievalEval.cpp" source file. After you are done with this, you can run the categorization algorithm for "RawTF".
Now if you are done with the categorization implementation, it's time to try the categorization algorithm with "RawTF" algorithm. Use your knowledge from Assigment1 to do the categorization with "RawTF". The parameter file for retrieval & categorization is "eval_stemmed_nostopw_param". The fields you will deal with are "weightScheme" for the retrieval algorithm, "kNN" for setting the k value of your k-NN, "result" which receives the name of the result file that contains the retrieval results and "classResultFile" which receives the name of the result file that contains the categorization results.
After running the "RawTF" based categorization, it's time now to see how well are the categorization results. Evaluate the categorization results by using "perl TC_eval.pl class_Result_File_RawTF_kNN1" which will produce (print on the screen) the contingency table for all categories (regular emails and spam emails). So that's all for "RawTF".
Now, you are asked to implement two more sophisticated categorization algorithms. You only need to change the computeMETHODWeight functions of the two algorithms. You can reuse your code from Assignment1 for this task. The weight functions of the two algorithms and the RawTF method (implemented) are:
In order to obtain statistics like the average document length, please check the reference of Lemur::api::index class. More general information about Lemur classes can be found at here.
Please modify the parameters of "weightingScheme", "resultFile", "classResultFile" and "kNN" within the parameter file, i.e. "eval_stemmed_nostopw_param", for different retrieval algorithms & different "kNN" configurations. Use similar scripts to generate the evaluation results.
2. Analysis of the classification results (40 points).
In this section you are asked to do analyze the categorization results.
2.1. Compare the classification performance based on three retrieval algorithms (20 points).
Compare the results of your experiments with respect to different retrieval algorithms for the retrieval based k-NN categorization. Please include your results and discuss which algorithm performs better and why your think it's like that.
2.2. Compare the classification performance based different values of K (i.e., 1, 3, 5, 10, 20, 80) (20 points).
Compare the results of your experiments with respect to different k values (i.e., 1, 3, 5, 10, 20, 80) for the retrieval based k-NN categorization. Please include your results and discuss which k values perform better and why your think it's like that.
1. Please write a report of your experiments.
2. Please attach the hardcopy of your code with the report.