CS547 Assignment 1: Retrieval Algorithms
(due Feb 19th, Before Class)
Introduction
In this assignment, you will implement several information retrieval algorithms with the Lemur toolkit (in C/C++). If you have any question, please email Ravi Bukka at rbukka@cs.purdue.edu.
You have been given access to the linux lab (LWSN B158): sslab01.cs.purdue.edu through sslab25.cs.purdue.edu. You can either do the homework in the lab or access the computers remotely with ssh or putty. One directory has been created for you under "/scratch". Please be noted that the files under /scratch is NOT backed up. You should back up the source code by yourself. Also, please do not put any irrelevant file under the directory as the space is limited. (Please note that you have to use the same computer in order to access the files under /scratch)
Both the text document collection and the incomplete implementation of a retrieval system can be found in a "light" version of the Lemur toolkit. The light version only runs on Unix/Linux.
You can refer to the instruction to install the light Lemur toolkit. An easy way is to follow the basic procedures:
0. Please go to your scratch directory. Use "wget http://www.cs.purdue.edu/homes/lsi/CS547_2013_Spring/assignments/HW1/lemur-light.tar.gz" to get the toolkit.
1. Use "gzip -d lemur-light.tar.gz" and "tar -xvf lemur-light.tar" to uncompress and unpack the package.
2. Enter the lemur-light root directory and use "./configure" to configure the Lemur toolkit
3. Use "gmake" in the lemur-light root directory to compile the package
4. Every time you change the code go to the lemur-light root directory to rebuild the code by "gmake"
There are two directories under the lemur-light root directory that are particular useful. The "eval_data/" directory contains the text document collection (i.e., database.sgml) to analyze and several parameter files to run the retrieval system. The "app/src/" directory contains the code of the applications. "BuildIndex.cpp" is the application for building indexes (complete). "RetrievalEval.cpp" is incomplete, which you will work on.
Implementation of several retrieval algorithms
In this task, you are going to use lemur toolkit for retrieval and implement several retrieval algorithms by yourself. First, please build the inverted index files. Enter the directory of "eval_data" under the light Lemur root directory. Use the command "../app/obj/BuildIndex build_param database.sgml" to build the index files with raw text data. Use the command "../app/obj/BuildIndex build_stemmed_nostopw_param database.sgml" to build the index files with text data processed by stemming and removing stop words.
The source file "RetrievalEval.cpp" under "app/src/" contains several retrieval algorithms. Please read through the annotated code to understand the framework. Different retrieval algorithms have different implementation of functions as computeMETHODWeight and/or computeMETHODAdjustedScore (METHOD should be replaced by the names of different retrieval method like RawTF). The retrieval score of a document (D) with respect to a query (Q) is calculated as follows:
S(Q, D) = scoreAdjust(Weight(q1,d1,Q,D) + ... + Weight(qN,dN,Q,D), Q, D)
Where computeMETHODWeight calculates the value of the Weight function for a matched term of a document and a query; computeMETHODAdjustedScore is used to calculate the value of scoreAdjust function (e.g., for score normalization).
One simple retrieval algorithm "RawTF" has been implemented. You can refer to the code for implementing other retrieval algorithms.
To play with Lemur toolkit and implement your retrieval algorithm, please complete the following experiments:
1. Test the performance of retrieval algorithm "RawTF" with two types of text data (i.e., raw text data and text data by stemming and removing stopwords). (15 points)
Enter the directory of "eval_data" under the light Lemur root directory. Use the command "../app/obj/RetrievalEval eval_param query" to generate the result file (i.e., result_rawtf) of the RawTF retrieval algorithm for the raw text data. Use the command "../app/obj/RetrievalEval eval_stemmed_nostopw_param query_stemmed_nostopw" to generate the result file (i.e., result_rawtf_stemmed_nostopw) of the RawTF retrieval algorithm for the text data by stemming and removing stopwords.
Evaluate the results by using "../trec_eval qrel result_rawtf" and "../trec_eval qrel result_rawtf_stemmed_nostopw". Please include the results in your report. Can you tell which result is better? If one is better than the other, please provide a short analysis.
2. Implement three different retrieval algorithms and evaluate their performance. (45 points)
You only need to change the computeMETHODWeight functions of the three algorithms. The weight functions of the three algorithms and the RawTF method (implemented) are:
|
RawTF:
|

|
|
RawTFIDF:
|

|
|
LogTFIDF:
|

|
|
Okapi:
|

|
In order to obtain statistics like the average document length, please check the reference of Lemur::api::index class. More general information about Lemur classes can be found at here.
Please check the parameters of weightingScheme and resultFile within "eval_stemmed_nostopw_param" for different retrieval algorithms. Follow the step in task 1 to generate the evaluation results.
Please compare the results of the different retrieval algorithm. Provide a short discussion about the advantage/disadvantage of the algorithms.
3. Run the retrieval algorithms on a second dataset: a subset of Purdue Computer Science web pages (20 pts).
In this task, you will try out your retrieval algorithms on a second dataset, namely a subset of the Purdue Computer Science (cs.purdue.edu) domain web pages. You will run your retrieval algorithms on a query set that is provided to you and analyze the results for this query file.
First enter the directory of "eval_data" under the light Lemur root directory. Build the inverted index files for "cs.purdue.data" file that contains the subset of the pages from the Purdue CS domain. Make sure the data is pre-processed by stemming and removing stop words.
The query file that is provided to you is a raw query file in natural language (i.e. "purdue_cs_query_file.raw_query"), which means you should eliminate stop words and do stemming on it as well. Use the command "../app/obj/ParseToFile parse_purdue_cs_raw_query_stemmed_nostopw_param purdue_cs_query_file.raw_query" to produce your (stemmed & stopwords eliminated) query file "purdue_cs_query_file.query" which is ready to use for retrieval.
Now it is time to do the retrieval: Use the appropriate command to generate the result file (i.e., result_purdue_cs_rawtf_stemmed_nostopw) of the RawTF retrieval algorithm for the purdue cs data by stemming and removing stopwords. Please modify the parameters of weightingScheme and resultFile within "eval_purdue_cs_stemmed_nostopw_param" for different retrieval algorithms. Use similar scripts to generate the evaluation results.
The last step is to evaluate the data by yourself. You will see which documents are ranked higher in the results file that you generate. Some explanation about the format of the results file is as follows:
|
Result File Format
|
<(Query_Number) (Ignore) (Doc_Name/URL) (Doc_Rank) (Ignore) (Ignore)>
|
|
Example
|
<101 Q0 http://www.cs.purdue.edu/phds/ 30 13.8994 Exp>
which means "for query 101, the page “http://www.cs.purdue.edu/phds/” ranked 30th"
|
Document name/number in this collection corresponds to its public URL. You can easily view the document to assess its relevance to a query.
Please analyze your results with respect to different retrieval algorithms. You can choose enough number of top documents (between Top5 - Top20) to better compare the different retrieval algorithms.
Please indicate in your report, which algorithm is better and “why”. Please summarize your results (how many documents you evaluated, what is the precision for each retrieval algorithm) in your report and provide your reasoning for your discussion of the advantages/disadvantages of each retrieval algorithm depending on your results.
4. Implement your own retrieval algorithm. (20 points)
It is time to show your creativity. We have tried several retrieval algorithms. Please implement your own retrieval algorithm in the two functions computeCustomWeight and computeCustomAdjustedScore. Test the algorithm on the “database.sgml” dataset, and provide a discussion of advantage/disadvantage of the algorithm.
Some ideas are try to use normalization in computeCustomAdjustedScore for different retrieval algorithms, use different weighting schema, or implement the unigram language modeling retrieval algorithm that is going to be covered in the class.
What to turn in
1. Please write a report of your experiments.
2. Please attach the hardcopy of your code with the report.