CS 54701: Information Retrieval

Project 1: Latent Semantic Indexing

Due 07:00EST Monday, 22 February 2016

Begin now. We are estimating this will take 12-16 hours assuming good programming skills but perhaps a bit rusty on C++, so if you put this off until the last weekend, you are not likely to get it done. Don't expect much response from the instructors in the last eight hours before it is due, either.
Late Policy:Late work will be penalized 10% per day (24 hour period). This penalty will apply except in case of documented emergency (e.g., medical emergency), or by prior arrangement.

In this project you will use the Indri information retrieval toolkit (the current version of what was known as the Lemur toolkit) to index and run queries on a set of web pages. The retrieval model you'll implement is Latent Semantic Indexing (and you are welcome to use an existing Singular Value Decomposition package if you don't want to implement your own, we suggest SVDLIBC or Matlab.)

For those of you who really don't like C++, we can discuss using Lucene instead (Python and Java implementations.) But Lucene is a much heavier weight system, designed for production use, and will probably take you a lot longer. If you do choose to go this route, expect only high-level help from us, we won't be answering any detailed questions on Lucene.

Installation Instructions

Instructions are given for building under Linux, using the CS department facilities sslabnn.cs.purdue.edu (where nn ranges from 00 to 24). You'll need to acknowledge the Acceptable Use Policy at the MyCS Portal. There are also instructions on remote access; other documentation can be found at the Support Wiki.

Note that your home directory has limited space. We recommended that you store indexes/data (anything large and easily recreated) in the /scratch partition - you have space on each machine under /scratch/username . Note that the /scratch space is local to each machine - which means it is faster, but if you log in to a different machine, you won't see what you had before. Also, the /scratch partition is not backed up - so store code and anything else you produce in your own home directory. (There is also a scratch directory in your home directory - this is un-backed-up space that is shared across all machines, so you don't get the performance advantages of running local.)

To save web traffic and storage space, all of the downloaded files are available on the sslab machines in directory /homes/clifton/scratch/CS54701/ You can link to these files (ln -s ~clifton/scratch/CS54701/* .) to avoid filling up your quota with local copies.

Indexing

We have provided a set of web pages for you to index (an old snapshot of the Computer Science department web site.)

Compute the Term Document Matrix with Tf-IDF weighting

Singular Value Decomposition

To factorize the matrix, you can use the SVDLIB toolkit. Another possibility would be Matlab (available on the sslab machines as /usr/local/bin/matlab.) You will need to choose an appropriate value of k that approximates the matrix. Compute a reduced version of the SVD based on this value.

Perform an evaluation

You will need to evaluate the quality of the resulting retrieval engine. Your report should contain a description of how you do this evaluation, the metrics used, and the baseline that you compare with.

What to submit

You will need to turn in a report on your evaluation (2-4 pages.) In addition, assuming that you are using Indri/SVDLIBC and that this runs on the SSLab machines, you need to turn in the specific C++ code you have written and any other changed files (this is likely to be only a file TFIDFMatrix.cpp .) If using Matlab, you will also need to turn in the Matlab files. If you are using a different environment, please discuss with us what you need to turn in.

How to turn in your project

SSH to a Purdue CS machine and run the following command to turn in your project.

turnin -v -c cs547 -p lsi name_of_directory

where name_of_directory is the directory that you want to submit.


Valid XHTML 1.1