CS 47300: Web Information Search and Management

Project 1 Part 2: Implementing a Retrieval Model

Due 11 October 11:59pm

In this project, you will use the galago toolkit to implement a retrieval model and compare with one (or more) of the models already included. While you are welcome to install and run on machines of your choice, please make sure that your final submission runs on the mc cluster CS servers (mc17.cs.purdue.edu, mc18.cs.purdue.edu, & mc19.cs.purdue.edu, we will note others as we identify ones set up appropriately).

You can use any language that runs on those machines, but you must provide a command-line script that will run the test. If you use Java, you can use the Galago API - but as it is very poorly documented, we recommend that even if using Java, your program use exec calls to run command line Galago and collect the output (this is what you'll need to do if using other languages.)

Instructions to access the galago installation:

An installation of galago is available on /home/u3s/cs473/project1/galago-3.16/. Your final submission must use this installation. You'll want to add this to your command path, using the follwing command:

export PATH=/homes/cs473/project1/galago-3.16/bin/:$PATH

You may also want to add the command to your ~/.bashrc file, so you don't have to rerun every time you log in. Make sure you run Galago from your home directory (or a subdirectory you create); Galago does put some things in the current working directory and if you try to run from /homes/cs473/ it will give some strange errors.

Task: Code up a retrieval model

Your task is now to implement a TF-IDF model. Essentially what you are doing is creating the output of the galago batch-search, but using the following TF-IDF scoring:

Query term weight:: 1
Query collection term weight:: 1
Query length normalization:: cosine similarity
Document term weight:: tf
Collection term weighting:: log₂(N/df)
Document length normalization:: cosine similarity

In other words, no weighting of terms in query, term count*log(IDF) for term weighting for documents.

Your program takes three inputs, to make things easy these are specified as files that will be in the same directory that the command is run from. The program should be named TFIDF - it should execute simply by calling ./TFIDF in the current directory (note that this could be a script that runs your code.) These are:

project1-query.json: A query file in the same format as /homes/cs473/project1/allqueries.json
project1.rel: A relevance file in the format of /homes/cs473/project1/cacm.rel
project1_fullpath.rel: A relevance file in the format of /homes/cs473/project1/cacm_fullpath.rel
project1-index: A galago index directory.

We'll be using different files to test with, but it will just be changing query terms, query numbers, and the number of queries. The syntax will be as given in the examples.

Your program should generate outputs in the format of the output from the galago batch-search command (suitable as input to galago eval). These consist of your TF-IDF cosine similarity scores for each document for each query in the project1-query.json file. (You only need to include documents that have a non-zero score.)

You should already know how to get the document frequency; you've gathered the information needed when doing Part 1. You may need to find a different galago command to get term frequency in a document, some possibilities are dump_key_value and dump_doc_terms.

Report Q1: Evaluation

Once you have this working, perform an evaluation comparing the model you implemented with the BM25 (Okapi) model, one of the built-in Galago models. Include this evaluation in report.pdf. Also include a paragraph on which method you think is better and why.

Report Q2: Other Things to Consider

While doing this project, let your curiousity guide you. Some interesting questions to explore:

Galago is supposed to be multi-threaded, and use a map-reduce framework. Can you see evidence of this? How do you think it is impacting performance in your tests?
How does the space consumed by the index compare to the space required for the original documents? Is this expected?
How do other models compare? Can you determine common threads in what makes one work better than others (e.g., one is better for short queries, another for long queries)?

I'm sure you can think of others. Include in your report a paragraph titled Interesting Discoveries with a brief discussion of one of the above questions, or (better still) something you've come up with on your own.

Instructions for submission

Submit a single folder named project1/ using turnin. The folder must contain a file TFIDF that when executed runs your code, and the file report.pdf (which is 60% of your grade.) There may be other files needed as well, but it should NOT contain the query, relevance, or galago index files.

Note that we will be providing different input (query, relevance, and galago index files) to your program than those listed above, the values listed above are what should be included in the report, but your program will have to work for different values.

Use the following command to submit the project:
turnin -c cs47300 -p project1 <path-to-submission-folder>
You may submit any number of times before the deadline. Only your most recent submission would be recorded.

To view submitted files, use:
turnin -c cs47300 -v -p project1
Don't forget the -v option, else it will overwrite your submission with an empty folder.

Please also submit report.pdf using gradescope (this makes grading it easier, faster, and more consistent.)