CS 47300: Web Information Search and Management

Project 1 Part 1: Index Use

Due 4 October 11:59pm

In this project, you will use the galago toolkit to explore a text corpus. While you are welcome to install and run on machines of your choice, please make sure that your final submission runs on the mc cluster CS servers (mc17.cs.purdue.edu, mc18.cs.purdue.edu, & mc19.cs.purdue.edu, we will note others as we identify ones set up appropriately). You can access the lab remotely. Using X11, you can even run graphical user interface software remotely (For PCs, you'll need to install X11 software, I use Xming. Old mac OSs support X11 directly, but for newer ones you'll need to use XQuartz, it will be similar to using XMing. Linux machines support X11 natively. While you can install galago and develop on your own platform, you will be graded on those machines only, so make sure that what you turn in runs as expected on the mc cluster. (Warning: Galago doesn't seem to compile or run under OpenJDK 11.)

How to configure Putty & Xming

Download & install XMing
Access and use the built index
Configure XMing (Click on XLaunch.exe in your install directory)
Select Mulitple Windows and type the number 0 for Display number. Click Next.
Select Start no client then click Next.
Click Next. Do not change anything.
Click on Save Configuration.
Save configuration to the same directory as config.xlaunch.

Installation & Configuration of Putty (for Windows; Mac and Linux machines include openSSH which will allow the same tunnelling.)

Download Putty
Configure Putty (Open putty.exe and choose ssh).
In the box under Saved Sessions type the name of a CS lab Linux server (e.g., mc18.cs.purdue.edu) and save
Click on the saved session that you just created. Then click on the button Load.
Configure X11 Forwarding. (On the left hand side, find the X11 configuration category by double-clicking on SSH and then clicking on X11. Make sure the box label Enable X11 forwarding is checked and MIT-Magic-Cookie-1 is selected.)
Login. (You will now see a command line screen that will prompt you to enter your usename.)
Enter a password. (Your password will be same with the BoilerKey. You will quickly notice as you type your password, no characters appear on the screen. Once you are finished typing your password, press Enter.)

Under Mac and Linux (and Windows 10), you can use ssh (part of openssh) rather than installing putty.

There are may things you'll have to figure out from looking through documentation, searching the web, or simply trying things out. You'll find that in an R&D environment, you will often be involved with things that are not well documented. After all, if everything were clean and laid out for you, where is the R&D? This is an intentional part of the assignment, to give you experience with such an environment before you hit the real world.

You can use any language that runs on those machines, but you must provide a command-line script that will run the test in Part 3. If you use Java, you can use the Galago API - but as it is very poorly documented, we recommend that even if using Java, your program use exec calls to run command line Galago and collect the output (this is what you'll need to do if using other languages.)

Instructions to access the galago installation:

An installation of galago is available on /home/u3s/cs473/project1/galago-3.16/. Your final submission must use this installation. You'll want to add this to your command path, using the follwing command:

export PATH=/homes/cs473/project1/galago-3.16/bin/:$PATH

You may also want to add the command to your ~/.bashrc file, so you don't have to rerun every time you log in. Make sure you run Galago from your home directory (or a subdirectory you create); Galago does put some things in the current working directory and if you try to run from /homes/cs473/ it will give some strange errors.

To get a list of all available galago commands run: galago help

You will perform the following steps in order:

Build an index for the corpus
Access and use the built index
Write a program that uses the index to compute the Retrieval Status Value for all documents for a query (or at least those where you can determine the RSV is greater than 0). This will be computed in two ways, both based on the Binary Independence Model.
Evaluate the two ranking mechanisms using given queries with known results.

You will use the CACM corpus (3204 documents) for this project, found at http://www.search-engines-book.com/collections/ There is already a copy of the documents, relevance judgements, and queries in /homes/cs473/project1/, so you don’t need to download it yourself.

Please, follow the naming convention very strictly - we need to be able to run everyone's program using exactly the same call. Your grade, and timely grading, will depend on it.

Part 1: Build the index

The index can be built using the command:
galago build --indexPath=./project1-index --inputPath=/homes/cs473/project1/cacm
Note that you can create your own input directory if you want to create specific documents to test things out. You don't need to turn anything in for this part.

Part 2: Accessing the index (35%)

Use galago commands to answer the following questions. We suggest galago batch-search, galago xcount, and galago doccount; you may find others useful. There is some documentation at sourceforge. The documentation is rather weak - you'll find a few things that are buried or misleading. For example, in doccount, the query is specified with `--x+searchterm', not '--x=searchterm' (but --x=searchterm works for xcount). You'll also need to try some experiments to understand the semantics - it isn't always obvious what each is telling you (for example, try galago doccount --x+"how many" --index=./project1-index, then try with just "how" and just "many". Note that you'll need to specify the index you created (e.g., --index=./project1-index). You can use /homes/cs473/project1/allqueries.json as an example of how to create a query file for json. The outcome from this step should be included in the file report.pdf.

Determine the total number of documents in the corpus. You may find dump-index-manifest useful. Beware, this takes as an argument not the index directory, but the corpus subdirectory within the index directory.
Determine the number of documents containing the word `retrieval' and the word 'algorithm'.
List the documents containing the word `Rice'.
List the top 5 documents returned for the queries `information retrieval' and `computer'. Report the documents for the given queries. Note that you'll need to transform queries into json format to use batch-search. You may find the galago query-transform command useful.

Part 3: Evaluation (15%)

You will use the galago eval command to evaluate the default Galago retrieval model. The outcome from this step should be included in the file report.pdf. You can use --judgments=/homes/cs473/project1/cacm.rel, or cacm_fullpath.rel, for the relevance file and /homes/cs473/project1/allqueries.json for the query files to construct the --baseline= query results.

Determine the precision and recall for the query `information retrieval vector space inverted files'. Use query 61 as the relevant documents in cacm.rel. Can we easily get some sort of rank-based precision measure for this?
If you've already done this for `parallel algorithms' (query 50), that is okay, you can submit that answer. But it isn't very interesting.

Part 4: Estimation (30%)

We sampled 100 documents from the corpus. In this part, you will estimate the statistics of the corpus with the information you get from the 100 sample documents. The sampled documents are in /homes/cs473/project1/cacm100

Build an index for cacm100
Determine the number of documents containing the word `algorithm'.
Determine the number of relevent documents of query `algorithm'.
Estimate the number of documents containing the word `algorithm', and the number of relevent documents of query `algorithm' in the cacm corpus. How does this compare with the true values? (If it isn't feasible to compare with the true values, briefly explain why.)

Part 5: BIM-based RSV (20%)

Please use the information that you get from Part 2 and Part 3 to calculate the RSV of query `retrieval algorithm'. Use the p_i=0.5, and standard assumptions of very few relevant documents relative to the size of the corpus for r_i.
Since we have queries and relevance files (allqueries.json and cacm.rel), we can compute p_i and r_i that are more precise than the above estimates. Describe how this would be done (3-4 sentences is all it should take), and do so. Use only a single query and relevance judgements to estimate the p_i and r_i for each term (state which query number you have used for which term).

Submission

Submit a pdf file as report to gradescope. The commands you used and the result of each question must be included in your report.