In this project, you will use the galago toolkit to explore a text corpus. While you are welcome to install and run on machines of your choice, please make sure that your final submission runs on the mc cluster CS servers (mc17.cs.purdue.edu, mc18.cs.purdue.edu, & mc19.cs.purdue.edu, we will note others as we identify ones set up appropriately). You can access the lab remotely. Using X11, you can even run graphical user interface software remotely (For PCs, you'll need to install X11 software, I use Xming. Old mac OSs support X11 directly, but for newer ones you'll need to use XQuartz, it will be similar to using XMing. Linux machines support X11 natively. While you can install galago and develop on your own platform, you will be graded on those machines only, so make sure that what you turn in runs as expected on the mc cluster. (Warning: Galago doesn't seem to compile or run under OpenJDK 11.)
How to configure Putty & Xming
Installation & Configuration of Putty (for Windows; Mac and Linux machines include openSSH which will allow the same tunnelling.)
Under Mac and Linux (and Windows 10), you can use ssh (part of openssh) rather than installing putty.
There are may things you'll have to figure out from looking through documentation, searching the web, or simply trying things out. You'll find that in an R&D environment, you will often be involved with things that are not well documented. After all, if everything were clean and laid out for you, where is the R&D? This is an intentional part of the assignment, to give you experience with such an environment before you hit the real world.
You can use any language that runs on those machines, but you
must provide a command-line script that will run the test
in Part 3. If you use Java, you can use the Galago API - but
as it is very poorly documented, we recommend that even if
using Java, your program use exec calls to run
command line
Galago and collect the output (this is
what you'll need to do if using other languages.)
An installation of galago is available on /home/u3s/cs473/project1/galago-3.16/. Your final submission must use this installation. You'll want to add this to your command path, using the follwing command:
export PATH=/homes/cs473/project1/galago-3.16/bin/:$PATH
You may also want to add the command to your ~/.bashrc file, so you don't have to rerun every time you log in. Make sure you run Galago from your home directory (or a subdirectory you create); Galago does put some things in the current working directory and if you try to run from /homes/cs473/ it will give some strange errors.
To get a list of all available galago commands run: galago help
You will perform the following steps in order:
You will use the CACM corpus (3204 documents) for this project, found at http://www.search-engines-book.com/collections/ There is already a copy of the documents, relevance judgements, and queries in /homes/cs473/project1/, so you don’t need to download it yourself.
Please, follow the naming convention very strictly - we need to be able to run everyone's program using exactly the same call. Your grade, and timely grading, will depend on it.
The index can be built using the command:
galago build --indexPath=./project1-index --inputPath=/homes/cs473/project1/cacm
Note that you can create your own input directory if you want to create specific documents to test things out. You don't need to turn anything in for this part.
Use galago commands to answer the following questions. We suggest galago batch-search, galago xcount, and galago doccount; you may find others useful. There is some documentation at sourceforge. The documentation is rather weak - you'll find a few things that are buried or misleading. For example, in doccount, the query is specified with `--x+searchterm', not '--x=searchterm' (but --x=searchterm works for xcount). You'll also need to try some experiments to understand the semantics - it isn't always obvious what each is telling you (for example, try galago doccount --x+"how many" --index=./project1-index, then try with just "how" and just "many". Note that you'll need to specify the index you created (e.g., --index=./project1-index). You can use /homes/cs473/project1/allqueries.json as an example of how to create a query file for json. The outcome from this step should be included in the file report.pdf.
You will use the galago eval command to evaluate the default Galago retrieval model. The outcome from this step should be included in the file report.pdf. You can use --judgments=/homes/cs473/project1/cacm.rel, or cacm_fullpath.rel, for the relevance file and /homes/cs473/project1/allqueries.json for the query files to construct the --baseline= query results.
We sampled 100 documents from the corpus. In this part, you will estimate the statistics of the corpus with the information you get from the 100 sample documents. The sampled documents are in /homes/cs473/project1/cacm100
Submit a pdf file as report to gradescope. The commands you used and the result of each question must be included in your report.