CS 47300: Web Information Search and Management

Project 2: Text Clustering

Due 11:59pmEDT Tuesday, 12 November, 2019

In this project, you will parse a corpus and then perform agglomerative clustering on the corpus to generate a dendrogram. Please make sure that your final submission runs on the mc cluster CS servers (mc01.cs.purdue.edu - mc18.cs.purdue.edu). You can access the lab remotely. Using X11, you can even run graphical user interface software remotely (for PCs, you'll need to install X11 software, I use Xming.) While you can install whatever tools you want and develop on your own platform, you will be graded on those machines only, so make sure that what you turn in runs as expected on the mc cluster.

You can use any language that runs on those machines, but you must provide a command-line script using bash that will run the specified tests. The data we assign you to cluster on will remain in the same location on the server, so that full path can simply be hard-coded. We do not guarantee it will be exactly the same data, but it will follow the same format (we may use a different subset of the corpus.)

You will perform the following steps in order:

Parse the Corpus, applying stemming and stopword removal. It us up to you to determine appropriate stemming and stopword removal techniques to apply, although we would suggest removing both very high and very low frequency words as stopword.
Use TF-IDF with log(tf)+1 for the term frequency weight, and log(N/df) as the inverse document frequency weight. You may also experiment with other weightings, but use these for the results you report. Use cosine similarity between articles to get a similarity matrix.
Apply agglomerative clustering twice, once using single-link clustering, and once using complete-link clustering.
Evaluate both models using a metric of your design, and discuss the effectiveness of the two procedures in light of your metric, and if you think your metric does a good job of selecting the best clustering approach.

You will use a subset of the Reuters-21578 corpus for this project. The subset (~3000 documents) to use will be found in department lab Linux machines in /homes/cs473/project2/reut2-subset.sgm . Your code should run on the mc cluster machines, using the above corpus file (which will be a different subset of the Reuters-21578 corpus when we test, but will be similar), since that is how we will grade it. You'll want to read the corpus README file to make sure you understand the corpus format. Your clustering should only make use of the <BODY> tagged section of the article.

The provided file contains many articles. There are many sgml/xml parsing tools you can use, you'll probably find things that work easily for this task. Note that there will be articles with no topic, and articles with no body text. You should ignore the articles with no body text. You'll need to figure an appropriate way to deal with articles without a topic listed when doing your evaluation, as well as deciding what to do with some likely uninteresting articles (e.g., those marked as BRIEF.)

Please, follow the naming convention very strictly - we need to be able to run everyone's program using exactly the same call. Your grade, and timely grading, will depend on it.

Part 1: Parse the Corpus

Process the corpus at /homes/cs473/project2/reut2-subset.sgm . This involves stemming, stopword removal, and computing cosine similarity using TF*IDF as described above on the <BODY> section of each article. You may choose to instantiate the similarity matrix (figure out how much memory this would need first), or to do this processing on-the-fly as part of the clustering.

You will need to determine appropriate means for stemming and stopword removal, as well as potentially other decisions that need to be made in computing similarity scores.

You may use publicly available packages (e.g., Python's nltk) to help with this process.

Please include in your report:

A discussion of the processing steps you used, and how you determined the parameters (e.g., experimentation, reasoning about the task, etc.
A brief overview of your code (packages used, etc.)

Part 2: Run Complete and Single-Linkage Clustering

You will process the aritcles and perform agglomerative clustering using both Single-Linkage and Complete-Linkage, in separate instances. The output of your code should be two separate output files titled "single.txt" and "complete.txt" which will contain a list of all the articles in lexical order by the article ID (NEWID) and their assigned clusters by respectively single-linkage and complete-linkage agglomerative clustering. This should be one line per article, with whitespace separating the article ID from the clusters, and between each cluster.

Remember that with hierarchical clustering, each article will belong to a hierarchy of clusters. The root should be labeled 1, and lower numbers should be closer to the root. Each article will those be labelled with multiple clusters (log N on average, but it could bemore or less.)

This code portion can be submitted using whatever langauge you're comfortable with. This code will be run using a bash script titled "CLUSTERING" that when run, will execute both single-linkage and complete-linkage clustering, and output the results to their respective files.

Samples of the .txt output files are given in /homes/cs473/project2/

Include in your report a paragraph discussion of the clusters found, and the top levels of the dendogram.

Part 3: Evaluation

Start with defining a measure to evaluate the quality of clustering. Note that the corpus has topics defined for each article so we recommend you use a gold-standard type of measure. (A measure that measure how effective the clustering is at finding know topics.) You'll find this challenging, as articles may have multiple topics, whereas many of the approaches we discussed assume a single topic for each article.

The dataset provided has articles with text bodies and category labels assigned to the articles, amongst other fields. The goal of the agglomerative clustering is to see whether the clustering did a good job matching the natural categories of the articles. What exactly is a "good job"? This is for you the student to decide. Come up with a good evaluation metric for deciding whether the clustering does a good job of matching with the natural (given) categories of the articles.

When devising this metric, consider:

Hierarchical clustering produces O(N) clusters - many of which would be uninteresting to include in the evaluation. (Quick mental exercise - can you figure out exactly how many clusters you get? Or does it vary depending on how things cluster?) For example, single document clusters, or the cluster of all documents, are probably uninteresting (but a single document cluster might be interesting, if very different from every other document.) You'll need to somehow define what are interesting clusters to include in the evaluation.
Some articles have multiple categories,
Clustering does not tell you which cluster correlates to which category,
What could be a possible best score for an arbitrary case with the vast majority metrics you could come up with? Can this be avoided? Should this be avoided? (Hint: what sort of score should you expect if the number of clusters is 1? And is that a "good" clustering?)

Please include the following in your report:

A brief (3-5 sentence) explanation of your measure.
A mathematical defiition of the measure.
An explanation of why you think this measure is appropriate for this problem.
Put the best score attained for each form of linkage in the report.
Discuss the difference in performance between single-linkage and complete-linkage.

We recommend that you do the first three questions before actually running your clustering. Think about why this should be done first.

Instructions for submission

Submit a single folder named project2/ using turnin. The folder must contain a bash script "CLUSTERING" that when executed ./CLUSTERING, runs BOTH SINGLE-LINKAGE AND COMPLETE-LINKAGE your code, and the file report.pdf (which is the majority of your grade.) There may be other files needed as well, but it should NOT contain the corpus or things derived from the corpus (e.g., Galago index files.)

Note that we will be providing different input to your program than what you have to work with, but the path will stay the same. The values on the corpus in place now are what should be included in the report, but your program will have to work for different values.

Use the following command to submit the project:
turnin -c cs47300 -p project2 <path-to-submission-folder>
You may submit any number of times before the deadline. Only your most recent submission would be recorded.

To view submitted files, use:
turnin -c cs47300 -v -p project2
Don't forget the -v option, else it may overwrite your submission with an empty folder.

Please also submit report.pdf using gradescope (this makes grading it easier, faster, and more consistent.)