In this project, you will parse a corpus and then perform agglomerative clustering on the corpus to generate a dendrogram. Please make sure that your final submission runs on the mc cluster CS servers (mc01.cs.purdue.edu - mc18.cs.purdue.edu). You can access the lab remotely. Using X11, you can even run graphical user interface software remotely (for PCs, you'll need to install X11 software, I use Xming.) While you can install whatever tools you want and develop on your own platform, you will be graded on those machines only, so make sure that what you turn in runs as expected on the mc cluster.
You can use any language that runs on those machines, but you must provide a command-line script using bash that will run the specified tests. The data we assign you to cluster on will remain in the same location on the server, so that full path can simply be hard-coded. We do not guarantee it will be exactly the same data, but it will follow the same format (we may use a different subset of the corpus.)
You will perform the following steps in order:
bestclustering approach.
You will use a subset of the Reuters-21578 corpus for this project. The subset (~3000 documents) to use will be found in department lab Linux machines in /homes/cs473/project2/reut2-subset.sgm . Your code should run on the mc cluster machines, using the above corpus file (which will be a different subset of the Reuters-21578 corpus when we test, but will be similar), since that is how we will grade it. You'll want to read the corpus README file to make sure you understand the corpus format. Your clustering should only make use of the <BODY> tagged section of the article.
The provided file contains many articles. There are many sgml/xml
parsing tools you can use, you'll probably find things that
work easily for this task. Note that there will be articles with
no topic, and articles with no body text. You should ignore
the articles with no body text. You'll need to figure an appropriate
way to deal with articles without a topic listed when doing
your evaluation, as well as deciding what to do with some
likely uninteresting articles (e.g., those marked as BRIEF
.)
Please, follow the naming convention very strictly - we need to be able to run everyone's program using exactly the same call. Your grade, and timely grading, will depend on it.
Process the corpus at /homes/cs473/project2/reut2-subset.sgm .
This involves stemming, stopword removal, and computing
cosine similarity using TF*IDF as described above on
the <BODY> section of each article.
You may choose to instantiate the similarity matrix (figure
out how much memory this would need first), or
to do this processing on-the-fly
as part of the
clustering.
You will need to determine appropriate means for stemming and stopword removal, as well as potentially other decisions that need to be made in computing similarity scores.
You may use publicly available packages (e.g., Python's nltk) to help with this process.
Please include in your report:
You will process the aritcles and perform agglomerative clustering using both Single-Linkage and Complete-Linkage, in separate instances. The output of your code should be two separate output files titled "single.txt" and "complete.txt" which will contain a list of all the articles in lexical order by the article ID (NEWID) and their assigned clusters by respectively single-linkage and complete-linkage agglomerative clustering. This should be one line per article, with whitespace separating the article ID from the clusters, and between each cluster.
Remember that with hierarchical clustering, each article will belong to a hierarchy of clusters. The root should be labeled 1, and lower numbers should be closer to the root. Each article will those be labelled with multiple clusters (log N on average, but it could bemore or less.)
This code portion can be submitted using whatever langauge you're comfortable with. This code will be run using a bash script titled "CLUSTERING" that when run, will execute both single-linkage and complete-linkage clustering, and output the results to their respective files.
Samples of the .txt output files are given in /homes/cs473/project2/
Include in your report a paragraph discussion of the clusters found, and the top levels of the dendogram.
Start with defining a measure to evaluate the quality of clustering.
Note that the corpus has topics defined for each article so
we recommend you use a gold-standard
type of measure.
(A measure that measure how effective the clustering is at
finding know topics.) You'll find this challenging, as articles
may have multiple topics, whereas many of the approaches we
discussed assume a single topic for each article.
The dataset provided has articles with text bodies and category labels assigned to the articles, amongst other fields. The goal of the agglomerative clustering is to see whether the clustering did a good job matching the natural categories of the articles. What exactly is a "good job"? This is for you the student to decide. Come up with a good evaluation metric for deciding whether the clustering does a good job of matching with the natural (given) categories of the articles.
When devising this metric, consider:
Please include the following in your report:
We recommend that you do the first three questions before actually running your clustering. Think about why this should be done first.
Submit a single folder named project2/ using turnin. The folder
must contain a bash script "CLUSTERING" that when executed ./CLUSTERING
, runs BOTH SINGLE-LINKAGE AND COMPLETE-LINKAGE your code, and the
file report.pdf (which is the majority of your grade.)
There may be other files needed as well, but it
should NOT contain the corpus or things derived from the
corpus (e.g., Galago index files.)
Note that we will be providing different input to your program than what you have to work with, but the path will stay the same. The values on the corpus in place now are what should be included in the report, but your program will have to work for different values.
Use the
following command to submit the project:
turnin -c cs47300 -p project2 <path-to-submission-folder>
You may submit any number of times before the deadline. Only your most recent submission would be recorded.
To view submitted files, use:
turnin -c cs47300 -v -p project2
Don't forget the -v option, else it may overwrite your submission with an empty folder.
Please also submit report.pdf using gradescope (this makes grading it easier, faster, and more consistent.)