This is a very short project, and will be scored as an assignment
rather than a project. But listed as a project, since you do have
to run things. Part 0 should be 10 minutes or so, and Part 1 is
only a couple of minutes - both are just cut/paste commands
.
Part 3 could take a bit longer, but if it takes you more than an
hour, then you probably don't have a sufficient understanding of
collaborative filtering to be prepared for the final
(and doing this project is a good way to learn it.)
There is nothing turn in for Part 0. If you've used Hadoop before (in particular, those who are taking CS348 have already done this), you can skip this part. Otherwise, we suggest you do the following to gain some familiarity. This will create a simple file and run a very basic program over it.
$ ssh jacobi00.cs.purdue.edu
$ printf "aaa\nbbb\nccc\nddd\naaa\nbbb\nccc\nddd\n" > tmp.txt
$ hdfs dfs -mkdir /user/$USER
$ hdfs dfs -mkdir /user/$USER/in
$ hdfs dfs -put ./tmp.txt /user/$USER/in
$ hdfs dfs -ls /user/$USER/in
$ scp Select.java jacobi00.cs.purdue.edu:
$ mkdir select
$ CP=$(hadoop classpath)
$ javac -classpath $CP -d select/ Select.java
$ jar -cvf select.jar -C select .
$ hadoop jar select.jar org.myorg.Select /user/$USER/in /user/$USER/out
Notice the last two arguments to the final command. The MapReduce job will take these
arguments to be the source directory (which already exists) and the destination directory
(which will be created by the MapReduce job).
$ hdfs dfs -ls /user/$USER/out
$ hdfs dfs -cat /user/$USER/out/*
$ hdfs dfs -cat '/user/$USER/out/*'
$ hdfs dfs -getmerge /user/$USER/out ./output.txt
$ cat output.txt
$ hdfs dfs -rm -r /user/$USER/out
$ hdfs dfs -rm -r /user/$USER/*
You will now run a pyspark job that runs memory-based collaborative filtering on 100,000 movie ratings from a few hundred users. Should take you at most a couple of minutes. Most of that will be waiting for the program to complete.
/homes/cs473/project3/u.data
into your input
directory in the Hadoop file system.
If you don't already have a personal directory,
see Step 2 in Part 0.
$ hdfs dfs -put /homes/cs473/project3/u.data /user/$USER/in
$ spark-submit --master yarn /homes/cs473/project3/mbcf.py /user/$USER/in/u.data
$ hdfs dfs -rm -r /user/$USER/*
Remember, for this part you only need to turn in the predicted rating output from the program.
For an introduction to PySpark, take a look at the following links.
NOTE: Be mindful of the version number whenever you are searching online for documentation. The version of Spark installed on the CS department's Spark cluster is version 2.4.3. Spark is under heavy development, so if you are looking at documentation for another version, there is a good chance that features will be changed or missing.
Spark provides a Python shell where you can submit commands
interactively. This shell can be very useful when trying to learn
the Spark interface. To start the shell, run:
$ pyspark
If you like, you can develop your queries in
the shell first, and then copy your shell commands to a Python
script to create a Python job. To distribute a Python job with
Spark, use:
$ spark-submit --master yarn [job.py] [command_line_args]
Now comes the part you have to think about:
Figuring out if mbcf.py
is working correctly.
Devise a way to test if mbcf.py
is giving the
correct answer. This need not be a full test suite, it could
be a single test that would give you some confidence in the program
giving the correct or incorrect result, inspection/analysis of the
code, or even augmenting the code with some tests.
You may want to look over the full MovieLens 100k dataset and
associated documentation. This can be found in
/homes/cs473/project3/ml-100k/
.
Note that the file u.data
consists
of a user number, movie number, rating (1-5), and
timestamp (which we don't use), tab-separated.
For this part, include in your report a brief (1-2 paragraph) description of how you validated the correctness (or incorrectness) of the code. If you modified the code, include the modified code. If you constructed a test, or a set of tests, include the test data (if it is under a page - otherwise, just give a description.) If based on running and getting results, include a sample of the output, as well as the expected results and how you came up with them.
The program may well be incorrect - if your test says it is wrong, don't assume you've done something wrong. The goal is to increase your understanding of collaborative filtering to the point where you are able to gain some confidence that the program is either correct or incorrect.
Don't feel you need to spend a lot of time on this. If you understand memory-based collaborative filtering, you should be able to devise a very simple test where you can manually calculate the answer, or a couple of tests where you can estimate how the answers should change, in a few minutes. Even a simple answer, as long as it demonstrates understanding of collaborative filtering, will be good for full credit.
And finally, when done please remember to clean up after yourself:
$ hdfs dfs -rm -r /user/$USER/*
Below this point, there is nothing that you will be graded on, although you are welcome to turn in what you've done and discovered, if you wish. These are suggestions for some things to try out to better understand Map/Reduce and Spark, as well as collaborative filtering. In particular, thinking about questions 1 and 2 is a good way to study up on collaborative filtering in preparation for the final. If you understand collaborative filtering well enough that you understand the questions and how to answer them, you have a good knowledge of what we've covered on collaborative filtering.
The only thing to turn in is your project report, submitted as a PDF through Gradescope (this link works if you are logged in to Blackboard.) It helps if the answer to each part begins on a new page.