CS 47300: Web Information Search and Management

Project 3: Collaborative Filtering using Spark

Due 11:59pmEDT Friday, 6 December, 2019

This is a very short project, and will be scored as an assignment rather than a project. But listed as a project, since you do have to run things. Part 0 should be 10 minutes or so, and Part 1 is only a couple of minutes - both are just cut/paste commands. Part 3 could take a bit longer, but if it takes you more than an hour, then you probably don't have a sufficient understanding of collaborative filtering to be prepared for the final (and doing this project is a good way to learn it.)

Part 0: Connecting to the Cluster, executing a simple program (optional)

There is nothing turn in for Part 0. If you've used Hadoop before (in particular, those who are taking CS348 have already done this), you can skip this part. Otherwise, we suggest you do the following to gain some familiarity. This will create a simple file and run a very basic program over it.

Log in to jacobi00.cs.purdue.edu (password is the same as your Purdue career account),
$ ssh jacobi00.cs.purdue.edu
If you are connecting from off campus, you may need to set up a VPN connection first. Instructions for setting up a VPN to Purdue's network are at https://www.itap.purdue.edu/connections/vpn/.
Move files to the Hadoop cluster. To do this, execute the following steps. After logging on to the cluster master node with SSH, populate a simple text file:
$ printf "aaa\nbbb\nccc\nddd\naaa\nbbb\nccc\nddd\n" > tmp.txt
Important: Create a personal HDFS directory:
$ hdfs dfs -mkdir /user/$USER
Create a directory in your personal HDFS directory to store the file:
$ hdfs dfs -mkdir /user/$USER/in
Copy the text file from the master node to the new directory:
$ hdfs dfs -put ./tmp.txt /user/$USER/in
List the directory contents:
$ hdfs dfs -ls /user/$USER/in
Now that we have loaded a file into HDFS, we would like to run a MapReduce job over it. Copy the file /homes/cs473/Project3/Select.java to the cluster master node, which is the same file system as your normal CS lab machine file system. (You'll want to take a look to see what the program does...) If you were creating your own program, you could use
$ scp Select.java jacobi00.cs.purdue.edu:
to copy your program to the master node. On the master node, navigate to the directory which holds Select.java and run the following commands. Be mindful of spaces and / and * characters.
$ mkdir select $ CP=$(hadoop classpath) $ javac -classpath $CP -d select/ Select.java $ jar -cvf select.jar -C select . $ hadoop jar select.jar org.myorg.Select /user/$USER/in /user/$USER/out Notice the last two arguments to the final command. The MapReduce job will take these arguments to be the source directory (which already exists) and the destination directory (which will be created by the MapReduce job).
Check and retrieve the output. To list files,
$ hdfs dfs -ls /user/$USER/out
You should see an output like this with SUCCESS message: Found 2 items -rw-r--r-- 1 $USER supergroup 0 2019-12-02 07:10 /user/manig/out/_SUCCESS -rw-r--r-- 1 $USER supergroup 16 2019-12-02 07:10 /user/manig/out/part-00000 To see the contents of the output, run
$ hdfs dfs -cat /user/$USER/out/*
Sometimes it may not display anything. In that case, use the following:
$ hdfs dfs -cat '/user/$USER/out/*'
To move the output from HDFS back to your directory on the master node, use:
$ hdfs dfs -getmerge /user/$USER/out ./output.txt $ cat output.txt
Finally, remove the output directory so that you can run your job again. (Hadoop will complain if you try to write to an HDFS directory that already exists.)
$ hdfs dfs -rm -r /user/$USER/out
(-rm -r is used for removing directories in HDFS.)
When done, please remember to clean up after yourself:
$ hdfs dfs -rm -r /user/$USER/*

Part 1: Collaborative Filtering using pyspark

You will now run a pyspark job that runs memory-based collaborative filtering on 100,000 movie ratings from a few hundred users. Should take you at most a couple of minutes. Most of that will be waiting for the program to complete.

Log in to the master node jacobi00.cs.purdue.edu. Load the movie ratings database from /homes/cs473/project3/u.data into your input directory in the Hadoop file system. If you don't already have a personal directory, see Step 2 in Part 0. $ hdfs dfs -put /homes/cs473/project3/u.data /user/$USER/in
You are given a program in /homes/cs473/project3/mbcf.py .
$ spark-submit --master yarn /homes/cs473/project3/mbcf.py /user/$USER/in/u.data
Copy the results (the predicted rating for user 22 on movie 377) into your report.
When done, please remember to clean up after yourself:
$ hdfs dfs -rm -r /user/$USER/*

Remember, for this part you only need to turn in the predicted rating output from the program.

Some documentation and notes that may help

For an introduction to PySpark, take a look at the following links.

NOTE: Be mindful of the version number whenever you are searching online for documentation. The version of Spark installed on the CS department's Spark cluster is version 2.4.3. Spark is under heavy development, so if you are looking at documentation for another version, there is a good chance that features will be changed or missing.

Spark provides a Python shell where you can submit commands interactively. This shell can be very useful when trying to learn the Spark interface. To start the shell, run:
$ pyspark
If you like, you can develop your queries in the shell first, and then copy your shell commands to a Python script to create a Python job. To distribute a Python job with Spark, use:
$ spark-submit --master yarn [job.py] [command_line_args]

Part 2: Testing the code

Now comes the part you have to think about: Figuring out if mbcf.py is working correctly. Devise a way to test if mbcf.py is giving the correct answer. This need not be a full test suite, it could be a single test that would give you some confidence in the program giving the correct or incorrect result, inspection/analysis of the code, or even augmenting the code with some tests.

You may want to look over the full MovieLens 100k dataset and associated documentation. This can be found in /homes/cs473/project3/ml-100k/ . Note that the file u.data consists of a user number, movie number, rating (1-5), and timestamp (which we don't use), tab-separated.

For this part, include in your report a brief (1-2 paragraph) description of how you validated the correctness (or incorrectness) of the code. If you modified the code, include the modified code. If you constructed a test, or a set of tests, include the test data (if it is under a page - otherwise, just give a description.) If based on running and getting results, include a sample of the output, as well as the expected results and how you came up with them.

The program may well be incorrect - if your test says it is wrong, don't assume you've done something wrong. The goal is to increase your understanding of collaborative filtering to the point where you are able to gain some confidence that the program is either correct or incorrect.

Don't feel you need to spend a lot of time on this. If you understand memory-based collaborative filtering, you should be able to devise a very simple test where you can manually calculate the answer, or a couple of tests where you can estimate how the answers should change, in a few minutes. Even a simple answer, as long as it demonstrates understanding of collaborative filtering, will be good for full credit.

And finally, when done please remember to clean up after yourself:
$ hdfs dfs -rm -r /user/$USER/*

Part 3: Some other ideas (optional, this is for your own edification and will not be graded)

Below this point, there is nothing that you will be graded on, although you are welcome to turn in what you've done and discovered, if you wish. These are suggestions for some things to try out to better understand Map/Reduce and Spark, as well as collaborative filtering. In particular, thinking about questions 1 and 2 is a good way to study up on collaborative filtering in preparation for the final. If you understand collaborative filtering well enough that you understand the questions and how to answer them, you have a good knowledge of what we've covered on collaborative filtering.

What would it take to do this with Pearson Correlation Coefficient Similarity instead of Vector Space Similarity? Would it change the answers? Under what conditions would it change the answers? (Hint: Vector Space Similarity makes an assumption that Pearson Correlation Coefficient Similarity does not, if that assumption holds, the results should be similar.)
Would the same approach work for model-based collaborative filtering? Would it change the performance bottlenecks?
Is Spark really scaling well? How would you test this? (Hint: You may have already tested this in Parts 2 and 3, but just weren't thinking of it.)
There are likely some substantial performance flaws in the way Spark is being used in mbcf.py . Find such a flaw, and think about how to fix it. You might even try fixing it and see if it makes a difference.
What about changing the program to give the top 10 recommendations for any single user? How much slower would you expect this to be? If you are bored, feel free to try it.

Instructions for submission

The only thing to turn in is your project report, submitted as a PDF through Gradescope (this link works if you are logged in to Blackboard.) It helps if the answer to each part begins on a new page.