CS 57300: Data Mining

Assignment 2: Predictive Modeling

Due 11:59pmEDT Monday, 14 February 2022

Note: You need to have a campus IP address (e.g., using VPN) or authenticate using Boilerkey to get some parts of the assignment.

Please turn in a PDF through Gradescope. You'll need to access Gradescope through Brightspace the first time (to register you for the course in Gradescope.) Gradescope is pretty self-explanatory, ITaP provides these instructions. Make sure that you mark the start/end of each question in Gradescope. Assignments will be graded based on what you mark as the start/end of each question. Please typeset your answers (LaTex/Word/OpenOffice/etc.)

1. Classification Metrics

Suppose you have a movie dataset with 1000 entries. Your job is to build a classifier that predicts whether a movie is worth watching or not, i.e., binary classification. However, the dataset has 958 yes and 42 no entries.

Can you spot any issue here? Explain.
For the purpose of classification, it becomes important to have a yardstick to compare your implementation against something. Define a baseline for the purpose of evaluation.
Is accuracy a good metric in this case? If not, suggest an alternative better suiting the scenario.
Can you suggest a remedy to the issue you noticed in 1?

2. Categorization / Multi-Class Classifiers

Consider the same dataset as in Question 1, but now your job is to classify the movies into 5 categories (Definitely yes, yes, not sure, avoid, definitely avoid). Devise a plan to classify each instance, given that you have a method that can learn a good binary classifier. (You should treat this as a black-box method - you know it learns a good binary classifier, but know nothing of how it works.)

3. Multiple Predictions

Consider the situation in Question 2. Now you wish to also classify if the movie would be a success at the box office or not. I.e., now you have to predict 2 labels, one binary, and one with multiple cateogires. Describe how would you go about that. Would you expect the performance be better compared to the case when just a single prediction is to be made?

4. Linear Regression

Chapter 11 from PDM is a good resource

Consider these two Simple Linear Regression equations:

Y = β₀ + β₁X + ε
Ŷ = b₀ + b₁X

Identify the explanatory and response variable.
What is the difference between the two equations?
What are residuals (e_i)? What kind of a distribution do they follow? Justify.
Derive the formula for b₀ and b₁. Use SSE (Sum of Squared Errors) as your cost function J.
Here is the SLR model Y=400+307.5(X) modelling Accumulated Savings (Y) and Time (X). Do you notice anything wrong here? What happens when X=0? Comment.

5. Lab: Naive Bayes and k-Nearest Neighbors Classifier

In this programming assignment, you are given a dataset of Wine Quality (adapted from the UCI Machine Learning Repository, but use the one we provide), and your task is to build classifiers using wine attributes to help predict the quality score of a wine. You will implement Naive Bayes classification (NBC) models and k-Nearest Neighbors classifiers to make such predictions. You can use your preferred programming language (e.g., Python, R, etc.) to answer the questions. Note that you can also use relevant packages to build classifiers and provide predictions (e.g., pandas, numpy, scipy, sklearn, naivebayes). However, your code should be your original – DO NOT use any publicly available code not contained in standard libraries. In addition, we will not provide separate testing data to you. You are asked to design your own tests to ensure that your code runs correctly and meets the specifications below.

Data Preprocessing.
1. The given csv file includes 13 columns, including the column ‘quality’, which is the quality score of the wine. Determine unnecessary columns for predicting ‘quality’ from the given data, if any, and remove those columns. In the report, describe the columns you choose to/not to remove.
2. The given csv file includes instances that are not useful to train the classifiers. Remove instances if there are any None/NaN values. In the report, describe the number of instances you removed from the data, and the reasons you removed instances (other than missing values.) If you feel that removing instances with missing values is the wrong approach, please explain why and what you would do (but for the results you turn in, base it on data without mising values.)
3. Some values in the file are corrupted and we want to remove those instances as well. For the wine attributes, 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', the value should be non-negative. For ‘pH’ values, specifically, it should also be less than 14. Identify instances with corrupted values and remove them from the data. In the report, describe the number of instances you removed from the data. If you see anything interesting about the corrupted values, describe it.
4. As a final step, we want to remove outliers from the data. We consider the instances with their ‘pH’ values greater than 4.4, and values less than 2.6 as outliers. In the report, describe the number of instances you removed from the data.
Training-Test Split
Write a function to split the given data to a training set and test set. The function takes two parameters, one is the fraction of the training set and the other is the input data. For instance, when you put 0.8 as a fraction, the function takes 80% of the sample as a training set, and the rest as a test set, and the function returns the splitted datasets. Set your random seed as 22, to ensure the reproducibility. You don’t need to have a separate development set in this assignment. Note that we don’t allow sampling the same row more than once.
Naive Bayes Classifier
Implement a Naive Bayes Classifier. Observe the class values and determine which distribution of P(x_i|y) makes sense to you. Train a Naive Bayes Classifier with the training set, and evaluate the accuracy on the test set. Try four different fractions, 0.3, 0.5, 0.7, 0.9, and compare the accuracy on the test set. In the report:
1. Describe which distribution you used for the likelihood.
2. Report the test set accuracies w.r.t. different fractions.
3. Describe your observation on the accuracies. What is the advantage / disadvantage of using more training data?
k-NN Classifier
Implement a k-Nearest Neighbors Classifier. Train a k-NN Classifier with the training set fraction as 0.7. Try different k values, 2, 4, 6, and compare the accuracy on the test set. In the report:
1. Report the test set accuracies w.r.t. different k values.
2. Describe your observation on the accuracies. What is the advantage / disadvantage of using different k values?
Discussion
Observe the classifier results and give some thought to the training environment (e.g. features, hyperparameters, pre-processing, data distribution, etc.). What would you change to improve the results while using the same classifiers? Describe at least one method and report the results.