CS 57300: Data Mining

Assignment 3: Predictive Modeling II

Due 11:59pmEDT Friday, 11 March 2022

Note: You need to have a campus IP address (e.g., using VPN) or authenticate using Boilerkey to get some parts of the assignment.

Please turn in a PDF through Gradescope. You'll need to access Gradescope through Brightspace the first time (to register you for the course in Gradescope.) Gradescope is pretty self-explanatory, ITaP provides these instructions. Make sure that you mark the start/end of each question in Gradescope. Assignments will be graded based on what you mark as the start/end of each question. Please typeset your answers (LaTex/Word/OpenOffice/etc.)

1: Optimization

  1. Consider there is a state space we want to optimize. Suppose there are 5 possible actions (i.e., number of features) at each state, and the length of state sequence is 10. Derive the number of sequences that we have to evaluate for exhaustive search and greedy search.
  2. Suppose you have computational resources enough for computing 200 sequences in memory. How would you improve greedy search given the resource? I.e., you can either get faster results, or better results, by saving some state sequences. Describe your method in enough detail to show it satisfies the resource limitation.
  3. Solve the following optimization problem by using Lagrangian multipliers. Find the x and y value that optimizes the given function, and the optimal objective function value. Minimize 2x2+3y2 subject to x+y ≥ 5

2: Neural Networks

Consider the following simple neural network. Here, x1 and x2 are inputs, h1 is a hidden unit, and o is the output. A sigmoid activation function is applied after the hidden unit h1 to generate the output, and the same activation is applied after the output layer o to generate a prediction ŷ. Mean squared loss function is used to calculate the error (L = 1/2(y - ŷ)2, where y is the true label ). Derive the full formula of ŷ with respect to the inputs. Derive the formula for calculating gradient of the loss L w.r.t. the weight w2, using back propagation.

You must be on campus, using VPN, or logged in using boilercast to see the image.

3: Support Vector Machine

Consider the following 2-dimensional data with two classes, o and x. Use linear SVM to find a hard-margin solution. Derive the formula for the hyperplane, support vectors, and the margin for the optimal case. Report a non-optimal hyperplane and compare the margin with the optimal hyperplane.

You must be on campus, using VPN, or logged in using boilercast to see the image.

Lab: SVM and Neural Net Parameterization

In this programming assignment, you are given a dataset of Wine Quality (adapted from the UCI Machine Learning Repository, but use the one we provide), and your task is to build classifiers using wine attributes to help predict if the wine is good or not. You will implement Linear SVM with Stochastic Gradient Descent and Artificial Neural Network to make such predictions. You can use your preferred programming language (e.g., Python, R, etc.) to answer the questions. Note that you can also use relevant packages to build classifiers and provide predictions (e.g., pandas, numpy, scipy, sklearn). However, your code should be your original - DO NOT use any publicly available code not contained in standard libraries. In addition, we will not provide separate testing data to you. You are asked to design your own tests to ensure that your code runs correctly and meets the specifications below.

Data Preprocessing and Splitting:

  1. Before preprocessing, drop the Id column. Perform Data-Preprocessing similar to Assignment 2 5.1. Your task is to predict the quality variable. Turn every value <=5 as -1 and the rest as 1.
  2. Some more changes: Split your data into 3 sets: Training (70%), Testing (20%) and Validation (10%).
  3. Once that is done, standardize your data. Learn the standardization parameters from the training data and apply to it, and the testing and validation data.

4: Linear SVM with SGD minimizing over Hinge Loss

Model: Use a linear, soft margin SVM. Use Stochastic Gradient Descent (SGD) to minimize over Hinge Loss.

  1. Hyperparameter tuning: The three hyperparameters you will be tuning. One is the soft margin paramter C, controlling how much impact failing to be within the margin has. The other two are parameters for SGD: Number of epochs and learning rate. Train your model on the training data; if using SciKitLearn you may find it helpful to use GridSearchCV (with 5 folds) exhausting the parameter space. Keep in mind, your model should be evaluated on the validation data. You should try using 3 values of your choice for each parameter (can go higher if you feel the need). You can look at these independently (e.g., 9 trials, not 27). Consider what you should use for a scoring function. This step may take time. If it does, try reducing the max number of epochs. Report how you find varying each hyperparameter affects the results.
  2. Evaluation Once your hyperparameters are tuned, evaluate the model on training data and testing data. Report the Accuracy and F1 Score. Also report the Matthews Correlation Coefficient (you'll need to look this up in the literature). Describe what each one of them has to say. Also plot the ROC curve and describe what you see.

5: ANN (Artificial Neural Networks) with Backpropagation

We expect a 3-layer Perceptron classifier with an input layer, one hidden layer with 3 nodes and one output layer. Activation function is the sigmoid function. No regularization is needed for this assignment i.e., λ=0. When using Backpropagation (optimized by SGD), you will probably find packages try to minimize over Logistic Loss, as opposed to squared loss. This is fine - think about the difference, and why this might be more appropriate (hint: think about what the sigmoid function does.)

  1. Hyperparameter tuning: The 2 hyperparameters you would be tuning are number of Epochs and learning rate. Train your model on the training data; if using SciKitLearn you may find it helpful to use GridSearchCV (with 5 folds) exhausting the parameter space. Keep in mind, your model should be evaluated on the validation data. You should try using 3 values of your choice for each parameter (can go higher if you feel the need). You can look at these independently (e.g., 9 trials, not 27). Consider what you should use for a scoring function. You should try using 3 values of your choice for each parameter (can go higher if you feel the need). Scoring function here is accuracy. This step may take time. If it does, try reducing the max number of epochs.
  2. Evaluation Once your hyperparameters are tuned, evaluate the model on training data and testing data. Report the Accuracy, F1 Score and Matthews Correlation Coefficient. Describe what each one of them has to say (in particularly, looking at the difference between testing and training data.) Also plot the ROC curve and describe what you see.

6: Additional variations in models

For each of the models you implemented, suggest something new that could improved the performance. Experimentally try it out and explain why it does/doesn’t align with your initial intuition.

Answer the following questions:

You may find it useful to try some of these out on your experimental setup, but it is not strictly necessary. For some questions, you may also find it helpful to do a literature search (you are unlikely to find the exact answer, but you may find things that give you insight.)

7: Why do we perform hyperparameter tuning on validation data and not testing data?

8: Write the psedudocode for Linear SVM with SGD over Hingeloss.

9: Does the initialization of Weights in the vectors impact the convergence?

Explain. What happens when all weight vectors are initialized to e3 in the beginning?

10: Activation Function

An activation function can make a great impact on the neural network’s prediction ability. Do you notice something disturbing about the sigmoid function? Explain what can go wrong with sigmoid function as an activation function choice. Also suggest what can be done to avoid this.

11: Do you think scaling the data before applying GD helps? Explain.

Optional contest

We will be providing a separate test dataset when the assignment is due: Available after midnight. If you wish, you may submit the results (provide a confusion matrix) that your approaches for the SVM and NN produce on this test dataset along with a 1-2 sentence summary of what you did to improve the results (Question 6); we will post the top three that are returned within 24 hours. Email to the instructor. This is ungraded, but if your code is well done it should take you very little time/effort, and gives you a chance to see how your ideas stack up against others.