CS 57300: Data Mining

Assignment 4: Descriptive Modeling

Due 11:59pmEDT Tuesday, 5 April 2022

Note: You need to have a campus IP address (e.g., using VPN) or authenticate using Boilerkey to get some parts of the assignment.

Please turn in a PDF through Gradescope. You'll need to access Gradescope through Brightspace the first time (to register you for the course in Gradescope.) Gradescope is pretty self-explanatory, ITaP provides these instructions. Make sure that you mark the start/end of each question in Gradescope. Assignments will be graded based on what you mark as the start/end of each question. Please typeset your answers (LaTex/Word/OpenOffice/etc.)

1: Using Ensembles

Consider a scenario where you have learned three regression decision tree models (e.g., CART) on subsets of data, and on four test instances you obtain the following predictions:

	1		2		3		4
Model 1	True	Prediction	True	Prediction	True	Prediction	True	Prediction
Model 1	8.976	23.658	4.568	67.889	3.889	45.891	9.993	89.447
Model 2	True	Prediction	True	Prediction	True	Prediction	True	Prediction
Model 2	8.976	24.765	4.568	65.231	3.889	47.897	9.993	90.000
Model 3	True	Prediction	True	Prediction	True	Prediction	True	Prediction
Model 3	8.976	22.488	4.568	65.442	3.889	45.998	9.993	89.515

There is a problem with the results - what can you say about this? What form of ensemble would help here? Why would it help?

Generally, we have high bias when a model is too simple to represent the concept to be learned, and high variance when it the model is too complex and able to learn many different representations that match the data. Is it possible for a model to have both high variance and high bias? How might this relate to ensembles?

2: Bagging

If we have a dataset of size n, and we build Bootstrap Aggregation (bagging) models using n datapoints for each sample, it would seem that they would all be the same - so why would the ensemble work? The key is that we are sampling with replacement. Prove that this means that each sample has on average only 63\% of the unique items in the dataset.

3: Adaboost

Adaboost weights models in the ensemble differently. Does the model generated in the final iteration of Adaboost always have the highest weight? Prove why or why not.
Say we use perceptron as the base model for Adaboost, and there are k input variables. How many parameters would we need to estimate after q rounds?
Given a model learned at the i^th iteration of Adaboost used the following instance weights, what are the weights of the instances at the next round?

Instance Initial Weight Prediction Final Weights

1 0.250 Right ?

2 0.400 Wrong ?

3 0.200 Wrong ?

4 0.150 Wrong ?
What is the contribution of this model in the final decision making? I.e., what is α?

Instance	Initial Weight	Prediction	Final Weights
1	0.250	Right	?
2	0.400	Wrong	?
3	0.200	Wrong	?
4	0.150	Wrong	?

4: K-Means termination/optimality

Either prove that k-means terminates, or that there may be situations where it doesn't terminate.
Either prove that k-means always arrives at an optimal solution, or that there may be situations where the outcome is not optimal.

5: Hierarchical clustering

	X	Y
a	1	2
b	2	4
c	3	6
d	6	7
e	6	8

Given the above five points in two diimensions, do two hierarchical clustering using Manhattan distance (makes calculations easy): One with single-link and one with complete-link clustering. Show the two dendrograms.
You are studying evolutionary paths in animals, and you have multi-variate descriptions of each animal. Do you think it would be best to use single-link, complete-link, or average-link measures to perform a dendrogram, where the goal is to capture evolutionary paths? Explain why.

6: Density-based clustering

Read Schubert, Erich; Sander, Jörg; Ester, Martin; Kriegel, Hans Peter; Xu, Xiaowei DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst. 42 (3): 19:1–19:21. Nothing to turn in for this question.

7: Density estimation

	X	Y
g	2	4
h	3	6
i	6	5

Given the above data, and two initial clusters: C1( μ=(3,5); σ=(0.3,0.4) ) , C2( μ=(6,3); σ=(0.4,0.3) ) ; and initial weights w1=0.4 and w2=0.6. (To simplify, assume x and y independent - e.g., you don't need to worry about covariance.)

Peform the next E (expectation) step in E-M clustering and give the appropriate outputs (show how you've done this.)
Given your above E step results, perform the next M (expectation) step in E-M clustering and give the appropriate outputs (show how you've done this.)

8: Lab - Clustering

In this programming assignment, you will be working with a wine dataset again, but this time, we provide a larger dataset which contains instances of both red wines and white wines (Refer to this link to access the dataset). Your task is to implement clustering algorithms using wine attributes to help cluster different wine products. You can use your preferred programming language (e.g., Python, R, etc.) to answer the questions. Note that you can also use relevant packages to build classifiers and provide predictions (e.g., pandas, numpy, scipy, sklearn). However, your code should be your original - DO NOT use any publicly available code not contained in standard libraries.

Implement clustering algorithms using wine attributes. Since we are not classifying wines based on their quality scores, discard the column 'quality'. Implement all models listed below, and show visualizations of the clusters and centroids:
- At least one distance-based approach, try with two different distance metrics
- At least one density-based approach
- At least one partitioning approach
- At least one probabilistic approach
For each method, describe:
1. Is the method deterministic or non-deterministic? If non-determistic, discuss differences between runs for the following questions as well as what is asked. You may find it helpful to use some sort of visualizations in answering these questions - but this may be challenging, as you can't easily visualize all the dimensions.
2. Do the clusters seem to be well-separated? Explain how you arrived at this answer.
3. Can you describe this clustering?
4. Can you give a meaningful description of any of the clusters? I.e., something that might be useful for choosing a wine to drink.
5. Try n_clusters = 2, for methods that require pre-defining the number of clusters. Suppose the first 1599 instances of the data are red wines and all others are white wines. Do you think your clusters make sense? Explain. You may also want to try other numbers of clusters and see if something else makes more sense.
Compare the methods: Which do you think provides the most useful information and why? Is there something about the data that would tell you which is most appropriate to use?
Now, choose one of your implemented models, and try to align cluster results with the wine quality scores. Explain why the cluster results (do/don't) align well with the quality scores. Comparing classification models with clustering methods, describe which you would choose to build a model for wine recommendation.