For the following assignments, please provide as much evidence of the results as possible, including the code, screenshots (only plots – not text or code) and documentation. Submit only one pdf file and .ipynb / .py files containing the code with documentation.
Choose any cleaned dataset such as the ones here: https:
www.kaggle.com/search?q=cleaned+datasets+datasetFileTypes%3Acsv
1.a. [10 points]
Ignore the label column and apply the AgglomerativeClustering method from sklearn.cluster on this dataset. Use min, average, and ward methods explained in the class to perform the hierarchical clustering. Please feel free to refer to https:
scikit-learn.org/stable/auto_examples/cluste
plot_digits_linkage.html#sphx-glr-auto-examples-cluster-plot-digits-linkage-py
1.b. [10 points]
Generate visualizations like in the above tutorial and dendrograms (please feel free to refer https:
scikit-learn.org/stable/search.html?q=dendrogram) for each of the methods.
1.c. [10 points]
Which method produces clusters that are most closely aligned with the labels in the dataset? Explain.
1.d. [20 points]
Using the k-means algorithm where k=2 and co
esponding visualizations, explain if it fares better than the agglomerative approaches in terms of the alignment with the labels.
Hint:
(a) Choose a smaller dataset for easier and better visualization and analysis
(b) Cut the dendrogram at an appropriate level to result in just two clusters, in order to see how aligned these two clusters are with the assigned labels.
2. [25 points]
The wine data set at https:
archive.ics.uci.edu/ml/datasets/wine has 13 features. Develop in Python and apply your own version of the PCA algorithm to this data set, to visualize how PCA helps with dimensionality reduction. Explain how many Principal Components you will choose and why. What percent of the variance in the data do the selected Principal Components cover?
For the implementation, you may use any objects, modules, and functions in NumPy, SciPy and other python li
aries to do various operations such as to compute the eigen values, vectors or perform any other math / linear alge
a operation, but not use the PCA function available in SciKit-Learn directly.
3.a. [20 points]
Refer to online tutorials on regularization such as
https:
medium.com/coinmonks
egularization-of-linear-models-with-sklearn-f88633a93a2
and
https:
towardsdatascience.com
idge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0
Apply the techniques from the above tutorial to the student dataset at https:
archive.ics.uci.edu/ml/datasets/student+performance
Does regularization help improve the accuracy of predicting the final Math grade of the students?
3.b. [5 points]
For regularization, we added the regularizer to the loss function. Does it make sense to multiply or subtract the term, instead? Explain.