Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

ISYE 6740 Homework 7 (Last Homework) Total 100 points. As usual, please submit a report with sufficient explanation of your answers to each the questions, together with your code, in a zip folder. 1...

1 answer below »
ISYE 6740 Homework 7 (Last Homework)
Total 100 points.
As usual, please submit a report with sufficient explanation of your answers to each the questions, togethe
with your code, in a zip folder.
1 Random fo
est for email spam classifier (30 points)
Your task for this question is to build a spam classifier using the UCR email spma dataset https:
archive.
ics.uci.edu/ml/datasets/Spambase came from the postmaster and individuals who had filed spam. The
collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ’george’ and
the area code ’650’ are indicators of non-spam. These are useful when constructing a personalized spam
filter.
One would either have to blind such non-spam indicators or get a very wide collection of non-spam to
generate a general purpose spam filter. Load the data.
1. (5 points) How many instances of spam versus regular emails are there in the data? How many data
points there are? How many features there are?
Note: there may be some missing values, you can just fill in zero.
2. (10 points) Build a classification tree model (also known as the CART model). In Python, this can be
done using sklearn.tree.DecisionTreeClassifier. In our answer, you should report the tree models fitted
similar to what is shown in the “Random forest” lecture, Page 16, the tree plot. In Python, getting
this plot can be done using sklearn.tree.plot tree function.
3. (15 points) Also build a random fo
est model. In Python, this can be done using
sklearn.ensemble.RandomForestClassifier.
Now partition the data to use the first 80% for training and the remaining 20% for testing. Your task
is to compare and report the AUC for your classification tree and random forest models on testing
data, respectively. To report your results, please try different tree sizes. Plot the curve of AUC versus
Tree Size, similar to Page 15 of the Lecture Slides on “Random Forest”.
Background information: In classification problem, we use AUC (Area Under The Curve) as a per-
formance measure. It is one of the most important evaluation metrics for checking any classification
model?s performance. ROC (Receiver Operating Characteristics) curve measures classification accu-
acy at various thresholds settings. AUC measures the total area under the ROC curve. Higher the
AUC, better the model is at distinguishing the two classes. If you want to read a bit more about AUC
curve, check out this link https:
towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
For instance, in Python, this can be done using sklearn.metrics.roc auc score and you will have to figure
out the details.
2 Nonlinear regression and cross-validation (30 points)
The coefficient of thermal expansion y changes with temperature x. An experiment to relate y to x was
done. Temperature was measured in degrees Kelvin. (The Kelvin temperature is the Celcius temperature
1
plus XXXXXXXXXXThe raw data file is copper-new.txt.
XXXXXXXXXX1000
temperature
0
5
10
15
20
25
co
ef
fic
ie
nt
o
f t
he
m
al
e
xp
an
si
on
1. (10 points) Perform linear regression on the data. Report the fitted model and the fitting e
or.
2. (10 points) Perform nonlinear regression with polynomial regression function up to degree n = 10 and
use ridge regression (see Lecture Slides for “Bias-Variance Tradeoff”). Write down your formulation
and strategy for doing this, the form of the ridge regression.
3. (5 points) Use 5 fold cross validation to select the optimal regularization parameter λ. Plot the cross
validation curve and report the optimal λ.
4. (5 points) Predict the coefficient at 400 degree Kelvin using both models. Comment on how would you
compare the accuracy of predictions.
3 Regression, bias-variance tradeoff (40 points)
Consider a dataset with n data points (xi, yi), xi ∈ Rp, drawn from the following linear model:
y = xTβ∗ + �,
where � is a Gaussian noise and the star sign is used to differentiate the true parameter from the estimators
that will be introduced later. Consider the regularized linear regression as follows:
β̂(λ) = arg min
β
{
1
n
n∑
i=1
(yi − xTi β)2 + λ‖β‖22
}
,
where λ ≥ 0 is the regularized parameter. Let X ∈ Rn×p denote the matrix obtained by stacking xTi in each
ow.
1. (10 points) Find the closed form solution for β̂(λ) and its distribution.
2. (10 points) Calculate the bias E[xT β̂(λ)]− xTβ∗ as a function of λ and some fixed test point x.
2
3. (10 points) Calculate the variance term E
[(
xT β̂(λ)− E[xT β̂(λ)]
)2]
.
4. (10 points) Use the results from parts (b) and (c) and the bias-variance decomposition to analyze
the impact of λ in the squared e
or. Specifically, which term dominates when λ is small, and large,
espectively?
(Hint.) Properties of an affine transformation of a Gaussian random variable will be useful throughout
this problem.
3
Answered Same Day Apr 10, 2021

Solution

Ximi answered on Apr 13 2021
148 Votes
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Question 1.1\n",
"df = pd.read_csv('spambase/spambase.data', names=list(range(57)) + ['class'])\n",
"df.head()\n",
"df = df.fillna(0)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total count 4601\n",
"spam count 1813\n",
"regular count 2788\n"
]
}
],
"source": [
"print (\"total count\", df.shape[0])\n",
"print (\"spam count\", df[df['class'] == 1].shape[0])\n",
"print (\"regular count\", df[df['class']==0].shape[0])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split, cross_val_score, TimeSeriesSplit\n",
"from sklearn.tree import DecisionTreeClassifier"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',\n",
" max_depth=None, max_features=None, max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, presort='deprecated',\n",
" random_state=None, splitter='best')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Question 1.2\n",
"X = df.iloc[:, 1:-1]\n",
"y = df.iloc[:, -1]\n",
"clf = DecisionTreeClassifier()\n",
"clf.fit(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.tree import plot_tree\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"image/png":...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here