ISYE 6740 Homework 7 (Last Homework)Total 100 points.As usual, please submit a report with...

Question

ISYE 6740 Homework 7 (Last Homework)Total 100 points.As usual, please submit a report with sufficient explanation of your answers to each the questions, togethewith your code, in a zip folder.1 Random foest for email spam classifier (30 points)Your task for this question is to build a spam classifier using the UCR email spma dataset https:archive.ics.uci.edu/ml/datasets/Spambase came from the postmaster and individuals who had filed spam. Thecollection of non-spam e-mails came from filed work and personal e-mails, and hence the word ’george’ andthe area code ’650’ are indicators of non-spam. These are useful when constructing a personalized spamfilter.One would either have to blind such non-spam indicators or get a very wide collection of non-spam togenerate a general purpose spam filter. Load the data.1. (5 points) How many instances of spam versus regular emails are there in the data? How many datapoints there are? How many features there are?Note: there may be some missing values, you can just fill in zero.2. (10 points) Build a classification tree model (also known as the CART model). In Python, this can bedone using sklearn.tree.DecisionTreeClassifier. In our answer, you should report the tree models fittedsimilar to what is shown in the “Random forest” lecture, Page 16, the tree plot. In Python, gettingthis plot can be done using sklearn.tree.plot tree function.3. (15 points) Also build a random foest model. In Python, this can be done usingsklearn.ensemble.RandomForestClassifier.Now partition the data to use the first 80% for training and the remaining 20% for testing. Your taskis to compare and report the AUC for your classification tree and random forest models on testingdata, respectively. To report your results, please try different tree sizes. Plot the curve of AUC versusTree Size, similar to Page 15 of the Lecture Slides on “Random Forest”.Background information: In classification problem, we use AUC (Area Under The Curve) as a per-formance measure. It is one of the most important evaluation metrics for checking any classificationmodel?s performance. ROC (Receiver Operating Characteristics) curve measures classification accu-acy at various thresholds settings. AUC measures the total area under the ROC curve. Higher theAUC, better the model is at distinguishing the two classes. If you want to read a bit more about AUCcurve, check out this link https:towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5For instance, in Python, this can be done using sklearn.metrics.roc auc score and you will have to figureout the details.2 Nonlinear regression and cross-validation (30 points)The coefficient of thermal expansion y changes with temperature x. An experiment to relate y to x wasdone. Temperature was measured in degrees Kelvin. (The Kelvin temperature is the Celcius temperature1plus XXXXXXXXXXThe raw data file is copper-new.txt. XXXXXXXXXX1000temperature0510152025coefficient of themal expansion1. (10 points) Perform linear regression on the data. Report the fitted model and the fitting eor.2. (10 points) Perform nonlinear regression with polynomial regression function up to degree n = 10 anduse ridge regression (see Lecture Slides for “Bias-Variance Tradeoff”). Write down your formulationand strategy for doing this, the form of the ridge regression.3. (5 points) Use 5 fold cross validation to select the optimal regularization parameter λ. Plot the crossvalidation curve and report the optimal λ.4. (5 points) Predict the coefficient at 400 degree Kelvin using both models. Comment on how would youcompare the accuracy of predictions.3 Regression, bias-variance tradeoff (40 points)Consider a dataset with n data points (xi, yi), xi ∈ Rp, drawn from the following linear model:y = xTβ∗ + �,where � is a Gaussian noise and the star sign is used to differentiate the true parameter from the estimatorsthat will be introduced later. Consider the regularized linear regression as follows:β̂(λ) = arg minβ{1nn∑i=1(yi − xTi β)2 + λ‖β‖22},where λ ≥ 0 is the regularized parameter. Let X ∈ Rn×p denote the matrix obtained by stacking xTi in eachow.1. (10 points) Find the closed form solution for β̂(λ) and its distribution.2. (10 points) Calculate the bias E[xT β̂(λ)]− xTβ∗ as a function of λ and some fixed test point x.23. (10 points) Calculate the variance term E[(xT β̂(λ)− E[xT β̂(λ)])2].4. (10 points) Use the results from parts (b) and (c) and the bias-variance decomposition to analyzethe impact of λ in the squared eor. Specifically, which term dominates when λ is small, and large,espectively?(Hint.) Properties of an affine transformation of a Gaussian random variable will be useful throughoutthis problem.3

Ximi · Accepted Answer

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Question 1.1
",
    "df = pd.read_csv('spambase/spambase.data', names=list(range(57)) + ['class'])
",
    "df.head()
",
    "df = df.fillna(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total count 4601
",
      "spam count 1813
",
      "regular count 2788
"
     ]
    }
   ],
   "source": [
    "print ("total count", df.shape[0])
",
    "print ("spam count", df[df['class'] == 1].shape[0])
",
    "print ("regular count", df[df['class']==0].shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split, cross_val_score, TimeSeriesSplit
",
    "from sklearn.tree import DecisionTreeClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
",
       "                       max_depth=None, max_features=None, max_leaf_nodes=None,
",
       "                       min_impurity_decrease=0.0, min_impurity_split=None,
",
       "                       min_samples_leaf=1, min_samples_split=2,
",
       "                       min_weight_fraction_leaf=0.0, presort='deprecated',
",
       "                       random_state=None, splitter='best')"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Question 1.2
",
    "X = df.iloc[:, 1:-1]
",
    "y = df.iloc[:, -1]
",
    "clf = DecisionTreeClassifier()
",
    "clf.fit(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import plot_tree
",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png":

ISYE 6740 Homework 7 (Last Homework) Total 100 points. As usual, please submit a report with sufficient explanation of your answers to each the questions, together with your code, in a zip folder. 1...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment