Assignment: Decision TreesLearning outcomes· Understand how to use decision trees on a Dataset to...

Question

Assignment: Decision TreesLearning outcomes· Understand how to use decision trees on a Dataset to make a prediction· Learning hyper-parameters tuning for decision trees by using RandomGrid· Learning the effectiveness of ensemble algorithms (Random Forest, Adaboost, Extra trees classifier, Gradient Boosted Tree)· · In the first part of this assignment, you will use Classification Trees for predicting if a user has a default payment option active or not. You can find the necessary data for performing this assignment here· This dataset is aimed at the case of customer default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default.· Required imports for this project are given below. Make sure you have all liaries required for this project installed. You may use conda or pip based on your set up.· NOTE: Since data is in Excel format you need to install xlrd in order to read the excel file inside your pandas dataframe. You can run pip install xlrd to installQuestions (15 points total)Question 1 (2 pts)Build a classifier by using decision tree and calculate the confusion matrix. Try different hyper-parameters (at least two) and discuss the result.Question 2 (4 pts)Try to build the decision tree which you built for the previous question, but this time by RandomGrid search over hyper-parameters. Compare the results.Question 3 (6 pts)Try to build the same classifier by using following ensemble models. For each of these models calculate accuracy and at least for two in the list below, plot the learning curves.· Random Forest· AdaBoost· Extra Trees Classifie· Gradient Boosted TreesQuestion 4 (3 pts)Discuss and compare the results for the all past three questions.· How does changing hyperparms effect model performance?· Why do you think certain models performed betteworse?· How does this performance line up with known strengths/weakness of these models?

Suraj · Accepted Answer

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9OBvBOCkPrga"
   },
   "source": [
    "## Assignment 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "bEmSTWZSPrgb"
   },
   "source": [
    "This assignment is based on content discussed in module 8 and using Decision Trees and Ensemble Models in classification and regression problems."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1cUoTzQLPrgc"
   },
   "source": [
    "## Learning outcomes "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Q1ygYVo_Prgc"
   },
   "source": [
    "- Understand how to use decision trees on a Dataset to make a prediction
",
    "- Learning hyper-parameters tuning for decision trees by using RandomGrid 
",
    "- Learning the effectiveness of ensemble algorithms (Random Forest, Adaboost, Extra trees classifier, Gradient Boosted Tree)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9hjVbQlVPrgd"
   },
   "source": [
    "In the first part of this assignment, you will use Classification Trees for predicting if a user has a default payment option active or not. You can find the necessary data for performing this assignment [here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) 
",
    "
",
    "This dataset is aimed at the case of customer default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default.
",
    "
",
    "Required imports for this project are given below. Make sure you have all libraries required for this project installed. You may use conda or pip based on your set up.
",
    "
",
    "__NOTE:__ Since data is in Excel format you need to install `xlrd` in order to read the excel file inside your pandas dataframe. You can run `pip install xlrd` to install "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "R376ZBnBPrge"
   },
   "outputs": [],
   "source": [
    "#required imports
",
    "import numpy as np
",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ddF9R5pdPrgi"
   },
   "source": [
    "After installing the necessary libraries, proceed to download the data. Since reading the excel file won't create headers by default, we added two more operations to substitute the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "CtNCjjr7Prgj"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None
"
     ]
    }
   ],
   "source": [
    "#loading the data
",
    "dataset = pd.read_excel("https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls")
",
    "#dataset.columns = dataset.iloc[0]
",
    "#dataset.drop(['ID'], inplace=True)
",
    "dataset.drop(dataset.columns[dataset.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
",
    "print(dataset.drop(0,inplace=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "cMh-sEIdPrgl"
   },
   "source": [
    "In the following, you can take a look into the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "E0lAPOXQPrgl",
    "outputId": "ea66ba57-f32c-4b39-c60a-e52402acbca1"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
",
       "
",
       "    .dataframe tbody tr th:only-of-type {
",
       "        vertical-align: middle;
",
       "    }
",
       "
",
       "    .dataframe tbody tr th {
",
       "        vertical-align: top;
",
       "    }
",
       "
",
       "    .dataframe thead th {
",
       "        text-align: right;
",
       "    }
",
       "
",
       "
",
       "  
",
       "    
",
       "      
",
       "      X1
",
       "      X2
",
       "      X3
",
       "      X4
",
       "      X5
",
       "      X6
",
       "      X7
",
       "      X8
",
       "      X9
",
       "      X10
",
       "      ...
",
       "      X15
",
       "      X16
",
       "      X17
",
       "      X18
",
       "      X19
",
       "      X20
",
       "      X21
",
       "      X22
",
       "      X23
",
       "      Y
",
       "    
",
       "  
",
       "  
",
       "    
",
       "      1
",
       "      20000
",
       "      2
",
       "      2
",
       "      1
",
       "      24
",
       "      2
",
       "      2
",
       "      -1
",
       "      -1
",
       "      -2
",
       "      ...
",
       "      0
",
       "      0
",
       "      0
",
       "      0
",
       "      689
",
       "      0
",
       "      0
",
       "      0
",
       "      0
",
       "      1
",
       "    
",
       "    
",
       "      2
",
       "      120000
",
       "      2
",
       "      2
",
       "      2
",
       "      26
",
       "      -1
",
       "      2
",
       "      0
",
       "      0
",
       "      0
",
       "      ...
",
       "      3272
",
       "      3455
",
       "      3261
",
       "      0
",
       "      1000
",
       "      1000
",
       "      1000
",
       "      0
",
       "      2000
",
       "      1
",
       "    
",
       "    
",
       "      3
",
       "      90000
",
       "      2
",
       "      2
",
       "      2
",
       "      34
",
       "      0
",
       "      0
",
       "      0
",
       "      0
",
       "      0
",
       "      ...
",
       "      14331
",
       "      14948
",
       "      15549
",
       "      1518
",
       "      1500
",
       "      1000
",
       "      1000
",
       "      1000
",
       "      5000
",
       "      0
",
       "    
",
       "    
",
       "      4
",
       "      50000
",
       "      2
",
       "      2
",
       "      1
",
       "      37
",
       "      0
",
       "      0
",
       "      0
",
       "      0
",
       "      0
",
       "      ...
",
       "      28314
",
       "      28959
",
       "      29547
",
       "      2000
",
       "      2019
",
       "      1200
",
       "      1100
",
       "      1069
",
       "      1000
",
       "      0
",
       "    
",
       "    
",
       "      5
",
       "      50000
",
       "      1
",
       "      2
",
       "      1
",
       "      57
",
       "      -1
",
       "      0
",
       "      -1
",
       "      0
",
       "      0
",
       "      ...
",
       "      20940
",
       "      19146
",
       "      19131
",
       "      2000
",
       "      36681
",
       "      10000
",
       "      9000
",
       "      689
",
       "      679
",
       "      0
",
       "    
",
       "    
",
       "      6
",
       "      50000
",
       "      1
",
       "      1
",
       "      2
",
       "      37
",
       "      0
",
       "      0
",
       "      0
",
       "      0
",
       "      0
",
       "      ...
",
       "      19394
",
       "      19619
",
       "      20024
",
       "      2500
",
       "      1815
",
       "      657
",
       "      1000
",
       "      1000
",
       "      800
",
       "      0
",
       "    
",
       "    
",
       "      7
",
       "      500000
",
       "      1
",
       "      1
",
       "      2
",
       "      29
",
       "      0
",
       "      0
",
       "      0
",
       "      0
",
       "      0
",
       "      ...
",
       "      542653
",
       "      483003
",
       "      473944
",
       "      55000
",
       "      40000
",
       "      38000
",
       "      20239
",
       "      13750
",
       "      13770
",
       "      0
",
       "    
",
       "    
",
       "      8
",
       "      100000
",
       "      2
",
       "      2
",
       "      2
",
       "      23
",
       "      0
",
       "      -1
",
       "      -1
",
       "      0
",
       "      0
",
       "      ...
",
       "      221
",
       "      -159
",
       "      567
",
       "      380
",
       "      601
",
       "      0
",
       "      581
",
       "      1687
",
       "      1542
",
       "      0
",
       "    
",
       "    
",
       "      9
",
       "      140000
",
       "      2
",
       "      3
",
       "      1
",
       "      28
",
       "      0
",
       "      0
",
       "      2
",
       "      0
",
       "      0
",
       "      ...
",
       "      12211
",
       "      11793
",
       "      3719
",
       "      3329
",
       "      0
",
       "      432
",
       "      1000
",
       "      1000
",
       "      1000
",
       "      0
",
       "    
",
       "    
",
       "      10
",
       "      20000
",
       "      1
",
       "      3
",
       "      2
",
       "      35
",
       "      -2
",
       "      -2
",
       "      -2
",
       "      -2
",
       "      -1
",
       "      ...
",
       "      0
",
       "      13007
",
       "      13912
",
       "      0
",
       "      0
",
       "      0
",
       "      13007
",
       "      1122
",
       "      0
",
       "      0
",
       "    
",
       "  
",
       "
",
       "10 rows × 24 columns
",
       ""
      ],
      "text/plain": [
       "        X1 X2 X3 X4  X5  X6  X7  X8  X9 X10  ...     X15     X16     X17  \
",
       "1    20000  2  2  1  24   2   2  -1  -1  -2  ...       0       0       0   
",
       "2   120000  2  2  2  26  -1   2   0   0   0  ...    3272    3455    3261   
",
       "3    90000  2  2  2  34   0   0   0   0   0  ...   14331   14948   15549   
",
       "4    50000  2  2  1  37   0   0   0   0   0  ...   28314   28959   29547   
",
       "5    50000  1  2  1  57  -1   0  -1   0   0  ...   20940   19146   19131   
",
       "6    50000  1  1  2  37   0   0   0   0   0  ...   19394   19619   20024   
",
       "7   500000  1  1  2  29   0   0   0   0   0  ...  542653  483003  473944   
",
       "8   100000  2  2  2  23   0  -1  -1   0   0  ...     221    -159     567   
",
       "9   140000  2  3  1  28   0   0   2   0   0  ...   12211   11793    3719   
",
       "10   20000  1  3  2  35  -2  -2  -2  -2  -1  ...       0   13007   13912   
",
       "
",
       "      X18    X19    X20    X21    X22    X23  Y  
",
       "1       0    689      0      0      0      0  1  
",
       "2       0   1000   1000   1000      0   2000  1  
",
       "3    1518   1500   1000   1000   1000   5000  0  
",
       "4    2000   2019   1200   1100   1069   1000  0  
",
       "5    2000  36681  10000   9000    689    679  0  
",
       "6    2500   1815    657   1000   1000    800  0  
",
       "7   55000  40000  38000  20239  13750  13770  0  
",
       "8     380    601      0    581   1687   1542  0  
",
       "9    3329      0    432   1000   1000   1000  0  
",
       "10      0      0      0  13007   1122      0  0  
",
       "
",
       "[10 rows x 24 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "r4jchSRoPrgr"
   },
   "source": [
    "## Questions (15 points total)
",
    "
",
    "#### Question 1 (2 pts)
",
    "Build a classifier by using decision tree and calculate the confusion matrix. Try different hyper-parameters (at least two) and discuss the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "1Qr1SPGlPrgr"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "
",
      "Int64Index: 30000 entries, 1 to 30000
",
      "Data columns (total 24 columns):
",
      "X1     30000 non-null object
",
      "X2     30000 non-null object
",
      "X3     30000 non-null object
",
      "X4     30000 non-null object
",
      "X5     30000 non-null object
",
      "X6     30000 non-null object
",
      "X7     30000 non-null object
",
      "X8     30000 non-null object
",
      "X9     30000 non-null object
",
      "X10    30000 non-null object
",
      "X11    30000 non-null object
",
      "X12    30000 non-null object
",
      "X13    30000 non-null object
",
      "X14    30000 non-null object
",
      "X15    30000 non-null object
",
      "X16    30000 non-null object
",
      "X17    30000 non-null object
",
      "X18    30000 non-null object
",
      "X19    30000 non-null object
",
      "X20    30000 non-null object
",
      "X21    30000 non-null object
",
      "X22    30000 non-null object
",
      "X23    30000 non-null object
",
      "Y      30000 non-null object
",
      "dtypes: object(24)
",
      "memory usage: 5.7+ MB
",
      "[[14306  3261]
",
      " [ 2883  2050]]
",
      "
",
      "
",
      "[[16868   699]
",
      " [ 3316  1617]]
",
      "[[16669   898]
",
      " [ 3122  1811]]
"
     ]
    }
   ],
   "source": [
    "# YOUR CODE HERE
",
    "import matplotlib.pyplot as plt
",
    "from sklearn.tree import DecisionTreeClassifier
",
    "from sklearn.model_selection import train_test_split
",
    "from sklearn.metrics import accuracy_score,confusion_matrix
",
    "dataset.info()
",
    "dataset.describe()
",
    "# dividing data into dependent and independent variables
",
    "ind=dataset.iloc[:,0:23].values
",
    "dep=dataset.iloc[:,23:24].values
",
    "dep=dep.astype('int')
",
    "# spliting data into train and test phase
",
    "x_train,x_test,y_train,y_test=train_test_split(ind,dep,test_size=0.75,random_state=0)
",
    "# building model
",
    "tree=DecisionTreeClassifier()
",
    "tree.fit(x_train,y_train)
",
    "pred=tree.predict(x_test)
",
    "print(confusion_matrix(y_test,pred))
",
    "#changing first hyperparameter
",
    "tree=DecisionTreeClassifier(criterion="entropy",max_depth=2,min_samples_leaf=1,min_samples_split=2)
",
    "tree.fit(x_train,y_train)
",
    "pred=tree.predict(x_test)
",
    "print(type(pred))
",
    "print(type(y_test))
",
    "print(confusion_matrix(y_test,pred))
",
    "#changing second hyperparameter
",
    "tree=DecisionTreeClassifier(criterion="gini",max_depth=4,min_samples_leaf=2,min_samples_split=3)
",
    "tree.fit(x_train,y_train)
",
    "pred=tree.predict(x_test)
",
    "print(confusion_matrix(y_test,pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "QwcecRukPrgw"
   },
   "source": [
    "#### Question 2 (4 pts)
",
    "
",
    "Try to build the decision tree which you built for the previous question, but this time by RandomGrid search over hyper-parameters. Compare the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "4XHRmsWOPrgx"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "
",
      "
",
      "[[16654   913]
",
      " [ 3112  1821]]
"
     ]
    }
   ],
   "source": [
    "# YOUR CODE HERE
",
    "from sklearn.model_selection import GridSearchCV
",
    "parameters = {'criterion':('gini','entropy'),'max_depth':(2,3,4,5,6,7,8),'min_samples_leaf':(2,3,4,5,6,7,8)}
",
    "grid=GridSearchCV(DecisionTreeClassifier(),param_grid=parameters,cv=3)
",
    "grid_model=grid.fit(x_train,y_train)
",
    "grid_model.best_estimator_
",
    "tree=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
",
    "            max_features=None, max_leaf_nodes=None,
",
    "            min_impurity_decrease=0.0, min_impurity_split=None,
",
    "            min_samples_leaf=4, min_samples_split=2,
",
    "            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
",
    "            splitter='best')
",
    "tree.fit(x_train,y_train)
",
    "pred=tree.predict(x_test)
",
    "print(type(pred))
",
    "print(type(y_test))
",
    "print(confusion_matrix(y_test,pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dEvsYwiXPrg3"
   },
   "source": [
    "#### Question 3 (6 pts)
",
    "
",
    "Try to build the same classifier by using following ensemble models. For each of these models calculate accuracy and at least for two in the list below, plot the learning curves.
",
    "
",
    "* Random Forest 
",
    "* AdaBoost
",
    "* Extra Trees Classifier 
",
    "* Gradient Boosted Trees 
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "J8S4UaKdPrg3"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "H:\Anaconda\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
",
      "  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
",
      "H:\Anaconda\lib\site-packages\ipykernel_launcher.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
",
      "  """
"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8035111111111111
"
     ]
    },
    {
     "data": {
      "image/png":

Assignment: Decision Trees Learning outcomes · Understand how to use decision trees on a Dataset to make a prediction · Learning hyper-parameters tuning for decision trees by using RandomGrid ·...

Solution