{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "9OBvBOCkPrga"
},
"source": [
"## Assignment 4"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "bEmSTWZSPrgb"
},
"source": [
"This assignment is based on content discussed in module 8 and using Decision Trees and Ensemble Models in classification and regression problems."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "1cUoTzQLPrgc"
},
"source": [
"## Learning outcomes "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Q1ygYVo_Prgc"
},
"source": [
"- Understand how to use decision trees on a Dataset to make a prediction\n",
"- Learning hyper-parameters tuning for decision trees by using RandomGrid \n",
"- Learning the effectiveness of ensemble algorithms (Random Forest, Adaboost, Extra trees classifier, Gradient Boosted Tree)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "9hjVbQlVPrgd"
},
"source": [
"In the first part of this assignment, you will use Classification Trees for predicting if a user has a default payment option active or not. You can find the necessary data for performing this assignment [here](https:
archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) \n",
"\n",
"This dataset is aimed at the case of customer default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default.\n",
"\n",
"Required imports for this project are given below. Make sure you have all li
aries required for this project installed. You may use conda or pip based on your set up.\n",
"\n",
"__NOTE:__ Since data is in Excel format you need to install `xlrd` in order to read the excel file inside your pandas dataframe. You can run `pip install xlrd` to install "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "R376ZBnBPrge"
},
"outputs": [],
"source": [
"#required imports\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ddF9R5pdPrgi"
},
"source": [
"After installing the necessary li
aries, proceed to download the data. Since reading the excel file won't create headers by default, we added two more operations to substitute the columns."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "CtNCjjr7Prgj"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n"
]
}
],
"source": [
"#loading the data\n",
"dataset = pd.read_excel(\"https:
archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls\")\n",
"#dataset.columns = dataset.iloc[0]\n",
"#dataset.drop(['ID'], inplace=True)\n",
"dataset.drop(dataset.columns[dataset.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)\n",
"print(dataset.drop(0,inplace=True))"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "cMh-sEIdPrgl"
},
"source": [
"In the following, you can take a look into the dataset."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "E0lAPOXQPrgl",
"outputId": "ea66ba57-f32c-4b39-c60a-e52402acbca1"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"