Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Introduction For this week’s take-home lab, you will work on the same data set from Week 4/5 Take-Home Labs. You will solve the very same problem studied in this week’s in-class lab on a much larger...

1 answer below »

Introduction

For this week’s take-home lab, you will work on the same data set from Week 4/5 Take-Home Labs. You will solve the very same problem studied in this week’s in-class lab on a much larger and more interesting dataset. The data contained in the file UCI_Credit_Card.csv contains 30,000 consumer records with 24 different variables. You can read a detailed description of the different fields at the following website:https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clientsThe description from the UCI says marriage should have levels: Marital status (1 = married; 2 = single; 3 = others) However, there are levels (0,1,2,3). You should treat 0 as unknown. the description from the UCI says Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). However, there are levels 1 to 6 for education. Thus here 5 = 6 = unknown. X6-X11: The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. However, there are many factors that are -2. This is also unknown. So every unknown you should treat them as NA.

Your task is to build the best possible model for predicting whether or not a consumer will default on their credit card payment for the next month (the last column in the dataset).

Assignment

Perform the following tasks:

  • Conduct a training/test split of the data, building a 20% held out test dataset

  • Fit the best KNN model and CART model you can (consider feature selection etc.) to the data to predict consumer default.

  • Then plot ROC curves for the logistic regression, SVM, KNN, and CART models, and compare their performance.

  • Compute the AUC for the logistic regression, SVM, KNN, and CART models, and compare their performance.

  • Provide a summary and discussion of your work in written form (.docx or .pdf) that includes the following:

    • Q1 Summarize the model/feature selection process you used to fit your KNN and CART model

    • Q2 Provide a summary of the fitted KNN/CART models (i.e.model summary)

    • Q3 Provide performance evaluation of the fitted KNN/CART models using confusion matrix.

    • Q4 How well do you think the fitted KNN/CART models to this dataset works?

    • Q5 Using ROC curves and AUC, which one of logistic regression, SVM, KNN, and CART models works better with the dataset so far?

Submission Instructions

For this weekly lab assignment, you should submit:

  • An R script file (or Rmd file)

  • A written summary/discussion of your work (as discussed above) in .docx

Answered 3 days After Feb 16, 2022

Solution

Mohd answered on Feb 19 2022
107 Votes
-
-
-
2/18/2022
li
ary(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
li
ary(caret)
## Warning: package 'caret' was built under R version 4.1.1
## Loading required package: ggplot2
## Loading required package: lattice
li
ary(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
li
ary(e1071)
## Warning: package 'e1071' was built under R version 4.1.1
li
ary(ggplot2)
li
ary(magrittr)
li
ary(rmarkdown)
li
ary(readxl)
ucicreditcard <- read_excel("~/New folder (2)/ucicreditcard.xlsx")
#View(ucicreditcard)
ucicreditcard$default_payment<-ucicreditcard$`default payment next month`
ucicreditcard<-ucicreditcard[,-25]
#Assigning values to NA
ucicreditcard$MARRIAGE<-replace(ucicreditcard$MARRIAGE,ucicreditcard$MARRIAGE==0,NA)
ucicreditcard%>%
count(EDUCATION)
## # A ti
le: 7 x 2
## EDUCATION n
## ## 1 0 14
## 2 1 10585
## 3 2 14030
## 4 3 4917
## 5 4 123
## 6 5 280
## 7 6 51
ucicreditcard$EDUCATION<-replace(ucicreditcard$EDUCATION,ucicreditcard$EDUCATION==6,NA)
ucicreditcard$EDUCATION<-replace(ucicreditcard$EDUCATION,ucicreditcard$EDUCATION==5,NA)
ucicreditcard%>%
count(PAY_0)
## # A ti
le: 11 x 2
## PAY_0 n
## ## 1 -2 2759
## 2 -1 5686
## 3 0 14737
## 4 1 3688
## 5 2 2667
## 6 3 322
## 7 4 76
## 8 5 26
## 9 6 11
## 10 7 9
## 11 8 19
sum(is.na(ucicreditcard$PAY_0))
## [1] 0
ucicreditcard$PAY_0<-replace(ucicreditcard$PAY_0,ucicreditcard$PAY_0==-2,NA)
ucicreditcard$PAY_2<-replace(ucicreditcard$PAY_2,ucicreditcard$PAY_2==-2,NA)
ucicreditcard$PAY_3<-replace(ucicreditcard$PAY_3,ucicreditcard$PAY_3==-2,NA)
ucicreditcard$PAY_4<-replace(ucicreditcard$PAY_4,ucicreditcard$PAY_4==-2,NA)
ucicreditcard$PAY_5<-replace(ucicreditcard$PAY_5,ucicreditcard$PAY_5==-2,NA)
ucicreditcard$PAY_6<-replace(ucicreditcard$PAY_6,ucicreditcard$PAY_6==-2,NA)
ucicreditcard%>%
count(PAY_0)
## # A ti
le: 11 x 2
## PAY_0 n
## ## 1 -1 5686
## 2 0 14737
## 3 1 3688
## 4 2 2667
## 5 3 322
## 6 4 76
## 7 5 26
## 8 6 11
## 9 7 9
## 10 8 19
## 11 NA 2759
sum(is.na(ucicreditcard$PAY_0))
## [1] 2759
Training/test split of the data, building a 20% held out test dataset
#removing NA
ucicreditcard<-na.omit(ucicreditcard)
set.seed(2223)
ucicreditcard<-ucicreditcard[,2:25]
inp <- sample(2, nrow(ucicreditcard), replace = TRUE, prob = c(0.8, 0.2))
training_data <- ucicreditcard[inp==1, ]
test_data <- ucicreditcard[inp==2, ]
Fit the best KNN model and CART model you can (consider feature selection etc.) to the data to predict consumer default.
train.respo<-training_data$default_payment
test.respo<-test_data$default_payment
train.explano<-training_data[,2:24]
test.explano<-test_data[,2:24]
KNN Model with Summary
li
ary(class)
li
ary(rpart)
li
ary(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.2
li
ary(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
li
ary(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
knn.1<-knn(train.explano,test.explano,train.respo,k=1)
knn.5<-knn(train.explano,test.explano,train.respo,k=5)
knn.10<-knn(train.explano,test.explano,train.respo,k=10)
knn.30<-knn(train.explano,test.explano,train.respo,k=30)
knn.15<-knn(train.explano,test.explano,train.respo,k=15)
sum(test.respo==knn.1)/length(test.respo)
## [1] 0.6909478
sum(test.respo==knn.5)/length(test.respo)
## [1] 0.7512247
sum(test.respo==knn.10)/length(test.respo)
## [1] 0.7635783
sum(test.respo==knn.30)/length(test.respo)
## [1] 0.770394
sum(test.respo==knn.15)/length(test.respo)
## [1] 0.7682641
class(train.respo)
## [1]...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here