Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Lab 6_Model Evaluation.docx MIS 545 Lab 6: Model Evaluation 1 Overview In this lab, we will examine the performance of prediction on two data sets, which can be found under lab 6 module on D2L. Save...

1 answer below »
Lab 6_Model Evaluation.docx
MIS 545 Lab 6: Model Evaluation
1 Overview
In this lab, we will examine the performance of prediction on two data sets, which can be found under lab 6 module on D2L. Save them in your working directory.
1. adult.csv: This dataset contains census data about more than 48,000 individuals. Try to predict whether an individual’s income exceeds $50K/yr based on census data, such as age, work class, education, race, sex, marital status, country etc. You can find the detail about the dataset at: https:
archive.ics.uci.edu/ml/datasets/Adult
2. titanic.csv: This dataset contains variables like class, age, and sex, to figure out if a person survived the wreck of titanic. It has been used in previous lectures.
2 Packages
For lab 6, we will use 2 packages to manipulate data.
C50: This model extends the C4.5 classification algorithms described in Quinlan XXXXXXXXXXThe details of the extensions are largely undocumented. The model can take the form of a full decision tree.
pROC: Tools for visualizing, smoothing, and comparing receiver operating characteristic (ROC curves).
# Install packages
install.packages("C50")
install.packages("pROC")
li
ary(C50)
li
ary(pROC)
3 Precision and Recall
First, use setwd() to assign your working directory. Save adult.csv under the directory. Then load adult dataset into the memory, in which question mark stands for missing value. Due to the built-in function of C50 package, we don't have to preprocess missing value.
# Read in csv file groceries.csv.
adult <- read.csv("adult.csv", na.strings = '?')
Split data for training and testing
# Partition dataset for training (80%) and testing (20%)
size <- floor(0.8 * nrow(adult))
### Randomly decide which ones for training
training_index <- sample(nrow(adult), size = size, replace = FALSE)
train <- adult[training_index,]
test <- adult[-training_index,]
### Names of variables that used for prediction
var_names <- names(adult)[-15]
Fit decision tree model. You can find a ranked list of attributes in term of usage via method summary(dt).
# Fit the model
dt <- C5.0(x = train[, var_names], y = train$if_above_50K)
# See the summary of model
summary(dt)
### Now, validate test
## predict() method returns a vector of result
dt_pred <- predict(dt, newdata = test)
### Merger dt_prediction value to test dataset
dt_evaluation <- cbind(test, dt_pred)
Have a simple feel of prediction
### Compare dt_prediction result to actual value
dt_evaluation$co
ect <- ifelse(dt_evaluation$if_above_50K == dt_evaluation$dt_pred, 1, 0)
### Accuracy rate
sum(dt_evaluation$co
ect) / nrow(dt_evaluation)
### Confusion matrix
table(dt_evaluation$if_above_50K, dt_evaluation$dt_pred)
## XXXXXXXXXXno yes
## no XXXXXXXXXX
## yes XXXXXXXXXX
In general, we have four metrics to evaluate prediction. TPR, TNR, FPR, and FNR.
### True Positive Rate (Sensitivity) TPR = TP / P
### = count of true positive dt_prediction divided by total positive truth
TPR <- sum(dt_evaluation$dt_pred == 'yes' & dt_evaluation$if_above_50K == 'yes') / sum(dt_evaluation$if_above_50K == 'yes')
### True Negative Rate (Specificity) TNR = TN / N
### = count of true negative dt_prediction divided by total negative truth
TNR <- sum(dt_evaluation$dt_pred == 'no' & dt_evaluation$if_above_50K == 'no') / sum(dt_evaluation$if_above_50K == 'no')
### False Positive Rate (1 - Spec) FPR = FP / N
### = count of false positive dt_prediction divided by total negative truth
### = sum(dt_evaluation$dt_pred == 'yes'& dt_evaluation$if_above_50K == 'no' )/
### sum(dt_evaluation$if_above_50K == 'no')
FPR <- 1 - TNR
### False Negative Rate FNR FNR = FN / P
### = count of false negative dt_prediction divided by total positive truth
### = sum(dt_evaluation$dt_pred == 'no'& dt_evaluation$if_above_50K == 'yes' )/
### sum(dt_evaluation$if_above_50K == 'yes')
FNR <- 1 - TPR
Precision and Recall are widely used to evaluate prediction performance.
### dt_precision equals
### = number of true positive dt_prediction / total positive dt_prediction
dt_precision <- sum(dt_evaluation$if_above_50K == 'yes' & dt_evaluation$dt_pred == 'yes') / sum(dt_evaluation$dt_pred == 'yes')
### dt_recall equals = TPR
### = true positive dt_prediction / total true positive
dt_recall <- sum(dt_evaluation$if_above_50K == 'yes' & dt_evaluation$dt_pred == 'yes') / sum(dt_evaluation$if_above_50K == 'yes')
F score is a metric that combines precision and recall is the harmonic mean of precision and recall. In some cases, we have to adjust weight of precision or recall due to domain knowledge.
### F measure
F <- 2 * dt_precision * dt_recall / (dt_precision + dt_recall)
4 ROC Curve: Receiver Operating Characteristic Curve
Load the second dataset, titanic.csv. Partition data into training and testing as we did above.
titanic <- read.csv("titanic.csv")
### Partition dataset for training (80%) and testing (20%)
size <- floor(0.8 * nrow(titanic))
### Randomly decide which ones for training
training_index <- sample(nrow(titanic), size = size, replace = FALSE)
train <- titanic[training_index,]
test <- titanic[-training_index,]
Fit logistic regression. Note parameter type = response in predict method. It returns risk rate instead of classification.
### Fitting regression model
eg <- glm(survive ~ . , data = train, family = binomial() )
### Model detail
summary(reg)
### Validate test dataset
evaluation <- test
evaluation$prob <- predict(reg, newdata = evaluation, type = "response")
See the improvement compared to baseline in dataset
# Baseline = 32%
count_survive <- nrow(subset(titanic, titanic$survive == "yes") )
aseline <- count_survive / nrow(titanic)
aseline
## XXXXXXXXXX
Plot ROC curve Note the AUC is 0.7686, significantly higher than average threshold. Since training set and testing set are randomly sampled, this number may be different on your computer.
# Feed Sensitivity & Specificity to roc()
g <- roc(evaluation$survive ~ evaluation$prob, data = evaluation)
# ROC curve
plot(g)
## Area under the curve: 0.7686
2
adult.csv
age,workclass,fnlwgt,education,education-num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,if_above_50K
39,State-gov,77516,Bachelors,13,Never-ma
ied,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,no
50,Self-emp-not-inc,83311,Bachelors,13,Ma
ied-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,no
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,no
53,Private,234721,11th,7,Ma
ied-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,no
28,Private,338409,Bachelors,13,Ma
ied-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,no
37,Private,284582,Masters,14,Ma
ied-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,no
49,Private,160187,9th,5,Ma
ied-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,no
52,Self-emp-not-inc,209642,HS-grad,9,Ma
ied-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,yes
31,Private,45781,Masters,14,Never-ma
ied,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,yes
42,Private,159449,Bachelors,13,Ma
ied-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,yes
37,Private,280464,Some-college,10,Ma
ied-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,yes
30,State-gov,141297,Bachelors,13,Ma
ied-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,yes
23,Private,122272,Bachelors,13,Never-ma
ied,Adm-clerical,Own-child,White,Female,0,0,30,United-States,no
32,Private,205019,Assoc-acdm,12,Never-ma
ied,Sales,Not-in-family,Black,Male,0,0,50,United-States,no
40,Private,121772,Assoc-voc,11,Ma
ied-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,?,yes
34,Private,245487,7th-8th,4,Ma
ied-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,0,0,45,Mexico,no
25,Self-emp-not-inc,176756,HS-grad,9,Never-ma
ied,Farming-fishing,Own-child,White,Male,0,0,35,United-States,no
32,Private,186824,HS-grad,9,Never-ma
ied,Machine-op-inspct,Unma
ied,White,Male,0,0,40,United-States,no
38,Private,28887,11th,7,Ma
ied-civ-spouse,Sales,Husband,White,Male,0,0,50,United-States,no
43,Self-emp-not-inc,292175,Masters,14,Divorced,Exec-managerial,Unma
ied,White,Female,0,0,45,United-States,yes
40,Private,193524,Doctorate,16,Ma
ied-civ-spouse,Prof-specialty,Husband,White,Male,0,0,60,United-States,yes
54,Private,302146,HS-grad,9,Separated,Other-service,Unma
ied,Black,Female,0,0,20,United-States,no
35,Federal-gov,76845,9th,5,Ma
ied-civ-spouse,Farming-fishing,Husband,Black,Male,0,0,40,United-States,no
43,Private,117037,11th,7,Ma
ied-civ-spouse,Transport-moving,Husband,White,Male,0,2042,40,United-States,no
59,Private,109015,HS-grad,9,Divorced,Tech-support,Unma
ied,White,Female,0,0,40,United-States,no
56,Local-gov,216851,Bachelors,13,Ma
ied-civ-spouse,Tech-support,Husband,White,Male,0,0,40,United-States,yes
19,Private,168294,HS-grad,9,Never-ma
ied,Craft-repair,Own-child,White,Male,0,0,40,United-States,no
54,?,180211,Some-college,10,Ma
ied-civ-spouse,?,Husband,Asian-Pac-Islander,Male,0,0,60,South,yes
39,Private,367260,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,80,United-States,no
49,Private,193366,HS-grad,9,Ma
ied-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,no
23,Local-gov,190709,Assoc-acdm,12,Never-ma
ied,Protective-serv,Not-in-family,White,Male,0,0,52,United-States,no
20,Private,266015,Some-college,10,Never-ma
ied,Sales,Own-child,Black,Male,0,0,44,United-States,no
45,Private,386940,Bachelors,13,Divorced,Exec-managerial,Own-child,White,Male,0,1408,40,United-States,no
30,Federal-gov,59951,Some-college,10,Ma
ied-civ-spouse,Adm-clerical,Own-child,White,Male,0,0,40,United-States,no
22,State-gov,311512,Some-college,10,Ma
ied-civ-spouse,Other-service,Husband,Black,Male,0,0,15,United-States,no
48,Private,242406,11th,7,Never-ma
ied,Machine-op-inspct,Unma
ied,White,Male,0,0,40,Puerto-Rico,no
21,Private,197200,Some-college,10,Never-ma
ied,Machine-op-inspct,Own-child,White,Male,0,0,40,United-States,no
19,Private,544091,HS-grad,9,Ma
ied-AF-spouse,Adm-clerical,Wife,White,Female,0,0,25,United-States,no
31,Private,84154,Some-college,10,Ma
ied-civ-spouse,Sales,Husband,White,Male,0,0,38,?,yes
48,Self-emp-not-inc,265477,Assoc-acdm,12,Ma
ied-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,no
31,Private,507875,9th,5,Ma
ied-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,43,United-States,no
53,Self-emp-not-inc,88506,Bachelors,13,Ma
ied-civ-spouse,Prof-specialty,Husband,White,Male,0,0,40,United-States,no
24,Private,172987,Bachelors,13,Ma
ied-civ-spouse,Tech-support,Husband,White,Male,0,0,50,United-States,no
49,Private,94638,HS-grad,9,Separated,Adm-clerical,Unma
ied,White,Female,0,0,40,United-States,no
25,Private,289980,HS-grad,9,Never-ma
ied,Handlers-cleaners,Not-in-family,White,Male,0,0,35,United-States,no
57,Federal-gov,337895,Bachelors,13,Ma
ied-civ-spouse,Prof-specialty,Husband,Black,Male,0,0,40,United-States,yes
53,Private,144361,HS-grad,9,Ma
ied-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,38,United-States,no
44,Private,128354,Masters,14,Divorced,Exec-managerial,Unma
ied,White,Female,0,0,40,United-States,no
41,State-gov,101603,Assoc-voc,11,Ma
ied-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,no
29,Private,271466,Assoc-voc,11,Never-ma
ied,Prof-specialty,Not-in-family,White,Male,0,0,43,United-States,no
25,Private,32275,Some-college,10,Ma
ied-civ-spouse,Exec-managerial,Wife,Other,Female,0,0,40,United-States,no
18,Private,226956,HS-grad,9,Never-ma
ied,Other-service,Own-child,White,Female,0,0,30,?,no
47,Private,51835,Prof-school,15,Ma
ied-civ-spouse,Prof-specialty,Wife,White,Female,0,1902,60,Honduras,yes
50,Federal-gov,251585,Bachelors,13,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,55,United-States,yes
47,Self-emp-inc,109832,HS-grad,9,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,60,United-States,no
43,Private,237993,Some-college,10,Ma
ied-civ-spouse,Tech-support,Husband,White,Male,0,0,40,United-States,yes
46,Private,216666
Answered Same Day Aug 22, 2021

Solution

Mohd answered on Aug 23 2021
154 Votes
Assignment
Assignment
Walker Kirk
8/23/2021
knitr::opts_chunk$set(echo = TRUE,cache = TRUE,warning = FALSE,message = FALSE,dpi = 180,fig.width = 8,fig.height = 5)
li
ary(dplyr)
li
ary(ggplot2)
li
ary(magrittr)
li
ary(rmarkdown)
li
ary(C50)
li
ary(pROC)
Please finish the questions below using R: 1. Fit Decision Tree and Logistic Regression to predict affairs (Attribute if_affair is the dependent/target variable).
1. Base on the result of Decision Tree:
1. Find the most useful attribute in prediction. (Hint: use summary(your model))
1. What is the Precision and Recall? (Define “Yes” as the positive outcome)
li
ary(readr)
affairs <- read_csv("New folder (2)/affairs.csv")
affairs$if_affai
-factor(affairs$if_affair)
str(affairs)
## spec_tbl_df [601 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:601] 37 27 32 57 22 32 22 57 32 22 ...
## $ yearsma
ied : num [1:601] 10 4 15 15 0.75 1.5 0.75 15 15 1.5 ...
## $ religiousness: num [1:601] 3 4 1 5 2 2 2 2 4 4 ...
## $ rating : num [1:601] 4 4 4 5 3 5 3 4 2 5 ...
## $ if_affair : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. yearsma
ied = col_double(),
## .. religiousness = col_double(),
## .. rating = col_double(),
## .. if_affair = col_character()
## .. )
## - attr(*, "problems")=affairs<-affairs%>%
mutate(if_affair=ifelse(if_affair=="no",0,1))
summary(affairs)
## age yearsma
ied religiousness rating
## Min. :17.50 Min. : 0.125 Min. :1.000 Min. :1.000
## 1st Qu.:27.00 1st Qu.: 4.000 1st Qu.:2.000 1st Qu.:3.000
## Median :32.00 Median : 7.000 Median :3.000 Median :4.000
## Mean :32.49 Mean : 8.178 Mean :3.116 Mean :3.932
## 3rd Qu.:37.00 3rd Qu.:15.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :57.00 Max. :15.000 Max. :5.000 Max. :5.000
## if_affair
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2496
## 3rd Qu.:0.0000
## Max. :1.0000
affairs$if_affai
-factor(affairs$if_affair)
set.seed(333)
size <- floor(0.8 * nrow(affairs))

### randomly decide which ones for training
training_index <- sample(nrow(affairs), size = size, replace = FALSE)

train <- affairs[training_index,]
test <- affairs[-training_index,]

### names of variables that used for prediction
...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here