Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Lab 3_Naive Bayes1.docx MIS 545 Lab 3: Naive Bayes Classifier Predicting Mushroom Types 1 Overview In this lab, we will apply Naive Bayes to a Mushroom dataset. You can find a Mushroom dataset under...

1 answer below »
Lab 3_Naive Bayes1.docx
MIS 545 Lab 3: Naive Bayes Classifie
Predicting Mushroom Types
1 Overview
In this lab, we will apply Naive Bayes to a Mushroom dataset. You can find a Mushroom dataset under D2L > Labs > Lab 3, called Mushroom.csv. Save it in your working directory.
In the Mushroom dataset, there are 8123 observations belonging to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. There are two types of mushroom in terms of edibility. If classes=e, the mushroom is edible, if classes=p, the mushroom is poisonous. We want to tell which mushrooms are edible from those poisonous by looking at some of their characteristics.
Note: The original dataset is downloaded from University of California Irvine’s machine learning data repository. For more details, please go to https:
archive.ics.uci.edu/ml/datasets/Mushroom.
2 Data Packages
We will need to install e1071 package for this lab, which is a well-developed public package on CRAN.
# install package “e1071”
install.packages("e1071")
# To use the package in an R session, we need to load it in an R
session via li
ary()
li
ary(e1071)
3 Preprocessing
Save Mushroom.csv under your working directory. Different from a clean dataset in Mushroom.csv, the null value in mushroom dataset is denoted by question mark. Given so, we will slightly adjust our read.csv() function.
# read in csv file mushroom.csv. Note the question mark represents null value
mushroom <- read.csv('Mushroom.csv', na.strings = '?')
Call function summary() to see how our data looks like in general. Look at column stalk_root. It is the only one column includes NAs.
summary(mushroom)
# check completion
nrow(mushroom[!complete.cases(mushroom),])
## [1] 2480
Recall Naive Bayes is an algorithm depends on probability. To predict a conditional probability, we have to figure out the prior probability of each predictive variables. Therefore, a dataset with null value will raise risk for our prediction.
# we can retain observations that do not contain NA(null) value
mushroom = mushroom[complete.cases(mushroom),]
4 Training and testing sets
Next, we will create train and test sets of the data. We will fit the model with the training
set, and use the test set to evaluate the model. We will do a 70/30 split (70% will be training
data).
# 70% of original data will be used for training
sample_size <- floor(0.7 * nrow(mushroom))
# randomly select index of observations for training
training_index <- sample(nrow(mushroom), size = sample_size, replace = FALSE)
train <- mushroom[training_index,]
test <- mushroom[-training_index,]
5 Fitting and model performance
There is a Naive Bayes classifier in the e1071 package, loaded into our cu
ent session already via function li
ary(e1071). Fit the model to the training data.
# note the period coming after tilde. It means all the other variables in that dataset will be predictive variable
mushroom.model <- naiveBayes(classes ~ . , data = train)
# We can explore the detail conditional probabilities for each variables by calling the object mushroom.model itself.
mushroom.model
After fitting, run the test data through the model to get the predicted class for each observation.
# The result of prediction, a vector, will be attached to test set labelled as “class”. The return of prediction is a vector including predicted type of mushroom
mushroom.predict <- predict(mushroom.model, test, type = 'class')
Show the performance metrics of the model:
# pick actual value and predicted value together in a dataframe called results
esults <- data.frame(actual = test[,'classes'], predicted = mushroom.predict)
# we can get a popular matrix called confusion matrix via function table to evaluate the performance of our prediction
table(results)
# columns indicate the number of mushrooms in actual type; likewise, rows indicate the number those in predicted type.
# for example, we successfully predicted 1067 mushroom as edible, and 580 as poisonous. However, we mistake 46 poisonous mushroom for edible.
XXXXXXXXXXactual
XXXXXXXXXXpredicted e p
XXXXXXXXXXe XXXXXXXXXX
XXXXXXXXXXp XXXXXXXXXX
1
Lab 3 Source.R
# Set working directory, please change the code below
# according to your own situation
setwd("Input your working directory")
# e1071 and kalR
if(!require(e1071)){
XXXXXXXXXXinstall.packages("e1071")
XXXXXXXXXXli
ary(e1071)
}
if(!require(caret)){
XXXXXXXXXXinstall.packages("caret")
XXXXXXXXXXli
ary(caret)
}
# read in dataset
mushroom <- read.csv('./data/Mushroom.csv'
, na.strings = '?'
)
# total number of mushroom
nrow(mushroom)
### 8123
# number of mushroom with na value
nrow(mushroom[!complete.cases(mushroom),])
### XXXXXXXXXX
### we can delete observations with missing value
mushroom = mushroom[complete.cases(mushroom),]
# data should be clean now
summary(mushroom)
########################
### mushroom type ###
########################
# types of mushrooms
levels(mushroom$classes)
# distribution of types
summary(mushroom$classes)
### take 70% as training set
sample_size <- floor(0.7 * nrow(mushroom))
### randomly decide which ones are training data
training_index <- sample(nrow(mushroom), size = sample_size, replace = FALSE)
train <- mushroom[training_index,]
test <- mushroom[-training_index,]
# take all explanatory variables to predict
mushroom.model <- naiveBayes(classes ~ .
, data = train
)
# details of model explain conditional probability
mushroom.model
# run the test data
# the result of prediction, a vector,
# will be attached to test set labelled as "class"
mushroom.predict <- predict(mushroom.model
, test
, type = 'class'
)
# pick actual value and predicted value together in a dataframe called results
esults <- data.frame(predicted = mushroom.predict, actual = test[,'classes'])
# we can get a popular matrix called confusion matrix via function table()
# to evaluate the performance of our prediction
table(results)
Mushroom.csv
classes,cap_shape,cap_surface,cap_color,if_
uises,odor,gill_attachment,gill_spacing,gill_size,gill_color,stalk_shape,stalk_root,stalk_surface_above_ring,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,population,habitat
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
e,b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
e,b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
p,x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g
e,b,s,y,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,s,m
e,x,y,y,t,l,f,c,b,g,e,c,s,s,w,w,p,w,o,p,n,n,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,s,m
e,b,s,y,t,a,f,c,b,w,e,c,s,s,w,w,p,w,o,p,n,s,g
p,x,y,w,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,n,v,u
e,x,f,n,f,n,f,w,b,n,t,e,s,f,w,w,p,w,o,e,k,a,g
e,s,f,g,f,n,f,c,n,k,e,e,s,s,w,w,p,w,o,p,n,y,u
e,f,f,w,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
p,x,s,n,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,g
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,n,s,u
p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,n,s,u
e,b,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,s,m
p,x,y,n,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,n,v,g
e,b,y,y,t,l,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,s,m
e,b,y,w,t,a,f,c,b,w,e,c,s,s,w,w,p,w,o,p,n,n,m
e,b,s,w,t,l,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,s,m
p,f,s,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,n,v,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
e,x,y,w,t,l,f,c
Answered Same Day Jul 25, 2021

Solution

Subhanbasha answered on Jul 26 2021
123 Votes
# Installing the required package
install.packages("e1071")
# Calling the package
li
ary(e1071)
# Question 1
# Reading data into R
Balance_Scale <- read.csv('Balance_Scale.csv', na.strings = '?')
# Summary of the data
summary(Balance_Scale)
# Checking completion
nrow(Balance_Scale[!complete.cases(Balance_Scale),])
# We are taking 70% of original data as training data set
sample_size <- floor(0.7 * nrow(Balance_Scale))
#Randomly select index of observations for training
training_index <- sample(nrow(Balance_Scale), size = sample_size, replace = FALSE)
train <- Balance_Scale[training_index,]
test <-...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here