Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Lab 11. Logistic Regression Lab 11. Logistic Regression Introduction Logistic regression uses one or more numeric variables to predict the probability of a binomial y variable - e.g. does the price of...

1 answer below »
Lab 11. Logistic Regression
Lab 11. Logistic Regression
Introduction
Logistic regression uses one or more numeric variables to predict the probability of a binomial y variable - e.g. does the price of a shirt predict if people will (1) or won’t (0) buy a shirt, does the amount of water you give a pa
ot every day determine if it will (1) or will not (0) curse at you, etc. In this lab you will walk through interpreting regression with a fake data set, and then you will use a real dataset to solve the mystery of why so many people in Florida let their pet reptiles go loose in the wild.
Well, okay, the United States as a whole but like… it’s mostly a Florida problem TBH.
Learning Outcomes
By the end of this lab you should be able to:
· Use glm() to make a binomial regression
· Use predict() and round() to predict on new data
· Use the confusionMatrix() function from the caret package to determine accuracy of your predictions
· Determine if a continuous variable can predict a binomial variable
· Interpret the meaning of a positive or negative slope in a logistic regression
· Find good statistical talking points to yell at the pet trade industry?
Part 1: Fake Data
Part 1.1. Make The Data
For a logistic regression we need to make two groups - one that is a “positive” result (1) and one that is a “negative” result (0). We also need some sort of predictor x variable.
positive <- data.frame(y = 1, x = rnorm(n = 50, mean = 50, sd = 3))
negative <- data.frame(y = 0, x = rnorm(n = 50, mean = 42, sd = 3))
together <-
ind(positive, negative)
Let’s take a look at them first, using a density diagram. Use the as.character() function to remind R that 0/1 is a category, and the alpha=.7 argument to make things see-through.
li
ary(ggplot2)
ggplot(together, aes(x = x, fill = as.character(y)))+
geom_density(alpha = .7)
This data could be anything, but you can see pretty clearly that the 0 and 1 categories are different. Some examples of what this data could be:
· People who get paid more are more likely to be happy (1) than unhappy (0).
· Reptiles that are bigger are more likely to be released to the wild (1) than kept forever (0).
· People with longer hair are more likely to be hippies (1) than not hippies (0).
· Greater amounts of vitamin D intake during the winter is more likely to make you happy (1) than unhappy (0).
In this sense, a logistic regression is very much like a t-test, but instead of saying “these are different” you’re asking “can I use x to predict these categories?”
Part 1.2 Plot the Data as A Bivariate Distribution
You can also use ggplot to view this data as a scatterplot just like you would otherwise, but to use the geom_smooth() function you will need to do a little bit of manipulation. Specifically, you are using the glm() function to build this model. This is a general function that is similar to lm() but more “general” hence the name general linear model.
Because glm() can take more arguments, you have to specify that this is a binomial function, where you only have two options.
ggplot(together, aes(x = x, y = y))+
geom_point()+
stat_smooth(method="glm", method.args=list(family="binomial"))
Part 1.3 Model Building
Now you can see that there is a relationship! Let’s build a model to test that.You can use the glm() function for real here, making sure to specify that this is a binomial model. Just like before, you can use the summary() function to get more information.
model <- glm(y~x, data = together, family=binomial)
summary(model)
##
## Call:
## glm(formula = y ~ x, family = binomial, data = together)
##
## Deviance Residuals:
## Min XXXXXXXXXX1Q Median XXXXXXXXXX3Q Max
## XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX
##
## Coefficients:
## XXXXXXXXXXEstimate Std. E
or z value Pr(>|z|)
## (Intercept XXXXXXXXXX XXXXXXXXXX000140 ***
## x XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: XXXXXXXXXXon 99 degrees of freedom
## Residual deviance: XXXXXXXXXXon 98 degrees of freedom
## AIC: 27.478
##
## Number of Fisher Scoring iterations: 8
Now we can see that there is a statistically significant relationship - the p value for x is very small. Additionally, the slope for x is positive - that means that as x increased, so did y.
Part 1.4 How Accurate?
You’ll notice that R doesn’t give you an R2 value for logistic regressions. Instead, it is helpful to look at the actual accuracy of the model. To start with, use the predict() function to create a new column of what the model would predict for each x value. You have to add type = "response" here to get a logistic regression result.
together$predict <- predict.glm(model, newdata=together, type = "response")
You’ll notice it hasn’t rounded the values. That’s fine, it likes to tell you how certain it is essentially of being a 1. You can use the round() function to make a column of rounded values.
together$predict2 <- round(together$predict)
Now, we’re going to make a confusion matrix - that’s a fancy term for tallying up how many of these predictions were wrong in either direction. To do this, you will need to use the install.packages() function to install caret and e1071. Once installed, load them using li
ary():
li
ary(caret)
li
ary(e1071)
For the confusion matrix, we have to do a little bit of wiggling. The function gets fussy when given what it thinks is the wrong kind of data - it wants factors, not numbers. Use the as.factor() function to get it to stop being silly.
confusionMatrix(data=as.factor(together$predict2), reference=as.factor(together$y))
## Confusion Matrix and Statistics
##
## XXXXXXXXXXReference
## Prediction 0 1
## XXXXXXXXXX
## XXXXXXXXXX
##
## XXXXXXXXXXAccuracy : 0.95
## XXXXXXXXXX% CI : XXXXXXXXXX, 0.9836)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## XXXXXXXXXXKappa : 0.9
##
## Mcnemar's Test P-Value : 1
##
## XXXXXXXXXXSensitivity : 0.9400
## XXXXXXXXXXSpecificity : 0.9600
## XXXXXXXXXXPos Pred Value : 0.9592
## XXXXXXXXXXNeg Pred Value : 0.9412
## XXXXXXXXXXPrevalence : 0.5000
## XXXXXXXXXXDetection Rate : 0.4700
## Detection Prevalence : 0.4900
## Balanced Accuracy : 0.9500
##
## 'Positive' Class : 0
##
First off, overall accuracy here is pretty high! It has calculated the confidence intervals for you, and the accuracy is between 0.887 and XXXXXXXXXXyour results will slightly differ).
As far as which groups were wrong, look at the little square of values at the top. 2 values were predicted as 0 when they were actually 1, and 3 values were predicted as 0 when they should have been 1.
Part 1.5 Summarize
If you were asked to summarize this model, you might say something like: the x value was a strong predictor of y, and was statistically significant (p < XXXXXXXXXXThe overall model accuracy was high (0.95), and it seems that this continuous variable has
Answered Same Day Dec 01, 2021

Solution

Sudharsan.J answered on Dec 01 2021
157 Votes
QUESTION 1: For any logistic regression to determine if a particular x variable influenced reptile release, which column would be your y axis?
· kraus_c_edd_1999
QUESTION 2: You will not use col_sp, col_sp_id, col_class, col_order, clade, col_family, col_genus, or lht_source as variables in any of your logistic regressions. Why?
· Because they are categorical data
QUESTION 3: Which numeric variable do you think is the most likely to increase the liklihood someone might release a reptile? Explain your answer.
· I have choosen below two variables are most likely to increase the likelihood someone might realease a reptile because import of species are big deal in trade market and especially an adult species with gigantic creatures has a higher likelihood of release.
1. adult_svl_cm: adult Snout Vent length (nose to tail tip) in centimeters
1. sum_qty: total number of imports for a given species from LEMIS database from 1999 to 2012 (see manuscript for details)
QUESTION 4: Which variable do you think is the most likely to decrease the liklihood someone might release a reptile? Explain your answer.
· I don’t think so this age_maturity_d variable will impact more to response variable; number of days’ species takes to reach sexual maturity does not make much sense so I feel this would most likely to decrease the likelihood of release a reptile.
QUESTION 5: Copy-paste your glm code for each of the variables you listed in the previous question.
Rcode:
data=read.csv("reptilerelease-ud3chnci.csv")
data1=as.data.frame(data)
model_1<- glm(kraus_c_edd_1999~adult_svl_cm+sum_qty, data = data1, family=binomial)
summary(model_1)
data1$predict <- predict.glm(model_1, newdata=data1, type = "response")
data1$predict2 <- round(data1$predict)
confusionMatrix(data=as.factor(data1$predict2), reference=as.factor(data1$kraus_c_edd_1999))
model_2<- glm(kraus_c_edd_1999~age_maturity_d, data = data1, family=binomial)
summary(model_2)
data1$predict_M2 <- predict.glm(model_2, newdata=data1, type = "response")
data1$predict_M2a <- round(data1$predict_M2)
confusionMatrix(data=as.factor(data1$predict_M2a),reference=as.factor(data1$kraus_c_edd_1999)).
QUESTION 6: Is there a statistically significant relationship between your x variables and the liklihood of reptile release?
· According to model_1, Yes, there was a statistically significant difference was noted in adult_svl_cm and sum_qty in comparison to kraus_c_edd_1999.
· According to model_2, there was no significant difference was noted between age_maturity_d and kraus_c_edd_1999.
QUESTION 7: Look at your answers for questions 3 & 4. Do your analyses support your answers? Use your slope estimates to answer your question.
Now we can see that there is no significant relationship - the p value for x is...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here