Lab 11. Logistic RegressionLab 11. Logistic RegressionIntroductionLogistic regression uses one or...

Question

Lab 11. Logistic RegressionLab 11. Logistic RegressionIntroductionLogistic regression uses one or more numeric variables to predict the probability of a binomial y variable - e.g. does the price of a shirt predict if people will (1) or won’t (0) buy a shirt, does the amount of water you give a paot every day determine if it will (1) or will not (0) curse at you, etc. In this lab you will walk through interpreting regression with a fake data set, and then you will use a real dataset to solve the mystery of why so many people in Florida let their pet reptiles go loose in the wild.Well, okay, the United States as a whole but like… it’s mostly a Florida problem TBH.Learning OutcomesBy the end of this lab you should be able to:· Use glm() to make a binomial regression· Use predict() and round() to predict on new data· Use the confusionMatrix() function from the caret package to determine accuracy of your predictions· Determine if a continuous variable can predict a binomial variable· Interpret the meaning of a positive or negative slope in a logistic regression· Find good statistical talking points to yell at the pet trade industry?Part 1: Fake DataPart 1.1. Make The DataFor a logistic regression we need to make two groups - one that is a “positive” result (1) and one that is a “negative” result (0). We also need some sort of predictor x variable.positive negative together ind(positive, negative)Let’s take a look at them first, using a density diagram. Use the as.character() function to remind R that 0/1 is a category, and the alpha=.7 argument to make things see-through.liary(ggplot2)ggplot(together, aes(x = x, fill = as.character(y)))+  geom_density(alpha = .7)This data could be anything, but you can see pretty clearly that the 0 and 1 categories are different. Some examples of what this data could be:· People who get paid more are more likely to be happy (1) than unhappy (0).· Reptiles that are bigger are more likely to be released to the wild (1) than kept forever (0).· People with longer hair are more likely to be hippies (1) than not hippies (0).· Greater amounts of vitamin D intake during the winter is more likely to make you happy (1) than unhappy (0).In this sense, a logistic regression is very much like a t-test, but instead of saying “these are different” you’re asking “can I use x to predict these categories?”Part 1.2 Plot the Data as A Bivariate DistributionYou can also use ggplot to view this data as a scatterplot just like you would otherwise, but to use the geom_smooth() function you will need to do a little bit of manipulation. Specifically, you are using the glm() function to build this model. This is a general function that is similar to lm() but more “general” hence the name general linear model.Because glm() can take more arguments, you have to specify that this is a binomial function, where you only have two options.ggplot(together, aes(x = x, y = y))+  geom_point()+   stat_smooth(method="glm", method.args=list(family="binomial"))Part 1.3 Model BuildingNow you can see that there is a relationship! Let’s build a model to test that.You can use the glm() function for real here, making sure to specify that this is a binomial model. Just like before, you can use the summary() function to get more information.model summary(model)## ## Call:## glm(formula = y ~ x, family = binomial, data = together)## ## Deviance Residuals: ##      Min XXXXXXXXXX1Q    Median XXXXXXXXXX3Q       Max  ## XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX  ## ## Coefficients:## XXXXXXXXXXEstimate Std. Eor z value Pr(>|z|)    ## (Intercept XXXXXXXXXX XXXXXXXXXX000140 ***## x XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX ***## ---## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ##     Null deviance: XXXXXXXXXXon 99  degrees of freedom## Residual deviance: XXXXXXXXXXon 98  degrees of freedom## AIC: 27.478## ## Number of Fisher Scoring iterations: 8Now we can see that there is a statistically significant relationship - the p value for x is very small. Additionally, the slope for x is positive - that means that as x increased, so did y.Part 1.4 How Accurate?You’ll notice that R doesn’t give you an R2 value for logistic regressions. Instead, it is helpful to look at the actual accuracy of the model. To start with, use the predict() function to create a new column of what the model would predict for each x value. You have to add type = "response" here to get a logistic regression result.together$predict You’ll notice it hasn’t rounded the values. That’s fine, it likes to tell you how certain it is essentially of being a 1. You can use the round() function to make a column of rounded values.together$predict2 Now, we’re going to make a confusion matrix - that’s a fancy term for tallying up how many of these predictions were wrong in either direction. To do this, you will need to use the install.packages() function to install caret and e1071. Once installed, load them using liary():liary(caret)liary(e1071)For the confusion matrix, we have to do a little bit of wiggling. The function gets fussy when given what it thinks is the wrong kind of data - it wants factors, not numbers. Use the as.factor() function to get it to stop being silly.confusionMatrix(data=as.factor(together$predict2), reference=as.factor(together$y))## Confusion Matrix and Statistics## ## XXXXXXXXXXReference## Prediction  0  1## XXXXXXXXXX## XXXXXXXXXX##                                           ## XXXXXXXXXXAccuracy : 0.95            ## XXXXXXXXXX% CI : XXXXXXXXXX, 0.9836)##     No Information Rate : 0.5             ##     P-Value [Acc > NIR] : ##                                           ## XXXXXXXXXXKappa : 0.9             ##                                           ##  Mcnemar's Test P-Value : 1               ##                                           ## XXXXXXXXXXSensitivity : 0.9400          ## XXXXXXXXXXSpecificity : 0.9600          ## XXXXXXXXXXPos Pred Value : 0.9592          ## XXXXXXXXXXNeg Pred Value : 0.9412          ## XXXXXXXXXXPrevalence : 0.5000          ## XXXXXXXXXXDetection Rate : 0.4700          ##    Detection Prevalence : 0.4900          ##       Balanced Accuracy : 0.9500          ##                                           ##        'Positive' Class : 0               ## First off, overall accuracy here is pretty high! It has calculated the confidence intervals for you, and the accuracy is between 0.887 and XXXXXXXXXXyour results will slightly differ).As far as which groups were wrong, look at the little square of values at the top. 2 values were predicted as 0 when they were actually 1, and 3 values were predicted as 0 when they should have been 1.Part 1.5 SummarizeIf you were asked to summarize this model, you might say something like: the x value was a strong predictor of y, and was statistically significant (p

Sudharsan.J · Accepted Answer

QUESTION 1: For any logistic regression to determine if a particular x variable influenced reptile release, which column would be your y axis?
· kraus_c_edd_1999
QUESTION 2: You will not use col_sp, col_sp_id, col_class, col_order, clade, col_family, col_genus, or lht_source as variables in any of your logistic regressions. Why?
· Because they are categorical data
QUESTION 3: Which numeric variable do you think is the most likely to increase the liklihood someone might release a reptile? Explain your answer.
· I have choosen below two variables are most likely to increase the likelihood someone might realease a reptile because import of species are big deal in trade market and especially an adult species with gigantic creatures has a higher likelihood of release.
1. adult_svl_cm: adult Snout Vent length (nose to tail tip) in centimeters
1. sum_qty:  total number of imports for a given species from LEMIS database from 1999 to 2012 (see manuscript for details)
QUESTION 4: Which variable do you think is the most likely to decrease the liklihood someone might release a reptile? Explain your answer.
· I don’t think so this age_maturity_d variable will impact more to response variable; number of days’ species takes to reach sexual maturity does not make much sense so I feel this would most likely to decrease the likelihood of release a reptile. 
QUESTION 5:

Lab 11. Logistic Regression Lab 11. Logistic Regression Introduction Logistic regression uses one or more numeric variables to predict the probability of a binomial y variable - e.g. does the price of...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment