5/19/2019 Math 270: Lab # 5...

Question

5/19/2019 Math 270: Lab # 5 Solutionfile:C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 1/13Math 270: Lab # 5 SolutionYOUR NAME HEREDue May 23, 2018Task #1: Exploratory data analysis – Honey production in the US from XXXXXXXXXXTask 2: Simple linear regression.2a. Briefly explore the data2b. Coelation2c . Build the linear model2d. Summarize and evaluate the linear regression2e. Analyze residuals.2f. Make predictions and evaluate the model on a new set of data.Task 3: Understanding Linear RegressionTask 4: how is goodness of fit related to residuals?4a. Zero eo4b. Non-zero eo4c. Create your ownTask #1: Exploratory data analysis – Honeyproduction in the US from XXXXXXXXXXFor this task you will want to import the honeyproduction.csv file available through the Canvas page.It is very important that you understand the contents of this file and the meaning of each variable. Thisinformation is available through Kaggle (a site that hosts machine learning/data science competitions) athttps:www.kaggle.com/jessicali9530/honey-production (https:www.kaggle.com/jessicali9530/honey-production)The easiest method to import a dataset in RStudio is to use the “Environment” panel – Import Dataset and choosethe “text” option (as this is a csv file). An interface for you to select the file and fine tune settings will pop up. Focases where you’d want to automate this process, there are (as with most programming languages) also read()file commands.There are several questions the Kaggle site posed for exploring this data. They are:1. How has honey production yield changed from 1998 to 2012?2. Over time, which states produce the most honey? Which produce the least? Which have experienced themost change in honey yield?3. Does the data show any trends in terms of the number of honey producing colonies and yield per colonyefore vs. after 2006, which was when concern over Colony Collapse Disorder spread nationwide?4. Are there any patterns that can be observed between total honey production and value of production ovethis time period?5. How has value of production, which in some sense could be tied to demand, changed over this time period?https:www.kaggle.com/jessicali9530/honey-production5/19/2019 Math 270: Lab # 5 Solutionfile:C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 2/13For this task, please write a report such as one you would turn in to a manager at a company. This meansthat you should have (1) succinct but meaningful writing, (2) nice visuals that are easily understood andwhose relevance is summarized clearly, (3) no code visible (set echo=FALSE ) (4) professional language andattention to layout and design. This task is meant to simulate (to a limited extent, with fairly clean data) theexperience you’d have if you were working as a data scientist.**Please note: I know about the code on Kaggle. If you refer to it and get ideas for analysis or visuals please refer tothe author of the code within your report.Task 2: Simple linear regression.You don’t need to write this as an official report– just like a normal lab.Consider the cars  dataset built in to R.2a. Briefly explore the dataWrite a short preliminary analysis of the cars  data set. This should include the following: - A short paragraphdescribing the contents of this data set. [Use ?cars ] - A scatterplot of the data, side-by-side boxplots of thevariables, and a description of visual trends evident from both plots.2b. CoelationFind the covariance and coelation of the speed to the stopping distance. What do these provide evidence for?2c . Build the linear modelBuild a linear model using the following command: CarsModel=lm(dist~speed,data=cars) . This automaticallycreates the least-squares (best fit) linear model predicting the cars$dist  variable from the cars$speed  data.Then create a plot of the line on top of the scatterplot. To do this, first generate the scatterplot, and then use thecommand abline(CarsModel) .To make plots prettier, let’s use a plotting liary: ggplot. You’ll first have to install it. In your R console,enter the code install.packages("ggplot2")  We’ve avoided using it until now because the arguments fothe function are a bit more opaque. The following “cheat sheet” may be helpful:https:www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf(https:www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)liary(ggplot2) data("cars")The following plot includes a 45% confidence interval around the predicted line– this accounts for someuncertainty regarding the slope and intercept by shading in a region that, given your data, has a 45% chance ofcontaining the “true” regression line you’d find from the full population. Change the code to build a 90%confidence interval– what happens to the shaded region, and why does that make sense? Also, why doesthe confidence interval widen at the extremities?ggplot2::ggplot(cars,aes(x=speed,y=dist))+geom_point(color='#2980B9',size=4)+geom_smooth(method=lm,se=TRUE, color='#2C3E50',level=0.45)https:www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf5/19/2019 Math 270: Lab # 5 Solutionfile:C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 3/13CarsModel=lm(dist~speed,data=cars)  ggplot2::ggplot(cars,aes(CarsModel$residuals))+geom_histogram(binwidth=5)5/19/2019 Math 270: Lab # 5 Solutionfile:C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 4/132d. Summarize and evaluate the linear regressionUse the command summary(CarsModel)  to get a numeric summary of the linear model you generated. Use it towrite out answers to the following questions. As you do this, you may find the description of the linear modelsummary (midway down the page at http:log.yhat.com/posts-lm-summary.html (http:log.yhat.com/posts-lm-summary.html)) to be helpful. It’s not mathematically rigorous, but gives some good rules of thumb for interprettingmodel fits– if you have questions aobut where those rules of thumb come from, please ask.i. The middle 50% of residuals (model eors) are between what two values?ii. The equation of the line (in  format) is…iii. Interpret the Pr(>|t|)  column.iv. Give and fully interpret (as a proportion of variance explained) the R-squared value.v. What is the residual standard eor, and what does it mean?2e. Analyze residuals.Residuals are the eor in estimating an output as a function of the input. This is often denoted as : if ,then .In an “ideal” situation the  follow a normal distribution and have a constant standard deviation. The reasoning iselow:y = mx + ϵ Y ≈ f(X)Y − f(X) = ϵϵhttp:log.yhat.com/posts-lm-summary.html5/19/2019 Math 270: Lab # 5 Solutionfile:C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 5/13In order to reasonably bound the eor of your predictions based on a model, it’s useful to be able to rely on thenormal distribution (so that you can be 95% sure the true value is within 2 standard deviations of the predictedvalue). We like the property of the normal distribution where eors are most frequently closest to 0 (meaning thepoints are close to the predicted value).To test this assumption of normality, let’s start by examining the distribution of residuals– we’ll do this by looking ata histogram. Use hist(CarsModel$residuals)  and explain your findings: do the residuals look roughly normallydistributed?In order to be able to easily evaluate the accuracy of a prediction, most analyses assume the residuals are“homoscedastic” (their standard deviation does not rely on , and so the eor of prediction is constant across allinputs). To see if the residuals here are homoscedastic, plot the residuals vs. the speed variable. Theesiduals should form a (relatively) constant band around 0– they should not balloon out as speed gets higher, foexample. Decide: are the residuals homoscedastic?There are other useful plots and statistics for testing that residuals (or any data) are normally distributed. Onecommon one is a Quantile-Quantile plot, which plots the ordered data (in this case residuals) against thetheoretical quantiles for a normal distribution with the same mean and variance. Generally speaking, the 25% markin the ordered data should match the 25% mark in the normal distribution: so if the residuals are normallydistributed the QQ-plot should be a straight line. The QQ-plot is the second (of four useful plots the rest of whichwe won’t discuss here) available from calling plot(CarsModel) . Call this function and examine the QQ-plot: Ifthere are any causes for concern, how do they match up with your histogram?2f. Make predictions and evaluate the model on anew set of data.Suppose you want to test the model against another set of data.I’ll have you use this set of data:testData=data.frame(speed=c(10,15,22,31,44,45,78,82,90),dist=c(12,50,40,90,154,139,286,284,345))Use the commandpredictedDist=predict.lm(CarsModel,newdata=testData,interval="prediction",level=0.9)  and also outputthe predicted values and 90% intervals of prediction. These intervals are built using the normaldistribution centered on the regression line with standard deviation equal to that of the residuals. Just aswe did earlier in class, the normal distribution allows you to predict a range that contains the middle X% ofvalues (e.g 68% of values are within one standard deviation). So these intervals should containapproximately 9 of 10 predictions. How do the fit values compare to the actual distance values of thetestData  set? Are the actual values at least contained in the prediction intervals?To visualize this, run the following code by setting eval=TRUE . I’m providing it to you because it’s unfortunatelymessy to include eor bars in R. There are some packages you can install to make it easier, but alas… it’s neveas easy as it should be.x5/19/2019 Math 270: Lab # 5 Solutionfile:C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 6/13##First we'll just plot the x values vs. their predicted values:  plot(testData$speed,predictedDist[,1])  #predictedDist[,1] is the `fit' column of the prediction output.  ##Next we will do the messy part: adding `eor bars' based on the 90% prediction interval.  I'll use the "aow" function which draws an aow from (x0,y0) to (x1,y1)-- here we just make the y0 and y1 the lwr and upr bounds of the prediction interval for each x value.   aows(x0=testData$speed, y0=predictedDist[,2], x1=testData$speed, y1=predictedDist[,3], length=0.05, angle=90, code=3)  #length, angle, code are all just hacky ways of getting the "aow" to instead have a flat head.   ## Finally, let's add the `actual' stopping distance of the testData in red to see how the predictions match up:  points(testData,col="red") What conclusions do you draw? Does the model reasonably fit the new (test) data?Task 3: Understanding Linear RegressionOne way of thinking about models (like linear regression), is that they are functions that estimate the output ( )values based on input data . Hence we are estimating . More precisely for each data point  where  is the index of the data point and  is the (hopefully small) eor– also called residualfor that data point.With linear regression,  is linear– but the accuracy of the model all has to do with the

Pooja · Accepted Answer

library(readxl)
data1 0)
Cholesterol_0 z) = ?#
#4#
n

5/19/2019 Math 270: Lab # 5 Solution file:///C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 1/13 Math 270: Lab # 5 Solution YOUR NAME HERE Due May 23, 2018 Task #1:...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment