Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

5/19/2019 Math 270: Lab # 5 Solution file:///C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 1/13 Math 270: Lab # 5 Solution YOUR NAME HERE Due May 23, 2018 Task #1:...

1 answer below »
5/19/2019 Math 270: Lab # 5 Solution
file:
C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 1/13
Math 270: Lab # 5 Solution
YOUR NAME HERE
Due May 23, 2018
Task #1: Exploratory data analysis – Honey production in the US from XXXXXXXXXX
Task 2: Simple linear regression.
2a. Briefly explore the data
2b. Co
elation
2c . Build the linear model
2d. Summarize and evaluate the linear regression
2e. Analyze residuals.
2f. Make predictions and evaluate the model on a new set of data.
Task 3: Understanding Linear Regression
Task 4: how is goodness of fit related to residuals?
4a. Zero e
o
4b. Non-zero e
o
4c. Create your own
Task #1: Exploratory data analysis – Honey
production in the US from XXXXXXXXXX
For this task you will want to import the honeyproduction.csv file available through the Canvas page.
It is very important that you understand the contents of this file and the meaning of each variable. This
information is available through Kaggle (a site that hosts machine learning/data science competitions) at
https:
www.kaggle.com/jessicali9530/honey-production (https:
www.kaggle.com/jessicali9530/honey-production)
The easiest method to import a dataset in RStudio is to use the “Environment” panel – Import Dataset and choose
the “text” option (as this is a csv file). An interface for you to select the file and fine tune settings will pop up. Fo
cases where you’d want to automate this process, there are (as with most programming languages) also read()
file commands.
There are several questions the Kaggle site posed for exploring this data. They are:
1. How has honey production yield changed from 1998 to 2012?
2. Over time, which states produce the most honey? Which produce the least? Which have experienced the
most change in honey yield?
3. Does the data show any trends in terms of the number of honey producing colonies and yield per colony
efore vs. after 2006, which was when concern over Colony Collapse Disorder spread nationwide?
4. Are there any patterns that can be observed between total honey production and value of production ove
this time period?
5. How has value of production, which in some sense could be tied to demand, changed over this time period?
https:
www.kaggle.com/jessicali9530/honey-production
5/19/2019 Math 270: Lab # 5 Solution
file:
C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 2/13
For this task, please write a report such as one you would turn in to a manager at a company. This means
that you should have (1) succinct but meaningful writing, (2) nice visuals that are easily understood and
whose relevance is summarized clearly, (3) no code visible (set echo=FALSE ) (4) professional language and
attention to layout and design. This task is meant to simulate (to a limited extent, with fairly clean data) the
experience you’d have if you were working as a data scientist.**
Please note: I know about the code on Kaggle. If you refer to it and get ideas for analysis or visuals please refer to
the author of the code within your report.
Task 2: Simple linear regression.
You don’t need to write this as an official report– just like a normal lab.
Consider the cars dataset built in to R.
2a. Briefly explore the data
Write a short preliminary analysis of the cars data set. This should include the following: - A short paragraph
describing the contents of this data set. [Use ?cars ] - A scatterplot of the data, side-by-side boxplots of the
variables, and a description of visual trends evident from both plots.
2b. Co
elation
Find the covariance and co
elation of the speed to the stopping distance. What do these provide evidence for?
2c . Build the linear model
Build a linear model using the following command: CarsModel=lm(dist~speed,data=cars) . This automatically
creates the least-squares (best fit) linear model predicting the cars$dist variable from the cars$speed data.
Then create a plot of the line on top of the scatterplot. To do this, first generate the scatterplot, and then use the
command abline(CarsModel) .
To make plots prettier, let’s use a plotting li
ary: ggplot. You’ll first have to install it. In your R console,
enter the code install.packages("ggplot2") We’ve avoided using it until now because the arguments fo
the function are a bit more opaque. The following “cheat sheet” may be helpful:
https:
www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
(https:
www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
li
ary(ggplot2)
data("cars")
The following plot includes a 45% confidence interval around the predicted line– this accounts for some
uncertainty regarding the slope and intercept by shading in a region that, given your data, has a 45% chance of
containing the “true” regression line you’d find from the full population. Change the code to build a 90%
confidence interval– what happens to the shaded region, and why does that make sense? Also, why does
the confidence interval widen at the extremities?
ggplot2::ggplot(cars,aes(x=speed,y=dist))+geom_point(color='#2980B9',size=4)+geom_smooth(method=
lm,se=TRUE, color='#2C3E50',level=0.45)
https:
www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
5/19/2019 Math 270: Lab # 5 Solution
file:
C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 3/13
CarsModel=lm(dist~speed,data=cars)

ggplot2::ggplot(cars,aes(CarsModel$residuals))+geom_histogram(binwidth=5)
5/19/2019 Math 270: Lab # 5 Solution
file:
C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 4/13
2d. Summarize and evaluate the linear regression
Use the command summary(CarsModel) to get a numeric summary of the linear model you generated. Use it to
write out answers to the following questions. As you do this, you may find the description of the linear model
summary (midway down the page at http:
log.yhat.com/posts
-lm-summary.html (http:
log.yhat.com/posts
-lm-
summary.html)) to be helpful. It’s not mathematically rigorous, but gives some good rules of thumb for interpretting
model fits– if you have questions aobut where those rules of thumb come from, please ask.
i. The middle 50% of residuals (model e
ors) are between what two values?
ii. The equation of the line (in format) is…
iii. Interpret the Pr(>|t|) column.
iv. Give and fully interpret (as a proportion of variance explained) the R-squared value.
v. What is the residual standard e
or, and what does it mean?
2e. Analyze residuals.
Residuals are the e
or in estimating an output as a function of the input. This is often denoted as : if ,
then .
In an “ideal” situation the follow a normal distribution and have a constant standard deviation. The reasoning is
elow:
y = mx +
ϵ Y ≈ f(X)
Y − f(X) = ϵ
ϵ
http:
log.yhat.com/posts
-lm-summary.html
5/19/2019 Math 270: Lab # 5 Solution
file:
C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 5/13
In order to reasonably bound the e
or of your predictions based on a model, it’s useful to be able to rely on the
normal distribution (so that you can be 95% sure the true value is within 2 standard deviations of the predicted
value). We like the property of the normal distribution where e
ors are most frequently closest to 0 (meaning the
points are close to the predicted value).
To test this assumption of normality, let’s start by examining the distribution of residuals– we’ll do this by looking at
a histogram. Use hist(CarsModel$residuals) and explain your findings: do the residuals look roughly normally
distributed?
In order to be able to easily evaluate the accuracy of a prediction, most analyses assume the residuals are
“homoscedastic” (their standard deviation does not rely on , and so the e
or of prediction is constant across all
inputs). To see if the residuals here are homoscedastic, plot the residuals vs. the speed variable. The
esiduals should form a (relatively) constant band around 0– they should not balloon out as speed gets higher, fo
example. Decide: are the residuals homoscedastic?
There are other useful plots and statistics for testing that residuals (or any data) are normally distributed. One
common one is a Quantile-Quantile plot, which plots the ordered data (in this case residuals) against the
theoretical quantiles for a normal distribution with the same mean and variance. Generally speaking, the 25% mark
in the ordered data should match the 25% mark in the normal distribution: so if the residuals are normally
distributed the QQ-plot should be a straight line. The QQ-plot is the second (of four useful plots the rest of which
we won’t discuss here) available from calling plot(CarsModel) . Call this function and examine the QQ-plot: If
there are any causes for concern, how do they match up with your histogram?
2f. Make predictions and evaluate the model on a
new set of data.
Suppose you want to test the model against another set of data.
I’ll have you use this set of data:
testData=data.frame(speed=c(10,15,22,31,44,45,78,82,90),dist=c(12,50,40,90,154,139,286,284,345))
Use the command
predictedDist=predict.lm(CarsModel,newdata=testData,interval="prediction",level=0.9) and also output
the predicted values and 90% intervals of prediction. These intervals are built using the normal
distribution centered on the regression line with standard deviation equal to that of the residuals. Just as
we did earlier in class, the normal distribution allows you to predict a range that contains the middle X% of
values (e.g 68% of values are within one standard deviation). So these intervals should contain
approximately 9 of 10 predictions. How do the fit values compare to the actual distance values of the
testData set? Are the actual values at least contained in the prediction intervals?
To visualize this, run the following code by setting eval=TRUE . I’m providing it to you because it’s unfortunately
messy to include e
or bars in R. There are some packages you can install to make it easier, but alas… it’s neve
as easy as it should be.
x
5/19/2019 Math 270: Lab # 5 Solution
file:
C:/Users/Nesi/Downloads/Lab5_DataExploration&SimpleLinearRegression XXXXXXXXXXhtml 6/13
##First we'll just plot the x values vs. their predicted
values:

plot(testData$speed,predictedDist[,1]) #predictedDist[,1] is the `fit' column
of the prediction output.

##Next we will do the messy part: adding `e
or bars' based on the 90% prediction interval. I'l
l use the "a
ow" function which draws an a
ow from (x0,y0) to (x1,y1)-- here we just make the
y0 and y1 the lwr and upr bounds of the prediction interval for each x value.


a
ows(x0=testData$speed, y0=predictedDist[,2], x1=testData$speed,
y1=predictedDist[,3], length=0.05, angle=90, code=3)

#length, angle, code are all just hacky ways of getting the "a
ow" to instead have a flat head.


## Finally, let's add the `actual' stopping distance of the testData in red to see how the predi
ctions match up:

points(testData,col="red")
What conclusions do you draw? Does the model reasonably fit the new (test) data?
Task 3: Understanding Linear Regression
One way of thinking about models (like linear regression), is that they are functions that estimate the output ( )
values based on input data . Hence we are estimating . More precisely for each data point
where is the index of the data point and is the (hopefully small) e
or– also called residual
for that data point.
With linear regression, is linear– but the accuracy of the model all has to do with the
Answered Same Day May 19, 2021

Solution

Pooja answered on May 28 2021
139 Votes
li
ary(readxl)
data1<-read_excel(path="C:/Users/HP/Desktop/Book2.xlsx")
summary(data1)
#task 2#
HD = as.data.frame(data1)
DiagnosedHD_0 <- subset(data1, DiagnosedHD <= 0)
DiagnosedHD_1 <- subset(data1, DiagnosedHD > 0)
Cholesterol_0 <- mean(DiagnosedHD_0$Cholesterol)
Cholesterol_1 <- mean(DiagnosedHD_1$Cholesterol)
Cholesterol_0
Cholesterol_1
thalamus_table <- table(data1$Thalamus)
prop.table(thalamus_table)
#Task 3#
#2#
n <- length(data1)
B <- 1000
R <- 100
#Doing the...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here