Great Deal! Get Instant \$10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

# MATH211 Spring 2020 NAME: Favorite Celebrity? Least Favorite Celebrity?: Complete this exam on your own paper, and submit your work as a PDF into the Exam01 assignment in Canvas. In case this doesn’t...

MATH211 Spring 2020
NAME:
Favorite Cele
ity? Least Favorite Cele
ity?:
Complete this exam on your own paper, and submit your work as a PDF into the Exam01 assignment
in Canvas. In case this doesn’t work, email me a copy at XXXXXXXXXX. In order to compute
integrals, use only techniques that we have addressed in this class.
Include this Honor Code statement in your submission:
The work attached represents my own efforts to respond to the writing prompts. I did not use any
esources other than worksheets, lecture notes, the textbook, or tools posted in Canvas. I did not look
up answers on the internet nor did I get the answer from another person.
Name: Signature: Date:
PART 1 Do questions 1 - 3.
1A Use dataset #1 posted in Exam 03 Send-Out Survey in Canvas. Be sure to read the description of the
dataset and what we are trying to predict.
(a) Choose the variable that you think is the most significant and use R to generate a linear regression
model for that variable.
(b) Construct a confidence interval for the slope and explain what this means.
1B Still using dataset #1 from Exam 03 Send-Out Survey, and still predicting the same variable:
(a) Using R, create a regression model using the Backwards Elimination strategy. Eliminate variables
until you are left with three significant variables. Show your results as you eliminate variables
one-by-one.
(b) Write the final equation for your regression model and explain what each slope means.
(c) Check the conditions necessary for this regression model to be valid. Are you convinced that a
linear model is appropriate?
(d) Find the residual for the Data Point shown in the Canvas survey. Explain what this residual
means.
2 (a) Using R, continue to create a multiple regression model using the Forward Selection Strategy.
Your final model should involve three predictor variables. Choose the variables one at a time. At
each step, show the summary results from R. If you find that a variable is not significant, explain
why, then do not use that variable, but choose a different one instead.
(b) Write the final equation for your regression model and explain whether each variable increases o
decreases the probability of the variable we are trying to predict.
(c) Find the residual for the Data Point shown in the Canvas survey. Explain what this residual
means in this context.
pg. 1 of 3
MATH211 Spring 2020
PART 2 You must do one question from each section that will count towards Exam 3. You may
choose to do more than one. Any credit you earn on these questions will count as extra credit towards
Exam01 Do at least one of these questions.
Tree diagrams? Independence vs Mutually exclusive (record shop with information and computation?)
Sampling techniques?
Q1 You are in charge of keeping the catalogue for a large Jazz Record store, and you collect the following
• 10% of the albums have only one solo musician.
• 42% of the albums have a saxophone.
• 82% of albums have a piano.
• 34% of albums have both saxophone and piano.
• Of the albums with only one solo musician, 77% of them have a piano.
You choose a record at random from the store
1. What is the probability that you choose an album that has either saxophone or piano?
2. What is the probability that you choose an album with only one solo musician playing piano?
3. Given that you chose an album with piano, what is the probability that it also has saxophone?
Q2 For each of the following pairs of hypothetical data sets, decide which mean is larger and which standard
deviation is larger. Explain your reasoning.
{
price of menu items at a fancy restaurant in Seattle
price of menu items at a fast food restaurant in Seattle{
Weight of pet cats
Weight of pet dogs{
Salary of teachers in Seattle
Salary of tech workers in Seattle
Exam02 Do at least one of these questions. For both, refer to the data sets in Exam 03 Send-Out Survey in
Canvas. Be sure to
Q3 Invent a possible research question that can be answered using a hypothesis test on difference of
proportions using Data Set #3. Clearly state that question, then use your sample to ca
y out a
formal hypothesis test. Analyze the types of e
ors and their consequences.
pg. 2 of 3
MATH211 Spring 2020
Q4 Invent a possible research question that can be answered using a hypothesis test on difference of means
using Data Set #4. Clearly state that question, then use your sample to ca
y out a formal hypothesis
test. Analyze the types of e
ors and their consequences.
pg. 3 of 3

Data Set #1
Here is a dataset about the performance of Professional Baseball Teams from XXXXXXXXXX.
aw.githubusercontent.com/trevorpelletie
2020Spring/maste
aseball_lin.csv")
Build a model that predicts the number of wins the team gets.
Use your model to predict the number of wins for a team with the following stats:
League = AL
unsuns = 513
on_base = 0.298
atting_average = 0.339
opp_runs = 698
opp_on_base = 0.313
opp_sluggig = 0.401
%======================%
Data Set #2
Here is a data set showing Seattle information about Officer Involved Shootings.
aw.githubusercontent.com/trevorpelletie
2020Spring/maste
spd_ois.csv")
Build a model that predicts the probability that the subject was killed.
Use your model to predict the probability of the subject being killed in the following situation:
Officer: White Male, 8 years of experience, not injured.
Subject: NonWhite Male, 25 years old, no weapon.
%======================%
Data Set #3
A "Te
y Stop" is a rule in the US that allows police officers to
iefly detain a person based on "reasonable suspicion" of involvement in criminal activity. This is commonly known as "stop and frisk." Here is information about Te
y Stops in Seattle
te
aw.githubusercontent.com/trevorpelletie
2020Spring/maste
te
y_stops.csv")
This is a very large data set, so before you use it, generate a sample with the code te
y_sample <- te
y[sample(nrow(te
y),N),]. Choose a suitable sample size N.
Use your sample to build a model that predicts the probability that the subject was a
ested.
%======================%
Data Set #4
Here is a data set showing education, crime, and political information about each US state.
aw.githubusercontent.com/trevorpelletie
2020Spring/maste
state_info.csv")

Data Documentation
Churches
id variables church_id church identification
model variables volume volume in cubic meters
length length in meters
width width in meters
avg_height average height in meters
surface_area inside total surface area in square meters
ground_surface_areaground surface area in square meters
eve
_time Reve
eration (Echo) time in seconds
Cereal
model variables Shelf recommended grocery store display shelf
Calories calories per serving
Protien grams of protien per serving
Fat grams of fat per serving
Sodium miligrams of sodium per serving
Fiber grams of fiber per serving
Ca
ohydrates grams of ca
ohydrates per serving
Sugars grams of sugar per serving
Potassium miligrams of potassium per serving
Serving_size Number of cups per serving
Baseball
id variables Team Which team
Year which yea
model variables League Either National League or American League
Playoffs Inicates if the team made the playoffs
wins Games won
uns total runs (points)
on_base how often players get on base
slugging how often players get a good hit
atting_avg how often players get any hit
opp_runs runs (points) scored by opponents
opp_on_base how often opponent gets on base
opp_slugging how often opponent gets a good hit
State Info (data from 2014~2016)
id variables State Which State
model variables median_household_incomemedian household income
avg_teacher_salaryaverage teacher salary
pct_hs_deg percent with High School Degree
pct_unemployed percent unemployed
pct_cities percent living in cities
pct_nonwhite percent non-white
pct_trump percent voted for Trump in 2016 election
crime_rate crimes per XXXXXXXXXXpeople
vcrime_rate violent crimes per XXXXXXXXXXpeople
hcrime_rate hate crimes per XXXXXXXXXXpeople
Te
y Stops
model variables officer_gender Gender of Office
officer_race Race of Officer (reported as white or non-white)
subject_gender Subject Percieved Gende
subject_race Subject Percieved Race
subject_age Subject Age Range
weapon Subject Weapon
frisk_flag Was the subject frisked?
a
est_flag Ws the subject a
ested?
Seattle PD Officer Involved in Shooting
model variables officer_gender Gender of Office
officer_race Officer Race
spd_years Officer Years Experience
officer_injured Was the officer Injured?
subject_gender Subject Gende
subject_race Subject Race
subject_age Subject Age
subject_weapon Did the subject have a weapon?
subject_fatal was the subject killed?

Useful R Commands
Here are a list of commands for R that will be useful for you on this test. Text in all capital letters is text that you
will edit for your specific problem.
Generate and Count Subsets and Samples
subset(DATA, CONDITION) #creates a subset of a data set based on a condition
#For example: april <- subset(sea, month == "4")
nrow(DATA) #counts the total number of rows in a data set
DATA[sample(nrow(DATA),SIZE),] #creates a random sample of a dataset of a determined
size
#For example: rsample <- april[sample(nrow(april),50),]
sum(CONDITION) #counts the number of data points that meet a given condition
#For example: sum(april\$TMAX > 70) counts the number of april days whose high
temperature was higher than 70.

Measure Statistics
mean(DATA) #computes mean of a data set
sd(DATA) #computes standard deviation of a data set
table(DATA\$VARIABLE) #creates a table showing levels in a column along with counts
for those levels
prop.table(table(DATA\$VARIABLE)) #creates a table showing levels in a column along
with proportions for those levels.

Compute with Distributions
pnorm(ZSCORE, 0, 1) #computes the area to the left of a given ZSCORE under the
standard normal distribution
qnorm(AREA, 0 ,1) #gives the zscore that contains a given AREA to the left.
pt(TSCORE,DF) #computes the area to the left of a given TSCORE under the
t-distribution with given Degrees of Freedom
qt(AREA,df) #computes the tscore that contains a given AREA to the left under the
t-distribution with given Degrees of Freedom
pchisq(CHISQUARE, DF) #Computes the area to the left of a chi-square test statistic
with a given number of Degrees of Freedom

Generate Regression Models
lm(DATA\$y ~ DATA\$x1 + DATA\$x2 + ...) #generates linear regression model
glm(DATA\$y ~ DATA\$x1 + DATA\$x2 + ..., family = binomial) #generates generalized
linear regression model
summary(MODEL) #prints coefficient information and model measurements

Plot Scatterplots
plot(DATA\$y ~ DATA\$x) #generates a scatter plot of x vs y. Use MODEL\$residuals for
first term for residual plots.

MATH211 Spring 2020
NAME:
Favorite Cele
ity? Least Favorite Cele
ity?:
Complete this exam on your own paper, and submit your work as a PDF into the Exam01 assignment
in Canvas. In case this doesn’t work, email me a copy at XXXXXXXXXX. In order to compute
integrals, use only techniques that we have addressed in this class.
Include this Honor Code statement in your submission:
The work attached represents my own efforts to respond to the writing prompts. I did not use any
esources other than worksheets, lecture notes, the textbook, or tools posted in Canvas. I did not
Answered Same Day Jun 19, 2021

## Solution

Biswajit answered on Jun 19 2021
Part 1 :
1A. (a)
summary(Model2)
Call:
lm(formula = wins ~ runs, data = baseball)
Residuals:
Min 1Q Median 3Q Max
-24.9326 -7.4241 0.6818 7.1930 24.6393
Coefficients:
Estimate Std. E
or t value Pr(>|t|)
(Intercept) 22.260427 4.335213 5.135 4.34e-07 ***
uns 0.077279 0.005674 13.621 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard e
or: 9.7 on 418 degrees of freedom
Multiple R-squared: 0.3074,    Adjusted R-squared: 0.3058
F-statistic: 185.5 on 1 and 418 DF, p-value: < 2.2e-16
I have chosen runs as the most significant variable here.
So model is
Wins = 22.2 + 0.077 * runs
( b)
confint(Model2)
2.5 % 97.5 %
(Intercept) 13.7388923 30.78196253
uns 0.0661272 0.08843178
For 95 % of the repeated samples of same size,slope will lie within ( 0.0666 ,0.088)
1B.(a)
Model3 <- step(lm(wins~League+runs+on_base+batting_avg+opp_runs+opp_on_base+opp_slugging,data=baseball),direction = "backward")
Start: AIC=1153.79
wins ~ League + runs + on_base + batting_avg + opp_runs + opp_on_base +
opp_slugging
Df Sum of Sq RSS AIC
- opp_slugging 1 0.0 6306.1 1151.8
- batting_avg 1 0.7 6306.7 1151.8
- League 1 2.7 6308.8 1152.0
- on_base 1 29.9 6335.9 1153.8
none> 6306.1 1153.8
- opp_on_base 1 199.2 6505.2 1164.8
- opp_runs 1 1963.4 8269.5 1265.6
- runs 1 4125.2 10431.3 1363.2
Step: AIC=1151.79
wins ~ League + runs + on_base + batting_avg + opp_runs + opp_on_base
Df Sum of Sq RSS AIC
- batting_avg 1 0.7 6306.8 1149.8
- League 1 2.7 6308.8 1150.0
- on_base 1 30.0 6336.1 1151.8
none> 6306.1 1151.8
- opp_on_base 1 199.1 6505.2 1162.8
- opp_runs 1 3444.2 9750.3 1332.8
- runs 1 4129.5 10435.6 1361.3
Step: AIC=1149.83
wins ~ League + runs + on_base + opp_runs + opp_on_base
Df Sum of Sq RSS AIC
- League 1 3.5 6310.3 1148.1
none> 6306.8 1149.8
- on_base 1 32.7 6339.5 1150.0
- opp_on_base 1 201.3 6508.1 1161.0
- opp_runs 1 3444.1 9750.9 1330.8
- runs 1 4238.7 10545.5 1363.7
Step: AIC=1148.07
wins ~ runs + on_base + opp_runs + opp_on_base
Df Sum of Sq RSS AIC
none> 6310.3 1148.1
- on_base 1 39.1 6349.4 1148.7
- opp_on_base 1 200.4 6510.7 1159.2
- opp_runs 1 3783.4 10093.7 1343.4
- runs 1 4512.8 10823.1 1372.7
Model4 <- lm(wins~runs+on_base+opp_runs+opp_on_base,data=baseball)
Model4
Call:
lm(formula = wins ~ runs + on_base + opp_runs + opp_on_base,
data = baseball)
Coefficients:
(Intercept) runs on_base opp_runs opp_on_base
97.31711 0.09031 49.46939 -0.08505 -110.75203
summary(Model4)
Call:
lm(formula = wins ~ runs + on_base + opp_runs + opp_on_base,
data = baseball)
Residuals:
Min 1Q Median 3Q Max
-11.5461 -2.8158 -0.1009 2.5720 12.3188
Coefficients:
Estimate Std. E
or t value Pr(>|t|)
(Intercept) 9.732e+01 9.189e+00 10.590 < 2e-16 ***
uns 9.031e-02 5.242e-03 17.228 < 2e-16 ***
on_base 4.947e+01 3.085e+01 1.604 0.109571
opp_runs -8.504e-02 5.391e-03 -15.774 < 2e-16 ***
opp_on_base -1.108e+02 3.051e+01 -3.630 0.000318 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard e
or: 3.899 on 415 degrees of freedom
Multiple R-squared: 0.8889,    Adjusted R-squared: 0.8878
F-statistic: 830 on 4 and 415 DF, p-value: < 2.2e-16
As p value of on_base > .05,we drop it
summary(Final_model)
Call:
lm(formula = wins ~ runs + opp_runs + opp_on_base, data = baseball)
Residuals:
Min 1Q Median 3Q Max
-11.5414 -2.7914 -0.0932 2.5331 12.1477
Coefficients:
Estimate Std. E
or t value Pr(>|t|)
(Intercept) 1.070e+02 6.948e+00 15.40 < 2e-16 ***
uns 9.783e-02 2.347e-03 41.68 < 2e-16 ***
opp_runs -8.615e-02 5.357e-03 -16.08 < 2e-16 ***
opp_on_base -1.050e+02 3.036e+01 -3.46 0.000595 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard e
or: 3.907 on 416 degrees of freedom
Multiple R-squared: 0.8882,    Adjusted R-squared: 0.8874
F-statistic: 1102 on 3 and 416 DF, p-value: < 2.2e-16
1B (b)
Final equation
Wins = 107 + .097 runs -.086 opp_runs – 105 opp_on_base
Keeping all other variables constant,if we change runs by 1 unit then wins changes by 0.097 units on average.
Keeping all other variables constant,if we change opp_runs by 1 unit then wins changes by -.086 units on average.
Keeping all other variables constant,if we change opp_on_base by 1 unit then wins changes by -105 units on average.
1B ( C )
Diagnostics plot
vif(Model5)
runs opp_runs opp_on_base
1.055384 6.051335 5.918250
From diagnostic plot,summary statistics & VIF,it seems the model fulfils all important criteria & can be used as regression equation

(d)
Predicted values
ound (predict1,2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
86.23 91.64 82.82 74.59 66.38 88.42 89.74 64.15 66.97 87.31 60.22 73.24 89.24 85.31 69.84 85.51 68.91
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
76.41 95.43 91.70 83.18 79.63 76.00 79.11 88.44 93.12 94.58 92.67 74.30 95.50 88.28 85.01 66.34 95.27
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
70.65 76.85 82.73 76.46 77.44 89.01 74.46 63.18 77.36 84.83 84.70 90.63 62.57 78.43 101.38 78.29 100.07
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
69.82 79.38 70.67 80.50 88.41 91.43 100.06 79.66 79.13 69.00 92.01 64.04 88.57 72.84 85.77 91.33 69.48
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
86.30 81.57 80.79 69.29 64.40 78.78 79.43 75.59 92.29 80.33 97.50 85.04 94.71 53.25 89.64 64.16 92.07
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
90.26 97.18 91.29 83.89 72.48 75.39 89.74 66.93 93.70 84.23 80.62 75.56 71.22 89.59 80.20 81.54 67.42
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
65.53 92.31 97.88 77.07 86.24 71.49 97.28 81.12 91.57 66.70 68.17 76.79 85.64 89.76 86.55 85.17 83.02
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
64.18 83.20 78.33 70.71 95.94 99.63 89.91 70.70 85.52 73.12 76.40 81.26 78.18 72.85 87.85 86.55 87.07
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
89.87 88.79 86.99 76.29 92.01 64.71 68.61 65.89 68.39 85.86 91.92 73.80 91.83 62.59 78.50 88.73 68.98
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
102.22 87.29 67.00 ...
SOLUTION.PDF