Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

DATA 303 Assignment 4 DATA 303 Assignment 4 Due: 5:00 PM Friday 26 June 2020 Intructions • Prepare your assignment using Rmarkdown • Submit your solutions in two files: an Rmarkdown file named...

1 answer below »
DATA 303 Assignment 4
DATA 303 Assignment 4
Due: 5:00 PM Friday 26 June 2020
Intructions
• Prepare your assignment using Rmarkdown
• Submit your solutions in two files: an Rmarkdown file named assignment4.Rmd and the PDF file
named assignment4.pdf that results from knitting the Rmd file.
• The YAML header of your Rmarkdown file must contain your name and ID number in the author field,
and should have the output format set to pdf_document. For example:
---
title: "DATA 303 Assignment 4"
author: "Ryan Admiraal, XXXXXXXXXX"
date: "26 June 2020"
output: pdf_document
---
• While you are developing your code you may find it easiest to have the output set to html_document
ut change it to pdf_document when you submit.
• In your submission, embed any executable R code in code chunks, and make sure both the R code and
the output is displayed co
ectly when you knit the document.
• If there are any R code e
ors, then the Rmarkdown file will not knit, and no output will be created at
all. If you cannot get your code to work but want to show your attempted code, then put e
or =
TRUE in the header of the R code chunk that is failing.
```{r, e
or = TRUE}
your imperfect R code
```
• Where appropriate, make sure you include your comments in the output within the Rmarkdown
document.
• You will receive an email confirming your submission. Check the email to be sure it
shows that both the Rmd file and the PDF file have been submitted.
1
Background and Data
Heart disease is the annual leading cause of death worldwide, accounting for more than 25% of deaths in
2016 (World Health Organization XXXXXXXXXXIt is also a significant economic burden for the healthcare system
with Nichols et al XXXXXXXXXXestimating that heart disease and other cardiovascular diseases cost an average of
oughly USD $19,000 per patient, according to a study in the United States over the period of XXXXXXXXXX.
Early detection of heart disease (along with many other diseases) is important in terms of reducing both
mortality and costs to the healthcare system.
We will examine data on 4,240 participants in the Framingham Heart Study (Boston University and the
National Heart, Lung, & Blood Institute 2020), an ongoing study that began in 1948 and has been instrumental
in the identification of a number of risk factors for heart disease and other cardiovascular diseases. The data
are available in the file Heart Disease.xlsx, which can be read into R using the code below but with the
path changed to point to the location of the file on your computer. A full list of variables contained in the
dataset and descriptions of these variables is also provided, both here and in the Excel file.
# Load the "readxl" package to read in data from an Excel file.
li
ary(readxl)
# Read in the heart disease dataset.
hd <- read_xlsx("~/Documents/Dropbox/Courses/DATA303/Data/Heart Disease.xlsx", sheet =
"Data", na = "NA")
Table 1: Variables and their descriptions for data contained in the
file Heart Disease.xlsx.
Variable Description
SEX Sex of the individual (0 = “Female”, 1 = “Male”).
AGE Age (in years) of the individual at the time of the health exam.
EDUC Highest level of education of the individual (1 = “Some high school”, 2 = “High school
or Graduate Equivalency Diploma”, 3 = “Some university or vocational school”, 4 =
“University”).
SMOKER Indicator of whether or not the individual is a cu
ent smoker (0 = “No”, 1 = “Yes”).
CIG Average number of cigarettes that the individual smokes each day.
BP_MED Indicator of whether or not the individual is on blood pressure medication (0 = “No”, 1
= “Yes”).
STROKE Indicator of whether or not the individual previously had a stroke (0 = “No”, 1 =
“Yes”).
HYPER Indicator of whether or not the individual was hypertensive (0 = “No”, 1 = “Yes”).
DIAB Indicator of whether or not the individual is diabetic (0 = “No”, 1 = “Yes”).
CHOL Total cholesterol level (in mg/dL).
SBP Systolic blood pressure (in mmHg).
DBP Diastolic blood pressure (in mmHg).
BMI Body mass index.
HR Resting heart rate (in beats per minute)
GLUC Glucose level (in mg/dL)
HD_RISK Indicator of whether the individual has 10-year risk of future coronary heart disease (0
= “No”, 1 = “Yes”)
Our focus will be on 10-year risk of coronary heart disease (CHD). Ten-year risk of CHD is a predicted risk
(i.e., a probability ranging between 0 and 1) of developing CHD within the next 10 years. Although this is
not an observed outcome but rather an estimated value, 10-year risk of CHD is a well-established measure in
the medical community. We will consider a binary version of this variable which indicates whether or not a
person would be considered as at risk of developing CHD within the next 10 years.
2
Assignment Questions
1. Missing data and variable recode: (10 marks)
Although our objective will be to consider inferential and predictive models for 10-year risk of CHD, we
will first ensure that we understand aspects of the underlying data as well as create a new variable
that may prove useful in producing comparisons of 10-year risk of CHD for medically-meaningful blood
pressure ranges. (In practice, we would want to examine each relevant variable to identify extreme
observations and be sure that there are not any e
oneous values. As this dataset has already been
cleaned, we will not do so for this assignment.)
a. (2 marks) First, perform an analysis of the level of missing data for each variable. For only
those variables for which there are missing data, produce a table of the form shown below, where
VARIABLE_i is the name of the variable with missing data, ni is the count for number of missing
observations for that variable, and pi is the proportion (to 5dp) of missing observations for that
variable. Which variable has the highest level of missing data?
Table 2: Frequency and proportion of missing values for variables
with missing data.
Variable VARIABLE_1 VARIABLE_ XXXXXXXXXXVARIABLE_k
Frequency (n) n1 n XXXXXXXXXXnk
Proportion (p) p1 p XXXXXXXXXXpk
. (3 marks) Create a new data frame called hd.complete, which only keeps people/observations
that have no missing data. In total, what proportion (to 5dp) of people have been removed from
the original dataset to produce this final data frame?
c. (3 marks) Add a variable to the data frame hd.complete called SBP_CAT, which converts systolic
lood pressure (SBP) from a numeric variable to a categorical variable according to the blood
pressure ranges specified by Madell and Cherney XXXXXXXXXXSee references listed at the end of the
assignment.) For the purposes of coding SBP_CAT, you can assume that the values for each blood
pressure category go to just below that of the next category, as our dataset does not consist of
lood pressures that are rounded to the nearest whole number. This means that, for instance, the
systolic blood pressure range of 120 – 129 should in fact be interpreted as 120 – < 130. This should
produce five levels (i.e., blood pressure ranges) for SBP_CAT. (Note that the final level co
esponds
to systolic blood pressure above 180 mmHg.) Produce a table for SBP_CAT which shows how many
observations fall into each blood pressure range.
d. (2 marks) Explain when we would expect that using the categorical variable SBP_CAT rathe
than the numeric variable SBP would lead to a better fit for a regression model (whether logistic
egression, linear regression, or Poisson regression).
2. Inferential analysis: (25 marks)
Now we will focus on 10-year risk of CHD and look at the role that blood pressure may play in whethe
or not someone is considered to be at risk of developing CHD within the next 10 years.
a. (3 marks) We will first consider a logistic regression model of 10-year risk of CHD (HD_RISK) on
systolic blood pressure (SBP) and diastolic blood pressure (DBP). Previous research suggests that
the following variables are potential confounders for the true relationship between blood pressure
and 10-year risk of CHD and should also be included in the logistic regression model:
• sex of the individual (SEX)
• age of the individual (AGE)
• highest level of education of the individual (EDUC)
• average number of cigarettes smoked per day (CIG)
3
• total cholesterol level (CHOL)
• body mass index (BMI)
• glucose level (GLUC)
For this logistic regression model, calculate the variance inflation factors for predictors (to 3dp) to
determine whether or not there is evidence of significant multicollinearity among the predictors
in the model. If so, comment on which predictor(s) should be removed, and use this model fo
subsequent parts of this question.
. (3 marks) Using your model from part (a), produce a table of logistic regression model output
and write out the estimated logistic regression equation using the form
log
(

1 − p̂
)
= β̂0 + β̂1X1 + · · · + β̂kXk,
where you clearly define the variables X1, X2, . . . , Xk and replace β̂0, β̂1, . . . , β̂k with thei
estimated values (to 4dp).
c. (6 marks) Ca
y out Wald tests for the coefficients fo
• systolic blood pressure and
• diastolic blood pressure.
For each coefficient, clearly state
i. the hypotheses you are testing,
ii. the value of the test statistic,
iii. the p-value, and
iv. your conclusion in terms of whether the “effect” of the predictor on the response is statistically
significant.
d. (3 marks) For any significant Wald tests in part (c), provide a precise interpetation of what the
estimated coefficient suggests about the “effect” of the predictor on the response, and calculate a
co
esponding 95% confidence interval (to 3dp) for the estimated “effect”.
e. (4 marks) A 2015 study by Wu et al XXXXXXXXXXfound that
“cardiovascular and expanded-cardiovascular mortality risks were lowest when systolic
lood pressures were 120 to 129 mm Hg, and increased significantly when systolic blood
pressures (SBPs) were ≥ 160 mm Hg. . . .”
Although Wu et al XXXXXXXXXXconsidered different ranges of systolic blood pressures (< 120, 120—129,
130—139, 140—149, 150—159, ≥ 160 mmHg) than Madell and Cherney (2018), we will use those
specified by Madell and Cherney XXXXXXXXXXin investigating whether ranges of blood pressures may
differ in terms of associated 10-year risk of CHD.
Fit the same model as before, but replace SBP with SBP_CAT.
i. Produce a table of logistic regression model output for this model.
ii. Based strictly on p-values, comment on what conclusions you would make for Wald tests
ased on coefficients for SBP_CAT. (Note that you do not need to state hypotheses or values
for test statistics. You simply need to use the p-values to explain what these results mean
about comparisons of systolic blood pressure ranges.)
Answered Same Day Jun 21, 2021 Victoria University

Solution

Sudharsan.J answered on Jun 23 2021
140 Votes
---
title: "DATA 303 Assignment 4"
author: "Sai Badrinarayan Swain"
date: "22 June 2020"
output: word_document
---
# Load the "readxl" package to read in data from an Excel file.
li
ary(readxl)
# Read the heart disease dataset.
hd <- read_xlsx("C:\\Users\\Monika\\Desktop\\Greynodes\\DATA 303\\Data\\heart-disease.xlsx", sheet =
"Data", na = "NA")
str(hd) #4240 obs 16 varibles with NA
data=na.omit(hd)
str(data) #3658 obs 16 varibles without NA
#Question-1 [a]
li
ary(dplyr)
d=colSums(round(is.na(hd))) # Frequency of NA
d1=hd %>% summarize_all(funs(sum(is.na(.)) / length(.))) # Proportion of NA
d3=round(d1,3)
d2=as.data.frame(
ind(d,d3))
d2
ownames(d2)=c("Frequency", "proportion") # changing the row names
#Question-1 [b]
hd.complete=na.omit(hd) # Omiting NA's
str(hd.complete)
#Question-1 [c]
#creating a new-variable named SBP_CAT and coverting it into categorical:
hd.complete$SBP_CAT=cut(hd.complete$SBP,...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here