DATA 303 Assignment 4DATA 303 Assignment 4Due: 5:00 PM Friday 26 June 2020Intructions• Prepare your...

Question

DATA 303 Assignment 4DATA 303 Assignment 4Due: 5:00 PM Friday 26 June 2020Intructions• Prepare your assignment using Rmarkdown• Submit your solutions in two files: an Rmarkdown file named assignment4.Rmd and the PDF filenamed assignment4.pdf that results from knitting the Rmd file.• The YAML header of your Rmarkdown file must contain your name and ID number in the author field,and should have the output format set to pdf_document. For example:---title: "DATA 303 Assignment 4"author: "Ryan Admiraal, XXXXXXXXXX"date: "26 June 2020"output: pdf_document---• While you are developing your code you may find it easiest to have the output set to html_documentut change it to pdf_document when you submit.• In your submission, embed any executable R code in code chunks, and make sure both the R code andthe output is displayed coectly when you knit the document.• If there are any R code eors, then the Rmarkdown file will not knit, and no output will be created atall. If you cannot get your code to work but want to show your attempted code, then put eor =TRUE in the header of the R code chunk that is failing.```{r, eor = TRUE}your imperfect R code```• Where appropriate, make sure you include your comments in the output within the Rmarkdowndocument.• You will receive an email confirming your submission. Check the email to be sure itshows that both the Rmd file and the PDF file have been submitted.1Background and DataHeart disease is the annual leading cause of death worldwide, accounting for more than 25% of deaths in2016 (World Health Organization XXXXXXXXXXIt is also a significant economic burden for the healthcare systemwith Nichols et al XXXXXXXXXXestimating that heart disease and other cardiovascular diseases cost an average ofoughly USD $19,000 per patient, according to a study in the United States over the period of XXXXXXXXXX.Early detection of heart disease (along with many other diseases) is important in terms of reducing bothmortality and costs to the healthcare system.We will examine data on 4,240 participants in the Framingham Heart Study (Boston University and theNational Heart, Lung, & Blood Institute 2020), an ongoing study that began in 1948 and has been instrumentalin the identification of a number of risk factors for heart disease and other cardiovascular diseases. The dataare available in the file Heart Disease.xlsx, which can be read into R using the code below but with thepath changed to point to the location of the file on your computer. A full list of variables contained in thedataset and descriptions of these variables is also provided, both here and in the Excel file.# Load the "readxl" package to read in data from an Excel file.liary(readxl)# Read in the heart disease dataset.hd "Data", na = "NA")Table 1: Variables and their descriptions for data contained in thefile Heart Disease.xlsx.Variable DescriptionSEX Sex of the individual (0 = “Female”, 1 = “Male”).AGE Age (in years) of the individual at the time of the health exam.EDUC Highest level of education of the individual (1 = “Some high school”, 2 = “High schoolor Graduate Equivalency Diploma”, 3 = “Some university or vocational school”, 4 =“University”).SMOKER Indicator of whether or not the individual is a cuent smoker (0 = “No”, 1 = “Yes”).CIG Average number of cigarettes that the individual smokes each day.BP_MED Indicator of whether or not the individual is on blood pressure medication (0 = “No”, 1= “Yes”).STROKE Indicator of whether or not the individual previously had a stroke (0 = “No”, 1 =“Yes”).HYPER Indicator of whether or not the individual was hypertensive (0 = “No”, 1 = “Yes”).DIAB Indicator of whether or not the individual is diabetic (0 = “No”, 1 = “Yes”).CHOL Total cholesterol level (in mg/dL).SBP Systolic blood pressure (in mmHg).DBP Diastolic blood pressure (in mmHg).BMI Body mass index.HR Resting heart rate (in beats per minute)GLUC Glucose level (in mg/dL)HD_RISK Indicator of whether the individual has 10-year risk of future coronary heart disease (0= “No”, 1 = “Yes”)Our focus will be on 10-year risk of coronary heart disease (CHD). Ten-year risk of CHD is a predicted risk(i.e., a probability ranging between 0 and 1) of developing CHD within the next 10 years. Although this isnot an observed outcome but rather an estimated value, 10-year risk of CHD is a well-established measure inthe medical community. We will consider a binary version of this variable which indicates whether or not aperson would be considered as at risk of developing CHD within the next 10 years.2Assignment Questions1. Missing data and variable recode: (10 marks)Although our objective will be to consider inferential and predictive models for 10-year risk of CHD, wewill first ensure that we understand aspects of the underlying data as well as create a new variablethat may prove useful in producing comparisons of 10-year risk of CHD for medically-meaningful bloodpressure ranges. (In practice, we would want to examine each relevant variable to identify extremeobservations and be sure that there are not any eoneous values. As this dataset has already beencleaned, we will not do so for this assignment.)a. (2 marks) First, perform an analysis of the level of missing data for each variable. For onlythose variables for which there are missing data, produce a table of the form shown below, whereVARIABLE_i is the name of the variable with missing data, ni is the count for number of missingobservations for that variable, and pi is the proportion (to 5dp) of missing observations for thatvariable. Which variable has the highest level of missing data?Table 2: Frequency and proportion of missing values for variableswith missing data.Variable VARIABLE_1 VARIABLE_ XXXXXXXXXXVARIABLE_kFrequency (n) n1 n XXXXXXXXXXnkProportion (p) p1 p XXXXXXXXXXpk. (3 marks) Create a new data frame called hd.complete, which only keeps people/observationsthat have no missing data. In total, what proportion (to 5dp) of people have been removed fromthe original dataset to produce this final data frame?c. (3 marks) Add a variable to the data frame hd.complete called SBP_CAT, which converts systoliclood pressure (SBP) from a numeric variable to a categorical variable according to the bloodpressure ranges specified by Madell and Cherney XXXXXXXXXXSee references listed at the end of theassignment.) For the purposes of coding SBP_CAT, you can assume that the values for each bloodpressure category go to just below that of the next category, as our dataset does not consist oflood pressures that are rounded to the nearest whole number. This means that, for instance, thesystolic blood pressure range of 120 – 129 should in fact be interpreted as 120 – produce five levels (i.e., blood pressure ranges) for SBP_CAT. (Note that the final level coespondsto systolic blood pressure above 180 mmHg.) Produce a table for SBP_CAT which shows how manyobservations fall into each blood pressure range.d. (2 marks) Explain when we would expect that using the categorical variable SBP_CAT rathethan the numeric variable SBP would lead to a better fit for a regression model (whether logisticegression, linear regression, or Poisson regression).2. Inferential analysis: (25 marks)Now we will focus on 10-year risk of CHD and look at the role that blood pressure may play in whetheor not someone is considered to be at risk of developing CHD within the next 10 years.a. (3 marks) We will first consider a logistic regression model of 10-year risk of CHD (HD_RISK) onsystolic blood pressure (SBP) and diastolic blood pressure (DBP). Previous research suggests thatthe following variables are potential confounders for the true relationship between blood pressureand 10-year risk of CHD and should also be included in the logistic regression model:• sex of the individual (SEX)• age of the individual (AGE)• highest level of education of the individual (EDUC)• average number of cigarettes smoked per day (CIG)3• total cholesterol level (CHOL)• body mass index (BMI)• glucose level (GLUC)For this logistic regression model, calculate the variance inflation factors for predictors (to 3dp) todetermine whether or not there is evidence of significant multicollinearity among the predictorsin the model. If so, comment on which predictor(s) should be removed, and use this model fosubsequent parts of this question.. (3 marks) Using your model from part (a), produce a table of logistic regression model outputand write out the estimated logistic regression equation using the formlog(p̂1 − p̂)= β̂0 + β̂1X1 + · · · + β̂kXk,where you clearly define the variables X1, X2, . . . , Xk and replace β̂0, β̂1, . . . , β̂k with theiestimated values (to 4dp).c. (6 marks) Cay out Wald tests for the coefficients fo• systolic blood pressure and• diastolic blood pressure.For each coefficient, clearly statei. the hypotheses you are testing,ii. the value of the test statistic,iii. the p-value, andiv. your conclusion in terms of whether the “effect” of the predictor on the response is statisticallysignificant.d. (3 marks) For any significant Wald tests in part (c), provide a precise interpetation of what theestimated coefficient suggests about the “effect” of the predictor on the response, and calculate acoesponding 95% confidence interval (to 3dp) for the estimated “effect”.e. (4 marks) A 2015 study by Wu et al XXXXXXXXXXfound that“cardiovascular and expanded-cardiovascular mortality risks were lowest when systoliclood pressures were 120 to 129 mm Hg, and increased significantly when systolic bloodpressures (SBPs) were ≥ 160 mm Hg. . . .”Although Wu et al XXXXXXXXXXconsidered different ranges of systolic blood pressures (130—139, 140—149, 150—159, ≥ 160 mmHg) than Madell and Cherney (2018), we will use thosespecified by Madell and Cherney XXXXXXXXXXin investigating whether ranges of blood pressures maydiffer in terms of associated 10-year risk of CHD.Fit the same model as before, but replace SBP with SBP_CAT.i. Produce a table of logistic regression model output for this model.ii. Based strictly on p-values, comment on what conclusions you would make for Wald testsased on coefficients for SBP_CAT. (Note that you do not need to state hypotheses or valuesfor test statistics. You simply need to use the p-values to explain what these results meanabout comparisons of systolic blood pressure ranges.)

Sudharsan.J · Accepted Answer

---
  title: "DATA 303 Assignment 4"
author: "Sai Badrinarayan Swain"
date: "22 June 2020"
output: word_document
---
# Load the "readxl" package to read in data from an Excel file.
library(readxl)
# Read the heart disease dataset.
hd % summarize_all(funs(sum(is.na(.)) / length(.)))    # Proportion of NA
d3=round(d1,3)
d2=as.data.frame(rbind(d,d3))
d2
rownames(d2)=c("Frequency", "proportion")                    # changing the row names
#Question-1  [b]
hd.complete=na.omit(hd)                                      # Omiting NA's
str(hd.complete)
#Question-1  [c]
#creating a new-variable named SBP_CAT and coverting it into categorical:
hd.

DATA 303 Assignment 4 DATA 303 Assignment 4 Due: 5:00 PM Friday 26 June 2020 Intructions • Prepare your assignment using Rmarkdown • Submit your solutions in two files: an Rmarkdown file named...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment