Linear Models (LMR) 2020: Assignment 2
Due: Monday 1st June 2020, 11:59pm
There are 3 parts in this assignment. Indications for the structure of responses are
given in the 3 parts. Please read these carefully. Please proof-read your
submission to ensure it is as polished and professionally presented as possible.
You can include any relevant computer code used to generate results for your
answers. We expect that a reasonable effort on the assignment should amount to
not much more than 12 pages including both text and graphics. Allowing some
tolerance, the maximum number of pages for this assignment is set to 16 pages.
Can I also ask you to indicate the part and number your answers and in the same
way the parts are numbered.
Instructions
• Answer the questions in an essay-style approach when appropriate. Make
sure to include all relevant computer output (and exclude i
elevant output),
that is presented neatly and integrated through the discussion and
interpretation.
• Place the code in an appendix (if the syntax you want to present is long).
• Do not repeat the assignment questions
• Do not include an assignment cover page
• Note that there are not necessarily unique co
ect answers for these
questions, and marks will be awarded for appropriate analysis using
egression models, with co
esponding justifications and explanations. Marks
may be subtracted for unfocussed or disorganised presentation of material.
• Where equations or formula are to be presented, please attempt to do this
electronically using a word processor rather than including images of scanned
in hand written work. When this is not possible, scanned in written work must
e extremely neat and legible. Writing mathematics electronically is an
important skill to learn and practice.
Part A
The dataset “vitD.dta” contains data on measurements made on a group of newborn babies in a
study on the possible associations between maternal vitamin D levels and fetal growth, as measured
y size at birth. The study was motivated by earlier studies on animals and some conflicting
epidemiological evidence that low levels of vitamin D may be associated with reduced growth of the
fetus. The particular measure of size at birth that we will focus on here was a measure of the baby’s
knee-to-heel length (performed by a very accurate device called a knemometer!), recorded in the
dataset with the variable name “kneeheel”. Vitamin D level was measured as the concentration of
25-hydroxyvitamin D (25-OHD), in nmol/L, at a first trimester antenatal visit. The dataset also
ecords a number of other variables potentially associated with birth size, in particular the sex of the
aby, the mother’s height, whether or not she smoked during pregnancy, whether this was her first
aby, and the gestational age at birth.
[Note that the usual caveats apply: these data have been sampled and modified from an original
study (conducted by Dr Ruth Morley at the Royal Children’s Hospital, Melbourne) and no substantive
conclusions should be drawn from these analyses.]
Question 1
In this question we ask you to examine the association between birth size (kneeheel) and
maternal vitamin D levels during pregnancy (vitd) using two different regression models. For this
part of the analysis you should ignore all other variables in the dataset.
The literature in the area suggests the following two possible relationships (if any):
a) There may be a smoothly increasing birth size with vitamin D level across the whole range of
vitamin D levels refe
ed to as model (1)
) There might be a threshold effect, whereby growth is adversely affected only below a
certain minimum “normal range” of vitamin D levels. In other words, there may be a
smooth association between the two variables that is particularly strong below the “normal
ange” of vitamin D levels, and less strong once within the normal range. This model will be
efe
ed to as model (2). It is often called a “hockey stick” or “bent stick” or “changepoint”
model. The threshold value of interest is taken from the relevant research literature, and is
28 mmol/L of 25-OHD.
(Note: Although it would be possible to explore for other threshold values, we require that you take
this value as given in your analysis.)
Your task is to investigate the evidence for each of the two hypothesised relationships in the above
order and to determine which model results you would present to the clinical investigator.
Provide your response in the form of a technical report documenting your analyses for each of the
models, including Stata (or other software) output where appropriate, in a form that would allow
another person to check over the work that you’ve done. It should include an appropriate
explanation of steps taken, including a clear statement of the models, and some checking of the
underlying assumptions for the analyses performed.
At the end of your report, you should include a separate and stand-alone paragraph where you
interpret the important results of your analysis suitable for a clinical investigator.
[Hint: For the second model, you will need to define a second vitamin D covariate in order to
estimate separately the association between birth size and vitamin D level for mothers with vitamin
D levels above and below 28 mmol/L. This is done most easily by creating a variable that takes the
value zero if vitd<28, and then measures the “excess” vitamin D beyond the threshold level, i.e. by
typing the following commands in Stata: gen vitd_beyond=vitd-28
eplace vitd_beyond=0 if vitd < 28 ]
Question 2
Now consider the additional variables: maternal height, smoking and pregnancy history, the baby’s
sex and gestational age (length in weeks) at birth.
(i) The clinical investigator asks you to comment on the possible confounding effects of
gestational age and requests you adjust for gestational age in your chosen regression
model in question 1. WITHOUT performing any analyses, explain to the clinician how the
estimated vitamin D effect would be interpreted in such a model, and whether you have
any reservations about this proposed regression model to address the confounding
question.
(ii) Would you have similar reservations about adjusting for any of the other four variables
(maternal height, smoking, pregnancy history and sex of child)?
NOTE: No analyses are required for question 2
Question 3
The hint for Question 1 regarding model (2) states the required form of the covariate that needs to
e added to model (1) to form the “threshold” effect model. Your task here is to provide an
alge
aic justification for the form of this covariate. You might wish to start by writing out separate
egression models for the relationship with vitamin D before and after the threshold, and then
including the constraint that the regression line must be continuous, that is, the lines before and
after the threshold joint together at the threshold. Or you may choose another approach.
Part B
A sexual health researcher has asked you for some statistical help in interpreting the results of their
study. In this study, the researcher randomised 96 people into 4 different education interventions,
and measured their knowledge on sexually transmitted infections (STIs) one month later. The
knowledge score is measured on a scale from 0 to 25 and the education groups are as follows:
Group A: An email containing links to web resources
Group B: A one on one discussion with a nurse about STIs
Group C: A fact sheet /
ochure
Group D: An interactive group presentation
The data are provided in the dataset “knowledge.dta”.
The researcher has previously completed an introductory statistics course, and analysed the scores
across groups using the stata code below, where variables B, C, and D represent indicator variables
for education groups B, C and D respectively, and ‘score’ represents the knowledge score.
. regr score B C D
Source | SS df MS Number of obs = 96
-------------+----------------- XXXXXXXXXXF(3, 92) = 2.19
Model | XXXXXXXXXX XXXXXXXXXXProb > F = 0.0941
Residual | XXXXXXXXXX XXXXXXXXXXR-squared = 0.0667
-------------+----------------- XXXXXXXXXXAdj R-squared = 0.0363
Total | XXXXXXXXXX XXXXXXXXXXRoot MSE = 5.2036
------------------------------------------------------------------------------
score | Coef. Std. E
. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX
B | XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX
C | XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX
D | XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX
------------------------------------------------------------------------------
The researcher interprets the results as telling him that the group presentation (D) is definitely the
est education intervention because it is the only one with a “significant” p-value. Following this
conclusion, the researcher decided to leave the “non-significant” indicator variables out of the
egression model and obtained the following results:
. regr score D
Source | SS df MS Number of obs = 96
-------------+----------------- XXXXXXXXXXF(1, 94) = 3.76
Model | XXXXXXXXXX XXXXXXXXXXProb > F = 0.0554
Residual | XXXXXXXXXX XXXXXXXXXXR-squared = 0.0385
-------------+----------------- XXXXXXXXXXAdj R-squared = 0.0283
Total | XXXXXXXXXX XXXXXXXXXXRoot MSE = 5.2254
------------------------------------------------------------------------------
score | Coef. Std. E
. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX
D | XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX834317
------------------------------------------------------------------------------
He suggests that this provides the simplest summary result and would like to report this in his
esearch paper. However, he is confused as to why the estimated regression coefficient for group D
has reduced compared with the previous model, with the P-value now being greater than 0.05.
Question 1
Provide some advice to the researcher on their interpretation of the data analysis. Do you agree
that the second regression results with the indicator for group D alone should be