Linear Models (LMR) 2020: Assignment 2
Total marks available: 40 (the value of individual questions is indicated).
Due: Monday 8 June 2020, 11:59pm
Penalty for late submission: 5% per day to a maximum of 50% of marks earned.
There are 3 parts in this assignment. Indications for the structure of responses are
given in the 3 parts. Please read these carefully. Deductions may be made for
unclear or i
elevant material that is included in your solutions. Please proof-read
your submission to ensure it is as polished and professionally presented as possible.
You can include any relevant computer code used to generate results for your
answers. We expect that a reasonable effort on the assignment should amount to
not much more than 12 pages including both text and graphics. Allowing some
tolerance, the maximum number of pages for this assignment is set to 16 pages.
Please note that any page beyond page 16 will not be marked. Can I also ask you to
indicate the part and number your answers and in the same way the parts are
numbered, that makes marking easier?
Please submit your work online via Canvas. If you are having trouble submitting your
assignment, email me a copy of your submission before the due date and time as
proof of completion by the due date. Extensions can be granted if need be and the
difficulties we are all facing during this pandemic will be taken into account. Please,
contact me by email, prior to the due date, if you need an extension.
Note that by submitting online you are agreeing to the plagiarism declaration
included on the submission web page. Good luck!
• Answer the questions in an essay-style approach when appropriate. Make
sure to include all relevant computer output (and exclude i
that is presented neatly and integrated through the discussion and
• Place the code in an appendix (if the syntax you want to present is long).
• Do not repeat the assignment questions
• Do not include an assignment cover page
• Note that there are not necessarily unique co
ect answers for these
questions, and marks will be awarded for appropriate analysis using
egression models, with co
esponding justifications and explanations. Marks
may be subtracted for unfocussed or disorganised presentation of material.
• Where equations or formula are to be presented, please attempt to do this
electronically using a word processor rather than including images of scanned
in hand written work. When this is not possible, scanned in written work must
e extremely neat and legible. Writing mathematics electronically is an
important skill to learn and practice.
• The size of your assignment can be quite variable and may depend on things
such as table formatting and size of plots. However concise presentation and
discussion within the limit of the maximum number of pages is encouraged
Part A (16 marks)
The dataset “vitD.dta” contains data on measurements made on a group of newborn babies in a
study on the possible associations between maternal vitamin D levels and fetal growth, as measured
y size at birth. The study was motivated by earlier studies on animals and some conflicting
epidemiological evidence that low levels of vitamin D may be associated with reduced growth of the
fetus. The particular measure of size at birth that we will focus on here was a measure of the baby’s
knee-to-heel length (performed by a very accurate device called a knemometer!), recorded in the
dataset with the variable name “kneeheel”. Vitamin D level was measured as the concentration of
25-hydroxyvitamin D (25-OHD), in nmol/L, at a first trimester antenatal visit. The dataset also
ecords a number of other variables potentially associated with birth size, in particular the sex of the
aby, the mother’s height, whether or not she smoked during pregnancy, whether this was her first
aby, and the gestational age at birth.
[Note that the usual caveats apply: these data have been sampled and modified from an original
study (conducted by Dr Ruth Morley at the Royal Children’s Hospital, Melbourne) and no substantive
conclusions should be drawn from these analyses.]
Question 1 [9 marks]
In this question we ask you to examine the association between birth size (kneeheel) and
maternal vitamin D levels during pregnancy (vitd) using two different regression models. For this
part of the analysis you should ignore all other variables in the dataset.
The literature in the area suggests the following two possible relationships (if any):
a) There may be a smoothly increasing birth size with vitamin D level across the whole range of
vitamin D levels refe
ed to as model (1)
) There might be a threshold effect, whereby growth is adversely affected only below a
certain minimum “normal range” of vitamin D levels. In other words, there may be a
smooth association between the two variables that is particularly strong below the “normal
ange” of vitamin D levels, and less strong once within the normal range. This model will be
ed to as model (2). It is often called a “hockey stick” or “bent stick” or “changepoint”
model. The threshold value of interest is taken from the relevant research literature, and is
28 mmol/L of 25-OHD.
(Note: Although it would be possible to explore for other threshold values, we require that you take
this value as given in your analysis.)
Your task is to investigate the evidence for each of the two hypothesised relationships in the above
order and to determine which model results you would present to the clinical investigator.
Provide your response in the form of a technical report documenting your analyses for each of the
models, including Stata (or other software) output where appropriate, in a form that would allow
another person to check over the work that you’ve done. It should include an appropriate
explanation of steps taken, including a clear statement of the models, and some checking of the
underlying assumptions for the analyses performed.
At the end of your report, you should include a separate and stand-alone paragraph where you
interpret the important results of your analysis suitable for a clinical investigator.
[Hint: For the second model, you will need to define a second vitamin D covariate in order to
estimate separately the association between birth size and vitamin D level for mothers with vitamin
D levels above and below 28 mmol/L. This is done most easily by creating a variable that takes the
value zero if vitd<28, and then measures the “excess” vitamin D beyond the threshold level, i.e. by
typing the following commands in Stata: gen vitd_beyond=vitd-28
XXXXXXXXXXreplace vitd_beyond=0 if vitd < 28 ]
Question 2 [5 marks]
Now consider the additional variables: maternal height, smoking and pregnancy history, the baby’s
sex and gestational age (length in weeks) at birth.
(i) The clinical investigator asks you to comment on the possible confounding effects of
gestational age and requests you adjust for gestational age in your chosen regression
model in question 1. WITHOUT performing any analyses, explain to the clinician how the
estimated vitamin D effect would be interpreted in such a model, and whether you have
any reservations about this proposed regression model to address the confounding
(ii) Would you have similar reservations about adjusting for any of the other four variables
(maternal height, smoking, pregnancy history and sex of child)?
NOTE: No analyses are required for question 2
Question 3 [2 marks]
The hint for Question 1 regarding model (2) states the required form of the covariate that needs to
e added to model (1) to form the “threshold” effect model. Your task here is to provide an
aic justification for the form of this covariate. You might wish to start by writing out separate
egression models for the relationship with vitamin D before and after the threshold, and then
including the constraint that the regression line must be continuous, that is, the lines before and
after the threshold joint together at the threshold. Or you may choose another approach.
Part B (13 marks)
A sexual health researcher has asked you for some statistical help in interpreting the results of their
study. In this study, the researcher randomised 96 people into 4 different education interventions,
and measured their knowledge on sexually transmitted infections (STIs) one month later. The
knowledge score is measured on a scale from 0 to 25 and the education groups are as follows:
Group A: An email containing links to web resources
Group B: A one on one discussion with a nurse about STIs
Group C: A fact sheet /
Group D: An interactive group presentation
The data are provided in the dataset “knowledge.dta”.
The researcher has previously completed an introductory statistics course, and analysed the scores
across groups using the stata code below, where variables B, C, and D represent indicator variables
for education groups B, C and D respectively, and ‘score’ represents the knowledge score.
. regr score B C D
Source | SS XXXXXXXXXXdf MS Number of obs = XXXXXXXXXX
-------------+----------------- XXXXXXXXXXF(3, 92) = XXXXXXXXXX
Model | XXXXXXXXXX XXXXXXXXXX XXXXXXXXXXProb > F = XXXXXXXXXX
Residual | XXXXXXXXXX XXXXXXXXXXR-squared = XXXXXXXXXX
-------------+----------------- XXXXXXXXXXAdj R-squared = XXXXXXXXXX
Total | XXXXXXXXXX XXXXXXXXXXRoot MSE = XXXXXXXXXX
score | Coef. Std. E