STA2300 Data Analysis S1, 18
1
Assignment 2
Due Date: 1 May, 2018
Weighting: 20%
Full Marks: 100
Answering the questions in this assignment should not be your first attempt at these types of
questions. It is essential that you work through practice exercises from the tutorial sheets in
the Study Book and/or Text Book first.
This assignment is important in checking your knowledge, providing feedback and helping to
establish competency in essential skills.
Answer all the questions. The questions are not of equal weight; some questions are worth
much more than the others.
The questions relate to materials in Modules 1 to 6.
Before starting this assignment read Notes Concerning Assignments under the Introductory
Material link in the ‘Getting started’ tab on the StudyDesk.
When you are asked to comment on a finding, usually a short paragraph is all that is required.
Do not copy/paste SPSS output into your assignment unless specifically asked to do so. In many
cases the SPSS output contains much more information than is required for a co
ect and
complete answer. In those cases just reproducing the output may not attract any marks. Make
sure you report only the information from the SPSS output relevant to your answer.
In order to obtain full marks for any question you must show all working.
Convert your word document to pdf before submitting your assignment via the link on the
StudtDesk. See the Introductory Material (Section 5, Assignments) for information about how
to do this properly.
This assignment consists of 6 questions.
STA2300 Data Analysis S1, 18
2
Question 1 (14 marks)
This question uses information from the data file DHS18.sav found under the Assessment tab on the
StudyDesk (also see DHS18.txt for more details about the survey and the variables measured). Make
sure the Variable View in SPSS is setup properly with all ‘labels’ co
ectly defined (with units), all
‘values’ assigned co
ectly for categorical variables and the co
ect ‘measure’ selected for all variables.
A researcher is interested to know if wealth of women is associated with the level of their educational
qualifications.
(a) (4 marks) Use a contingency table to display the relationship between ‘Wealth status’ and
‘Education level’ for the women in this survey (you should use SPSS to produce this
contingency table). The title for this table should reflect the context of the study. (Note that
y convention, a table title should appear above the table).
(b) (2 marks) What proportion of women are ‘Poorest’ and have ‘No Education’?
(c) (2 marks) Of those who are ‘Richest’, what proportion have ‘Higher’ education?
(d) (6 marks) Does there appear to be an association between ‘Wealth status’ and the ‘Education
level’ for women in this developing country? Explain in less than 100 words, using a numerical
example(s) from a conditional distribution table to support your conclusion.
Question XXXXXXXXXXmarks)
Consider the data in the file DHS18.sav again. Use SPSS to find the answers to the following questions,
ut do not copy and paste SPSS output into your answer for parts (c) and (d) (make sure you always
include units where appropriate).
(a) (5 marks) Display the distribution of ‘Weight’ of the women in 2011 from this survey using an
appropriate graph. Label the axes co
ectly, include units of measure and provide an
appropriate title.
(b) (4 marks) Using the graph produced in part (a) only (don’t refer to SPSS summary statistics),
describe in no more than 60 words, the distribution of ‘Weight’ of the women, from this
survey. Include comments on shape, centre and spread of the distribution and the existence
of outliers and/or gaps, if any. Do not perform any calculations; use the graph only.
(c) (3 marks) What is the sample size, mean and standard deviation of the distribution of ‘Weight’
of the women in 2011, from this survey? (You can use SPSS to calculate them but do not
copy/paste SPSS output).
(d) (4 marks) Using SPSS find the median, first quartile, third quartile and IQR of the distribution
of ‘Weight’ of the women in 2011, from this survey. (Do not copy/paste SPSS output).
STA2300 Data Analysis S1, 18
3
(e) (4 marks) For the distribution of ‘Weight’ of the women in 2011, which statistics are
appropriate to measure its centre and spread? Give a reasonable explanation for your choice.
Question 3 (12 marks)
Use this extract taken from the article, “Garlic juice and moderate physical activity increase the level
of kidney function of CKD patients,” (appeared in Kidney Research on December 31, 2017) to answer
the questions that follow:
Kidney disease is called a ‘silent disease’ as there are often few or no symptoms. In fact, 90% of
kidneys can be damaged without observing any symptoms. Nowadays, Chronic Kidney Disease
(CKD) is considered as one of the major public health problems worldwide. It is a chronic disorder
in which a person has a low glomerular filtration rate (GFR). A GFR level of 44 to 30 is considered as
moderate to severe loss of kidney function. A recent study by researchers at the US Kidney Research
Centre and Oklahoma University School of Public Health investigated the effects of garlic juice and
moderate physical activity on moderate to severe kidney disease in patients in the US.
A double-blinded, randomized, placebo-controlled trial was conducted with 120 moderate-to-
severely-affected kidney patients. Randomization was stratified by gender. Patients were randomly
assigned to one of four groups. Each group consisted of 15 men and 15 women. The first group was
assigned to receive 50 grams of garlic juice and required to participate in moderate physical activity
(30 minute walk) daily, the second group was given 50 of grams garlic juice daily, the third group
was required to undertake a 30 minute walk daily, and the last group was not given any intervention.
After fifteen weeks of intervention, it was found that the GFR of the combined garlic juice and
physical activity group was significantly lower compared with the other groups.
The researchers also found a greater reduction in BMI, systolic and diastolic blood pressure in the
mixed group compared with garlic-only, physical-activity-only and control groups.
(a) (2 marks) Is this an experimental or observational study? In less than 50 words clearly explain
your choice based on the extract given above.
(b) (3 marks) For the above study identify, if appropriate, the
i) response variable(s).
ii) factor and its levels.
iii) sample size.
(c) (4 marks) Are the four principles of experimental design used in this study? Explain, in the
context of the study?
(d) (3 marks) Explain explicitly what a confounding variable is. Identify one plausible confounding
variable in this study and explain why it is a confounding variable.
Question 4 (12 marks)
Recent research shows that the distance from home to the nearest health service facility is an
STA2300 Data Analysis S1, 18
4
important factor in the control of a number of diseases in developing countries. Based on historical
data (not the sample data in DHS18.sav) in Bangladesh, the distance from home to the nearest health
service facility in rural areas is approximately normally distributed with a mean of 9.5 kms and a
standard deviation of 1.5 kms.
(a) (2 marks) Identify the variable of interest and the unit of measurement of this variable.
(b) (3 marks) Based on the above normal distribution, for what proportion of rural dwellers in
Bangladesh is the distance from home to the nearest health service facility 11 kms or more?
(c) (4 marks) Based on the above normal distribution, for what proportion of Bangladeshi rural
dwellers is the nearest health service facility between 7 and 10 kms from home?
(d) (3 marks) Based on the above normal distribution, below what distance are the closest 5% of
Bangladeshi rural dwellers to their nearest health service facility?
Question 5 (24 marks)
Consider the data in the file DHS18.sav again. Given that it is believed that ready access to health
services can have a marked impact on the wellbeing of people in developing countries, a researcher is
interested in identifying if distance from home to the nearest health service facility of women can be
used to predict the weight of women based on information collected in this survey in 2011.
(a) (2 marks) What are the two variables the researcher will need to include in the analysis? What
type of variables are they?
(b) (4 marks) Use an appropriate graph to display the relationship between the two variables
identified in part (a). Label the axes co
ectly, include units of measure and provide an
appropriate title.
(c) (4 marks) From the graph in part (b), describe (in no more than 30 words) the form, direction
and scatter of this relationship, and identify any outliers.
(d) (4 marks) Calculate an appropriate statistic to measure the strength and direction of the
elationship between the two variables for these women. Interpret this statistic.
(e) (6 marks) Use SPSS output to write the equation of the regression line which could be used to
make this prediction and then plot the regression line on the graph in part (b).
(f) (3 marks) Using the regression equation from part (e), predict the expected weight of women
whose nearest health service facility is 10 km from their home. Would you consider this to be
an accurate prediction? Why?
(g) o(1 mark) What proportion of the variability in weight of women can be explained by the
model, i.e. the relationship between weight and distance from home to the nearest health
STA2300 Data Analysis S1, 18
5
service facility