Solution
Harshal answered on
May 17 2023
**ALERT: PLEASE, READ THE “CW2_Brief.docx” file CAREFULLY BEFORE YOU START**
**Data**
For this exam, you will need a range of datasets named 1. "heart", 2. "kidney", 3. "ovarian", and 4. the data files from the genomics practical.
Data 1-3 are provided in two formats, 1) as a combined ".RDS" file and 2) as individual files in ".csv" files (provided as a ".zip"). You will find that using the ".RDS" is the easiest way to get all the data in one go. However, if you want to try JASP for some of the questions, you will find the ".csv" files inside the ".zip" file the easiest way to go.
The details of these datasets are provided with the co
esponding questions.
Data 4 is provided as part of the practical sessions, and you should use those.
ANALYSIS: Only three questions from the exam require analysis inside RStudio. You are provided with template notebook “CW2_analysis_template.Rmd” with the relevant chunks already inserted to help you.
Good Luck
Please, attempt all questions below.
(IMPORTANT NOTE: There are no mistakes in the question text, so answer the questions as you see them)
Question 1 (10 marks)
You are an intern in a neurology research laboratory tasked with studying the effect of lifestyle on the risk of dementia. You have been given a data table containing three columns as follows:
Column 1) weekly use of bleach (value range = 1 – infinite),
Column 2) length of fingernails (value range = short, medium, and long), and
Column 3) the risk of dementia (value range = Low and high).
Please, answer the below questions using the above data:
1a - What type of variables are 1) weekly use of bleach, 2) waist size, and 3) the risk of dementia? (3 marks)
Ans:
weekly use of bleach (value range = 1 – infinite), continuous variable
length of fingernails (value range = short, medium, and long), categorial variable
the risk of dementia (value range = Low and high). Categorial variable
1b - What is the appropriate plot for visualising the relationship between “Column 2” and "Column 3"? (0.5 mark)
Ans: Grouped Bar chart or stacked bar chart
1c - What is the suitable plot for visualising the relationship between "Column 3" and "Column 1"? (0.5 mark)
Ans: Box plot or violin plot
1d - Name the statistical tests to confirm observations made in 1b and 1c above and list all their assumptions? (2 marks)
Ans: chi-square test for 1
Assumption 1. Independence
2. sufficient sample size
Independence t-test or Mann-Whitney test for 1c
Assumption 1. Independent observation
2. normality
3. homogeneity of variance
1e – What type of plot is shown below (1 mark)? And interpret the plot in 50 words (2 marks). What is the median survival time for the “Low Score” Group? (1 mark)
Ans:
Figure 1e:
Above plot is Kaplan Meier plot
A Kaplan-Meier plot displays the probability of an event (such as survival probability) over time. It provides insight into the survival patterns of a population and allows comparisons between different groups in this case the two groups are low score and high score. The plot shows the proportion of individuals at risk at each time point and highlights any differences in survival between groups. Here the p-value associated with the Kaplan-Meier plot is less than 0.05, which suggests that there is a statistically significant difference in survival between the low score and high score. This means that the observed difference in survival rates is unlikely to occur by chance alone. It provides evidence to support the hypothesis that the groups have different survival patterns or outcomes.
median survival time for the “Low Score” Group
Here the survival curve for the "Low Score" group in a Kaplan-Meier plot does not cross the 0.5 probability mark, it means that the median survival time cannot be directly determined from the plot
Question 2 (20 marks):
The kidney data file has columns called "region" and "KRT19", representing the kidney region of the sample and the expression of the gene KRT19 as TPM, respectively.
Before attempting the question under this section, use online search to understand how the kidney regions in the data map to the anatomy of the kidney. Also, use online search to understand the full name of the KRT19 genes.
Please, answer the below questions using the above data:
2a - Using the co
ect statistical test, is the expression of KRT19 different across the kidney regions and between pairs of kidney regions? (10 marks)
(You MUST perform and comment on all the assumption checks and pre-tests to get any mark for question 2a)
Ans: we use one way anova to determine if the expression of KRT19 differs across kidney regions and between pairs of kidney regions
li
ary(readxl)
kidney <- read_excel("C:/Users/LENOVO/Downloads/cw2-exam-20230516-ieawfm
kidney.xlsx")
View(kidney)
length(kidney)
[1] 3
region = kidney$region
KRT10 = kidney$KRT19
GGT1= kidney$GGT1
li
ary(stats)
#perform one way anova
result = aov(KRT10 ~ region, data = kidney)
result
Call:
aov(formula = KRT10 ~ region, data = kidney)
Terms:
region Residuals
Sum of Squares 187.38708 60.43356
Deg. of Freedom 3 36
Residual standard e
or: 1.29565
Estimated effects may be unbalanced
# Check ANOVA assumptions
# 1. Normality assumption
shapiro.test(residuals(result))
Shapiro-Wilk normality test
data: residuals(result)
W = 0.94001, p-value = 0.03461
#here p value is less than 0.05 hence data does not follow normality assumption
# 2. Homogeneity of variances assumption (Levene's test)
li
ary(car)
leveneTest(KRT19 ~ region, data = kidney)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 2.1461 0.1114
36
Warning message:
In leveneTest.default(y = y, group = group, ...) : group coerced to factor.
#here p value is > 0.05 hence it satisfy the homogeneity test of variance
# Perform Tukey's HSD test
tukey_result <- TukeyHSD(result)
# View the pairwise comparison results
tukey_result
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = KRT10 ~ region, data = kidney)
$region
diff lwr upr p adj
glomeruli-cortex -0.572 -2.1325433 0.9885433 0.7575930
medulla-cortex 2.710 1.1494567 4.2705433 0.0002266
pelvis-cortex 4.810 3.2494567 6.3705433 0.0000000
medulla-glomeruli 3.282 1.7214567 4 .8425433 0.0000113
pelvis-glomeruli 5.382 3.8214567 6.9425433 0.0000000
pelvis-medulla 2.100 0.5394567 3.6605433 0.0047128
2b - Produce and interpret a publication-ready plot decorated with the co
ect statistical test for each pair. (5 marks)
(You MUST use meaningful axis labels for question 2b)
Ans:
2c - Given what is known in the literature about the different regions of the kidney and the expression of KRT19 observed above in 2b, explain how a “loss of function” mutation of KRT19 may affect...