Module 4 case study – 25 marks
Data used for this case study have been simulated but the problem considered here is very close to a study conducted by researchers at Sydney University.
Subject matter background:
Steatohepatitis is a liver disease where harmful fat accumulates in liver cells and can cause long term damage and scarring of the liver. Steatohepatitis is measured with a liver biopsy that estimates the percentage of unhealthy or unhealthy liver cells. This study is interested in identifying predictors of Steatohepatitis outcomes. One such predictor is the severity of hepatitis C infection which can be measured in three ways: the amount of Hepatitis C virus RNA in the blood; T-cell levels in the blood; and number of cells in the blood infected with hepatitis C virus. All measures are relatively unobtrusive as they only require a blood sample, however there are significant laboratory costs associated with measuring the number of cells infected with hepatitis C virus.
Data has been collected on 238 patients with an active hepatitis C infection, and is recorded in the dataset HCV.dta. The variables in this dataset are:
- • idnum Patient ID number
- • unhealthy Percentage of unhealthy liver cells estimated from a biopsy (outcome)
- • bmi Body mass index (kg/m2), weight divided by the square of height
- • age Age of the patient in years
- • alcohol Alcohol consumption (estimated standard drinks per week)
- • diabetes Presence of diabetes (0 – absent, 1 – present), binary indicator
- • rna Level of hepatitis C RNA in the blood (copies per mL)
- • tcell Level of hepatitis C specific T-cells in the blood (per million T cells)
- • infected Number of cells infected with hepatitis C in the blood (per million cells)
squared
Exercise:
Your task is to build a prediction model for the outcome percentage of unhealthy liver cells for patients with hepatitis C infection. Your clinical investigator is seeking your advice on the best way to use the various hepatitis C measures along with identifying other possible predictors of Steatohepatitis. To this end, you should first follow the model building steps below (in this order):
1. Investigate the individual associations between each variable and the percentage of unhealthy liver cells to: identify which variables should be included in a multivariable model; to identify if any transformations are necessary; and to identify any possible issues of non-linearity.
2. Create an initial multivariable regression model with the percentage of unhealthy liver cells as the outcome and including all possible predictors identified in part one.
3. Investigate possible collinearity in this model and deal with it appropriately.
4. Refine the multivariable model as necessary to exclude terms not associated with the outcome.
5. Check the assumptions of your final model and make any adjustments as necessary. As part of the validation process, you will provide at least 3 plots that are the most useful, in your opinion.
Note that, for the purpose of this exercise, there is no need to investigate interactions.
Once you have completed this analysis, write a summary of your findings for the clinical collaborate that includes the following:
- • A description and explanation of any issues that arose during the model building process
- • A summary of the relevant findings including P-values, interpretation of regression coefficients, confidence intervals, and an equation that could be used to predict the percentage of unhealthy liver cells in other patients (only for the final model)
- • Some specific advice on which measure of hepatitis C infection (rna, t-cells, or cells infected with hepatitis C) is most useful and why.