PROG8430 – Data Analysis, Modeling and Algorithms
Assignment 3
Multivariate Linear Regression
DUE BEFORE MAR 28, 2021; 10PM
1. Submission Guidelines
All assignments must be submitted via the econestoga course website before the due date in to
the assignment folder.
You may make multiple submissions, but only the most cu
ent submission will be graded.
SUBMISSIONS
In the Assignment 3 Folder submit:
1. Your R Code
2. Your report in Word, following the template from our MLR lecture and in the Assignment
folder.
DO NOT PUT THE DOCUMENTS IN TO A ZIP FILE!
All variables in your code must abide by the naming convention [variable_name]_[intials]. For
example, my variable for State would be State_DM.
You may only use base R (i.e. no additional packaged may be used)
THIS IS AN INDIVIDUAL ASSIGNMENT. UNAUTHORIZED COLLABORATION IS AN ACADEMIC
OFFENSE. Please see the Conestoga College Academic Integrity Policy for details.
2. Grading
This assignment is worth 12.5% of your total grade in the course and you can expect it to take five
to eight hours. It is out of 25 marks overall.
Assignments submitted after 10pm will be reduced 20%. Assignments received after 8:00am
the morning after the due date will receive a mark of 0%.
Assignments which do not follow the submission instructions may have marks deducted.
3. Data
Each student will be using the study dataset:
STUDY DATASET:
PROG8430_Assign_MLR.Rdata
Appendix one contains a data dictionary for the study file.
4. Background
A survey of 2120 residents of Canada was conducted to determine the key factors associated with
political engagement. A variety of variables were measured and recorded including some tests
they were asked to complete. Appendix 1 contains the data dictionary for the data set. One group
of respondents (“Treat”) were given additional education on political matters while the other
(“Control”) were not.
Your task will be to used multiple linear regression to determine the factors which contribution
to Political Awareness (variable: Pol).
All of the tasks have been completed using the examples presented in class. A careful review of
your notes from the lectures should give you everything you need to complete these tasks.
5. Assignment Tasks
N
Description Marks
1 Data Transformation
As demonstrated in class, transform any variables that are required to
conduct the regression analysis.
2
2 Reduce Dimensionality
1. Apply the Missing Value Filter to remove appropriate columns of data.
2. Apply the Low Variance Filter to remove appropriate columns of data.
3. Apply the High Co
elation Filter to remove appropriate columns of
data.
3
3 Outliers
1. Create boxplots of all relevant variables (i.e. non-binary) to determine
outliers.
2. Comment on any outliers you see and deal with them appropriately.
2
4 Exploratory Analysis
1. Co
elations: Create both numeric and graphical co
elations (as
demonstrated in class) and comment on noteworthy co
elations you
observe. Are these surprising? Do they make sense?
2
5 Simple Linear Regression
1. Create a simple linear regression model using Pol as the dependent
variable and Score as the independent. Create a scatter plot of the
two variables and overlay the regression line.
4
2. Create a simple linear regression model using Pol as the dependent
variable and income as the independent. Create a scatter plot of the
two variables and overlay the regression line.
3. Compare the models. Which model is superior? Why?
6 Model Development
As demonstrated in class, create two models using two automatic
variable selection techniques discussed in class (Full, Backward). For
each model interpret and comment on the five main measures we
discussed in class:
1. F-Stat
2. R-Squared value
3. Residuals
4. Significant variables
5. Variable Co-Efficients
4
7 Model Evaluation – Verifying Assumptions
1. For all three models (as discussed and demonstrated in class) evaluate
the main assumptions of regression: E
or terms mean of zero,
constant variance and normally distributed.
4
Final Recommendation
1. Based on your preceding analysis, recommend which of the three
models should be used.
NOTE – Even if none of the models meet all the assumptions of
egression, choose the best of the three. In subsequent classes we will
learn how to deal with these issues.
1
Professionalism, Clarity and Proper Citations 3
APPENDIX ONE: STUDY FILE DATA
Variable Description
id UserID (unique to each respondent)
group Treatment or Control group
hs.grad Graduated High School (Y or N)
nation Nationality (Region)
gender M/F
age Age in Years
m.status Marital Status
political: Political Affiliation
n.child Number of Children
income Annual Household Income
food Pct of Income to Food
housing Pct of Income to Housing
other Pct of Income to Other Expenses
score Score on Political Awareness Test
scr Standardized Score Test
time1 Pct of Time Taken on Test
time2 Time Taken on Section 1 (Standardized)
time3 Time Taken on Section 2 (Standardized)
Pol Measure of Political Involvement
Title Layout
PROG8430 – Data
Analysis, Modeling and
Algorithms
LECTURE 8 – REGRESSION ANALYSIS
Introduction to Simple
Linear Regression (SLR)
From inference to prediction.
Did you summarize the
data?
STOP! Not
a data
analysis
NO
Did you report the
summaries without
interpretation?
Descriptive
Did you quantify whether
your discoveries will hold in
a new sample?
Exploratory
NO
YES
YES
NO
Are you trying to determine
how changing the average
of one measurement
affects another?
YES
Are you trying to predict
measurements for
individuals?
Is the effect you are
seeking average or
deterministic?
Inferential Predictive Causal Mechanistic
YES
NO
NO YES Average
Deterministic
FOCUS FOR THIS
LECTURE
Prediction vs. Explanation vs. Anomaly
Detection
• Predictive modeling is the process of applying a statistical model or data mining algorithm to
data for the purpose of predicting new or future observations. (E.g. the output value (Y ) for
new observations given their input values (X).
Prediction
• Causal or non-casual explanation and explanatory modeling is the use of statistical models for
testing causal explanations.
Explanation
• Identifies unusual or atypical patterns (outliers). E.g.
• Fraud detection in various operating environments
• Intrusion detection (unusual patterns in network traffic – potential hack?)
• Identifying tumors in health imaging (E.g. MRI scans)
Anomaly Detection
Adapted from Shmueli, Galit To Explain or Predict?, Statistical Science, 2010, Vol. 25, No. 3
Simple Linear Regression
Models the relationship between the magnitude of one variable and another.
• Measures the strength of the relationship.
• Y and X are interchangable
Co
elation
• Quantifies the relationship
• Y is predicted using the value of X
Regression
Y = ?0 + ?1? + ?
Examples in R
# Read "comma separated value" files (".csv")
# Systolic Blood Pressure Dataset
Systolic <- read.csv("C:/Users/David/Documents/Data/Systolic.csv",
header = TRUE, sep = ",")
# Read "comma separated value" files (".csv")
# Thunder Basin Dataset
Thunder <-
ead.csv("C:/Users/David/Documents/Data/ThunderBasin.csv", header
= TRUE, sep = ",")
• PROG8430-SLR_Demo.R is attached at
the website and we will be using it for
this lecture.
• Also, download ThunderBasin1.csv and
Systolic1.csv files
Systolic Blood Pressure Data
The data (X1, X2, X3) are for
each patient.
X1 = systolic blood pressure
X2 = age in years
X3 = weight in pounds
Thunder Basin Antelope Study
The data (X1, X2, X3, X4) are for
each year.
X1 = spring fawn count/100
X2 = size of adult antelope
population/100
X3 = annual precipitation (inches)
X4 = winter severity index
(1=mild, 5 = severe)
Rename variables to make them more
convenient
#Rename Variables to something meaningful
names(Systolic) <- c("BP", "Age", "Wgt")
str(Systolic)
'data.frame': 11 obs. of 3 variables:
$ BP : int XXXXXXXXXX XXXXXXXXXX128 ...
$ Age: int XXXXXXXXXX XXXXXXXXXX ...
$ Wgt: int XXXXXXXXXX XXXXXXXXXX167 ...
names(Thunder) <- c("Fwn", "Adt", "Prc", "Sev")
str(Thunder)
'data.frame': 8 obs. of 4 variables:
$ Fwn: num XXXXXXXXXX ...
$ Adt: num XXXXXXXXXX6 ...
$ Prc: num XXXXXXXXXX3 12.6 ...
$ Sev: int XXXXXXXXXX
Systolic Blood Pressure Data
The data (X1, X2, X3) are for
each patient.
X1 = systolic blood pressure
X2 = age in years
X3 = weight in pounds
Thunder Basin Antelope Study
The data (X1, X2, X3, X4) are for
each year.
X1 = spring fawn count/100
X2 = size of adult antelope
population/100
X3 = annual precipitation (inches)
X4 = winter severity index
(1=mild, 5 = severe)
NOTE – These are
very small datasets
used simply for
demonstration
purposes!
Two Research Questions for Each Dataset
1.1 Is there a relationship between age and blood pressure?
1.2 Can we quantify it?
2.1 Is there a relationship between spring fawn count and adult population?
2.2 Can we quantify it?
Examine Summary Statistics
TdrSum <-stat.desc(Thunder)
format(TdrSum,digits=2)
• As always, let’s look at statistical measures as well as graphical representations.
Fwn Adt Prc Sev
n
.val XXXXXXXXXX
n
.null XXXXXXXXXX
n
.na XXXXXXXXXX
min XXXXXXXXXX00
max XXXXXXXXXX00
ange XXXXXXXXXX
sum XXXXXXXXXX23.00
median XXXXXXXXXX00
mean XXXXXXXXXX88
SE.mean XXXXXXXXXX
CI.mean.
XXXXXXXXXX1.04
var XXXXXXXXXX
std.dev XXXXXXXXXX
coef.var XXXXXXXXXX
SysSum <-stat.desc(Systolic)
format(SysSum,digits=2)
BP Age Wgt
n
.val XXXXXXXXXX
n