1
University at Buffalo, Industrial and Systems Engineering
IE322 Analytics and Computing for Industrial Engineers
Lab#2 Fall 2022
Data Preprocessing and Linear Regression
(This is an individual lab)
XXXXXXXXXXName: ________________
Due 23:59 October 23th, 2022
Description:
The dataset for this lab is SalesData.csv (the same one as used in Lab 1), and it is available on
UBlearns. The dataset is used to record the sales information of a company.
Requirements:
Draft a report to document your R code and results (or partial results if there are too many) in
each step. Note that your report will be graded on both technical content (70%) and report
quality (30%). Submit two files to UBLearns: 1) your report, and 2) your R script.
1. Data preprocessing (30 pts)
Read the SalesData.csv data into R console as D0.
(a) Identify the variable(s) that have missing data, and determine the number of missing
values for each variable identified in your report.
(b) Install R package “imputeTS” into your R environment, and import it using
li
ary(imputeTS). Note: this package is used for data imputation.
(c) Perform data imputation by following methods:
i. Using D0, delete all rows with missing data, and name the new dataset as S1.
ii. Using D0, replace the missing values with the mean value of each numerical
variable, and name the new dataset as S2. (Hint: you may use na_mean() in
“imputeTS” package, or any other method for this data imputation).
(d) Calculate the co
elation between “Profit” and “Sales” for both S1 and S2. To compare
how those two co
elations are affected by different data imputation methods. Document
your findings in the report.
(e) Use any method to identify any outliers in “Shipping.Cost” column. Count how many
outliers identified. Document your findings in the report.
(f) Delete those outliers you identified in 2(e), and name the new dataset as S3.
Abdullah Fahad
Abdullah Fahad
Abdullah Fahad
Abdullah Fahad
2
2. EDA (20 pts)
Using D0 for the following tasks.
(a) Draw the histogram for the “Shipping.Cost” column, and change the y-axis as frequency.
Then, add a density curve on it. Present the plot into your report.
(b) Draw a single scatter plot to show the relationship between “Profit” (as in y-axis) and
“Sales” (as in x-axis). Based on this plot:
a. add a color of your choice for each “Ship.Mode” (i.e., Delivery Truck, Express
Air, Regular Air). Make sure you choose different colors;
. add a legend. Present this plot into your report.
(c) Draw a scatter plot in a matrix form (see slides Lecture 5, page 16) for "Order.Quantity",
"Profit", and "Shipping.Cost". Present this plot into your plot.
3. Linear regression (50 pts)
For the following tasks, use the “Profit” as dependent variable, and four independent variables:
“Order.Quantity”, “Sales”, “Unit.Price”, “Shipping.Cost”.
(a) Build four linear regression models (name as m0, m1, m2, m3) on dataset D0, S1, S2, and
S3 respectively, using lm() function.
(b) Fill out the following table in your report using summary().
Models Dataset & description
Model performance
RMSE R2
m0 D0: raw data
m1 S1: excluded NAs
m2 S2: imputed Nas with mean
m3 S3: excluded outliers
Your proposed
model in (e)
(c) Compare the model performance among m0, m1 and m2, and discuss how the data
imputation methods affects the model performance.
(d) Compare the model performance between m0 and m3, and discuss how the exclusion of
outliers affects the model performance.
(e) Propose another way to further improve the model performance in your report. Justify
your method by building such a linear regression model. You may consider add
additional variables (e.g., Store, company), or interaction/polynomial terms, based on the
dataset S3. Fill out the table in (b) with your model results.