CEIE 474/574 | Assignment 6
1
CEIE 474/574 Construction Computer Application and Informatics
Assignment 6 – Data Mining
Data mining is the process of discovering patterns in large data sets involving methods at the intersection
of machine learning, statistics, and database systems. Generally, data mining tasks can be categorized into
two groups, namely predictive analysis and descriptive analysis, also known as supervised learning and
unsupervised learning. In this assignment, one predictive problem and one descriptive problem are
provided to help you get insights into data mining applications in the construction industry. For the
predictive problem, the regression approach is used to predict construction equipment maintenance cost.
For the descriptive problem, the clustering approach is used to enhance product quality analysis and
product quality management.
1 Problem 1 – Regression
1.1 Background
It’s beneficial for professionals in the construction industry to accurately predict cost since it can not only
help assure reasonable profits but help ensure projects are delivered within the budget as well. Hour
meters are used to log running time of equipment to assure proper maintenance of expensive machines
or systems. This maintenance typically involves replacing, changing, or checking parts, belts, filters, oil,
lu
ication or running condition in engines, motors, blowers, and fans, to name a few.
Equipment maintenance cost prediction is one significant element of the prediction of overall
construction cost. In this assignment, you will practice on preprocessing data, training linear regression
models, selecting the model with good prediction performance through K-fold cross-validation, and
predicting equipment maintenance cost.
1.2 Data Description
A “.xlsx” file is provided for this assignment. Each row in the file represents a specific equipment
maintenance occu
ence. Detailed explanations of variables are listed in Table 1:
Table 1: Variables in the maintenance cost file (maintenance cost.xlsx)
Variable name Variable description
Unit ID Equipment unit ID
Hour Meter Reading Reading of hour meter
Labor Cost Labor cost for this specific maintenance
Parts Cost Parts cost for this specific maintenance
Total Cost Total cost is the sum of Labor cost and parts cost for this specific
maintenance
CEIE 474/574 | Assignment 6
2
1.3 Maintenance Cost Prediction Steps
Step 1:
Real-world data is often incomplete, inconsistent, and is likely to contain many e
ors. Data preprocessing
is a proven method of resolving such issues. Data preprocessing is a data mining technique that involves
transforming raw data into a clean and tidy format. In this step, you are asked to preprocess data to define
the independent variable (input) and the dependent variable (output), which will then be used to train
the maintenance cost prediction model.
1) Delete rows whose “Total Cost” is zero.
2) For each equipment unit, sort the data ascendingly based on “Hour Meter Reading.”
3) For each equipment unit, add a new column called “Cumulative Cost” by accumulating the “Total
Cost” based on the sorted data.
Questions:
(a) For each equipment unit, visualize the relationships between “Hour Meter Reading” and
“Cumulative Cost”.
(b) Interpret these relationships. Discuss the usage and maintenance cost of these equipment units.
Step 2:
Generally, different equipment units are in different operating phases. Therefore, it is difficult to make a
comprehensive prediction by separating these equipment units. In this step, all equipment data is
combined to enrich the dataset based on the assumption that all equipment units have the exact same
characteristics.
1) Combine all equipment “Hour Meter Reading” and “Cumulative Cost” into one dataset, named
“combined dataset”.
2) Sort the “combined dataset” ascendingly based on the column “Hour Meter Reading.”
Questions:
(a) Visualize the relationship between “Hour Meter Reading” and “Cumulative Cost” in the
“combined dataset”.
(b) Compare the relationship in “combined dataset” to the relationships in step 1 (b).
Step 3:
In statistics, linear regression is the simplest approach to model the relationship between a dependent
variable (output) and one or more independent variables (inputs). In this maintenance cost prediction
task, linear and quadratic relationships between the “Hour Meter Reading” and “Cumulative Cost” are
compared. The linear relationship model can be expressed as:
CEIE 474/574 | Assignment 6
3
?? = ??0 + ??1??
The quadratic relationship model can be expressed as:
?? = ??0 + ??1?? + ??2??2
where X represents “Hour Meter Reading” and Y represents “Cumulative Cost”.
In this step, you need to select the model that has a better generalization ability through the K-fold cross-
validation approach. Here, K is set to be 5.
Questions:
(a) Select a model with a better generalization ability based on the K-fold cross-validation approach.
(b) List the selection metrics used for both models.
Step 4:
Up to now, you have selected the model with a better generalization ability, which is more accurate to
characterize the relationship between “Hour Meter Reading” and “Cumulative Cost”. However, in the
previous cross-validation process, only partial data (4 folds) are used for training, which leads to that the
model parameters obtained are not fully optimized. In this step, the selected model in step 3(a) will be
fully trained using the entire “combined dataset.” The model’s predictability will be quantitatively
evaluated using R2 and MSE.
1) Train the selected model with the whole “combined dataset.”
2) Quantitatively evaluate the model performance using R2 and MSE.
Questions:
(a) Calculate and list the values of R2 and Mean Squared E
or (MSE) between predictions and real
observations of “Cumulative Cost.”
(b) Evaluate the model’s predictability using two metrics, namely R2 and MSE.
Step 5:
Once the equipment maintenance cost prediction model is built, practitioners can get a better
understanding and control of future equipment maintenance cost. In this step, you will predict the
cumulative cost for specific hour meter readings.
Questions:
(a) Predict the “Cumulative Cost” when “Hour Meter Reading” are 4000 and 8000, accordingly.
Bonus questions:
CEIE 474/574 | Assignment 6
4
(b) Calculate the 95% confidence intervals for predicted “Cumulative Cost” given 4000 and 8000 for
“Hour Meter Reading.”
(c) Discuss the differences in obtained confidence intervals and tell which prediction is more reliable?
1.4 Marking Scheme
Question Mark Report R Script
Step 1 10
Reasonable visualization of the relationships
etween “Cumulative Cost” and “Hour Meter
Reading” (2)
Reasonable interpretation and explanation (3)
Co
ect presentation using R
script (5)
Step 2 10
Reasonable visualization of the relationship
etween “Cumulative Cost” and “Hour Meter
Reading” in the “combined dataset” (2)
Reasonable comparison (3)
Co
ect presentation using R
script (5)
Step 3 15
Co
ect model selection (5)
Co
ect metrics (5)
Co
ect presentation using R
script (5)
Step 4 15
Co
ect MSE and R2 (5)
A reasonable interpretation of model
predictability (5)
Co
ect presentation using R
script (5)
Step 5
10
Co
ect predictions on “Cumulative Cost”
when “Hour Meter Reading” are 4000 and
8000 (5)
Co
ect presentation using R
script (5)
Bonus
10
Co
ect 95% confidence intervals (4)
A reasonable explanation of prediction
differences (4)
Co
ect presentation using R
script (2)
Total 60+10
CEIE 474/574 | Assignment 6
5
2 Problem 2 – Clustering
2.1 Background
In data mining, cluster analysis or clustering is the process of partitioning a set of objects in such a way
that objects in a cluster are more like one another than the objects in other clusters. An advantage of
clustering is that clustering can automatically lead to the discovery of previously unknown groups within
data. Therefore, product quality performance clustering would group the products that have similar
quality performance into one cluster, which could be used to improve product quality analysis and product
quality management, especially when a vast number of product types is involved. In this assignment, you
will practice on clustering products based on their quality performance, selecting the best cluster number,
and visualizing the clustering result.
2.2 Data Description
A “.CSV” file is provided for this assignment and explanations of variables are listed in Table 1. The quality
performance is defined by the repair rate (q) whose distribution is a beta distribution parameterized by
alpha (α) and beta (β).
??(??) = ????????(??,??)
Table 2: Variables in the product quality performance file (product_quality.csv)
Variable name Variable description
Weld type ID Weld type ID
Weld type Weld type which is composed by the pipe size, schedule, and material
alpha The first shape parameter of the beta distribution
eta The second shape parameter of the beta distribution
2.3 Product Quality Performance Clustering Steps
Step 1:
In data mining, an object is typically represented by multiple features, such as a point can be represented
y a coordinate (x, y, z) in a 3D space, which are normally used to train the model to mine hidden patterns.
Median is the value separating the higher half from the lower half of a population or probability
distribution. In this step, the median value of a distribution is selected as the only feature to represent
the product quality performance.
Questions:
(a) Calculate and list the median values of quality performance for all product types.
Step 2:
CEIE 474/574 | Assignment 6
6
For the K-Means clustering algorithm, determination of hyperparameter K is a common problem. The
co
ect choice of K is often ambiguous, with interpretations depending on the shape and scale of the
distribution of points in a dataset and desired clustering resolution of the user. In this step, you are asked
to select the best cluster number based on the elbow method to further perform the K-means clustering
approach.
The e
or is defined by the distance between the cluster mean and the object that belongs to this cluster.
In this step, the sum of squared e
ors (SSE) is used as the objective function value.
Questions:
(a) Visualize the relationship between objective function value (SSE) and cluster number K.
(b) Select the best cluster number based on the elbow method.
Step 3:
A good clustering of product quality performance can group products that have similar quality
performance into one cluster