1
University at Buffalo, Industrial and Systems Engineering
IE322 Analytics and Computing for Industrial Engineers
Lab#3 Fall 2022
Machine Learning Practices
(This is an individual lab)
Due 23:59 November 13th, 2022
Description:
The dataset for this lab is tuition.csv, and it is available on UBlearns. The dataset has information
about school tuition. The description of each variable is displayed in Table 1.
Requirements:
Draft a report to document your R code and results (or partial results if there are too many) in
each step. Note that your report will be graded on both technical content (70%) and report
quality (30%). Submit two files to UBLearns: 1) your report, and 2) your R script.
Table 1
VARIABLES DESCRIPTION DATA
TYPE
tuition College tuition ("out-of-state" rate). continuous.
pcttop25 Percent of new students from the top 25% of high school class. continuous.
sf_ratio Student to faculty ratio. continuous.
fac_comp Average faculty compensation. continuous.
accrate Fraction of applicants accepted for admission. continuous.
graduat Percent of students who graduate. continuous.
pct_phd Percent of faculty with Ph.D.'s. continuous.
fulltime Percent of undergraduates who are full time students. continuous.
alumni Percent of alumni who donate. continuous.
num_enrl Number of new students enrolled. continuous.
public.private Is the college a public or private institution? public=0, private=1 discrete.
Abdullah Fahad
Abdullah Fahad
Abdullah Fahad
2
1. Basic plotting (20 pts)
Read the tuition.csv data into R console as D0. Using D0 for the following questions.
a) Change the data type of “public.private” into a factor.
) Use ggplot to draw a scatter plot, where the x-axis is “num_enrl” and y-axis is “fac_comp”,
each data point is distinguished by “public.private”.
c) Based on b), add linear regression lines for public institutions and private institutions. Copy
and paste the final plot to your report.
2. Feature selection (30 pts)
Using D0 for the following questions.
a) Build a full linear regression model, named it as full_model, where “tuition” is dependent
variable, and the rest of variables are independent variables. Report the summary of this
full model into your report.
) Based on the full model, perform forward feature selection to select top 3 key features.
This selection is based on the p-value of inclusion (i.e., penter). Report the results to the
eport.
c) Based on the full model, perform backward feature selection to select top 3 key features.
This selection is based on the p-value of exclusion (i.e., prem). Report the results to the
eport.
3. KNN (50 pts)
Using D0 to create a subset named as D1, where D1 only includes three features: “accrate”,
“graduat”, “public.private”. Then, delete all missing values from D1, and overwrite D1. Hint: D1
- na.omit(D1).
Among all three features in D1, we consider independent variables are “accrate”, “graduat”, and
target variable is “public.private”. Use D1 for the following questions.
a) Use min-max normalization to normalize two independent variables “accrate”, “graduat”.
This step is to eliminate the effect of different value range on the model.
) Set the seed number as XXXXXXXXXXHint: set.seed XXXXXXXXXXThis step is to make sure that you
will get same model results every time you run the code.
c) Split the D1 into training set with 70% of the data, and test set with the remaining 30% of
the data.
d) Build a KNN model using the training set, and test the model performance using the test
set. Report the confusion matrix into your report.