Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

IE 332 - Homework #3 Due: April 20, 11:59pm Read Carefully. Important! As outlined in the course syllabus this homework is worth 7% of your final grade. The maximum attainable mark on this homework is...

1 answer below »
IE 332 - Homework #3
Due: April 20, 11:59pm
Read Carefully. Important!
As outlined in the course syllabus this homework is worth 7% of your final grade. The maximum attainable
mark on this homework is 140. As was also outlined in the syllabus, there is a zero tolerance policy fo
any form of academic misconduct. The assignment can be done individually or in pairs.
By electronically uploading this assignment to Blackboard you acknowledge these statements and accept
any repercussions if in any violation of ANY Purdue Academic Misconduct policies. You must upload you
homework on time for it to be graded. No late assignments will be accepted. Only the last uploaded
version of your assignment will be graded.
NOTE: You should aim to submit no later than 30 minutes before the deadline, as there could be last minute
network traffic that would cause your assignment to be late.
When submitting your assignment it is assumed that every student considers the below checklist, as there are
grading consequences otherwise (e.g., not submitting a cover sheet is an automatic grade of ZERO).
� Attach a cover sheet (see Blackboard) as the first page of your submission.
� Submit page i of this assignment as the second page of your submission.
� Your solutions have style (see Q1 in the assignment).
� All of your solutions (program code, etc.) are included in the submission.
� All of your source code is included as requested (.R, etc.).
� You have not included any screen shots, photos, etc. (plots from R should be intermediately saved as
.png files).
� All math notation and expressions are created using an equation editor (no pictures, handwritten solu-
tions, etc.).
� If using Word or other text processor, convert the output to .pdf and ensure that it does not contain
any aesthetic or other e
ors. Submit only the pdf version.
� If using LATEX, the source code is separately submitted in a .zip file. If using LATEX co
ectly, there
is a 7 point bonus. Failure to submit the source code voids potential bonus.
� If submitting with a partner, BOTH must have uploaded the SAME assignment by the due date.
� Watch videos on creating pseudocode if you need a refresher or quick reference to the idea. These are
good starter videos:
www.youtube.com/watch?v=4jLO0vXPktU
www.youtube.com/watch?v=yGvfltxHKUU
Page i of i
https:
www.youtube.com/watch?v=4jLO0vXPktU
https:
www.youtube.com/watch?v=yGvfltxHKUU
IE 332 Homework #3 Due: Apr XXXXXXXXXX
1. (0 points) These are style points meant to enforce the skill of communicating technical information in a
precise, concise and easily interpretable way. You are penalized for (a) using poor gramma
spelling, (b)
disorganized presentation of solutions (including organizing your code into functions, as appropriate), (c)
not commenting well your source code, (d) not using meaningful variable names in your code. At the
discretion of the TA (who should be grading this “hard”). The presumption is that you do not do any
of these things, and so doing them will cost addition points (up to -10). Your goal is to get 0/0 on this
question.
The assignment should have margins between 0.5 and 0.75 inches wide, with font size no larger than 12pt
(11pt for code) and no smaller than 10pt - su
superscripts, figure and plot captions excluded, but should
e clearly legible. Clearly label each question.
If a question requires more than 40% of the page to answer, then that is the only answer on that page. If
multiple pages are required, this rule applies to the last of those pages.
2. As discussed in lecture k-means clustering can group samples into a user-defined number of clusters. In this
question, you will perform a k-means analysis for the Youtube Videos data set (posted on on Blackboard).
This data set includes 40,949 rows (representing individual videos), and each row has attributes/columns
indicating values such as the video id, trending date, title, channel title, category id, publish time, tags,
views, likes, dislikes, comment count, etc. As usual, submit your R code, as well as R console output and
plots as appropriate.
(a) (5 points) Perform a k-means clustering by varying k = {2, . . . , 6}. Consider only attributes of
views, likes, dislikes, comment count data! Provide 1-2 sentences to explain your results as to the
usefulness of the resulting clusters, and anything else you may notice concerning cluster quality. No
more than 7 lines of R code!
(b) (8 points) Perform cluster validation analysis. To do this use the ”clValid” package for k =
{2, . . . , 6}, and considering three clustering methods {k-means, hierarchical and partitioning around
medoids (pam)}. Only use the first 1000 rows of the data, after sorting by decreasing number of
views. Provide the best cluster value for each clustering method and how you found it (note: use
internal validation). No more than 5 lines of R code!
(c) (6 points) Create a biplot using the autoplot function in the ggfortify package using the best
number of clusters for each clustering method you obtained in the previous step (so one plot pe
clustering method). Provide 1-2 sentences to explain and compare/contrast the results. No more
than 3 lines of R code!
3. We also discussed classification in lecture. To this end, the goal of this question is twofold:
1. Learn how to use the function prcomp to perform Principal Component Analysis (PCA) fo
dimensionality reduction. This will be accomplished by selecting the principal components that
co
espond to 90% of the variance in the data set.
2. Compare the performance of the Naive Bayes classifier (using the e1071 package), and the Decision
Tree classifier (using the tree package). For that purpose, you will report the confusion matrix
and the ROC curve (with the ROCR package).
The dataset:
• The data set (‘2017 Financial Data.csv’ available on Blackboard) is composed of more than 200
financial indicators such as revenue, gross profit, operating expenses, etc., for more than 4000 publicly
listed companies in 2017.
• The value of the column Class is “1” if the stock price increased (in percentage points) from the
eginning of the year to the end of the year, or “0” if it decreased. From a trading perspective, the
“1” identifies those stocks that a hypothetical trader should have bought at the start of the year and
sell at the end of the year for a profit.
Before you start, make sure you have installed the e1071, tree, and ROCR packages.
STEP 1: Cleaning the dataset.
(a) (2 points) Use the function read.csv to read the data into a data frame in R called df. Set
header=TRUE. You only need one line of R code.
(b) (5 points) Remove from df all of the columns that have more than 1000 NA entries. This will remove
columns that have “too many” NA entries. Then, apply the function na.omit() to omit all the rows
that contain NA values. Name the resulting data frame as df cleaned. No more than 3 lines of
R code.
STEP 2: Apply PCA to select the minimum number of principal components that can explain ≥90%
of the overall variance.
(c) (5 points) Since PCA is a linear alge
a based method, we can only apply it to numeric columns.
Create a data frame called df numeric, where you remove the columns X, class, sector, that are
the non-numeric columns from df cleaned. Apply PCA with the prcomp function to df numeric
(use arguments center and scale to make each column have mean zero and variance 1) and save
the result as a variable principal components. Two line of R code only.
(d) (2 points) Show a summary() of your PCA results. Read the summary result and indicate the
minimum number of principal components necessary to account for 90% of the cumulative variance.
From the summary, you can note that the principal components are ordered by decreasing variance.
Only 1 line of R code needed.
(e) (4 points) In this and the following item, we want to illustrate the variance and cumulative variance
accounted for each of the principal components by creating some plots. Plot the principal components
variable with screeplot, where the y-axis shows the variance accounted for each of the principal
components, whose number is shown in the x-axis. Only 1 line of R code needed.
(f) (6 points) Plot the cumulative sum of the variances where the x-axis is the principal component
number and the y-axis the cumulative variance. Add a horizontal line to the plot at 0.9, so that we can
visualize the number of principal components needed to account for 90% of the cumulative variance.
IE 332 Homework #3 Page 2 of 7
(Hint: Use the column sdev from the principal components variable to calculate the variance of
each principal component, then apply the cumsum function to find the cumulative variance). Only
4 lines of R code.
(g) (5 points) Create a data frame df pca by binding the Class column (set as factor, in order to
properly work with the Naive Bayes and tree models) and the first 54 principal components, that
account for more 90% of the variance. This data frame should have the format:
Class PC1 PC2 ... PC54
(Hint: Use the column x from the result of the prcomp function to obtain a matrix with each principal
component as column and the observations as rows.) Only 3 lines of R code needed.
STEP 3: Partition the data set into training and testing data.
(h) (6 points) For the below classifiers your training set will contain a random selection of 75% of the data
and the testing set will be composed of the remaining 25%. You can use the createDataPartition
function in the caret package, setting the attribute p to the appropriate proportion, and retrieve
the column Resample1 from the result to get the randomly selected indices. No more than 3 lines
of R code.
STEP 4: Learn the models using the training data.
(i) (5 points) Use the naiveBayes function in the e1071 package to learn a classifier that determines
if a stock is “1” or “0” (meaning explained above). You only need one line of R code.
(j) (5 points) Use the tree function in the tree package, and learn a classifier to determine the stock
class. You only need one line of R code.
STEP 5: Evaluate the performance of the models on unseen data (testing data). No more than 7
lines of R code.
(k) (8 points) Report the Confusion matrix for both models from above using no more than 7 lines
of R code. What percentage of the test data was co
ectly classified for each model?
(l) (10 points) Use the ROCR package in R to create a single ROC plot showing
Answered Same Day Apr 07, 2021

Solution

Abr Writing answered on Apr 12 2021
127 Votes
q2.R
li
ary(clValid)
li
ary(GGally)
li
ary(ggfortify)
# Question 2
videos <- read.csv("USvideos.csv")
videos <- videos[,c(
"views",
"likes",
"dislikes",
"comment_count"
)]
cols <- c(
"forestgreen",
"gold",
"dodge
lue",
"red",
"yellow",
"
own"
)
## Part (a)
for (k in 2:6) {
fit <- kmeans(videos, k)
print(paste("For K = ", k,", The centre of the clusters are: "))
print(fit$centers)
cluster <- as.factor(fit$cluster)
print(ggpairs(data = videos,mapping = aes(color=cluster),legend = c(1,1),progress = F) +theme(axis.line=element_blank(),axis.text=element_blank(),axis.ticks=element_blank()))
}
## Part (b)
videos.1000 <- videos[order(-videos$views),][1:1000,]
fit <- clValid(videos.1000, 2:6, clMethods=c("hierarchical","kmeans","pam"), validation="internal",maxitems=nrow(videos.1000))
summary(fit)
## Part (c)
plot(fit)
q2.docx
IE 332 - Homework #3
Kanish
12/04/2020
Question 2
videos <- read.csv("USvideos.csv")
videos <- videos[,c(
"views",
"likes",
"dislikes",
"comment_count"
)]
cols <- c(
"forestgreen",
"gold",
"dodge
lue",
"red",
"yellow",
"
own"
)
Part (a)
for (k in 2:6) {
fit <- kmeans(videos, k)
print(paste("For K = ",
k,
", The centre of the clusters are: "))
print(fit$centers)
cluster <- as.factor(fit$cluster)
print(ggpairs(data = videos,
mapping = aes(color=cluster),
legend = c(1,1),
progress = F) +
theme(axis.line=element_blank(),
axis.text=element_blank(),
...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here