IE 332 - Homework #3Due: April 20, 11:59pmRead Carefully. Important!As outlined in the course...

Question

IE 332 - Homework #3Due: April 20, 11:59pmRead Carefully. Important!As outlined in the course syllabus this homework is worth 7% of your final grade. The maximum attainablemark on this homework is 140. As was also outlined in the syllabus, there is a zero tolerance policy foany form of academic misconduct. The assignment can be done individually or in pairs.By electronically uploading this assignment to Blackboard you acknowledge these statements and acceptany repercussions if in any violation of ANY Purdue Academic Misconduct policies. You must upload youhomework on time for it to be graded. No late assignments will be accepted. Only the last uploadedversion of your assignment will be graded.NOTE: You should aim to submit no later than 30 minutes before the deadline, as there could be last minutenetwork traffic that would cause your assignment to be late.When submitting your assignment it is assumed that every student considers the below checklist, as there aregrading consequences otherwise (e.g., not submitting a cover sheet is an automatic grade of ZERO).� Attach a cover sheet (see Blackboard) as the first page of your submission.� Submit page i of this assignment as the second page of your submission.� Your solutions have style (see Q1 in the assignment).� All of your solutions (program code, etc.) are included in the submission.� All of your source code is included as requested (.R, etc.).� You have not included any screen shots, photos, etc. (plots from R should be intermediately saved as.png files).� All math notation and expressions are created using an equation editor (no pictures, handwritten solu-tions, etc.).� If using Word or other text processor, convert the output to .pdf and ensure that it does not containany aesthetic or other eors. Submit only the pdf version.� If using LATEX, the source code is separately submitted in a .zip file. If using LATEX coectly, thereis a 7 point bonus. Failure to submit the source code voids potential bonus.� If submitting with a partner, BOTH must have uploaded the SAME assignment by the due date.� Watch videos on creating pseudocode if you need a refresher or quick reference to the idea. These aregood starter videos:www.youtube.com/watch?v=4jLO0vXPktUwww.youtube.com/watch?v=yGvfltxHKUUPage i of ihttps:www.youtube.com/watch?v=4jLO0vXPktUhttps:www.youtube.com/watch?v=yGvfltxHKUUIE 332 Homework #3 Due: Apr XXXXXXXXXX1. (0 points) These are style points meant to enforce the skill of communicating technical information in aprecise, concise and easily interpretable way. You are penalized for (a) using poor grammaspelling, (b)disorganized presentation of solutions (including organizing your code into functions, as appropriate), (c)not commenting well your source code, (d) not using meaningful variable names in your code. At thediscretion of the TA (who should be grading this “hard”). The presumption is that you do not do anyof these things, and so doing them will cost addition points (up to -10). Your goal is to get 0/0 on thisquestion.The assignment should have margins between 0.5 and 0.75 inches wide, with font size no larger than 12pt(11pt for code) and no smaller than 10pt - susuperscripts, figure and plot captions excluded, but shoulde clearly legible. Clearly label each question.If a question requires more than 40% of the page to answer, then that is the only answer on that page. Ifmultiple pages are required, this rule applies to the last of those pages.2. As discussed in lecture k-means clustering can group samples into a user-defined number of clusters. In thisquestion, you will perform a k-means analysis for the Youtube Videos data set (posted on on Blackboard).This data set includes 40,949 rows (representing individual videos), and each row has attributes/columnsindicating values such as the video id, trending date, title, channel title, category id, publish time, tags,views, likes, dislikes, comment count, etc. As usual, submit your R code, as well as R console output andplots as appropriate.(a) (5 points) Perform a k-means clustering by varying k = {2, . . . , 6}. Consider only attributes ofviews, likes, dislikes, comment count data! Provide 1-2 sentences to explain your results as to theusefulness of the resulting clusters, and anything else you may notice concerning cluster quality. Nomore than 7 lines of R code!(b) (8 points) Perform cluster validation analysis. To do this use the ”clValid” package for k ={2, . . . , 6}, and considering three clustering methods {k-means, hierarchical and partitioning aroundmedoids (pam)}. Only use the first 1000 rows of the data, after sorting by decreasing number ofviews. Provide the best cluster value for each clustering method and how you found it (note: useinternal validation). No more than 5 lines of R code!(c) (6 points) Create a biplot using the autoplot function in the ggfortify package using the bestnumber of clusters for each clustering method you obtained in the previous step (so one plot peclustering method). Provide 1-2 sentences to explain and compare/contrast the results. No morethan 3 lines of R code!3. We also discussed classification in lecture. To this end, the goal of this question is twofold:1. Learn how to use the function prcomp to perform Principal Component Analysis (PCA) fodimensionality reduction. This will be accomplished by selecting the principal components thatcoespond to 90% of the variance in the data set.2. Compare the performance of the Naive Bayes classifier (using the e1071 package), and the DecisionTree classifier (using the tree package). For that purpose, you will report the confusion matrixand the ROC curve (with the ROCR package).The dataset:• The data set (‘2017 Financial Data.csv’ available on Blackboard) is composed of more than 200financial indicators such as revenue, gross profit, operating expenses, etc., for more than 4000 publiclylisted companies in 2017.• The value of the column Class is “1” if the stock price increased (in percentage points) from theeginning of the year to the end of the year, or “0” if it decreased. From a trading perspective, the“1” identifies those stocks that a hypothetical trader should have bought at the start of the year andsell at the end of the year for a profit.Before you start, make sure you have installed the e1071, tree, and ROCR packages.STEP 1: Cleaning the dataset.(a) (2 points) Use the function read.csv to read the data into a data frame in R called df. Setheader=TRUE. You only need one line of R code.(b) (5 points) Remove from df all of the columns that have more than 1000 NA entries. This will removecolumns that have “too many” NA entries. Then, apply the function na.omit() to omit all the rowsthat contain NA values. Name the resulting data frame as df cleaned. No more than 3 lines ofR code.STEP 2: Apply PCA to select the minimum number of principal components that can explain ≥90%of the overall variance.(c) (5 points) Since PCA is a linear algea based method, we can only apply it to numeric columns.Create a data frame called df numeric, where you remove the columns X, class, sector, that arethe non-numeric columns from df cleaned. Apply PCA with the prcomp function to df numeric(use arguments center and scale to make each column have mean zero and variance 1) and savethe result as a variable principal components. Two line of R code only.(d) (2 points) Show a summary() of your PCA results. Read the summary result and indicate theminimum number of principal components necessary to account for 90% of the cumulative variance.From the summary, you can note that the principal components are ordered by decreasing variance.Only 1 line of R code needed.(e) (4 points) In this and the following item, we want to illustrate the variance and cumulative varianceaccounted for each of the principal components by creating some plots. Plot the principal componentsvariable with screeplot, where the y-axis shows the variance accounted for each of the principalcomponents, whose number is shown in the x-axis. Only 1 line of R code needed.(f) (6 points) Plot the cumulative sum of the variances where the x-axis is the principal componentnumber and the y-axis the cumulative variance. Add a horizontal line to the plot at 0.9, so that we canvisualize the number of principal components needed to account for 90% of the cumulative variance.IE 332 Homework #3 Page 2 of 7(Hint: Use the column sdev from the principal components variable to calculate the variance ofeach principal component, then apply the cumsum function to find the cumulative variance). Only4 lines of R code.(g) (5 points) Create a data frame df pca by binding the Class column (set as factor, in order toproperly work with the Naive Bayes and tree models) and the first 54 principal components, thataccount for more 90% of the variance. This data frame should have the format:Class PC1 PC2 ... PC54(Hint: Use the column x from the result of the prcomp function to obtain a matrix with each principalcomponent as column and the observations as rows.) Only 3 lines of R code needed.STEP 3: Partition the data set into training and testing data.(h) (6 points) For the below classifiers your training set will contain a random selection of 75% of the dataand the testing set will be composed of the remaining 25%. You can use the createDataPartitionfunction in the caret package, setting the attribute p to the appropriate proportion, and retrievethe column Resample1 from the result to get the randomly selected indices. No more than 3 linesof R code.STEP 4: Learn the models using the training data.(i) (5 points) Use the naiveBayes function in the e1071 package to learn a classifier that determinesif a stock is “1” or “0” (meaning explained above). You only need one line of R code.(j) (5 points) Use the tree function in the tree package, and learn a classifier to determine the stockclass. You only need one line of R code.STEP 5: Evaluate the performance of the models on unseen data (testing data). No more than 7lines of R code.(k) (8 points) Report the Confusion matrix for both models from above using no more than 7 linesof R code. What percentage of the test data was coectly classified for each model?(l) (10 points) Use the ROCR package in R to create a single ROC plot showing

Abr Writing · Accepted Answer

q2.R
library(clValid)
library(GGally)
library(ggfortify)
# Question 2
videos

IE 332 - Homework #3 Due: April 20, 11:59pm Read Carefully. Important! As outlined in the course syllabus this homework is worth 7% of your final grade. The maximum attainable mark on this homework is...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers