Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

11/18/22, 4:22 PM WGU Performance Assessment https://tasks.wgu.edu/student/ XXXXXXXXXX/course/ XXXXXXXXXX/task/2807/overview 1/10 NVM2 — NVM2 TASK 1: CLASSIFICATION ANALYSIS DATA MINING I —...

1 answer below »
11/18/22, 4:22 PM WGU Performance Assessment
https:
tasks.wgu.edu/student/ XXXXXXXXXX/course/ XXXXXXXXXX/task/2807/overview 1/10
NVM2 — NVM2 TASK 1: CLASSIFICATION ANALYSIS
DATA MINING I — D209
PRFA — NVM2
COMPETENCIES
XXXXXXXXXX : Classification Data Mining Models
The graduate applies observations to appropriate classes and categories using classification models.
XXXXXXXXXX : Data Mining Model Performance
The graduate evaluates data mining model performance for precision, accuracy, and model comparison.
INTRODUCTION
In this task, you will act as an analyst and create a data mining report. In doing so, you must select one of the data dictionary and data set files to use
for your report from the following link: Data Sets and Associated Data Dictionaries.
 
You should also refer to the data dictionary file for your chosen data set from the provided link. You will use Python or R to analyze the given data
and create a data mining report in a word processor (e.g., Microsoft Word). Throughout the submission, you must visually represent each step of you
work and the findings of your data analysis.
 
Note: All algorithms and visual representations used need to be captured either in tables or as screenshots added into the submitted document. A
separate Microsoft Excel (.xls or .xlsx) document of the cleaned data should be submitted along with the written aspects of the data mining report.
REQUIREMENTS
Your submission must be your original work. No more than a combined total of 30% of the submission and no more than a 10% match to any one
individual source can be directly quoted or closely paraphrased from sources, even if cited co
ectly. The originality report that is provided when you
submit your task can be used as a guide.
TASK OVERVIEW SUBMISSIONS EVALUATION REPORT
https:
lrps.wgu.edu/provision/ XXXXXXXXXX
11/18/22, 4:22 PM WGU Performance Assessment
https:
tasks.wgu.edu/student/ XXXXXXXXXX/course/ XXXXXXXXXX/task/2807/overview 2/10
 You must use the ru
ic to direct the creation of your submission because it provides detailed criteria that will be used to evaluate your work. Each
equirement below may be evaluated by more than one ru
ic aspect. The ru
ic aspect titles may contain hyperlinks to relevant portions of the
course.
Tasks may not be submitted as cloud links, such as links to Google Docs, Google Slides, OneDrive, etc., unless specified in the task requirements. All
other submissions must be file types that are uploaded and submitted as attachments (e.g., .csv, .docx, .pdf, .ppt). 
Part I: Research Question
A.  Describe the purpose of this data mining report by doing the following:
1.  Propose one question relevant to a real-world organizational situation that you will answer using one of the following classification methods:
•  k-nearest neighbor (KNN)
•  Naive Bayes
2.  Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available
data.
 
Part II: Method Justification
B.  Explain the reasons for your chosen classification method from part A1 by doing the following:
1.  Explain how the classification method you chose analyzes the selected data set. Include expected outcomes.
2.  Summarize one assumption of the chosen classification method.
3.  List the packages or li
aries you have chosen for Python or R, and justify how each item on the list supports the analysis.
 
Part III: Data Preparation
C.  Perform data preparation for the chosen data set by doing the following:
1.  Describe one data preprocessing goal relevant to the classification method from part A1.
2.  Identify the initial data set variables that you will use to perform the analysis for the classification question from part A1, and classify each
variable as continuous or categorical.
3.  Explain each of the steps used to prepare the data for the analysis. Identify the code segment for each step.
4.  Provide a copy of the cleaned data set.
 
Part IV: Analysis
D.  Perform the data analysis and report on the results by doing the following:
1.  Split the data into training and test data sets and provide the file(s).
2.  Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you
performed.
3.  Provide the code used to perform the classification analysis from part D2.
 
11/18/22, 4:22 PM WGU Performance Assessment
https:
tasks.wgu.edu/student/ XXXXXXXXXX/course/ XXXXXXXXXX/task/2807/overview 3/10
Part V: Data Summary and Implications
E.  Summarize your data analysis by doing the following:
1.  Explain the accuracy and the area under the curve (AUC) of your classification model.
2.  Discuss the results and implications of your classification analysis.
3.  Discuss one limitation of your data analysis.
4.  Recommend a course of action for the real-world organizational situation from part A1 based on your results and implications discussed in
part E2.
 
Part VI: Demonstration
F.  Provide a Panopto video recording that includes a demonstration of the functionality of the code used for the analysis and a summary of the
programming environment.
 
Note: The audiovisual recording should feature you visibly presenting the material (i.e., not in voiceover or embedded video) and should
simultaneously capture both you and your multimedia presentation.
 
Note: For instructions on how to access and use Panopto, use the "Panopto How-To Videos" web link provided below. To access Panopto's
website, navigate to the web link titled "Panopto Access," and then choose to log in using the “WGU” option. If prompted, log in using your WGU
student portal credentials, and then it will forward you to Panopto’s website.
 
To submit your recording, upload it to the Panopto drop box titled “Data Mining I – NVM2.” Once the recording has been uploaded and processed
in Panopto's system, retrieve the URL of the recording from Panopto and copy and paste it into the Links option. Upload the remaining task
equirements using the Attachments option.
 
G.  Record the web sources used to acquire data or segments of third-party code to support the analysis. Ensure the web sources are reliable.
H.  Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized.
I.  Demonstrate professional communication in the content and presentation of your submission.
File Restrictions
File name may contain only letters, numbers, spaces, and these symbols: ! - _ . * ' ( )
File size limit: 200 MB
File types allowed: doc, docx, rtf, xls, xlsx, ppt, pptx, odt, pdf, txt, qt, mov, mpg, avi, mp3, wav, mp4, wma, flv, asf, mpeg, wmv, m4v, svg, tif, tiff, jpeg, jpg, gif, png,
zip, rar, tar, 7z
RUBRIC
11/18/22, 4:22 PM WGU Performance Assessment
https:
tasks.wgu.edu/student/ XXXXXXXXXX/course/ XXXXXXXXXX/task/2807/overview 4/10
A1:PROPOSAL OF QUESTION
A2:DEFINED GOAL
B1:EXPLANATION OF CLASSIFICATION METHOD
B2:SUMMARY OF METHOD ASSUMPTION
NOT EVIDENT
The submission does not propose 1 question.
APPROACHING COMPETENCE
The submission proposes 1 question that is
not relevant to a real-world organizational
situation. Or the proposal does not include 1
of the given classification methods.
COMPETENT
The submission proposes 1 question that is
elevant to a real-world organizational situa-
tion, and the proposal includes 1 of the given
classification methods.
NOT EVIDENT
The submission does not define 1 goal fo
data analysis.
APPROACHING COMPETENCE
The submission defines 1 goal for data analy-
sis, but the goal is not reasonable, is not
within the scope of the scenario, or is not rep-
esented in the available data.
COMPETENT
The submission defines 1 reasonable goal fo
data analysis that is within the scope of the
scenario and is represented in the available
data.
NOT EVIDENT
The submission does not explain how the cho-
sen classification method analyzes the se-
lected data set.
APPROACHING COMPETENCE
The submission does not logically explain how
the chosen classification method analyzes the
selected data set, or the explanation includes
inaccurate expected outcomes.
COMPETENT
The submission logically explains how the cho-
sen classification method analyzes the se-
lected data set and includes accurate expected
outcomes.
NOT EVIDENT
The submission does not summarize 1 as-
sumption of the chosen classification method.
APPROACHING COMPETENCE
The submission inadequately summarizes 1
assumption of the chosen classification
COMPETENT
The submission adequately summarizes 1 as-
sumption of the chosen classification method.
11/18/22, 4:22 PM WGU Performance Assessment
https:
tasks.wgu.edu/student/ XXXXXXXXXX/course/ XXXXXXXXXX/task/2807/overview 5/10
B3:PACKAGES OR LIBRARIES LIST
C1:DATA PREPROCESSING
C2:DATA SET VARIABLES
C3:STEPS FOR ANALYSIS
method.
NOT EVIDENT
The submission does not list the packages o
li
aries chosen for Python or R.
APPROACHING COMPETENCE
The submission lists the packages or li
aries
chosen for Python or R but does not justify
how 1 or more items on the list support the
analysis.
COMPETENT
The submission lists the packages or li
aries
chosen for Python or R and justifies how each
item on the list supports the analysis.
NOT EVIDENT
The submission does not describe 1 data pre-
processing goal.
APPROACHING COMPETENCE
The submission describes 1 data preprocess-
ing goal, but it is not relevant to the classifica-
tion method from part A1.
COMPETENT
The submission describes 1 data preprocess-
ing goal that is relevant to the classification
method from part A1.
NOT EVIDENT
The submission does not identify any data set
variables used to perform the analysis for the
classification question from part A1 or does
not classify the variables as continuous o
categorical.
APPROACHING COMPETENCE
The submission identifies the data set vari-
ables used to perform the analysis for the
classification question from part A1, but the
submission inaccurately classifies 1 or more
variables as continuous or categorical.
COMPETENT
The submission identifies the data set vari-
ables used to perform the analysis for the clas-
sification question from part A1, and the sub-
mission accurately classifies each variable as
continuous or categorical.
11/18/22, 4:22 PM WGU Performance Assessment
https:
tasks.wgu.edu/student/ XXXXXXXXXX/course/ XXXXXXXXXX/task/2807/overview 6/10
C4:CLEANED DATA SET
D1:SPLITTING THE DATA
D2:OUTPUT AND INTERMEDIATE CALCULATIONS
NOT EVIDENT
The submission does not explain each step
used to prepare the data for the analysis, o
the submission does not identify the code
segment for each step.
APPROACHING COMPETENCE
The submission inaccurately explains 1 o
more steps used to prepare the data fo
analysis, or the submission identifies an inac-
curate code segment for 1 or more steps.
COMPETENT
The submission accurately explains each step
used to prepare the data for analysis, and the
submission identifies an accurate code seg-
ment for each step.
NOT EVIDENT
The submission does not include a copy of the
cleaned data set
APPROACHING COMPETENCE
The submission includes a copy of the cleaned
data set, but the data set is inaccurate.
COMPETENT
The submission includes an accurate copy of
the cleaned data set.
NOT EVIDENT
The submission does not provide the training
and test data set file(s).
APPROACHING COMPETENCE
The submission provides training and test
data sets, but the split is not reasonably
proportioned.
COMPETENT
The submission provides reasonably propor-
tioned training and test data sets.
NOT EVIDENT
The submission does not describe the analy-
sis technique used to analyze the data, or it
does not include screenshots of the interme-
diate calculations performed.
APPROACHING COMPETENCE
The submission inaccurately describes the
analysis technique used to appropriately ana-
lyze the data, or the submission includes
COMPETENT
The submission accurately describes the
analysis technique used to appropriately ana-
lyze the data, and the submission includes ac-
11/18/22, 4:22 PM WGU Performance Assessment
https:
tasks.wgu.edu/student/ XXXXXXXXXX/course/ XXXXXXXXXX/task/2807/overview 7/10
D3:CODE EXECUTION
E1:ACCURACY AND AUC
E2:RESULTS AND IMPLICATIONS
E3:LIMITATION
screenshots of the intermediate calculations
performed but they are inaccurate.
curate screenshots of the intermediate calcu-
lations performed.
NOT EVIDENT
The submission does not provide the code
used to perform the classification analysis
from part D2.
APPROACHING COMPETENCE
The submission provides the code used to
perform the classification analysis from part
D2, but 1 or more e
ors are evident during
the execution of the code.
COMPETENT
The submission provides the code used to per-
form the classification analysis from part D2
and the code executes without e
ors.
NOT EVIDENT
The submission does not explain the accuracy
or the AUC of the classification model.
APPROACHING COMPETENCE
The submission does not logically explain the
accuracy or the AUC of the classification
model.
COMPETENT
The submission logically explains both the ac-
curacy and the AUC of the classification
model.
NOT EVIDENT
The submission does not discuss both the re-
sults and implications of the classification
analysis.
APPROACHING COMPETENCE
The submission discusses both the results
and implications of the classification analysis,
ut the discussion is inadequate.
COMPETENT
The submission adequately discusses both the
esults and implications of the classification
analysis.
11/18/22, 4:22 PM WGU Performance Assessment
https:
tasks.wgu.edu/student/ XXXXXXXXXX/course/ XXXXXXXXXX/task/2807/overview 8/10
E4:COURSE OF ACTION
F:PANOPTO RECORDING
G:SOURCES FOR THIRD-PARTY CODE
NOT EVIDENT
The submission does not discuss 1 limitation
of the data analysis.
APPROACHING COMPETENCE
The submission discusses 1 limitation of the
data analysis but lacks adequate detail or is
illogical.
COMPETENT
The submission logically discusses 1 limitation
of the data analysis with adequate detail.
NOT EVIDENT
The submission does not recommend a
course of action for the real-world organiza-
tional situation from part A1
APPROACHING COMPETENCE
The submission does not recommend a rea-
sonable course of action for the real-world
organizational situation from part A1, or the
course of action is not based on the results
and implications discussed in part E2.
COMPETENT
The submission recommends a reasonable
course of action for the real-world organiza-
tional situation from part A1 based on the re-
sults and implications discussed in part E2.
NOT EVIDENT
The submission does not provide a Panopto
video recording.
APPROACHING COMPETENCE
The submission provides a Panopto video
ecording, but it does not include a demon-
stration of the functionality of the code used
for the analysis or a summary of the program-
ming environment or both.
COMPETENT
The submission provides a Panopto video
ecording that
Answered 13 days After Nov 18, 2022

Solution

Aditi answered on Nov 22 2022
40 Votes
In [1]:
In [2]:
Task 1 - KNN Classification
Configure Notebook:
Configure and import packages. A imports.PY file contains all of the programming necessary for importing and customising. There is a second assistant as well. Several functions used all through this notebook are defined in a PY file.
    from imports import *
%matplotlib inline
warnings.filterwarnings('ignore')
P:\code\wgu\py\Scripts\python.exe
python version: 3.9.7
pandas version: 1.3.0
numpy version: 1.19.5
scipy version: 1.7.1
sklearn version: 1.0.1
matplotlib version: 3.4.2
seaborn version: 0.11.2
graphviz version: 0.17
    from helpers import *
getFilename version: 1.0
saveTable version: 1.0
describeData version: 1.0
createScatter version: 1.0
createBarplot version: 1.1
get_unique_numbers version: 1.0
createCo
elationMatrix version: 1.0
createStackedHistogram version: 1.0
plotDataset version: 1.0
Part I: Research Question
A. Describe the purpose of this data mining report by doing the following:
In [3]:
A1. Propose one question relevant to a real-world organizational situation that you will answer using one of the following classification methods: (a) k-nearest neighbor (KNN) or (b) Naive Bayes.
Primary purpose: A telecoms business has received an inquiry about churn. When a client decides to quit using services, this is churn. Is it feasible to categories a new (or cu
ent) client based on their resemblance to previous customers with comparable qualities that have and haven't churned in the past if the firm has customer information that have but have not done so in the past. Two (2) attributes—MonthlyCharge and Tenure—from the company's database of 10,000 consumers will be taken into account in this research. Additionally, if the forecast comes true, the analysis will make an effort to determine how accurate the prediction was.
A2. Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.
Primary purpose: With MonthlyCharge = $170.00 and Tenure = 1.0, the study will try to forecast customer churn for a new client. The company's customer data may be used to accomplish this aim, as both traits are included in the data for 10,000 customers and should give sufficient information for the forecast. K-nearest neighbours (KNN) will be used in the study to categorise the new client depending on the k-nearest existing customers with comparable features.
    import pandas as pd
newCustomer = pd.DataFrame([{'Tenure': 1.0,
'MonthlyCharge': 170.0,
'zTenure': 0.0,
'zMonthlyCharge': 0.0}])
Part II: Method Justification
B. Explain the reasons for your chosen classification method from part A1 by doing the following:
B1. Explain how the classification method you chose analyzes the selected data set. Include expected outcomes.
Describe Method. KNN classification will search for comparable characteristics in
the nearest k neighbors, which are in close proximity to a classification goal value. A classification excellent understanding on those values will be generated when it determines which classification value appears more frequently in those k-neighbors. The target variable should be shown in relation to the model's accuracy summary and k-neighbors in the results, in my opinion.
B2. Summarize one assumption of the chosen classification method.
One supposition. It is a fundamental tenet of KNN modelling that related items are close to one another. It will search for comparable customer records to categories the new customer by looking for that class's occu
ences most often in those near neighbors.
B3. List the packages or li
aries you have chosen for Python or R, and justify how each item on the list supports the analysis.
At the very start of the notebook, all of the Python packages needed for this study were loaded. Version numbers and the packages are displayed. In addition to the typical Python tools (such as numpy, scipy, matplotlib, pandas, etc.), sklearn provides the main package needed to build and view the classification model. I also employ two (2). Instead of putting all that code into to this notebook, I use it in different notebooks thanks to Py files.
All necessary packages are included in Imports.py, and Helpers.py contains a wealth of useful features that let me standardize my tables, figures, and other notebook components. The two of these. The notebook will come with PY files for your convenience.
Part III: Data Preparation
C. Perform data preparation for the chosen data set by doing the following:
C1. Describe one data preprocessing goal relevant to the classification method from part A1.
One purpose of data preprocessing. After importing the firm data into the Python environment, the raw numerical data should be normalized before the KNN classification analysis can be applied to this issue. Additionally, the business information will divide into two (2) subsets: a training dataset with 70% of the data and a testing or validation dataset with the remaining 30%. The training set will then be used by the KNN to create the model, and the test set will be used to verify the model. To make it as easy and clear as possible for anybody to track the analysis all through the notebook, the major objective of data preparation will be establishing these subsets of data. The following is a list of the planned data variables for this analysis:
Raw Data.
y = target data (i.e. Churn (categorical))
X = feature data (i.e. MonthlyCharge, and Tenure) rawData = y.merge(X)
Clean Data.
y = target data (i.e. Churn (bool))
X = feature data (i.e. MonthlyCharge, Tenure, zMonthlyCharge, and zTenure)
cleanData = y.merge(X)
Training Data. 70% of the cleaned data.
X_train = created using train-test-split (i.e.
zMonthlyCharge, and zTenure)
y_train = created using train-test-split
trainData = y_train.merge(X_train)
Testing Data. The remaining 30% of the cleaned data.
X_test = created using train-test-split (i.e.
zMonthlyCharge, and zTenure)
y_test = created using train-test-split
testData = y_test.merge(X_test)
C2. Identify the initial data set variables that you will use to perform the analysis for the classification question from part A1, and classify each variable as continuous or categorical.
Establish the initial variables. I will take into account two aspects, MonthlyCharge and Tenure, as well as one objective, Churn, for my study. The reading of is done with Pandas. The USECOLS option only returns certain data from a CSV raw data file.
The monthly fee that the consumer is charged represents an average for each individual customer.
Tenure (FEATURE) The length of time a consumer has been a customer of the company
If a client has stopped receiving services during the past month, that is churn (TARGET) (yes, no).
In [4]:
TABLE 3-1.SELECTED RAW DATA.
Initial state of dataset before any manipulations.
    raw = pd.read_csv('data/churn_clean.csv',
usecols=['Churn','Tenure','MonthlyCharge']) saveTable(data=raw, title='RAW', sect='C2',
course='D209', task='Task2', caption='3 1')
    
    0
    1
    2
    3
    Churn
    No
    Yes
    No
    No
    Tenure
    6.796
    1.157
    15.754
    17.087
    MonthlyCharge
    172.456
    242.633
    159.948
    119.957
shape: (10000, 3)
Table saved to: TABLES/D209_TASK2_C2_TAB_3_1_RAW.CSV
Summary. 10,000 customer records with three (3) variables each make up the raw customer data for the firm that has been read into the df variable. Two (2) of the variables, which are continuous (numerical) data, will be employed as features, and the third variable is our target binary variable. For each variable, the conventional transformation—a Z-scored column—was added in additional to the raw data.
C3. Explain each of the steps used to prepare the data for the analysis. Identify the code segment for each step.
Step 1: Enter chosen firm details. The pandas.read csv() method was used to read the relevant customer data (Churn, MonthlyCharge, and Tenure) into the Python environment using the usecols=[] option. Earlier, in section C2 [9], this was finished.
In [5]:
In [6]:
In [7]:
In [8]:
    # start with a copy of raw data
clean = raw.copy()
Step 2: Each row of the Churn variable initially had Yes or No values, therefore this step used the pandas.replace() method to transform the category data into boolean data. Boolean data is a form of numerical data in Python that can be either 1 or 0. (int). Earlier, in section C2 [9], this was finished.Target Data ( y ). Convert categorical Churn to numeric boolean. Ref: (1) https:
pandas.pydata.org/pandas- docs/stable/user_guide/indexing.html
    target = 'Churn'
clean[target] = clean[target].replace({"No":False, "Yes":True}) clean[target] = clean[target].astype('bool')
Step 3: Explain the starting set of variables. Describe the data, whether numerical or categorical, for each variable. I used a program I wrote to iterate over each one and list it along with a
ief explanation. Additionally, to display descriptive statistics for numerical data, utilize the pandas.describe() function. Sections C2 [10] and C2 [11] above accomplished this.
    features = ['MonthlyCharge','Tenure']
for c in features:
clean['z'+c] = (clean[c] - clean[c].mean()) / clean[c].std()
    describeData(data=clean)
1. Churn is boolean (BINARY): [False    True].
2. Tenure is numerical (CONTINUOUS) - type: float64. Min: 1.000    Max: 71.999    Std: 26.443
3. MonthlyCharge is numerical (CONTINUOUS) - type: float64. Min: 79.979    Max: 290.160    Std: 42.943
4. zMonthlyCharge is numerical (CONTINUOUS) - type: float64. Min: -2.157    Max: 2.737    Std: 1.000
5. zTenure is numerical (CONTINUOUS) - type:...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here