Microsoft PowerPoint - Final_Term_Project_2021Spring - Compatibility ModeData MiningProjectProject...

Question

Microsoft PowerPoint - Final_Term_Project_2021Spring - Compatibility Mode
Data Mining
Project
Project Grading
· Submit ONE SINGLE file. Only a doc file is accepted
· The project file should contain the source code and documentation including screenshots. The screenshots are used to demonstrate the running situation of your programs, particularly how the programs execute and produce output based on different input data and user- specified parameter values.
(
2
)
Final Term Project:
Supervised Data Mining (Classification)
· This option is to implement 2 classification algorithms of your choice on 1 dataset of your choice (each of the 2 algorithms must run on the dataset).
· Your final term project documentation must indicate clearly the algorithms and dataset you used in the project.
Final Term Project:
General Sources of Algorithms/Software
http:
davidmlane.com/hyperstat/Statistical_analyses.html http:
statpages.org/javasta2.html http:
pcp.sourceforge.net/ http:
www.cs.waikato.ac.nz/ml/weka
http:
www.r-project.org/ http:
mlflex.sourceforge.net/ Han Book 3rd ed Chapters 8, 9 Kumar Book 1st ed Chapters 4, 5
Final Term Project:
Specific Algorithms and Tools used in the Project
· There are 5 categories of algorithms listed on the following pages (categories 6-10 are software tools and platforms). The 2 classification algorithms you choose must come from two different categories. In the same category, only one algorithm can be chosen from that category.
· In your final term project documentation, for each algorithm you choose, specify clearly the category numbe
name and algorithm name in that category.
Category 1 (Support Vector Machines)
· LIBSVM (http:
www.csie.ntu.edu.tw/~cjlin/libsvm/) linear kernel
· LIBSVM polynomial kernel
· LIBSVM radial basis function (RBF) kernel
· LIBSVM sigmoid kernel
· Gist (http:
www.chibi.ubc.ca/gist/) Pick a kernel of your choice and specify the kernel used in your project
Category 2 (Random Forests)
· CRAN (http:
cran.r-project.org/we
packages
andomForest/index.html)
· Willows (http:
c2s2.yale.edu/software/Willows/)
· Weka (https:
www.cs.waikato.ac.nz/ml/weka/)
Category 3 (Decision Trees)
Refer to Weka
http:
www.cs.waikato.ac.nz/ml/weka
· ADTree
· J48 (C4.5)
· LMT
· M5P
· NBTree
Category 4 (Bayesian Networks)
· Weka BayesNet https:
www.cs.waikato.ac.nz/ml/weka
· JBNC (http:
jbnc.sourceforge.net/)
Category 5 (Naïve Bayes)
Refer to Weka:
https:
www.cs.waikato.ac.nz/ml/weka
· AODE
· ComplementNaiveBayes
· NaiveBayes
· NaiveBayesMultinomial
· NaiveBayesSimple
· NaiveBayesUpdateable
Category 6 (R Package)
· Refer to
http:
www.r-project.org
Pick any classification tool in R.
Category 7 (Mathematical Package)
· MATLAB
Category 8 (RapidMiner)
www.rapidminer.com http:
sourceforge.net/projects
apidmine
Category 9 (Weka)
https:
www.cs.waikato.ac.nz/ml/weka
Category 10 (Python)
https:
scikit-learn.org
Final Term Project: Option 1
Sources of Data
http:
archive.ics.uci.edu/ml/ http:
www.cs.ucr.edu/~eamonn/time_series_data/ http:
aws.amazon.com/datasets http:
www.trustlet.org/wiki/Repositories_of_datasets
Final Term Project:
Unsupervised Data Mining (Clustering)
Part 1
Generate a set S of 500 points (vectors) in 3-dimensional Euclidean space. Use the Euclidean distance to measure
the distance between any two points. Write a program to find all the outliers in your set S and print out these outliers. If there is no outlier, your program should indicate so. Use any programming language of your choice (specify the programming language you use in the project).
Next, remove the outliers from S, and call the resulting set
S’. 21
Final Term Project:
Part 2
(1) Write a program that implements the hierarchical agglomerative clustering algorithm taught in the class to cluster the points in S’ into k clusters where k is a user-specified parameter value.
(2) Repeat part 1 and (1) above on two additional different datasets.
22
Notes on the hierarchical agglomerative clustering algorithm
In determining the distance of two clusters, you should consider the following definitions respectively:
· the distance between the nearest two points in the two clusters,
· the distance between the farthest two points in the two clusters,
· the average distance between points in the two clusters,
· the distance between the centers of the two clusters.
Use the definition that yields the best performance where the performance is measured by the Silhouette coefficient.
23
Final Term Project:
Submission (One File)
A word or pdf file (final term project report) containing:
· Source code of your clustering algorithm.
· The website where the complete datasets can be downloaded.
· All related documentation and documents including the manual you developed and screenshots showing the running situation and input/output of your programs. This report should be written in a tutorial style to explain through screenshots and examples how to run your tool on the datasets you choose.
24
(
19
)

finaltermproject2022spring-atbwcaeq.docx

Sharda · Accepted Answer

Final Term Project: 
Supervised Data Mining (Classification)
Category 1 (Support Vector    Machines)
· Gist(http://www.chibi.ubc.ca/gist/) Pick a kernel of your choice and specify the kernel used in your project
Category 2 (Random Forests)
· CRAN(http://cran.rproject.org/web/packages/randomForest/index.html)
Category 10 (Python)
· https://scikit-learn.org/
Dataset
· http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/ 
Classification Algorithm 1-> Category 1: Support Vector Machines in Python
Support Vector Machine for classification using scikit-learn and the Radial Basis Function (RBF) Kernel.
It aims to model and predict the presence of heart disease in patients by using Support Vector Machine and Data set contains continuous and categorical data from the UCI Machine Learning Repository located heart disease.
First we will read the data using Python read_csv() function and then will look at its structure using .info() function and look at first few rows using head() function as shown below:-
We see that instead of column names, we just have column numbers. Since column names would make it easier to know how to format the data, let's replace the column numbers with the following column names:
· age,
· sex,
· cp, chest pain
· restbp, resting blood pressure (in mm Hg)
· chol, serum cholesterol in mg/dl
· fbs, fasting blood sugar
· restecg, resting electrocardiographic results
· thalach, maximum heart rate achieved
· exang, exercise induced angina
· oldpeak, ST depression induced by exercise relative to rest
· slope, the slope of the peak exercise ST segment.
· ca, number of major vessels (0-3) colored by fluoroscopy
· thal, this is short of thalium heart scan.
· hd, diagnosis of heart disease, the predicted attribute
Get an overview distribution of each column-
Next we will look at how the variables correlate with each other using the corr() function-
Now that we have the data in a data frame called df, we are ready to identify and deal with Missing Data.
Missing Data Part 1: Identifying Missing Data-
Dealing With Missing Data
Since scikit-learn's support vector machines do not support datasets with missing values, we need to figure out what to do these question marks. We can either delete these patients from the training dataset, or impute values for the missing data. First let's see how many rows contain missing values.
Now let's count the number of rows in the full dataset.
So 6 of the 303 rows, or 2%, contain missing values. Since 303 - 6 = 297, and 297 is plenty of data to build a support vector machine, we will remove the rows with missing values, rather than try to impute their values. We do this by selecting all of the rows that do not contain question marks in either the ca or thal columns:
Split the Data into Dependent and Independent Variables-
Format the Data Part 2: One-Hot Encoding
So, we see that age, restbp, chol and thalach are all float64, which is good, because we want them to be floating point numbers. All of the other columns, however, need to be inspected to make sure they only contain reasonable values, and some of them need to change. This is because, while scikit learn Support Vector Machines natively support continuous data, like resting blood preasure (restbp) and maximum heart rate (thalach), they do not natively support categorical data, like chest pain (cp), which contains 4 different categories. Thus, in order to use categorical data with scikit learn Support Vector Machines, we have to use a trick that converts a column of categorical data into multiple columns of binary values. This trick is called One-Hot Encoding.

Microsoft PowerPoint - Final_Term_Project_2021Spring - Compatibility Mode Data Mining Project Project Grading · Submit ONE SINGLE file. Only a doc file is accepted · The project file should contain...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment