Microsoft PowerPoint - Final_Term_Project_2021Spring - Compatibility Mode
Data Mining
Project
Project Grading
· Submit ONE SINGLE file. Only a doc file is accepted
· The project file should contain the source code and documentation including screenshots. The screenshots are used to demonstrate the running situation of your programs, particularly how the programs execute and produce output based on different input data and user- specified parameter values.
(
2
)
Final Term Project:
Supervised Data Mining (Classification)
· This option is to implement 2 classification algorithms of your choice on 1 dataset of your choice (each of the 2 algorithms must run on the dataset).
· Your final term project documentation must indicate clearly the algorithms and dataset you used in the project.
Final Term Project:
General Sources of Algorithms/Software
http:
davidmlane.com/hyperstat/Statistical_analyses.html http:
statpages.org/javasta2.html http:
pcp.sourceforge.net/ http:
www.cs.waikato.ac.nz/ml/weka
http:
www.r-project.org/ http:
mlflex.sourceforge.net/ Han Book 3rd ed Chapters 8, 9 Kumar Book 1st ed Chapters 4, 5
Final Term Project:
Specific Algorithms and Tools used in the Project
· There are 5 categories of algorithms listed on the following pages (categories 6-10 are software tools and platforms). The 2 classification algorithms you choose must come from two different categories. In the same category, only one algorithm can be chosen from that category.
· In your final term project documentation, for each algorithm you choose, specify clearly the category numbe
name and algorithm name in that category.
Category 1 (Support Vector Machines)
· LIBSVM (http:
www.csie.ntu.edu.tw/~cjlin/libsvm/) linear kernel
· LIBSVM polynomial kernel
· LIBSVM radial basis function (RBF) kernel
· LIBSVM sigmoid kernel
· Gist (http:
www.chibi.ubc.ca/gist/) Pick a kernel of your choice and specify the kernel used in your project
Category 2 (Random Forests)
· CRAN (http:
cran.r-project.org/we
packages
andomForest/index.html)
· Willows (http:
c2s2.yale.edu/software/Willows/)
· Weka (https:
www.cs.waikato.ac.nz/ml/weka/)
Category 3 (Decision Trees)
Refer to Weka
http:
www.cs.waikato.ac.nz/ml/weka
· ADTree
· J48 (C4.5)
· LMT
· M5P
· NBTree
Category 4 (Bayesian Networks)
· Weka BayesNet https:
www.cs.waikato.ac.nz/ml/weka
· JBNC (http:
jbnc.sourceforge.net/)
Category 5 (Naïve Bayes)
Refer to Weka:
https:
www.cs.waikato.ac.nz/ml/weka
· AODE
· ComplementNaiveBayes
· NaiveBayes
· NaiveBayesMultinomial
· NaiveBayesSimple
· NaiveBayesUpdateable
Category 6 (R Package)
· Refer to
http:
www.r-project.org
Pick any classification tool in R.
Category 7 (Mathematical Package)
· MATLAB
Category 8 (RapidMiner)
www.rapidminer.com http:
sourceforge.net/projects
apidmine
Category 9 (Weka)
https:
www.cs.waikato.ac.nz/ml/weka
Category 10 (Python)
https:
scikit-learn.org
Final Term Project: Option 1
Sources of Data
http:
archive.ics.uci.edu/ml/ http:
www.cs.ucr.edu/~eamonn/time_series_data/ http:
aws.amazon.com/datasets http:
www.trustlet.org/wiki/Repositories_of_datasets
Final Term Project:
Unsupervised Data Mining (Clustering)
Part 1
Generate a set S of 500 points (vectors) in 3-dimensional Euclidean space. Use the Euclidean distance to measure
the distance between any two points. Write a program to find all the outliers in your set S and print out these outliers. If there is no outlier, your program should indicate so. Use any programming language of your choice (specify the programming language you use in the project).
Next, remove the outliers from S, and call the resulting set
S’. 21
Final Term Project:
Part 2
(1) Write a program that implements the hierarchical agglomerative clustering algorithm taught in the class to cluster the points in S’ into k clusters where k is a user-specified parameter value.
(2) Repeat part 1 and (1) above on two additional different datasets.
22
Notes on the hierarchical agglomerative clustering algorithm
In determining the distance of two clusters, you should consider the following definitions respectively:
· the distance between the nearest two points in the two clusters,
· the distance between the farthest two points in the two clusters,
· the average distance between points in the two clusters,
· the distance between the centers of the two clusters.
Use the definition that yields the best performance where the performance is measured by the Silhouette coefficient.
23
Final Term Project:
Submission (One File)
A word or pdf file (final term project report) containing:
· Source code of your clustering algorithm.
· The website where the complete datasets can be downloaded.
· All related documentation and documents including the manual you developed and screenshots showing the running situation and input/output of your programs. This report should be written in a tutorial style to explain through screenshots and examples how to run your tool on the datasets you choose.
24
(
19
)