Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

CISC 5790: Data Mining Prof. Yijun Zhao Fordham University, Spring 2023 Course Project Due: May 8 1 Introduction This project requires you to explore classification algorithms on a real world...

1 answer below »
CISC 5790: Data Mining Prof. Yijun Zhao
Fordham University, Spring 2023
Course Project
Due: May 8
1 Introduction
This project requires you to explore classification algorithms on a real world dataset, and write a
eport explaining your experimental results. The language of implementation is up to you — the
only requirement is that your program be able to interpret the data format specified below, and
e able to classify instances and produce interesting statistics such as accuracy, false positive rate,
false negative rate, etc. You are free to construct whatever user interface for your program, but
you must fully document your interface.
2 Algorithm
• Your algorithm should be based on the classification algorithms learned during the course.
Usually a straight forward implementation of one method will not lead to satisfactory perfor-
mance. Your algorithm can be a combination of methods and should incorporate one or more
data mining techniques when the situation arises. These techniques include (and certainly
not limited to):
– Handling imbalanced dataset
– Proper imputation methods for missing values
– Different treatment of various type of features: continuous, discrete, categorical, etc.
3 Data
You’ll be examining the behavior of your model on a dataset from the UCI machine learning lab.
The dataset is represented in a standard format, consisting of 3 files. The first file, census-income.names,
describes the categories and features of the dataset. It also has some empirical results for your ref-
erence. The other two files are census-income.data and census-income.test, containing the
actual data instances, formatted at one instance per line, as follows:
1
F 11 , F
2
1 , . . . , F
k
1 , label1
F 12 , F
2
2 , . . . , F
k
2 , label2
...
F 1n , F
2
n , . . . , F
k
n , labeln
where F ji , labeli (i = 1, . . . , n, j = 1, . . . , k) represent the value of the j
th feature and class category
for the ith instance respectively.
The data you will be examining was extracted from the census bureau database. Each instance
contains an individual’s educational, demographic and family information. Prediction task is to
determine whether a person makes over 50K a year. You should use census-income.data to
train your classifier and use census-income.test to evaluate the performance of your learning
algorithm.
4 Your Mission...
Deliverables for this project are:
• Code to implement the classification algorithm for the data file formats given above
• A README file, with simple, clear instructions on how to compile and run you
code
• Testing statistics for the application of your learning algorithm. At a minimum you should
provide training set accuracy, test set accuracy
• A discussion of data mining techniques employed in your algorithm
• A report analyzing the behavior of your algorithm on the dataset, including any unusual o
anomalous (in your opinion) behavio
2
5 How to turn in your code
• Create a README file, with simple, clear instructions on how to compile and run
your project. If the TA cannot run your program by following the instructions,
you will receive 50% of programing score.
• Zip all your files (code, README, written report, etc.) in a zip file named
{firstname} {lastname} CS5790 project.zip and upload it to Blackboard
• Only one person in your group needs to turn in the code and the report. Make
sure every team member’s name is listed on the cover of the report
3
Answered 14 days After Mar 27, 2023

Solution

Mukesh answered on Apr 01 2023
28 Votes
PowerPoint Presentation
Income Classification
About Dataset
An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, and etc.
This is a widely cited KNN dataset. I encountered it during my course, and I wish to share it here because it is a good starter example for data pre-processing and machine learning practices.
Fields
The dataset contains 16 columns
Target filed: Income
-- The income is divide into two classes: <=50K and >50K
Number of attributes: 14
-- These are the demographics and other features to describe a person
We can explore the possibility in predicting income level based on the individual’s personal information.
Independent features
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov,
Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Ma
ied-civ-spouse, Divorced, Never-ma
ied, Separated, Widowed, Ma
ied-spouse-absent, Ma
ied-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales,...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here