This assignment requires you to perform machine learning data analysis based on a bankruptcy dataset that’s free on Kaggle.
Please use the dataset and perform machine learning analysis on this topic and submit a research report. The report contains the following:
· Abstract
· Which classification algorithms will we use?
· Logistic Regression
· Naive Bayes Classifier
· K-nearest neighbo
· Decision Trees
· Performance Measures?
· Confusion Matrix
· Precision
· Recall/ Sensitivity
· Specificity
· F1-Score
· AUC & ROC Curve
· Motivation
· Many businesses have gone into default due to the recent pandemic which may cause investors to be wary of providing further capital. We seek to provide a model that interprets whether a company will become bankrupt, which can be used by investors to provide confidence and guidance. In order to address this problem, we will use a company bankruptcy dataset found on Kaggle along with the help of classification algorithms like Logistic Regression and Support Vector Machine
· Literature Survey
· Whatever algorithms we choose we can find literature on
· Methodology
· Data research - Taiwan economy during 1999 to 2009. the great recession
· Data Exploration and Visualization - look for co
elation between features
· Data Preprocessing - The data is very imbalanced. The total entries are 6,819 with 6,599 not bankrupt and 220 being bankrupt. This is an obstacle we would have to come up with a solution through random sampling (under sampling/oversampling). Feature engineering - removing/adding features to determine which yield the best results, experiment with attribute combinations. We have 95 features.
· Data Cleaning - check for nulls, check for missing entries, etc. Handling text and categorical attributes, feature scaling
· Data split into training and testing data.
· Train and evaluate for different models
· Seek better evaluation through fine tuning
· Apply performance measures to find best algorithm or best feature combinations