For the following assignments, please provide as much evidence of the results as possible, including...

Question

For the following assignments, please provide as much evidence of the results as possible, including the code, screenshots (only plots – not text or code) and documentation. Submit only one pdf file...

1 answer below »

For the following assignments, please provide as much evidence of the results as possible, including the code, screenshots (only plots – not text or code) and documentation. Submit only one pdf file and .ipynb / .py files containing the code with documentation.
1.a. [10 points]
Please write a report summary (one page) as Machine Learning experts working in the industry and about the machine learning topics to understand how they are used in the industry. Submit your report in writing. List 5 key learnings / takeaways. Written submissions must be entirely your own.
1.b. [10 points]
Assume there’s a data set that has just three columns (two features and one label) and four rows (items). The four vectors co
esponding to the items are at the corners of a square. The two vectors at the ends of one diagonal of this square belong to one class and the other two vectors on the other diagonal belong to the second class. Is this data separable by a straight line? Which algorithm that you studied in the class would you choose if you were to come up with a classifier for this toy data set and why?
2.(a) [15 points]
Follow the tutorial on Naïve Bayes classifier at https:
machinelearningmastery.com/naive-bayes-classifier-scratch-python/
Write your own code for Naïve Bayes Classification of the UCLA admissions dataset
Download from https:
stats.idre.ucla.edu/stat/data
inary.csv
Comment on the performance of Naïve Bayes
2.(b) [10 points]
Describe five real-world applications in which regression can be used. For each of these applications, describe the y-value and the co
esponding feature vector X. Also discuss whether linear regression can be used in each case.
3.(a) [10 points]
Refer to online tutorials on K-NN implementation such as https:
machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/
Extend the implementation to use various distance metrics such as Manhattan distance and note if the classification changes with the distance metric (for an more exhaustive list of distances, see getDistMethods() in R). Choose one of the cleaned datasets at https:
www.kaggle.com/annavictoria/ml-friendly-public-datasets
3.(b) [5 points]
In K-NN, we ignored the direction component of the vectorized representation of data items and only considered the distance. Does it make sense to also consider the direction of the nearest neighbor in addition to or instead of the distance from it? Why or why not?
3.(c) [10 points]
Listing which problem domains are best suited for each,
iefly explain in your own words, the pros and cons of
· Logistic Regression
· K-NN
· SVM
· Naïve Bayes
· Decision Trees
4. [25 Points]
Manually generate the decision tree (as much as possible) for the following subset from a large dataset using the ID3 algorithm. Show the information gain computation at each stage. Then generate the decision tree programmatically using Python. Submit code and the decision tree so generated.

instructions-o1ri5xxc.docx

Answered 3 days After Apr 09, 2021

Solution

Sandeep Kumar answered on Apr 13 2021

157 Votes

id3.ipyn
{
"metadata": {
"language_info": {
"codemi
or_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4-final"
},
"orig_nbformat": 2,
"kernelspec": {
"name": "python394jvsc74a57bd081118431cc388d258ed977b65143603a98f8ad6ed776c173758a3af876bc6de9",
"display_name": "Python 3.9.4 64-bit"
}
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from matplotlib import pyplot as plt\n",
"from sklearn import datasets\n",
"from sklearn.tree import DecisionTreeClassifier \n",
"from sklearn import tree\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Windy?\\tAir Quality Good?\\tHot?\\tPlay Tennis?\n",
"0 No\\tNo\\tNo\\tNo\n",
"1 Yes\\tNo\\tYes\\tYes\n",
"2 Yes\\tYes\\tNo\\tYes\n",
"3 Yes\\tYes\\tYes\\tNo"
],
"text/html": "

\n

Sandeep Kumar · Accepted Answer

id3.ipynb
{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python394jvsc74a57bd081118431cc388d258ed977b65143603a98f8ad6ed776c173758a3af876bc6de9",
   "display_name": "Python 3.9.4 64-bit"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from matplotlib import pyplot as plt
",
    "from sklearn import datasets
",
    "from sklearn.tree import DecisionTreeClassifier 
",
    "from sklearn import tree
",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "  Windy?\tAir Quality Good?\tHot?\tPlay Tennis?
",
       "0                                No\tNo\tNo\tNo
",
       "1                             Yes\tNo\tYes\tYes
",
       "2                             Yes\tYes\tNo\tYes
",
       "3                             Yes\tYes\tYes\tNo"
      ],
      "text/html": "

.dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

.dataframe tbody tr th {
        vertical-align: top;
    }

.dataframe thead th {
        text-align: right;
    }

Windy?\tAir Quality Good?\tHot?\tPlay Tennis?

0
      No\tNo\tNo\tNo

1
      Yes\tNo\tYes\tYes

2
      Yes\tYes\tNo\tYes

3
      Yes\tYes\tYes\tNo

"
     },
     "metadata": {},
     "execution_count": 2
    }
   ],
   "source": [
    "data = pd.read_csv('id3.csv')
",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ]
}
id3.csv
Windy?,Air Quality good?,Hot?,Play Tennis?
0,0,0,0
1,0,1,1
1,1,0,1
1,1,1,0
ML_4.docx
1.a.
As machine learning experts the key takeaways are:
1. Cleaning data is of utmost priority, as faulty values like NaN or random values can cause errors, also you should remove columns that are not of need. Also, the data presented should be properly formatted to the datatypes that can be processed. For example, changing the string values to float datatypes. And changing NaN values to 0. Removing columns which have NaN values, as they can cause discrepancy while training the model. Also in case of shortage of dataset, the data segmentation should be used to create more data values. And while testing the data an appropriate i.e. 0.10 to 0.25 of dataset should be used for testing.
2. Using the required machine learning algorithms. Often we use machine learning algorithms that are not required, for example in a simple clustering based problem, we can use KNN or K-Means and establish the model that fits that data. But if we use regression models like ridge and elastic, we can have unexpected results. So use the appropriate algorithm based on the requirement and the dataset
3. Overfitting and underfitting issues. Over training the data can lead to overfitting issues where the model takes the variations in account more than necessary. For example instead of having a general idea, it takes the unique ideas which are not necessary. Likewise there is underfitting issues as well, where the model fails to recognize the general trend and predicts wrongly. Using proper hyperparameters can fix this issue
4. Use of deep learning. Deep learning models are very targeted in their requirements, if there is a problem statement that can be solved using simple machine learning algorithms then deep learning should be avoided. Also if there is a necessity of deep learning, then such models should be used where the levels of layers are minimum or optimum, using models with more than necessary layers, will only lead to wastage of time in training and testing. Also, hyperparameters selection is a must. Also, there will be cases where reinforcement learning can be used, but it should be important to avoid it using classical deep learning algorithms. Using pretrained models, training and testing can be sped up.
5. Reading and implementing research papers. With the speed in which research is happening in machine learning it is important to be study newer research papers and implement them. Machine learning is a state of the art domain. 
1.b.
Since there are two features and two values of the label it is a classification problem, in case of classification problems it is best to use KNN or K-Means, because of it’s clustering property.
2.a. (NB.ipynb provided)
2.b.
The real-world applications of regression is:
1. Housing problem, where based on the size and locality of a house, we will have to predict the price of the house (Y)
2. Cancer detection, based on the size of the tumor, we will have to predict the state of the tumor in the future, also the stage of the cancer(Y)
3. Based on the customer salary, and their lifestyle, predict the amount of money they will spend on a gambling den(Y)
4. Based on the past marks in maths, science and social studies. Predict the marks in the future tests(Y)
5. Find the distance a runner can travel(Y) based on the past athletic records, stamina, and physique
3.a. (see knn.ipynb)
3.b. The direction of the nearest neighbor should not be considered, because only the nearest points within the circle is taken.
3.c. 
1. Logistic regression is best suited for regression based problems with multiple features. Like the housing problem
2. K-NN is used for clustering, mainly classification problem.
3. SVM can be used for both classification and regression problems, but it is mainly used for classification
4. Naïve Bayes is used for classification problems
5. Decision trees are used to solve both classification and regression problems in the form of trees that can be incrementally updated by splitting the dataset into smaller datasets, where the results are represented in the leaf nodes.
4.a. Decision tree:
	
NB.ipynb
{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python394jvsc74a57bd081118431cc388d258ed977b65143603a98f8ad6ed776c173758a3af876bc6de9",
   "display_name": "Python 3.9.4 64-bit"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd
",
    "
",
    "from sklearn.model_selection import train_test_split
",
    "from sklearn.naive_bayes import GaussianNB
",
    "from sklearn import metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv('binary.csv')
",
    "y = data["admit"]
",
    "X = data.drop(["admit"],axis =1)
",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "gnb = GaussianNB()
",
    "gnb.fit(X_train, y_train)
",
    "  
",
    "y_pred = gnb.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "71.875"
      ]
     },
     "metadata": {},
     "execution_count": 13
    }
   ],
   "source": [
    "metrics.accuracy_score(y_test, y_pred)*100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ]
}
binary.csv
admit,gre,gpa,rank
0,380,3.61,3
1,660,3.67,3
1,800,4,1
1,640,3.19,4
0,520,2.93,4
1,760,3,2
1,560,2.98,1
0,400,3.08,2
1,540,3.39,3
0,700,3.92,2
0,800,4,4
0,440,3.22,1
1,760,4,1
0,700,3.08,2
1,700,4,1
0,480,3.44,3
0,780,3.87,4
0,360,2.56,3
0,800,3.75,2
1,540,3.81,1
0,500,3.17,3
1,660,3.63,2
0,600,2.82,4
0,680,3.19,4
1,760,3.35,2
1,800,3.66,1
1,620,3.61,1
1,520,3.74,4
1,780,3.22,2
0,520,3.29,1
0,540,3.78,4
0,760,3.35,3
0,600,3.4,3
1,800,4,3
0,360,3.14,1
0,400,3.05,2
0,580,3.25,1
0,520,2.9,3
1,500,3.13,2
1,520,2.68,3
0,560,2.42,2
1,580,3.32,2
1,600,3.15,2
0,500,3.31,3
0,700,2.94,2
1,460,3.45,3
1,580,3.46,2
0,500,2.97,4
0,440,2.48,4
0,400,3.35,3
0,640,3.86,3
0,440,3.13,4
0,740,3.37,4
1,680,3.27,2
0,660,3.34,3
1,740,4,3
0,560,3.19,3
0,380,2.94,3
0,400,3.65,2
0,600,2.82,4
1,620,3.18,2
0,560,3.32,4
0,640,3.67,3
1,680,3.85,3
0,580,4,3
0,600,3.59,2
0,740,3.62,4
0,620,3.3,1
0,580,3.69,1
0,800,3.73,1
0,640,4,3
0,300,2.92,4
0,480,3.39,4
0,580,4,2
0,720,3.45,4
0,720,4,3
0,560,3.36,3
1,800,4,3
0,540,3.12,1
1,620,4,1
0,700,2.9,4
0,620,3.07,2
0,500,2.71,2
0,380,2.91,4
1,500,3.6,3
0,520,2.98,2
0,600,3.32,2
0,600,3.48,2
0,700,3.28,1
1,660,4,2
0,700,3.83,2
1,720,3.64,1
0,800,3.9,2
0,580,2.93,2
1,660,3.44,2
0,660,3.33,2
0,640,3.52,4
0,480,3.57,2
0,700,2.88,2
0,400,3.31,3
0,340,3.15,3
0,580,3.57,3
0,380,3.33,4
0,540,3.94,3
1,660,3.95,2
1,740,2.97,2
1,700,3.56,1
0,480,3.13,2
0,400,2.93,3
0,480,3.45,2
0,680,3.08,4
0,420,3.41,4
0,360,3,3
0,600,3.22,1
0,720,3.84,3
0,620,3.99,3
1,440,3.45,2
0,700,3.72,2
1,800,3.7,1
0,340,2.92,3
1,520,3.74,2
1,480,2.67,2
0,520,2.85,3
0,500,2.98,3
0,720,3.88,3
0,540,3.38,4
1,600,3.54,1
0,740,3.74,4
0,540,3.19,2
0,460,3.15,4
1,620,3.17,2
0,640,2.79,2
0,580,3.4,2
0,500,3.08,3
0,560,2.95,2
0,500,3.57,3
0,560,3.33,4
0,700,4,3
0,620,3.4,2
1,600,3.58,1
0,640,3.93,2
1,700,3.52,4
0,620,3.94,4
0,580,3.4,3
0,580,3.4,4
0,380,3.43,3
0,480,3.4,2
0,560,2.71,3
1,480,2.91,1
0,740,3.31,1
1,800,3.74,1
0,400,3.38,2
1,640,3.94,2
0,580,3.46,3
0,620,3.69,3
1,580,2.86,4
0,560,2.52,2
1,480,3.58,1
0,660,3.49,2
0,700,3.82,3
0,600,3.13,2
0,640,3.5,2
1,700,3.56,2
0,520,2.73,2
0,580,3.3,2
0,700,4,1
0,440,3.24,4
0,720,3.77,3
0,500,4,3
0,600,3.62,3
0,400,3.51,3
0,540,2.81,3
0,680,3.48,3
1,800,3.43,2
0,500,3.53,4
1,620,3.37,2
0,520,2.62,2
1,620,