2/23/20201/5Decision TreesDue: Friday, Feb 28th at 6pmThe purpose of this assignment is to give you...

Question

2/23/20201/5Decision TreesDue: Friday, Feb 28th at 6pmThe purpose of this assignment is to give you experience with data cleaning and decision trees.You can work in pairs on this assignment.  Once you have followed these instructions, your repository will contain a directory named pa4. That direc-tory will include:transform.py: skeleton code for Task 1.test_transform.py: test code for Task 1.decision_tree.py: skeleton code for Task 2.test_decision_tree.py: test code for Task 2.data: a directory with sample data sets.To pick up the data for this assignment, change to the data directory and run the following on the Linuxcommand-line:$ ./get_files.sh Pima Indian DataThe Pima Indian Data Set, which is from the UC Irvine Machine Learning Repository, containsanonymized information on women from the Pima Indian Tribe. This information was collected by the Na-tional Institute of Diabetes and Digestive and Kidney Diseases to study diabetes, which is prevalent amongmembers of this tribe.The data set has 768 entries, each of which contains the following attributes:1. Number of pregnancies2. Plasma glucose concentration from a 2 hours in an oral glucose tolerance test3. Diastolic blood pressure (mm Hg)4. Triceps skin fold thickness (mm)5. 2-Hour serum insulin (mu U/ml)6. Body mass index (weight in kg/(height in m)^2)7. Diabetes pedigree function8. Age (years)9. Has diabetes (1 means yes, 0 means no)Task 1: Data CleaningYour first task is to clean and then transform the raw Pima data into a training set and a testing set.We have seeded your pa4 directory with a file named transform.py. Your task is to complete the functionclean, which takes four arguments:1. the name of a raw Pima Indians Diabetes data file,2. a filename for the training data3. a filename for the testing data4. a seed for use with train_test_split.Getting startedhttp:archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabeteshttps:en.wikipedia.org/wiki/Pima_people2/23/20202/5Your function should clean and transform the raw data as described below, split the resulting data intotraining and testing sets, and save the split data in CSV files.Each row in the raw file contains an observation. The raw attribute values are floating point numbers. Foevery attribute but the first and the last, a zero should be interpreted as missing data. The "Triceps skinfold thickness (mm)" and "2-Hour serum insulin (mu U/ml)" columns have a lot of missing data,so you should eliminate them when you process the data. Also, you should remove any observation with avalue of zero for plasma glucose concentration, diastolic blood pressure, or body mass index.Once the data is cleaned, you will need to convert the numeric data into categorical data. We have includeda dictionary named BOUNDS in transform.py, that specifies for each category, a list of category boundariesand a list of category labels. For example, the categories for plasma glucose level are represented with thefollowing pair of lists: [0.1, 95, 141, float("inf")] and ["low", "medium", "high"]. Together,these list specify that a plasma glucose level between 0.1 (inclusive) and 95 (exclusive) should be labeled as“low”, a level between 95 (inclusive) and 141 (exclusive) should be labeled as medium, and a level of 141 ohigher should be labeled as high.Note: the Python expression float('inf') evaluates to positive infinity. For all floating point values x, x float('inf').Finally, once the data is cleaned and transformed, you should randomly split the observations into twosets–training and testing–using the sklearn.model_selection function train_test_split using thespecified seed for the random_state parameter. The training set should contain roughly 90% of the trans-formed data, with the remainder going into the testing set.The raw data includes a header row, which should be suitably modified and included in both output files.Do not include the row index in the output files.Pandas is ideally suited for this task.Testing Task 1 We have provided test code for Task 1 in test_transform.py.Decision TreesAs we discussed in lecture, decision trees are a data structure used to solve classification problems. Here isa sample decision tree that labels tax payers as potential cheaters or non-cheaters.This tree, for example, would classify a single person who did not get a refund and makes $85,000 a yeaas a possible cheater.We iefly summarize the algorithm for building decision trees below. See the chapter on Classificationand Decision Trees from Introduction to Data Mining by Tan, Steinbach, and Kumar for a more detaileddescription.Definitionshttps:www-users.cs.umn.edu/~kumar001/dmbook/ch3_classification.pdf2/23/20203/5Before we describe the decision tree algorithm, we need to define a few formulas. Let e a multiset of observations,  a “row” or “observation” in ,  an attribute set, and  aow in . Denote  the number of observed elements in  (including repetition of the same element.)We use the following definitions:to describe the subset of the observations in  that have value j for attribute  and the fraction of the ob-servations in  that have value j for attribute .Decision Tree AlgorithmGiven a multiset of observations , a target attribute  (that is, the label we are trying to predict), and aset, , of possible attributes to split on, the basic algorithm to build a decision tree, based on Hunt’salgorithm, works as follows:1. Create a tree node, N, with its class label set to the value from the target attribute  that occurs mostoften:where  is the set of possible values for attribute  and argmax yields the value  that maximizesthe function. For interior nodes, the class label will be used when a traversal encounters an unexpectedvalue for the split attribute.2. If all the observations in  are from the same target class,  is the empty set, or the remaining ob-servations share the same values for the attributes in , return the node N.3. Find the attribute  from  that yields the largest gain ratio (defined below), set the split attributefor tree node N to be ,and set the children of N to be decision trees computed from the subsets ob-tained by splitting  on  with T as the target class and the remaining attributes (ATTR - {A}) as theset of possible split attributes. The edge from N to the child computed from the subset  should belabeled . Stop the recursion if the largest gain ratio is zero.We use the term gain to describe the increase in purity with respect to attribute  that can be obtained bysplitting the observations in  according to the value of attribute . (In less formal terms, we want to iden-tify the attribute that will do the best job of splitting the data into groups such that more of the membersshare the same value for the target attribute.)There are multiple ways to define impurity, we’ll use the gini coefficient in this assignment:Given that definition, we can define gain formally as:We might see a large gain merely because splitting on an attribute produces many small subsets. To pro-tect against this problem, we will compute a ratio of the gain from splitting on an attribute to the split in-formation for that attribute:where split information is defined as:S = × . . . ×A1 A2 Ak S A ∈ { , . . . , }A1 Ak r[A]A |S| SSA=jp(S, A, j)=={r ∈ S | r[A] = j}| |SA=j|S|S AS AS TATTRTp(S, T, v)argmaxv∈values(T)values(T) T vS ATTRATTRA ATTRAS ASA=jjTS Agini(S, A) = 1 − p(S, A, jΣj∈values(A) )2gain(S, A, T) = gini(S, T) − p(S, A, j) ∗ gini( , T)∑j∈values(A)SA=jgain_ratio(S, A, T) =gain(S, A, T)split_info(S, A)2/23/20204/5Task 2: Building and using decision treesWe have seeded your pa4 directory with a file named decision_tree.py. This file includes a main blockthat processes the expected command-line arguments–filenames for the training and testing data–andthen calls a function named go. Your task is to implement go and any necessary auxiliary functions. Yougo function should build a decision tree from the training data and then return a list (or Pandas series) ofthe classifications obtained by using the decision tree to classify each observation in the testing data.Your program must be able to handle any data set that:1. has a header row,2. has categorical attributes, and3. in which the (binary) target attribute appears in the last column.You should use all the columns except the last one as attributes when building the decision tree.You could eak ties in steps 1 and 3 of the algorithm aitrarily, but to simplify the process of testing wewill dictate a specific method. In step 1, choose the value that occurs earlier in the natural ordering fostrings, if both classes occur the same number of times. For example, if "Yes" occurs six times and "No"occurs six times, choose "No", because "No" tributes a1 and a2, where a1 You must define a Python class to represent the nodes of the decision tree. We strongly encourage you touse Pandas for this task as well. It is well suited to the task of computing the different metrics (gini, gain,etc).Testing Task 2 We have provided test code for Task 2 in test_decision_tree.py.GradingCompleteness: 65%Coectness: 10%Design: 15%Style: 10%Obtaining your test scoreThe completeness part of your score will be determined using automated tests. To get your score for theautomated tests, simply run the following from the Linux command-line. (Remember to leave out the $prompt when you type the command.)$ py.test $ ../common/grader.py Notice that we’re running py.test without the -k or -x options: we want it to run all the tests. If you’restill failing some tests, and don’t want to see the output from all the failed tests, you can add the --tb=nooption when running py.test:$ py.test --tb=no $ ../common/grader.py split_info(S, A) = −( p(S, A, j) ∗ log p(S, A, j))∑j∈values(A)Programming assignments will be graded according to a general ruic. Specifically, we will assign points for completeness

Kshitij · Accepted Answer

pa4/ci.yml
compile_and_lint:
    stage: build
    script:
        - python3 -m py_compile pa4/*.py
        - pylint -E pa4/*.py
run_tests:
    stage: test
    script:
        - cd pa4/ && py.test -v
    after_script:
        - cd pa4/ && ../common/grader.py
pa4/data/get_files.sh
echo "Getting PA4 files..."
wget -nv -O pa4-files.tgz https://www.classes.cs.uchicago.edu/archive/2020/winter/30122-1/pa4-files.tgz
tar xvzf pa4-files.tgz
pa4/data/README.txt
# CAPP30122 W'20: Building decision trees assignment
ex.csv           -- Example from chapter on Classification and
    Decision Trees from Introduction to Data Mining by Tan, Steinbach, and Kumar
   (http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf.)
pima-indians-diabetes.csv -- This data set contains anonymized
  information on women from the `Pima Indian Tribe.
  See http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
  for more information
README.txt                -- this file
pa4/decision_tree.py
'''
CAPP30122 W'20: Building decision trees
Your name
'''
import math
import sys
import pandas
class DecisionTree:
    def __init__(self, data, attributes):
        self.data = data
        self.attributes = sorted(attributes)
        self.__target = self.data.columns[-1]
        values = sorted(self.data[self.__target].unique())
        self.__label = values[np.argmax([sum(self.data[self.__target] == value) for value in values])]
        self.__split_attr = None
        self.__children = {}
    @staticmethod
    def __rate(data, attribute, value):
        
        return sum(data[attribute] == value) / data.shape[0]
    def __attr_gini(self, data, attribute):
        
        gini = 1
        for value in data[attribute].unique():
            gini -= self.__rate(data, attribute, value)**2
        return gini
    def __attr_gain_ratio(self, attribute):
        
        gain = self.__attr_gini(self.data, self.__target)
        split_info = 0
        attr = self.data[attribute]
        for value in attr.unique():
            gain -= self.__rate(self.data, attribute, value) * self.__attr_gini(self.data[attr == value], self.__target)
            split_info -= self.__rate(self.data, attribute, value) * math.log(self.__rate(self.data, attribute, value))
        if split_info == 0:
            return 0
        return gain / split_info
    def find_best_split(self):
        
        self.__split_attr = self.attributes[np.argmax([ self.__attr_gain_ratio(attr) for attr in self.attributes])]
    def is_leaf(self):
        
        if any([self.data[self.__target].nunique() == 1, not self.attributes, all(self.data[self.attributes].apply(lambda col: col.nunique() == 1))]):
            return True
        return False
    def train(self):
       
        if self.is_leaf():
            return self
        self.find_best_split()
        if self.__attr_gain_ratio(self.__split_attr) == 0:
            return self
        for edge in self.data[self.__split_attr].unique():
            sub_data = self.data[self.data[self.__split_attr] == edge]
            sub_attr = list(filter(lambda x: x != self.__split_attr, self.attributes))
            self.__children[edge] = DecisionTree(sub_data, sub_attr).train()
        return self
    def classify(self, row):
       
        if not self.__children or row[self.__split_attr] not in self.__children:
            return self.__label
        return self.__children[row[self.__split_attr]].classify(row)
def go(training_filename, testing_filename):
    '''
    Construct a decision tree using the training data and then apply
    it to the testing data.
    Inputs:
      training_filename (string): the name of the file with the
        training data
      testing_filename (string): the name of the file with the testing
        data
    Returns (list of strings or pandas series of strings): result of
      applying the decision tree to the testing data.
    '''
    train = pd.read_csv(training_filename, dtype=str)
    test = pd.read_csv(testing_filename, dtype=str)
    output = []
    trained_tree = DecisionTree(train, list(train.columns[:-1])).train()
    for _, row in test.iterrows():
        output.append(trained_tree.

2/23/2020 1/5 Decision Trees Due: Friday, Feb 28th at 6pm The purpose of this assignment is to give you experience with data cleaning and decision trees. You can work in pairs on this assignment. Once...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment