2/23/2020
1/5
Decision Trees
Due: Friday, Feb 28th at 6pm
The purpose of this assignment is to give you experience with data cleaning and decision trees.
You can work in pairs on this assignment.
Once you have followed these instructions, your repository will contain a directory named pa4. That direc-
tory will include:
transform.py: skeleton code for Task 1.
test_transform.py: test code for Task 1.
decision_tree.py: skeleton code for Task 2.
test_decision_tree.py: test code for Task 2.
data: a directory with sample data sets.
To pick up the data for this assignment, change to the data directory and run the following on the Linux
command-line:
$ ./get_files.sh
Pima Indian Data
The Pima Indian Data Set, which is from the UC Irvine Machine Learning Repository, contains
anonymized information on women from the Pima Indian Tribe. This information was collected by the Na-
tional Institute of Diabetes and Digestive and Kidney Diseases to study diabetes, which is prevalent among
members of this tribe.
The data set has 768 entries, each of which contains the following attributes:
1. Number of pregnancies
2. Plasma glucose concentration from a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Has diabetes (1 means yes, 0 means no)
Task 1: Data Cleaning
Your first task is to clean and then transform the raw Pima data into a training set and a testing set.
We have seeded your pa4 directory with a file named transform.py. Your task is to complete the function
clean, which takes four arguments:
1. the name of a raw Pima Indians Diabetes data file,
2. a filename for the training data
3. a filename for the testing data
4. a seed for use with train_test_split.
Getting started
http:
archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
https:
en.wikipedia.org/wiki/Pima_people
2/23/2020
2/5
Your function should clean and transform the raw data as described below, split the resulting data into
training and testing sets, and save the split data in CSV files.
Each row in the raw file contains an observation. The raw attribute values are floating point numbers. Fo
every attribute but the first and the last, a zero should be interpreted as missing data. The "Triceps skin
fold thickness (mm)" and "2-Hour serum insulin (mu U/ml)" columns have a lot of missing data,
so you should eliminate them when you process the data. Also, you should remove any observation with a
value of zero for plasma glucose concentration, diastolic blood pressure, or body mass index.
Once the data is cleaned, you will need to convert the numeric data into categorical data. We have included
a dictionary named BOUNDS in transform.py, that specifies for each category, a list of category boundaries
and a list of category labels. For example, the categories for plasma glucose level are represented with the
following pair of lists: [0.1, 95, 141, float("inf")] and ["low", "medium", "high"]. Together,
these list specify that a plasma glucose level between 0.1 (inclusive) and 95 (exclusive) should be labeled as
“low”, a level between 95 (inclusive) and 141 (exclusive) should be labeled as medium, and a level of 141 o
higher should be labeled as high.
Note: the Python expression float('inf') evaluates to positive infinity. For all floating point values x, x
float('inf').
Finally, once the data is cleaned and transformed, you should randomly split the observations into two
sets–training and testing–using the sklearn.model_selection function train_test_split using the
specified seed for the random_state parameter. The training set should contain roughly 90% of the trans-
formed data, with the remainder going into the testing set.
The raw data includes a header row, which should be suitably modified and included in both output files.
Do not include the row index in the output files.
Pandas is ideally suited for this task.
Testing Task 1 We have provided test code for Task 1 in test_transform.py.
Decision Trees
As we discussed in lecture, decision trees are a data structure used to solve classification problems. Here is
a sample decision tree that labels tax payers as potential cheaters or non-cheaters.
This tree, for example, would classify a single person who did not get a refund and makes $85,000 a yea
as a possible cheater.
We
iefly summarize the algorithm for building decision trees below. See the chapter on Classification
and Decision Trees from Introduction to Data Mining by Tan, Steinbach, and Kumar for a more detailed
description.
Definitions
https:
www-users.cs.umn.edu/~kumar001/dmbook/ch3_classification.pdf
2/23/2020
3/5
Before we describe the decision tree algorithm, we need to define a few formulas. Let
e a multiset of observations, a “row” or “observation” in , an attribute set, and a
ow in . Denote the number of observed elements in (including repetition of the same element.)
We use the following definitions:
to describe the subset of the observations in that have value j for attribute and the fraction of the ob-
servations in that have value j for attribute .
Decision Tree Algorithm
Given a multiset of observations , a target attribute (that is, the label we are trying to predict), and a
set, , of possible attributes to split on, the basic algorithm to build a decision tree, based on Hunt’s
algorithm, works as follows:
1. Create a tree node, N, with its class label set to the value from the target attribute that occurs most
often:
where is the set of possible values for attribute and argmax yields the value that maximizes
the function. For interior nodes, the class label will be used when a traversal encounters an unexpected
value for the split attribute.
2. If all the observations in are from the same target class, is the empty set, or the remaining ob-
servations share the same values for the attributes in , return the node N.
3. Find the attribute from that yields the largest gain ratio (defined below), set the split attribute
for tree node N to be ,and set the children of N to be decision trees computed from the subsets ob-
tained by splitting on with T as the target class and the remaining attributes (ATTR - {A}) as the
set of possible split attributes. The edge from N to the child computed from the subset should be
labeled . Stop the recursion if the largest gain ratio is zero.
We use the term gain to describe the increase in purity with respect to attribute that can be obtained by
splitting the observations in according to the value of attribute . (In less formal terms, we want to iden-
tify the attribute that will do the best job of splitting the data into groups such that more of the members
share the same value for the target attribute.)
There are multiple ways to define impurity, we’ll use the gini coefficient in this assignment:
Given that definition, we can define gain formally as:
We might see a large gain merely because splitting on an attribute produces many small subsets. To pro-
tect against this problem, we will compute a ratio of the gain from splitting on an attribute to the split in-
formation for that attribute:
where split information is defined as:
S = × . . . ×A1 A2 Ak
S A ∈ { , . . . , }A1 Ak r[A]
A |S| S
SA=j
p(S, A, j)
=
=
{r ∈ S | r[A] = j}
| |SA=j
|S|
S A
S A
S T
ATTR
T
p(S, T, v)argmax
v∈values(T)
values(T) T v
S ATTR
ATTR
A ATTR
A
S A
SA=j
j
T
S A
gini(S, A) = 1 − p(S, A, jΣj∈values(A) )
2
gain(S, A, T) = gini(S, T) − p(S, A, j) ∗ gini( , T)∑
j∈values(A)
SA=j
gain_ratio(S, A, T) =
gain(S, A, T)
split_info(S, A)
2/23/2020
4/5
Task 2: Building and using decision trees
We have seeded your pa4 directory with a file named decision_tree.py. This file includes a main block
that processes the expected command-line arguments–filenames for the training and testing data–and
then calls a function named go. Your task is to implement go and any necessary auxiliary functions. You
go function should build a decision tree from the training data and then return a list (or Pandas series) of
the classifications obtained by using the decision tree to classify each observation in the testing data.
Your program must be able to handle any data set that:
1. has a header row,
2. has categorical attributes, and
3. in which the (binary) target attribute appears in the last column.
You should use all the columns except the last one as attributes when building the decision tree.
You could
eak ties in steps 1 and 3 of the algorithm a
itrarily, but to simplify the process of testing we
will dictate a specific method. In step 1, choose the value that occurs earlier in the natural ordering fo
strings, if both classes occur the same number of times. For example, if "Yes" occurs six times and "No"
occurs six times, choose "No", because "No" < "Yes". In the unlikely event that the gain ratio for two at-
tributes a1 and a2, where a1 < a2, is the same, chose a1.
You must define a Python class to represent the nodes of the decision tree. We strongly encourage you to
use Pandas for this task as well. It is well suited to the task of computing the different metrics (gini, gain,
etc).
Testing Task 2 We have provided test code for Task 2 in test_decision_tree.py.
Grading
Completeness: 65%
Co
ectness: 10%
Design: 15%
Style: 10%
Obtaining your test score
The completeness part of your score will be determined using automated tests. To get your score for the
automated tests, simply run the following from the Linux command-line. (Remember to leave out the $
prompt when you type the command.)
$ py.test
$ ../common/grader.py
Notice that we’re running py.test without the -k or -x options: we want it to run all the tests. If you’re
still failing some tests, and don’t want to see the output from all the failed tests, you can add the --tb=no
option when running py.test:
$ py.test --tb=no
$ ../common/grader.py
split_info(S, A) = −( p(S, A, j) ∗ log p(S, A, j))∑
j∈values(A)
Programming assignments will be graded according to a general ru
ic. Specifically, we will assign points
for completeness