Lecture Recording URL: https:
ucidce.zoom.us
ec/share/-tJOIKHfykJOSIXf6h7DQYQ7Rp73eaa81HUb8_MOxI7uyzsycMubf064CsL-1TM
Day 6: Assignment, Ensembles, Decision Trees, and Trading Systems
Submit Assignment
Instructions
This assignment looks at using k-nearest neighbors to create a simple recommendation engine.
Homework steps:
· Open the homework notebook link: LINK TO NOTEBOOK (Links to an external site.)
· Save a copy to your Google drive
· Answer the questions in the notebook copy with your code and answers to the question
· Set sharing to "Anyone with a link can view".
· Save the notebook and submit the link
Day 6: Content
Overview
In information-based modeling, we use again utilize the structure of past data in order to build models for regression and classification problems. In this module, we cover decision trees which is a modeling method based on information gain. The resulting model is a tree structure based on actual values of attributes in the data and is often a big favorite in machine learning because of its readability. Ensemble methods are also touched upon in this module as a way to augment the modeling power from multiple models.
Readings and Media
· Class slides: Information-based Modeling
·
Modeling Methods, Deploying, and Refining Predictive Models
Modeling Methods, Deploying, and Refining Predictive Models
UCI Spring 2020
Class 6 Information-based Modeling
Schedule
2
Introduction and Overview
Data and Modeling + Simulation Modeling
E
or-based Modeling
Probability-based Modeling
Similarity-based Modeling
Information-based Modeling
Time-series Modeling
Deployment
At the end of this module:
You will learn how to build:
Decision Trees and
Ensembles
Fo
Regression and classification
3
Supervised Methods
E
or-based
SIMILARITY-based
Information-based
Probability-based
Neural networks and deep Learning-based methods
Ensembles
Today’s Objectives
Information-based Modeling
Decision Trees
Ensembles
Information-based Algorithms
Models which are based on information gain in data sets such as decision trees.
Decision tree methods construct a model of decisions made based on actual values of attributes in the data.
Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are trained on data for classification and regression problems. Decision trees are often fast and accurate and a big favorite in machine learning.
The most popular decision tree algorithms are:
Classification and Regression Tree (CART)
Iterative Dichotomiser 3 (ID3)
C4.5 and C5.0 (different versions of a powerful approach)
Chi-squared Automatic Interaction Detection (CHAID)
Decision Stump
M5
Conditional Decision Trees
Today’s Objectives
Information-based Modeling
Decision Trees
Ensembles
Decision Trees
Robust and intuitive predictive models when the target attribute is categorical in nature and when the data set is of mixed data types
Unlike more numerical methods, decision trees are better at handling attributes that have missing or inconsistent values
Decision trees tell the user what is predicted, how confident that prediction can be, and how we a
ived at that prediction
Popular method when communicability is a priority
Computationally efficient
Applications
Medicine: used for diagnosis in numerous specialties
Financial analysis: credit risk modeling
Internet routing: used in routing tables to find next router to handle packet based on the prefix sequence of bits
Computer vision: tree-based classification for recognizing 3D objects
Many more…
An example of a Decision Tree developed in RapidMine
Decision trees are made of nodes and leaves to represent the best predictor attributes in a data set
Elements of a decision tree
Contains images
Suspicious words
Unknown sende
spam
legit
spam
legit
true
false
true
false
Root Node
Nodes
Nodes
Leaf Nodes
Depth
Decision Path
true
false
The ABT for decision trees
Descriptive Feature 1 Descriptive Feature 2 … Descriptive Feature m Target Feature
Obs 1 Obs 1 Obs 1 Target value 1
Obs 2 Obs 2 Target value 2
. . .
. Obs 2 . .
Obs n-2 Obs n-2 .
Obs n-1 Obs n-1 .
Obs n Obs n Obs n Target value n
Categorical, numeric, or mixed feature space.
This just represents sets. The heterogeneity of sets represents entropy.
Can be numeric or categorical
Shannon’s entropy model and cards
Entropy(card) = 0.0
Entropy(card) = .81
Entropy(card) = 1.0
Entropy(card) = 1.50
Entropy(card) = 1.58
Entropy(card) = 3.58
Entropy increases as uncertainty increases
Entropy(card) = 0.0
Entropy(card) = .81
Entropy(card) = 1.0
Entropy(card) = 1.50
Entropy(card) = 1.58
Entropy(card) = 3.58
Shannon’s Model of Entropy
Cornerstone of modern information theory
Measures heterogeneity of a set
Defined as:
P(d=l): probability of randomly selecting an element d of type l
L is number of different types of d in the set
s is an a
itrary base, but for information modeling, 2 is used to represent bits
Shannon’s Model of Entropy
Cornerstone of modern information theory
Measures heterogeneity of a set
Defined as:
P(d=l): probability of randomly selecting an element d of type l
L is number of different types of d in the set
s is an a
itrary base
The ABT for decision trees
Descriptive Feature 1 Descriptive Feature 2 … Descriptive Feature m Target Feature
Obs 1 Obs 1 Obs 1 Target value 1
Obs 2 Obs 2 Target value 2
. . .
. Obs 2 . .
Obs n-2 Obs n-2 .
Obs n-1 Obs n-1 .
Obs n Obs n Obs n Target value n
Our dataset . We can repartition by a descriptive feature, d, for each of the levels that d can take; e.g. . Each partition reduces the entropy in the set. The difference is the information gain.
Levels(Y) is the set of levels in the domain of target feature Y and is a value in Levels(Y) with L levels
Entropy for our dataset
Entropy for our dataset
Remaining entropy in a partitioned dataset
Entropy remaining when we partition the dataset:
Information gain
Entropy remaining when we partition the dataset:
Information gained by splitting the dataset using the feature d:
Decision tree process
Compute the entropy of the original dataset with respect to the target feature.
This gives us a measure of how much information is required in order to organize datasets into pure sets which relates to the heterogeneity or entropy of the set.
Decision tree process
Compute the entropy of the original dataset with respect to the target feature.
For each descriptive feature, create the sets that result by partitioning the instances in the dataset using their feature values and then sum the entropy scores of each of these sets.
This is the remaining entropy in the partitioned sets and is required to organize the instances into pure sets after we have split them using the descriptive feature
Decision tree process
Compute the entropy of the original dataset with respect to the target feature.
For each descriptive feature, create the sets that result by partitioning the instances in the dataset using their feature values and then sum the entropy scores of each of these sets.
Subtract the remaining entropy value from the original entropy value to compute the information gain.
Implementation
Iterative Dichotomizer 3 (ID3) algorithm is one of the most popular approaches.
Top-down, recursive, depth-first partitioning beginning at the root node and finishing at the leaf nodes.
Assumes categorical features and clean data but can be extended to handle numeric features and targets and noisy data via thresholding and pruning.
Today’s Objectives
Information-based Modeling
Decision Trees
Ensembles
Ensembles
Instead of focusing on a single model for prediction, what if we generate a set of independent models, aggregate them, and compose their outputs?
Ensemble properties
Build multiple independent models from the same dataset but each model uses a modified subset of the dataset
Make a prediction by aggregating the predictions of the different models in the ensemble.
For categorical targets, this can be done using voting mechanisms.
For numeric targets, this can be done using a measure of central tendency like the mean or median.
Ensemble properties
Build multiple independent models from the same dataset but each model uses a modified subset of the dataset
Make a prediction by aggregating the predictions of the different models in the ensemble.
For categorical targets, this can be done using voting mechanisms.
For numeric targets, this can be done using a measure of central tendency like the mean or median.
Boosting
Increasing repetitions to target weak performance.
Boosting idea
Step 1:
Use a weighted dataset where each instance has an associated weight. Initially, distribute the weights uniformly to all instances.
Sample over this weighted set to create a replicated training set and create a model using the replicated training set.
Find the total e
or in the set of predictions made by the model.
(Prediction, E
or Rate)
Boosting idea
Step 2
Increase the weight for the misclassified instances and decrease the weight for co
ectly classified instances. The number of times an instance is replicated is proportional to its weight
Calculate a confidence measure of the model based on the e
or. This is used to weight the predictions from the models
(Prediction, E
or Rate)
(Prediction, E
or Rate)
Model 1
Model 2
Confidence measures
Replicated instances
Boosting idea
Step 3
Make a prediction using the weighted models by:
For categorical targets, this can be done using voting mechanisms.
For numeric targets, this can be done using a measure of central tendency like the mean or median.
Model 1
Model 3
Model 2
Bagging (bootstrap aggregating)
Bagging and subspace sampling
Bagging is another method to generate ensembles.
Bagging (bootstrap aggregating)
Bagging and subspace sampling
Random samples the same size of the dataset are sampled with replacement from the dataset.
These are the bootstrap samples.
Bootstrap samples
Bagging (bootstrap aggregating)
Bagging and subspace sampling
For each of the bootstrap samples, we create a model.
Because we trained the models on sampled datasets with replacement, there will be duplicates and missing instances in each training set.
This creates many different models because of the different data sets
This is called subspace sampling
Decision trees
Bagging (bootstrap aggregating)
Bagging and subspace sampling
Random Forest
The ensemble of decision trees resulting from subspace sampling is refe
ed to as a Random Forest
The ensemble makes predictions by returning the majority vote or by the median for continuous features.
Boosting vs. Bagging
Which method is prefe
ed is up to experimentation
Typically, boosting exhibits a tendency towards overfitting with a large number of features.
Review of topics
Information-based Modeling
Decision Trees
Entropy
Information gain
Categorical and numeric prediction
Ensembles
Boosting
Bagging
Comparison of ABT/Feature matrix concepts
E
or-based
Probability-based
Similarity-based
Information-based
We need to have an Analytics Base Table (ABT) before we can model anything
Descriptive Feature 1 Descriptive Feature 2 … Descriptive Feature m Target Feature
The ABT and the Model
Descriptive Feature 1 Descriptive Feature 2 … Descriptive Feature m Target Feature
Obs 1 Obs 1 Obs 1 Categorical Target value 1
Obs 2 Obs 2 Categorical Target value 2
. . .
. Obs 2 . .
Obs n-2 Obs n-2 Categorical .
Obs n-1 Obs n-1 Categorical .
Obs n Obs n Obs n Categorical Target value n
Existence of a target feature automatically make the modeling problem supervised.
The data type of the feature restrict which models can be used
The dataset characteristics may restrict the resolution of the model, force you to make assumptions, or require modeling for imputation, de-noising, data generation, etc.
Understanding and manipulating feature spaces is the key to data analytics
N-dimensional vector space representation of language produces an incredible ability to perform word-vector arithmetic.
Image source: Deep Learning Illustrated by Krohn
The ABT/ Feature space
The ABT/feature space representation is nothing more than an n-dimensional matrix
Modeling methods are just different ways to perform statistical, mathematical, or even heuristic