Modeling Methods, Deploying, and Refining Predictive Models
Modeling Methods, Deploying, and Refining Predictive Models
UCI Spring 2020
I&C X425.34 Modeling Methods, Deploying, and Refining Predictive Models
Module: Data Preparation and the Modeling Process
Schedule
2
Introduction and Overview
Data and Modeling + Simulation Modeling
E
or-based Modeling
Probability-based Modeling
Similarity-based Modeling
Information-based Modeling
Time-series Modeling
Deployment
At the end of this module:
You will review how to:
Prepare and analyze data to understand its properties and relationships
In relation to:
The modeling process
You will learn how to build:
Simulation models
Fo
Scenario analysis
3
Today’s Objectives
Data and the modeling process
Data pitfalls
Model risk
Simulation Modeling
Today’s Objectives
Data and the modeling process
Data pitfalls
Model risk
Simulation Modeling
First translate the business question to a data problem
Understand Business Problem
Propose Analytics Solutions
Explore Data
Assess Analytics Solutions
Choose Analytics Solutions
Agree on Analytics Goals
Design Domain Concepts
Brainstorm Domain Concepts
Review Domain Concepts
Explore Data
Design Features
Review Features
Build ABT
Clean & Prepare Data
Deploy Data
Business Understanding
Data Understanding
Data Preparation
Next, understand the data available
Understand Business Problem
Propose Analytics Solutions
Explore Data
Assess Analytics Solutions
Choose Analytics Solutions
Agree on Analytics Goals
Design Domain Concepts
Brainstorm Domain Concepts
Review Domain Concepts
Explore Data
Design Features
Review Features
Build ABT
Clean & Prepare Data
Deploy Data
Business Understanding
Data Understanding
Data Preparation
Setup the data for modeling
Understand Business Problem
Propose Analytics Solutions
Explore Data
Assess Analytics Solutions
Choose Analytics Solutions
Agree on Analytics Goals
Design Domain Concepts
Brainstorm Domain Concepts
Review Domain Concepts
Explore Data
Design Features
Review Features
Build ABT
Clean & Prepare Data
Deploy Data
Business Understanding
Data Understanding
Data Preparation
The process is highly iterative
Understand Business Problem
Propose Analytics Solutions
Explore Data
Assess Analytics Solutions
Choose Analytics Solutions
Agree on Analytics Goals
Design Domain Concepts
Brainstorm Domain Concepts
Review Domain Concepts
Explore Data
Design Features
Review Features
Build ABT
Clean & Prepare Data
Deploy Data
Business Understanding
Data Understanding
Data Preparation
Real world data must be represented in some digital form like:
Numeric: True numeric values that allow arithmetic operations (price, measurement, etc.)
Interval: allow ordering and subtraction but not other arithmetic operations (date, time, etc.)
Ordinal: Values that allow ordering but do not permit arithmetic operations (e.g. size as S, M, L)
Categorical: Finite set of values that cannot be ordered and allow no arithmetic operation
Binary: A set of two values (e.g. T/F)
Textual: Free-form
Tuples: n-tuple identifiers such as lat/lon coordinates
Translating the business question to a target concept is hard
Analytics Solution
Domain Concept
Domain Concept
Target Concept
Domain Subconcept
Domain Subconcept
Domain Subconcept
Domain Subconcept
Target Feature
Feature
Feature
Feature
Feature
Feature
Feature
Feature
The more specific you are about the question, the bette
For example, instead of asking “how can we increase sales for the next quarter”, we could ask much more na
owly focused questions:
What impact did the latest 5% pricing discount have on sales for item x?
Which demographic spent more online within 1 week of the big push email campaign?
What items are typically purchased together for customers who also bought item x?
Defining what domain concept might impact the target feature is even harde
Analytics Solution
Domain Concept
Domain Concept
Target Concept
Domain Subconcept
Domain Subconcept
Domain Subconcept
Domain Subconcept
Target Feature
Feature
Feature
Feature
Feature
Feature
Feature
Feature
The range of domain concepts is only limited by what you can quantify or label
Prediction subject details:
Descriptive details of any aspect of the prediction subject
Demographics o
Cohorts
Features of users, customers, issuance, origination such as age, gender, origination date, occupation, gender, category, race
Usage
Frequency and Recency
Cumulative value
Mix or diversity
Changes in Usage
Usage measures but in change terms
Special Usage
Extraordinary events
Unusual activity
Increased usage, decreased usage, drop out
Lifecycle Phase
Early, middle, late
Network Links
Relationships between other measures from a structural, social, geo-spatial, temporal view
Domain subconcept
eakdowns can help to make the connection to features in the dataset
Analytics Solution
Domain Concept
Domain Concept
Target Concept
Domain Subconcept
Domain Subconcept
Domain Subconcept
Domain Subconcept
Target Feature
Feature
Feature
Feature
Feature
Feature
Feature
Feature
Features are the measurable attributes we use for creating the model to predict the target feature
Analytics Solution
Domain Concept
Domain Concept
Target Concept
Domain Subconcept
Domain Subconcept
Domain Subconcept
Domain Subconcept
Target Feature
Feature
Feature
Feature
Feature
Feature
Feature
Feature
Conceptually mapping Gross Domestic Product (GDP) is hard
From Econ 101, gross domestic product is calculated as:
Conceptually mapping Gross Domestic Product (GDP) is hard
From Econ 101, gross domestic product is calculated as:
Let’s say the question is what will be the % change in next quarter’s GDP?
The target feature: % change in next quarter GDP
Specifics? adjusted for inflation? Detrended?
Motivation? Are we trying to gauge economic output or production? Are we trying to look at economic growth relative to other countries? Perhaps, other measures are more direct at asking the question?
Conceptually mapping Gross Domestic Product (GDP) is hard
From Econ 101, gross domestic product is calculated as:
Let’s say the question is what will be the % change in next quarter’s GDP?
What features do we have?
Are they up to date or reported with a lag?
Can we build a predictor for each of the inputs into GDP?
How do constant historical revisions of the data impact the analysis?
What conceptually goes into measuring each of the features?
Is there enough data to make a prediction?
Next, we need to transform the raw data into features for the model
Raw
Sensor-based
Digitally tracked
Polled
User-submitted
Statistics
System defined
Etc.
Derived
Aggregates: defined over a group or period usually as: count, sum, average, min, max
Flags: binary features indicating presence or absence of characteristic in data
Ratios: continuous features that capture the relationship between two or more data values.
Mapping: converts continuous features into categorical features to reduce the number of unique values or provide higher level conceptual mapping
We need to have an Analytics Base Table (ABT) before we can model anything
Descriptive Feature 1 Descriptive Feature 2 … Descriptive Feature m Target Feature
The ABT and the Model
Descriptive Feature 1 Descriptive Feature 2 … Descriptive Feature m Target Feature
Obs 1 Obs 1 Obs 1 Categorical Target value 1
Obs 2 Obs 2 Categorical Target value 2
. . .
. Obs 2 . .
Obs n-2 Obs n-2 Categorical .
Obs n-1 Obs n-1 Categorical .
Obs n Obs n Obs n Categorical Target value n
Existence of a target feature automatically make the modeling problem supervised.
The data type of the feature restrict which models can be used
The dataset characteristics may restrict the resolution of the model, force you to make assumptions, or require modeling for imputation, de-noising, data generation, etc.
Understanding and manipulating feature spaces is the key to data analytics
N-dimensional vector space representation of language produces an incredible ability to perform word-vector arithmetic.
Image source: Deep Learning Illustrated by Krohn
The ABT/ Feature space
The ABT/feature space representation is nothing more than an n-dimensional matrix
Modeling methods are just different ways to perform statistical, mathematical, or even heuristic transformations of the matrix to capture patterns and relationships.
Today’s Objectives
Data and the modeling process
Data pitfalls
Model risk
Simulation Modeling
Data Pitfalls
Epistemic E
ors: How we think about data
Assuming data is perfect reflection of reality
Forming conclusions about future based on historical data only
Seeking to use data to verify a previously held belief rather than test its veracity
Technical Traps: How we process data
Dirty data with mismatched category levels, typos
Units of measurement, date/time misalignments
Merging of disparate data sources, duplication
Mathematical miscues: How we calculate data
Summing at various levels of aggregation
Calculating rates or ratios
Working with proportions and percentages
Dealing with different units
Statistical slipups
Co
ect distributional assumptions
Sampling issues
Comparative issues
Analytical a
erations
Overfitting
Missing signals
Extrapolation or interpolation issues
Using unnecessary metrics
Graphical issues
Suitable visualization type
Clarity and reasonableness of visualization
Charting e
ors
Embellishment issues
Communication
Interactivity
Data fallacies to avoid
data-literacy.geckoboard.com
data-literacy.geckoboard.com
Data fallacies to avoid
data-literacy.geckoboard.com
Data fallacies to avoid
Simpson’s Paradox
Source: Rafael Iriza
y
The aggregation of data reverses the trend for the individual groups. So the resulting co
elation is inverted and counter to the individual groups and to the average of the individual groups.
Simpson’s Paradox
Example: Effectiveness of kidney stones treatment A and B are shown in the following table:
Treatment A works better for both small and large kidney stones. However, when you aggregate the data, treatment B works better for all cases.
Simpson’s Paradox
How can this be?
It turns out that small stones are considered less serious cases. Treatment A is more invasive than treatment B.
Therefore, doctors are more likely to recommend the inferior treatment, B, for small kidney stones where the patient is more likely to recover successfully in the first place because the case is less severe. For large, serious stones, doctors more often go with the better but more invasive treatment A.
Even though treatment A performs better on these cases, because it is applied to more serious cases, the overall recovery rate for treatment A is lower than treatment B.
Domain mapping to data oftentimes requires meta-level thinking
One way to resolve Simpson’s paradox is to employ causal analysis which requires domain knowledge akin to what we just heard.
Size of the kidney stone which co
esponds to the severity of the case is a confounding variable because it affects both the independent variable, the treatment, and the dependent variable, successful recovery.
To resolve, we need to control for the confounding variable by segmenting the two groups rather than aggregating over them.
Size of stone
Treatment selected
Successful Recovery
Confounding variable
Effects
Data is 99% of the work
Avoiding data pitfalls and fallacies is and unending task and most certainly should include having or working with domain expertise along with solid understanding of the statistical issues that might arise from leaving data cleansing, transformations, aggregations, and derivations unchecked.
Today’s Objectives
Data and the modeling process
Data pitfalls
Model risk
Simulation Modeling
Typical Sources of Model Risk
Assumptions:
Linearity
Stationarity
Normality
Statistical Bias
Sampling Bias
Over and under-fitting
Survivorship Bias
Confirmation Bias
Omitted-variable, Confounders
Linearity Assumption
Assumption that the relationship between any two variables can be expressed using a straight-line graph. Linearity is a common assumption that is buried in many models because most co
elation metrics reflect linearity between two variables or assumes something about the distribution of the variables
Pearson co
elation (common co
elation measure) scaled between -1 and 1 measures propensity of random phenomenon to have linear association
As we’ll see in regression, the slope of the line of fit is fully determined by Pearson’s co
elation coefficients.
If the relationship is non-linear, a linearity assumption will eithe
Not detect the relationship
Over or underestimate the relationship.
Stationarity Assumption
Assumption is that a variable or distribution from which a random variable is sampled is constant over time. For many stochastic models, particularly dealing with volatility and co
elation, this is a strong assumption that can lead to completely unrealistic results.
Normality Assumption
Normal distributions or Gaussian distributions are often used as a matter of convenience.
Sums of any combination of normal distributions are also