Modeling Methods, Deploying, and Refining Predictive Models
Modeling Methods, Deploying, and Refining Predictive Models
UCI Spring 2020
I&C X425.34 Modeling Methods, Deploying, and Refining Predictive Models
Module: E
or-based Modeling
Schedule
2
Introduction and Overview
Data and Modeling + Simulation Modeling
E
or-based Modeling
Probability-based Modeling
Similarity-based Modeling
Information-based Modeling
Time-series Modeling
Deployment
At the end of this module:
You will learn how:
Regression models work
In order to
Predict continuous values and classifications
3
Today’s Objectives
E
or-based models
Multiple regression
Simple linear regression
Logistic regression
Today’s Objectives
E
or-based models
Multiple regression
Simple linear regression
Logistic regression
Regression is a hamme
E
or-based Modeling
Regression
Linear Regression
Logistic Regression
Supervised Methods
E
or-based
Instance-based
Information-based
Probability-based
Neural networks and deep Learning-based methods
Ensembles
Practical Approach to Learning Machine Learning
What does each method do?
What type of analytics can I apply it to? {descriptive, diagnostic, predictive, prescriptive}
What type of data can it input and output? {categorical, continuous, discrete, probabilistic, etc.}
How does the method work?
Assumptions
How do I use the model in practice?
Code, off-the-shelf solutions, etc.
Advantages/Disadvantages
Conceptual strengths and weaknesses
Applications
Model Evaluation
Deployment and Integration
Inputs/ Outputs
Basic mathematical foundation or pseudocode
Ease of interpretability, monitoring and deployment
E
or-based Methods
E
or-based methods like regression are concerned with modeling the relationship between variables both continuous or categorical using a measure of e
or in the predictions made by the model to determine the optimal relationship.
Regression methods are a workhorse of statistics and have been regularly used as a baseline in machine learning. This may be confusing because we can use regression to refer to the class of problem and the class of algorithm. Really, regression is a process.
The most popular regression algorithms are:
Ordinary Least Squares Regression (OLSR)
Linear Regression
Logistic Regression
Stepwise Regression
Multivariate Adaptive Regression Splines (MARS)
Locally Estimated Scatterplot Smoothing (LOESS)
Simulation verses Machine Learning
Our simulation model examples did not make use of all the information it had in devising a method for forecasting. We had to produce all the parameters and logic for the system but nothing in the algorithm produced its own parameters or logic.
The system never learned from the data since we never provided the model a way to discover increasingly better parameters.
What we need is a way to optimize. All of machine learning revolves around optimization.
If we compare simulation to regression
Simulation Regression
Types of analytics Predictive modeling for discrete, continuous, and probabilistic inputs and outputs Predictive modeling for discrete, continuous, categorical, and probabilistic outputs
How it works Simulation E
or minimization
Applications Complex systems typically stochastic processes, Monte Carlo simulations, and probabilistic forecasting Linear and non-linear modeling of discrete, continuous values for forecasting and classification
Advantages/Disadvantages Relatively robust, easy to understand. Able to model complex systems. Can become overly complex with declining performance relative to simpler methods. Easy to understand and able to incorporate many features. Flexibility of regression means that it can handle most prediction and classification cases. However, model is sensitive to data.
Machine Learning - Regressions
How does it work?
Regressions minimize the e
or between the predicted target, and the actual target value .
These types of methods can handle categorical and continuous-valued data in both the input and the output.
The ABT and the Model
Descriptive Feature 1 Descriptive Feature 2 … Descriptive Feature m Target Feature
Obs 1 Obs 1 Obs 1 Categorical Target value 1
Obs 2 Obs 2 Categorical Target value 2
. . .
. Obs 2 . .
Obs n-2 Obs n-2 Categorical .
Obs n-1 Obs n-1 Categorical .
Obs n Obs n Obs n Categorical Target value n
Existence of a target feature automatically make the modeling problem supervised.
The data type of the feature restrict which models can be used
The dataset characteristics may restrict the resolution of the model, force you to make assumptions, or require modeling for imputation, de-noising, data generation, etc.
The ABT for a regression model
Descriptive Feature 1 Descriptive Feature 2 … Descriptive Feature m Target Feature
Obs 1 Obs 1 Obs 1 Categorical Target value 1
Obs 2 Obs 2 Categorical Target value 2
. . .
. Obs 2 . .
Obs n-2 Obs n-2 Categorical .
Obs n-1 Obs n-1 Categorical .
Obs n Obs n Obs n Categorical Target value n
Today’s Objectives
E
or-based models
Multiple regression
Simple linear regression
Logistic regression
We are all familiar with linear regression on an intuitive level
Simply by looking at a graph of the predictor X and the target value Y, we can usually guess if there is a linear relation.
Mathematically, we can start with the equation of a line
Recall that the equation of a line can be written as: Y
m is the slope of the line,
is the y-intercept of the line (where the line meets the vertical axis when x = 0).
m
Defined by the slope and intercept
Linear regression finds values for m and b such that we now have an estimate of the target Y for any value of X.
.
m
And minimize the prediction e
o
We would like to make sure m and b give us the smallest expected difference between the predicted value and actual value
difference
Defined by the sum of squared e
ors
For simple linear regression, most used is sum of squared e
ors, or e
or.
difference
To find the parameters for a simple linear regression
Given a set of d and d on a scatterplot, find the optimal line:
Such that the sum of squared e
ors is minimized
The statistical solution uses
Recall that co
elation between two variables is:
To find the relationship between target and predictor as
Recall that co
elation between two variables is:
It can be shown that the e
ors are minimized when
Co
elation!@#
Recall that co
elation between two variables is:
It can be shown that the e
ors are minimized when
This is related to the co
elation then by:
Co
elation is at the heart of many models
Co
elation implies some sort of dependence between the variables X and Y even though it does not imply causation.
Co
elation only implies causality if and only if co
elation is exactly 1.
CAPM, a benchmark model in finance
The Capital Asset Pricing Model (CAPM) is a famous linear model
Created by Nobel Prize winner in Economics William Sharpe
Estimates the return of an asset based on the return of the market and the asset’s linear relationship to the return of the market.
The linear relationship of an asset to the market is the “beta” coefficient.
CAPM “is the centerpiece of MBA investment courses. Indeed, it is often the only asset pricing model taught in these courses…unfortunately, the empirical record of the model is poor.” - Fama and French
CAPM is a linear model
More formally:
where
But, is just slope m which is proportional to the co
elation between the asset/portfolio and the market.
CAPM optimal parameters
Given a set of asset returns and market returns on a scatterplot, find the best-fit line:
Such that the sum of squared e
ors is minimized
So:
Beta measures how much the asset will change when the market changes
A more general way to think about regression
Think about averages. Its essentially finding the line which is the average value y for any input x.
difference
In terms of statistics
Regression can be thought of as a condition mean.
difference
Which relates back to our linear regression
Regression can be thought of as a condition mean.
We can relate the mean back to our linear equation:
But, isn’t there some variance?
We have a conditional mean =
would have some variance
But, isn’t there some variance?
We have a conditional mean =
would have some variance
Putting it back into the equation:
with
Normally distributed e
ors with mean 0 and standard deviation
Our first model check
with
Normally distributed e
ors with mean 0 and standard deviation
Check to see if our assumptions are co
ect by testing the e
or distribution via t-test, p-values, plotting, etc. These e
ors are often called the Residuals and we assume they are independently and identically distributed (i.i.d.) with normal distribution
And we can score the model by:
The e
or differences:
Mean Absolute E
or (MAE)
Mean Squared E
or (MSE)
Root Mean Squared E
o
Amount of variance explained by the model
Variance explain-ability
Variance of the data
Explained variance by the model
is the amount of variance the model explains so:
Variance explain-ability
Variance of the data
Explained variance by the model
is the amount of variance the model explains so:
The better the fit, the more variance the model accounts for. The closer is to 1, the better the model.
General structure of modeling
Data
Training Set
Test Set
Model Development
Model Evaluation
Performance measures:
Accuracy
Precision
Recall
Simple Linear Regression
Demo
Very easy to fall into traps…
From XKCD
All the assumptions
The assumptions that must be met for linear regression to be valid depend on the purposes for which it will be used. Any application of linear regression makes two assumptions:
(A) The data used in fitting the model are representative of the population.
(B) The true underlying relationship between X and Y is linear.
All you need to assume to predict Y from X are (A) and (B). To estimate the standard e
or of the prediction , you also must assume that:
(C) The variance of the residuals is constant (homoscedastic, not heteroscedastic).
For linear regression to provide the best linear unbiased estimator of the true Y, (A) through (C) must be true, and you must also assume that:
(D) The residuals must be independent.
To make probabilistic statements, such as hypothesis tests involving b or r, or to construct confidence intervals, (A) through (D) must be true, and you must also assume that:
(E) The residuals are normally distributed.
Regression is a hamme
Linear regression does not assume anything about the distributions of either X or Y; it only makes assumptions about the distribution of the residuals . As with many other statistical techniques, it is not necessary for the data themselves to be normally distributed, only for the e
ors (residuals) to be normally distributed.
And this is only required for the statistical significance tests (and other probabilistic statements) to be valid; regression can be applied for many other purposes even if the e
ors are non-normally distributed.
Steps for Simple Regression Models
Plot and examine the data
Transform X and Y
Calculate the linear regression statistics. By hand that would be:
Calculate:
Calculate:
Examine the regression slope and intercept
Examine the residuals plot, that is
Plot the residuals versus X
If the residuals increase or decrease with X, they are heteroscedastic. Transform Y to cure this.
If the residuals are curved with X, the relationship between X and Y is nonlinear. Either transform X, or fit a nonlinear curve to the data.
If there are outliers, check their validity, and/or use robust regression techniques.
Plot the residuals versus Y and check the cases like X above
Plot the residuals against every other possible explanatory variable in the data set