PowerPoint Presentation
ITEC 210 DATA ANALYSIS FOR BUSINESS
Analyzing Quantitative Variables
Prof. Itir KARAESMEN AYDIN
1
Outline and Learning Outcomes
In this presentation, you will learn
To build simple linear regression models.
To interpret the statistical output of a linear regression model.
To make predictions based on simple linear regression models (trend lines).
To define and interpret summary statistics (descriptive measures) for quantitative variables.
NOTE: This presentation does not show you *how* the work is done on Excel.
2
Scatter Plots and Trend Lines
3
Trend Line
Trend line is a straight line
It is displayed on the scatter plot
The trend line equation is
Y = b0 + b1 X
where
X: variable displayed on the horizontal axis of the scatter plot
0: intercept of the line (the value Y takes when X=0).
1: slope of the line (i.e., every 1 unit change in X, results in b units of change in Y); slope can be positive or negative.
Y: variable displayed on the vertical axis of the scatter plot
4
Fitting a Trend Line to the Data
5
A
B
C
Which of these lines “fits best” to the data?
Simple Linear Regression
6
Learning Objectives
In this presentation, you will learn
How to use regression analysis to predict the value of a dependent variable based on an independent variable
The meaning of the regression coefficients b0 and b1
How to judge the goodness of fit
How to make inferences about the slope
Exploring the Relationship Between Two Quantitative Variables
A scatter plot shows the relationship between two variables
Co
elation measures the strength of the linear relationship between two variables
Regression analysis is used to:
Predict the value of a dependent variable based on the value of at least one independent variable
Explain the impact of changes in an independent variable on the dependent variable
Dependent vs. Independent Variables in Regression
Dependent variable (or outcome variable): the variable we intend to predict or explain
Independent variable (or predictor): the variable we use to predict or explain the dependent variable
Simple Linear Regression Model
In a simple linear regression model
There is only one independent variable, X
The dependent variable Y is described by a linear function of X
The changes in Y are assumed to be related to changes in X
Linear component
Simple Linear Regression Model
Population
Y intercept
Population Slope
Coefficient
Random E
or term
Dependent Variable
Independent Variable
E
or (random)
Y
X
Yi: Observed Value of Y for Xi
Xi
Scatter Plot
Y
X
Xi
Slope = β1
Intercept = β0
Simple Linear Regression Model
Yi: Observed Value of Y for Xi
The simple linear regression equation provides an estimate of the population regression line.
Population regression line:
Prediction line (regression equation):
Simple Linear Regression Equation (Prediction Line)
Estimate of the regression
line
intercept
Estimate of the regression slope
Estimated (or predicted) Y value for observation i
Value of X for observation i
0 is the estimated average value of Y when the value of X is zero
1 is the estimated change in the average value of Y as a result of a one-unit increase in X
Regression Coefficients
The Least Squares Method
0 and b1 are obtained by finding the values of that minimize the sum of the squared differences between observed Y and predicted :
Prediction e
or for this observation
Y
X
Predicted Value of Y for Xi
Xi
Slope = b1
Intercept = b0
Simple Linear Regression Model
Yi: Observed Value of Y for Xi
Goodness of Fit
How well does the estimated regression line fit the data?
We investigate the variation in regression to answer this.
Total Sum of Squares
Regression Sum of Squares
E
or Sum of Squares
where:
= Mean value of the dependent variable
Yi = Observed value of the dependent variable
= Predicted value of Y for the given Xi value
A measure of goodness of fit is the coefficient of determination, a.k.a., the R-squared value (R2).
R2 is the portion of the total variation in the dependent variable that is explained by variation in the independent variable:
The higher the R2 value, the better the fit.
R-squared is NOT equal to the co
elation between the dependent and independent variables.
R-squared is equal to square of co
elation in simple linear regression.
Goodness of Fit (cont’d)
R2
Chap 13-20
Inferences About the Slope
Questions
Is there a linear relationship between X and Y?
Could the slope of the regression line be 0?
Hypothesis Test
H0: β1 = 0 (the null hypothesis: slope=0)
H1: β1 ≠ 0 (the alternative hypothesis: slope ≠0)
Chap 13-21
Inferences About the Slope (cont’d)
Conducting the Hypothesis Test:
Obtain the p-value for the slope coefficient from the regression output.
Compare the p-value to a given significance level, . Typical choices of =0.01, 0.05, 0.10.
Conclude:
Reject H0 if p-value < .
Fail to reject H0 if p-value > .
Interpret the results and conclusions.
Chap 13-22
Inferences About the Slope (cont’d)
Interpretation of the hypothesis test results
Reject H0: There is enough statistical evidence that supports the claim that the slope is not zero. We can say there is a linear relationship between X and Y. The strength of the linear relationship can separately be evaluated by computing the co
elation between X and Y.
Fail to reject H0: Statistical evidence supports the claim that the slope is zero. We cannot say there is a linear relationship between X and Y. You can separately compute the co
elation between X and Y to verify that the linear relationship between X and Y is very weak or nonexistent (i.e., co
elation value should be close to 0).
Steps of Regression Analysis
Prepare a scatter plot and add a trendline in Excel.
Obtain the regression output in Excel.
Use the regression output and
write down predicted regression line,
make predictions, i.e., compute predicted value of the dependent variable for a given value of the independent variable,
discuss goodness of fit of the regression line,
discuss the existence (or lack of) a linear relationship between the dependent and independent variables,
discuss how reliable the predictions are based on the regression analysis.
Chap 13-24
Excel Exercise#11
You are asked to examine the relationship between the size (square feet) of a house and its sales price in a real estate market.
A random sample of 20 houses is selected
Dependent variable (Y) = house price in $1000s
Independent variable (X) = square feet
#11 (cont’d)
Snapshot of Data
#11 (cont’d)
Questions:
Prepare a scatter plot and fit a trend line to the data.
#11 (cont’d)
Questions:
2. Perform regression analysis using the Data Analysis Toolpak in Excel.
#11 (cont’d)
Questions:
3. What is the predicted regression line equation?
Trend line: Y= XXXXXXXXXX * X
From Regression Output in Excel:
Predicted sales price (on $1000s)
= XXXXXXXXXXx Size of the House in Sq.Ft.
#11 - Regression Line Equation
29
Y = XXXXXXXXXXX
Predicted sales price (on $1000s)
= XXXXXXXXXXx Size of the House in Sq.Ft.
#11 (cont’d)
Questions:
4. What is the practical interpretation of the intercept of the regression equation in this example?
Y= XXXXXXXXXXwhen X=0, but house size will never be zero. Therefore, there is no practical interpretation.
#11 (cont’d)
Questions:
5. What is the practical interpretation of the slope of the regression equation in this example?
The sales price increases by 0.1303x$1000 = $1303 for every 1 sq.ft increase in the size of the house.
#11 (cont’d)
Questions:
6. What is the predicted price for a house that is 1900 square feet?
House price = XXXXXXXXXX1303x1900
= $ XXXXXXXXXXin $1000s) = $338,838.70
#11 (cont’d)
Questions:
7. What is the value of the coefficient of determination?
Coefficient of determination = R-squared = 0.5992.
#11 – R-squared
R-squared
= SSR / SST
= XXXXXXXXXX / XXXXXXXXXX
= 0.5992
#11 (cont’d)
Interpretation of the coefficient of determination:
How good is the fit of the regression line to the data?
The house prices in our data set are not constant and vary from a minimum of $209,000 to a maximum of $498,000.
In this example, 59.9% of the variation in house prices is explained by the size of the house.
The size of a house is a good but not a perfect predictor of the price of a house. There must be other factors or variables that affect and determine the price of a house.
#11 (cont’d)
Questions:
8. Is there really a linear relationship between the size of the house and the house price? Conduct a hypothesis test on the slope coefficient of the regression line.
H0: β1 = 0
vs. H1: β1 ≠ 0
Chap 13-37
#11 – Inference about the slope
H0: β1 = 0
H1: β1 ≠ 0
P-value of the slope: p-value = 6.19422E-0.5 = XXXXXXXXXX.
Significance level: Choices are =0.01, 0.05, 0.10. The p-value is smaller than any of these significance levels.
Decision: Reject H0, since p-value < α.
Conclusion: There is sufficient evidence that size of the house affects the house price, i.e., the regression slope is not zero. We can claim that there is a linear relationship between these two variables. The strength of the linear relationship can be verified by separately calculating the co
elation between these two variables.
Hypothesis test of the regression line slope:
#11 (cont’d)
Questions:
9. If we were to use the regression output to predict the sales price of a house that is 10000 square feet, how reliable would our prediction be?
Excel Exercise#11 (cont’d)
Answers:
9. Prediction: Y = XXXXXXXXXX1303x10000 = $ XXXXXXXXXXin $1000s) = $1,394,268.70
Check the following to assess the reliability of the prediction:
Goodness of fit: R-squared value is XXXXXXXXXXThis is a good fit.
Inference on the slope: Reject H0. There is statistical evidence for a linear relationship between the size of a house and its price.
Risk of extrapolation: Size of house given XXXXXXXXXXsq.ft) is beyond the values used in the regression analysis. This means we need to extrapolate.
The predicted value is not reliable because of extrapolation.
39
Chap 13-40
CAUTION
There is no “causation” – the changes in the values independent variable do not cause the values of the dependent variable to change.
You should build and interpret the model with the knowledge of the subject matter.
Do not extrapolate beyond the range of values used in the regression analysis.
You must ensure that the assumptions underlying least-squares regression are satisfied (beyond the scope