PowerPoint Presentation
ITEC 210 DATA ANALYSIS FOR BUSINESS
Exploring The Relationship
Between Two Variables
Prof. Itir KARAESMEN AYDIN
1
Outline and Learning Outcomes
In this presentation, you will learn
To interpret two-way tables that explore the relationship between two categorical variables
To explore the relationship between a quantitative (numerical) and categorical variable using side-by-side charts and histograms
To interpret scatter diagrams (plots)
To interpret co
elation between two numerical variables
To interpret the trend line on a scatter diagram.
NOTE: This presentation does not show you *how* the work is done on Excel.
2
Topic:
Relationship Between Two Categorical Variables
3
Learning Objective
After completing these examples, you will be able to
interpret two-way tables for two categorical variables
visualize two-way tables using charts
answer business questions or provide business insights with the help of two-way tables.
4
Two-Way Tables
Information displayed in each cell of the table
Frequency or % frequency by column (Variable-1)
Frequency or % frequency by row (Variable-2)
Frequency or % frequency by grand total
5
TABLE Categories of Variable 1
Cat-1 Cat-2 Cat-3 Cat-4 …
Categories of
Variable 2 Cat-A
Cat-B
…
Excel Exercise #4
Use the data from Exercise#2.
6
#4 (cont’d)
Considering all the customers who used a discount code, did the majority use the EStore Card to make a payment? How about all the customers who did not use a discount code?
To answer these questions:
Prepare a two-way table to display the percentage frequencies with respect to method of payment and use of discount code.
Visualize the data using charts.
7
#4 (cont’d)
Two-way table prepared using Pivot Tables
Frequencies are displayed below
8
#4 (cont’d)
Two-way table prepared using Pivot Tables
Frequencies as percentages of the Grand Total are displayed below
9
#4 (cont’d)
Two-way table prepared using Pivot Tables
Frequencies as percentages of the column totals (use of discount code) are displayed below
10
#4 (cont’d)
Side-by-side bar chart created via Pivot Charts
We display the percentage frequency for method of payment considering all the customers who used a discount code (vs. not used).
11
#4 (cont’d)
Q: Considering the customers who used a discount code, did the majority use the EStore Card to make a payment? How about the customers who did not use a discount code?
A:
Of all the customers who used a discount code, 80% used the EStore Card.
Of all the customers who did not use a discount code, 50% used the EStore Card.
12
Excel Exercise #5
Use the data in the previous exercise. Of all the customers who made a purchase within the past 3 months, what percentage used an EStore Card? Is this percentage different for the customers who did not make purchase in the past 3 months?
To answer these questions:
Prepare a two-way table for method of payment and time since last purchase.
Visualize the data using bar charts.
13
#5 (cont’d)
Two-way table prepared using Pivot Tables
Frequencies as percentages of the Grand Total are displayed below
14
#5 (cont’d)
Two-way table prepared using Pivot Tables
Frequencies as percentages of the column total (time since last payment) are displayed below
15
#5 (cont’d)
Side-by-side bar chart created via Pivot Charts
Displays percentage frequency of method of payment by customers who made a purchase in the past 3 months (vs. not)
16
#5 (cont’d)
Q: Of all the customers who made a purchase within the past 3 months, what percentage used an EStore Card? Is this percentage different for the customers who did not make purchase in the past 3 months?
A:
Of all the customers who made a purchase within the past 3 months, 74% used the EStore card.
Of all the customers who did not make a purchase within the past 3 months, 68% used the EStore card.
17
Summary
Two-way tables and side-by-side bar charts make comparison across categories easy.
However:
Just because we observe that the percentage of customers who used the EStore Card is different for two distinct groups of customers (discount code users vs. not) does not mean this difference is statistically significant.
Statistical tests (e.g. hypothesis tests) can be used to test for statistical significance of the differences.
18
Topic:
Relationship Between
A Categorical Variable
and
A Quantitative Variable
19
Learning Objective
After completing these examples, you will be able to
interpret tables that display information about a quantitative variable when data is grouped with respect to another categorical variable
compare histograms of a quantitative variable for different categories
answer business questions or provide business insights with the help of tables and charts.
20
Excel Exercise #6
Consider the online retailer data used in Exercises #2 to #5. The data has been expanded to include the transaction value ($ sales value) and the age of each customer.
21
#6 (cont’d)
On the average, did the customers who use a discount code spend more or less than the customers who did not use a discount code?
To answer this question:
Prepare a table to display the average $ amount spent per customer in each category (i.e. those that use discount code vs. not).
Visualize the data using bar charts.
22
#6 (cont’d)
Pivot Table that displays the average $ sales value for customers who used a discount code vs. not
23
Rows: Use of discount code
Column: Transaction value
Average transaction value
#6 (cont’d)
A bar chart created via Pivot Charts
24
#6 (cont’d)
Q: On the average, did the customers who use a discount code spend more or less than the customers who did not use a discount code?
A: The average amount spent by customers who used a discount code ($113) is lower than the average amount spent by those who did not use a discount code ($120).
Note: This difference may or may not statistically significant.
25
Excel Exercise #7
Use the data provided for Exercise#6. Is the transaction value distribution similar for customers who use the discount code vs. not?
To answer this question:
Prepare a table to display the statistical distribution of the $ amount spent for customers how used the discount code (vs. not).
Visualize the data using bar charts.
26
#7 (cont’d)
Pivot Table that displays the no. of customers who used a discount code (vs. not), grouped by the amount spent by customer.
Amount spent is grouped by $25 increments
27
#7 (cont’d)
Side-by-side bar chart is created using Pivot Charts
28
#7 (cont’d)
Histogram of amount spent for two separate categories, created using Pivot Charts
29
#7 (cont’d)
Q: Is the transaction value distribution similar for customers who use the discount code vs. not?
A: Both distributions are skewed to the right; there are a few customers who spent >$350 regardless of the discount code.
The bin ranges $50-$74 and $75-$99 have the highest frequency for both groups of customers.
The number of customers who did not use a discount code declines sharply beyond $99.
In contrast, there is a sizable portion of customers who used a discount code that spent more than $99.
30
Summary
We can compare quantitative data across categories by
Creating tables
Creating side-by-side charts
Creating separate histograms for each category
Pivot tables and pivot charts are quick ways to organize data
31
Topic:
Relationship Between Two Quantitative Variables
32
Learning Objective
After completing these examples, you will be able to
interpret co
elation between two quantitative variables
interpret scatter plots (diagrams)
interpret trend lines on scatter plots.
33
Review: Co
elation
Co
elation: measures the strength of the linear relationship between two quantitative variables
Co
elation takes values between -1 and +1.
34
Co
elation = -1
Co
elation = 0
Co
elation = +1
Types of Relationships
Y
X
Y
X
Y
Y
X
X
Linear relationships
Curvilinear relationships
Types of Linear Relationships
Y
X
Y
X
Strong
Y
Y
X
X
Weak
Y
X
Y
X
No relationship
Review: Co
elation
If co
elation value is close to 0, does this mean there is no apparent relationship between the variables?
Example: Co
elation between X and Y variables is 0.000 for the data below.
37
Review: Co
elation
If co
elation value is close to +1 (or -1), does this mean the relationship between the variables is perfectly linear?
Example: Co
elation is +0.90 for the data below.
38
Excel Exercise #8
Use the data provided for Exercise#6. What is the co
elation between the amount spent by a customer and the customer’s age?
Use the CORREL function in Excel.
Format: CORREL(Range1, Range2)
Range1: cell range for the amount spent
Range2: cell range for the age of custome
39
#8 (cont’d)
Co
elation between the amount spent by a customer and the customer’s age is +0.667.
There is a positive relationship between these two variables.
40
#8 (cont’d)
Wrong interpretation: “The more people spend, the older they get.”
This statement says that spending more at the retail shop is causing people to age.
Better interpretation: “The older customers in this data set tend to spend more and the younger customers tend to spend less.”
41
#8 (cont’d)
Possible reasons for a positive relationship between age and transaction value:
Older people have higher income and have higher discretionary budget to spend.
The retailer sells products that are demanded more by older people (e.g. medical devices).
42
Summary
Zero co
elation does not mean one variable is not related to another.
Co
elation is not causation.
When X and Y are co
elated, the changes in Y may not be “caused” or “induced” by changes in X but by outside factors that are related to both variables.
CORREL function in Excel computes the co
elation.
43
Excel Exercise #9
Use the data provided for Exercise#8. Explore the relationship between the amount spent by a customer and the customer’s age by preparing a scatter plot (scatter diagram).
Use the chart functions in Excel to prepare a scatter plot. Age should be on the horizontal axis. Transaction value should be on the vertical axis.
44
#9 (cont’d)
Scatter plot of Transaction Value and Age
45
#9 (cont’d)
Observations
Older customers tend to spend more.
Customers who are 35 or younger spent between $50 and $100.
Only the customers who are older than 35 spent more than $100.
The transactions with the highest $ value were made by customers between the ages of 35 and 45.
The relationship between age and transaction value is positive and linear for customers 35 or younger.
The relationship between age and transaction value does not appear to be linear for customers 35 and older.
46
Summary
A scatter plot can be used to explore the relationship between two quantitative variables.
Trends and patterns can be detected visually on a scatter plot.
The scatter plot in the Excel chart function plots two quantitative variables.
47
Trend Line
Trend line is a straight line
It is displayed on the scatter plot
The trend line equation is
Y = b0 + b1 X
where
X: variable displayed on the horizontal axis of the scatter plot
0: intercept of the line (the value Y takes