Lab 8. Chi Squares
Lab 8. Chi Squares
Introduction
Today you will be running both Goodness-of-Fit Chi Squared tests and Chi Squared Tests for Independence. You’re going to use these to evaluate a few fake data questions, and then do a walk through with real data to determine whether Tucson or Phoenix pilots are more efficient at using their giant planes to murder different species of birds - as well as their overall overall bird murder rates. Or, conversely, you’ll be testing which city has birds that are best at flying into planes - it could be their fault I suppose.
The real data is not in a contingency table, so you will be learning some tricks to make it into the right format. Then as your final portion of the lab, you’ll be picking two (or more) new airports to look at and cross-compare.
Note: I got a bit ca
ied away with synonyms and slang for dia
hea. So if you don’t understand a term… it’s probably a slang term for dia
hea. This will make more sense shortly.
Learning Outcomes
By the end of today’s class you should be able to do the following in R:
· Use the chisq.test() function to calculate goodness of fit
· Use the chisq.test() function to determine if data independence
· Interpret Chi Square p values and make recommendations
· Use dplyr and reshape2 to create contingency tables
· Use ggplot2 to make a filled barplot
· Use mutate() to make new columns
· Use filter() to, well, filte
· Use select() to pick different columns or drop them
· Use dcast() to make your data into wider format
Part 1: Revisiting Chi Square Test Types
Before we begin looking at how to run these, let’s make sure you can tell the difference between a goodness of fit test and a test of independence. If you are unsure on the answers to any of the following questions, you might want to revisit the lecture before you continue.
We will be practicing with fake data from two different examples.
##Part 1.1 Example 1 (Cars) White cars can be hard to see. You are trying to see if white cars get into more accidents then black cars or cars of other colors. You have data from 2019, the total number of cars that are white, black, or all other colors and the number of cars that got into accidents for each color category.
QUESTION 1: What type of chi-squared question is Example 1?
1. Goodness of fit
1. Independence
QUESTION 2: How would you calculate your degrees of freedom for Example 1?
1. n - 1, leaving you with 2 degrees of freedom
1. (R-1) * (C-1), leaving you with 4 degrees of freedom
Part 1.2 Example 2 Poosplosions
You are working with a group of people who have an intestinal disease that causes frequent and volcanic dia
hea. You are trying to see if a particular medication causes fewer cases of violent dia
hea. You have people on the medication and people off the medication, and their relative numbers of folks with and without dia
hea.
QUESTION 3: What type of chi-squared question is Example 2?
1. Goodness of fit
1. Independence
QUESTION 4: How would you calculate your degrees of freedom for Example 2?
1. n - 1, leaving you with 3 degrees of freedom
1. (R-1) * (C-1), leaving you with 1 degrees of freedom
Part 2: Fake Examples
To run today’s code you will need the ggplot2, reshape2 and dplyr packages installed. Once they are installed, use the li
ary() function to remind R to load them.
li
ary(dplyr)
li
ary(ggplot2)
li
ary(reshape2)
Part 2.1 Make The Data
To begin with, we will run both types of chi-squared tests using these fake data examples. The below code will create a data frame for each example – and I’m fairly sure that by the names you can tell which is which. The reason we are starting with these is because this data is already in contingency table format - when we work with real data in Part 3, we have to do a lot more work to get your data into the right format.
cars <- data.frame(Color = c("White", "Black", "Everything Else"), Accidents = c(252, 109, 450), TotalCars = c(2023, 4005, 12124))
poopoo <- data.frame(DevilsCoffeeStatus = c("Butt Barfing", "Enjoying Life"), Control = c(35, 78), Medication = c(53, 13))
Part 2.2 Cars Example
In this question we are looking to see if the proportion of cars of each color has an equal number of accidents. We have the total number of cars for each color, but we are more interested in asking whether or not white cars have a lot of accidents versus whether or not people prefer purchasing white cars. So to start off with we need to calculate what our expected percentages of accidents would be given the number of cars in each category.
To calculate percentages, we first need to find the total number of cars (all) using the sum() function.
all <- sum(cars$TotalCars)
Now we need to find the overall percentage of cars that fall into each category. We can do this using a new function from the dplyr package: mutate(). This essentially is a way of quickly doing column math without having to write a lot of code. The mutate function takes two arguments: the name of the new column, and whatever you wanted to have done.
We want to have the overall proportion of each type of car (Proportion.Of.Cars).
cars <- cars %>%
mutate(Proportion.Of.Cars = TotalCars/all)
The chisq.test() function takes two arguments: x, the actual number of accidents, and p, the probabilities of accidents expected for each type. Though the hand calculation of Chi Square requires expected observations rather than proportions, R likes to do that for you. It instead wants you to specify the expected probabilities.
chisq.test(x= cars$Accidents, p = cars$Proportion.Of.Cars)
This is statistically significant. What does that actually mean? We need to look at effect size to find out. To get this, we need the number of accidents, and to do a little column math.
total.accidents <- sum(cars$Accidents)
cars <- cars %>%
mutate(ExpectedAccidents = total.accidents*Proportion.Of.Cars) %>% #how many accidents we expected by category
mutate(EffectSize = Accidents/ExpectedAccidents)
QUESTION 5: Look at the chi square results and the effect size column. Are white cars more likely to be involved in accidents? Use your p value and your effect size to explain your answer.
Part 2.3 Pants Gravy Problem
In this case, we are interested in looking both at the people who took the medication, and those that didn’t, to see if there’s interaction between the medication and the likelihood of having a Downstairs Exxon Valdez. That means we’re not actually calculating expected values, we’re looking for a relationship.
To do that, we have to make this data into a format R likes, which is a matrix(). They look like data frames, but in a matrix all the data has to be the same type (no categories, etc.). We can use the select() function to only take the columns we want, and then the as.matrix() function to turn it into a matrix.
muddymatrix <- poopoo %>%
select(Control, Medication) %>%
as.matrix()
And now place that matrix in the chisq.test() function. No need to add anything else!
chisq.test(muddymatrix)
Again, we will want to look at effect size to interpret these results. In this case, our expected values are calculated a little differently - across columns, rather than compared to the whole. You can still use mutate() for this.
poopoo <- poopoo %>%
mutate(EffectSize = Medication/Control)
QUESTION 6: If you wanted to lower dia
hea rates in this group of people, should you put them on this medication? Use your p value and effect size to defend your answer.
Part 3: Real (Complicated) Data
We will be using some real data on collisions between wildlife and aircraft between 1990 and 2015, collected from https:
www.kaggle.com/faa/wildlife-strikes.
Import this dataset (angrybirds.csv) and call it birds.
If you are having problem loading it or downloading it, use the smolangrybirds.csv file instead, which has fewer states.
When I was initially looking at this data I thought it was only birds. It isn’t. There are coyotes and bats and other non-avian wildlife in here too (no shark tornadoes though). I just want to clarify before you do this part that I do have a degree in biology and understand that coyotes and bats are not birds, but I also have a dumb sense of humor and think it is funnier to pretend that they are.
Part 3.1 Subsetting Data
There is a lot of data in here, but what we really want is just Tucson and Phoenix airport data and we need it in a contingency table. So we have to do a lot of work to get it there. We are going to use the filter() function from dplyr to subset our data. We are giving the data set a new name (bi
s1) instead of overwriting birds because we are going to come back to this data set later.
i
s1 <- birds %>%
filter(Airport %in% c("TUCSON INTL","PHOENIX SKY HARBOR INTL ARPT"))
We now need to have a total count by species for each airport. We can use the group_by() and tally() functions from dplyr to get this information.
i
s1 <- bi
s1 %>%
group_by(Species.Name, Airport) %>%
tally()
But right now this data is long and hard to read, and won’t work well for a chi square test. Instead, we need to make it wide format. We can do this using dcast() from reshape2.
dcast() works as a formula, where you put the things you want to have stay in rows on the left-hand side and the thing you want to have be new columns on the right hand side. You do have to tell R what sort of data you want to have fill the new columns - sometimes it will figure that out automatically but it can be unpredictable. You use the value.var argument and put the column name in quotation marks ("n"). Finally, use the fill argument to tell R that any species&airport combination that doesn’t have a data point should be a 0.
widebi
s <- bi
s1 %>%
dcast(Species.Name~Airport, value.var = "n", fill = 0)
QUESTION 7: If you remove the fill = 0 portion of this function, what happens to missing values?
1. They are sad and lost foreve
1. They become zeros
1. They become NA values
QUESTION 8: If you remove the quotations around “n” from this function, what happens?
1. Your code runs totally perfectly and RStudio gives you a high five
1. R thinks that n is a function and can’t find it
1. R thinks n is an object,