FES 205
S&DS 230e Final Project Guidelines 1
S&DS 23eData Analysis
Final Project Guidelines
Due Friday, XXXXXXXXXX, 11:59pm, uploaded to CANVAS as PDF or DOC AND RMD
Overview
Analyze a dataset of your choice and write a 10-20 page report of your findings. This report
must be created in RMarkdown and you’ll submit both a knitted PDF/doc file and the raw
Rmarkdown code. Your goal is to demonstrate your ability to code in R, to clean data, to use
appropriate graphical and statistical techniques in R, and to interpret your results.
Groups
You are encouraged but certainly not required to work in groups. Groups can be up to 4
students. Everyone in the group gets the same grade.
Data
You should choose a dataset that is interesting to you, OR you may use one of three datasets
provided by myself. The dataset should have at least 10 variables and at least 50 observations.
You must have at least two continuous variables and at least two categorical variables. Some
datasets will have hundreds of variables and more than 100,000 observations. Getting the
cleaning the data may be the most difficult part of your project. YOU ABSOLUTELY SHOULD
DISCUSS YOUR DATA WITH MYSELF OR A TA BEFORE TURNING IN YOUR PROJECT.
There are many online sources for data – you can just go to Google and search for a subject and
then add ‘data’. You can also scrape data off a website.
Here are some good sites:
ï‚· ICPSR https:
www.icpsr.umich.edu/icpsrwe
landing.jsp. More than 10,000 datasets here
ï‚· Kaggle https:
www.kaggle.com/datasets
ï‚· The Census Bureau (http:
www.census.gov/)
ï‚· NOAA (http:
www.nodc.noaa.gov/)
ï‚· The US Environmental Protection Agency (http:
www.epa.gov/epahome/Data.html).
Other ideas:
ï‚· Use your web scraping tools to get data on all roll call votes in the 116th Senate (2nd
session, 2020)
You should NOT choose a dataset that has already been extensively cleaned and analyzed (i.e.
from a textbook or ‘nice example’ website). However, if there is minimal cleaning to do, then
put more effort into something else.
You do NOT need to use all the variables in your dataset; indeed, you may end up
cleaning/analyzing only 6 to 10 variables. Your goal is not be comprehensive, but to
demonstrate what you’ve learned.
https:
www.icpsr.umich.edu/icpsrwe
landing.jsp
https:
www.kaggle.com/datasets
http:
www.census.gov
http:
www.nodc.noaa.gov
http:
www.epa.gov/epahome/Data.html
https:
www.senate.gov/legislative/LIS
oll_call_lists/vote_menu_116_2.htm
S&DS 230e Final Project Guidelines 2
If you decide not to find your own data, you can use one of the following three datasets, all
available on CANVAS under Files ïƒ Final Project Information. Dataset
information on variables and collection methods are also provided.
ï‚· World Bank Data from 2016
ï‚· Environmental Attitudes from the General Social Survey of 2000
ï‚· Food Choices (we looked
iefly at a few variables in class) :
https:
www.kaggle.com
orapajo/food-choices
Format
Your project should be presented as a report; it should have appropriate RMarkdown
formatting and discussions should be in complete sentences. There is no minimum length
(
evity and clarity are admired), and your knitted report should not be more than 15 pages
long, including graphs and relevant output (just suppress i
elevant output). You should NOT
have pages of output that you don’t discuss. You also don’t need to have RMarkdown show
every last bit of output your code creates. It should feel more formal than a homework
assignment, but you should be extremely concise in your discussion.
Sections of the Report
 Introduction (Background, motivation) – not more than a short paragraph.
 DATA: Make a LIST of all variables you actually use – describe units, anything I should
know. Ignore variables you don’t discuss.
 Data cleaning process – describe the cleaning process you used on your data. Talk
about what issues you encountered.
ï‚· Descriptive Plots, summary information. Plots should be clearly labeled, well formatted,
and display an aesthetic sense.
 Analysis – see below
 Conclusions and Summary – a short paragraph.
Content Requirements
Your report should include evidence of your ability in each of the following areas:
1) Data Cleaning – demonstrate use of find
eplace, data cleaning, dealing with missing
values, text character replacement, matching. It’s ok if your data didn’t require much of
this.
2) Graphics – show appropriate use of at least ONE of each of the following – boxplot,
scatterplot (can be matrix plot), normal quantile plot (can be related to regression),
esidual plots, histogram.
3) Basic tests - t-test, co
elation, AND ability to create bootstrap confidence interval for
either a t-test or a co
elation.
4) Permutation Test – include at least one.
5) Multiple Regression – use either backwards stepwise regression or some form of best
subsets regression. Should include residual plots. A GLM with a mix of continuous and
categorical predictors is fine here.
https:
www.kaggle.com
orapajo/food-choices
S&DS 230e Final Project Guidelines 3
6) AT LEAST ONE OF THE FOLLOWING TECHNIQUES – ANOVA, ANCOVA, Logistic
Regression, Multinomial Regression, OR data scraping off a website.
Additional Comments
Please do NOT have appendices – unlike a journal article, include relevant plots and output in
the section where you discuss the results (more of a na
ative). This said, you should ONLY
include output that is relevant to your discussion. I can always look at your RMarkdown code if
I have questions. It is fine to suppress both long output and parts of your R code.
As you work on this project, I expect you will regularly pester myself and TA’s.
Submission - Please read this carefully
1) ONLY ONE person in a group should upload a copy of the final project (i.e. if there are
three people in a group, only one person needs to upload the files.
2) BE SURE to put all members’ names on your project documents.
---
title: "Final Project Rideshare"
author: "Jack Kidney"
date: ' XXXXXXXXXX'
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, eval = F}
m(list=ls())
```
```{r}
ideshare <- read.csv("/Users/jackkidney/Downloads/Final Project
ideshare.csv")
attach(rideshare)
```
Data Cleaning:
```{r}
# There is some missing data.
total_missing <- sum(is.na(rideshare))
paste0("We find that ", total_missing, " rides are missing data from at least one column, which is ", (round((sum(is.na(rideshare)) / dim(rideshare)[1]), 3) * 100), "% of our total ride data.\ ")
#It seems that data is missing only from the price category."
sapply(rideshare, function(x) sum(is.na(x)))
# Time_stamp is in unix format (seconds since January 1st 1970)
# Convert to friendlier format.
# Timestamp
#install.packages("lu
idate") # Use lu
idate to convert.
#li
ary("lu
idate")
# Date is given in a nice format, but we're going to pretend that we only had the Unix time listed under timestamp
drop <- c("hour", "day", "month", "datetime")
ideshare <- rideshare[,!(names(rideshare) %in% drop)]
ideshare$timedate <- as.POSIXct(rideshare$timestamp, origin = " XXXXXXXXXX:00:00") #I double checked the dataset and for some reason the author doesn't explain, the origin in this case is 5am instead of 12am...
head(rideshare$timedate)
# Convert timedate
```
```{r}
oxplot(price ~ name, data = rideshare)
hist(price)
```
Our histogram and boxplots look pretty right-skewed so maybe a transformation is in order, here.
Let's check out a box-cox transformation.
```{r}
# First, we fit the simplest model possible.
model1 <- lm(price ~ distance)
#Figure out what value of lambda (x) gives max value of log-liklihood (y)
trans <- boxCox(price ~ distance)
trans$x[which.max(trans$y)]
```
Homework 09 Two Way ANOVA / ANCOVA / GLM
Homework 09 Two Way ANOVA / ANCOVA / GLM
Due by 11:59pm, Monday, August 1, 2022
S&DS 230e
This assignment uses data from the International Social Survey Program on Environment
from 2000. There are over 100 questions from over 31000 individuals across 38 countries.
The data you’ll need is here. Be aware that it will take a few moments to load this data.
You’ll also want the codebook that describes the variables.
1) Data Set creation (23 pts - 3 pts each section, except part f which is 5 pts)
1.1) Read the data into an object called envdat (do NOT use the option as.is = TRUE).
Check the dimension to be sure the data loaded co
ectly. Then create a new object called
envdat2 which only contains information for the following countries : USA, Norway, Russia,
New Zealand, Canada, Japan, and Mexico. The variable that contains country is V3. You’ll
need to use the codebook to figure out which number goes with which country. Check the
dimensions of your results - you should have 9102 observations.
envdat <- read.csv("http:
euningscherer.net/s&ds230/data/envdata.csv",
as.is = F)
dim(envdat)
## [1] XXXXXXXXXX
#making envdat2
envdat2 <- envdat[envdat$V3 %in% c(6, 12, 18, 19, 20, 24, 38),]
1.2) Create a new variable called Country on envdat2 which has Country names rather than
Country numbers. There are several ways to do this, but I suggests you use the recode()
function in the car package. The syntax for this function is something like
li
ary(car)
envdat2$Country <- recode(envdat2$V3, "6 = 'USA'; 12 = 'Norway'; 18 =
'Russia'; 19 = 'New Zealand'; 20 = 'Canada'; 24 = 'Japan'; 38 = 'Mexico'")
Once you’re created the variable, make a table of the resulting variable to see how many
observations there are from each country.
table(envdat2$Country)
## < table of extent 0 >
1.3) Make a variable Gender on envdat2 that contains gender (which is variable V200).
Recode so that 1 becomes ‘Male’ and 2 becomes ‘Female’. Again, make a table of resulting
variable to see how many people identify as Male and how many as Female.
http:
euningscherer.net/s&ds230/data/envdata.csv
http:
euningscherer.net/s&ds230/data/Env_Survey_2000_Codebook.pdf
li
ary(car)
envdat2$Gender <- recode(envdat2$V200, "1 = 'Male'; 2 = 'Female'")