Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Instructions: You will be looking at data from a survey in the US state of Colorado on opinions of the oil and gas industry, and evaluating whether Facebook ads changed opinions of the oil & gas...

1 answer below »
Instructions:
You will be looking at data from a survey in the US state of Colorado on opinions of the oil and gas industry, and evaluating whether Facebook ads changed opinions of the oil & gas industry.
For context, in this study, some individuals in Colorado were randomly selected to receive video advertisements on Facebook, which highlighted the risks of the oil & gas industry. This is the ‘treatment’ group. Another set of individuals on Facebook were the ‘control’ group and did not receive ads.
All individuals in both the treatment and control groups were asked to complete a survey. Not all individuals started the survey, and not all individuals who started the survey completed it. The survey asked respondents a number of demographic questions, then asked “Do you believe your community is better or worse off because of the oil and gas industry?”. Respondents selected one of the following choices:
● 1 - Definitely better off
● 2 - Somewhat better off
● 3 - Neither better nor worse off
● 4 - Somewhat worse off
● 5 - Definitely worse off
We can compare answers between the treatment and control groups to evaluate the
effectiveness of the advertisements.
Data: You will use two datasets:
1. Survey Data: includes a row for every individual who started the survey. Includes fields for survey responses and attributes of individuals.
· Description of fields:

5-digit FIPS code for county of respondent
treatment 1- indicates respondent was in treatment group, 0 - indicates
espondent was in control group
total_duration_in_sec time respondent to took respond to survey, in seconds
Q1_answer_code The respondent's numerical response to survey question 1
Q1_answer_text The respondent's text response to survey question 1
    Field
    Description
    person_id
    ID for survey respondent
    county
    county of respondent
    FIPS
    5-digit FIPS code for county of respondent
    treatment
    1- indicates respondent was in treatment group, 0 - indicates
espondent was in control group
    total_duration_in_sec
    time respondent to took respond to survey, in seconds
    Q1_answer_code
    The respondent's numerical response to survey question 1
    Q1_answer_text
    The respondent's text response to survey question 1
2. County Shapefiles: Standard zip file of county boundary shapefiles from the US Census
Objectives:
With this data, your goal is to:
● Clean up and QA survey data
● Understand scope of cleaned data: what is the geographic coverage of our survey
espondents?
● Compare the survey responses of the treatment group (those who saw video
advertisements) and control group (those who did not see video advertisements).
ASSIGNMENT
Part 1: Data Intro and QA
In Part 1, we will load the survey data and clean it.
1.1: Set Up
Run the code below to import modules. Then read in the survey data into a dataframe called df_survey. The survey data is available on GitHub at the link below:
'https:
aw.githubusercontent.com/smsidekick/project-sidekick/main
lihkjhdrsers.csv'
# Install Geopandas ! pip install geopandas --q
# Import pandas and numpy import pandas as pd import numpy as np # Import geopandas import geopandas as gpd # Import plotnine from plotnine import * import plotnine
1.2: Explore Data
Orient to the survey data.
1.3: Duplicate IDs
Is the person_id field unique? Are there any duplicate values in that field? If there are duplicates, remove the duplicates. Save this back to df_survey
1.4: Complete Survey Responses
Using code, check if any individuals did not answer survey question 1. If so, filter df_survey to include responses only from those who completed survey question 1: filter out any rows where Q1_answer_code is null.
Save this filtered data to a new dataframe called df_complete.
1.5: Survey Speeders
Did any respondents in df_complete speed through the survey?
Filter out any responses that were impossibly fast outliers based on your judgement. Save this filtered data back to df_complete
Make the rationale for your decision clear. A histogram may be helpful.
1.6: Survey Responses
Show the distribution of the survey responses in Q1_answer_text (i.e. how many people responded with each answer?) In a sentence,
ainstorm why you think some may say the oil & gas industry makes their community better off vs. worse off?
Part 2: Survey Coverage in Colorado
In Part 2, we will explore the survey results by Colorado county and then create a map to understand the geographic coverage of our responses. We'll explore all the results (for both treatment and control).
We're looking to inform two questions:
1. Do we think we have a good, representative sample of the entire state?
2. Do we think have enough data to evaluate the experiment by county?
2.1: Read in County Shapefiles
Use command line code to read in the county shapefiles for the entire US from the link below. Read the data into a geodataframe, df_counties
https:
www2.census.gov/geo/tige
TIGER2019/COUNTY/tl_2019_us_county.zip
2.2: Filter Geodataframe
Filter df_counties to include only Colorado counties by filtering for when STATEFP is 08 (the State FIPS code for Colorado). Save this to a new geodataframe, df_counties_co.
2.3: Summarize Survey by County
Turning back to the survey results: create a dataframe summarizing the total number of survey responses by county and FIPS. Save this summary to a new dataframe, df_county_survey. (In the next step, we'll join this onto df_counties_co.)
Then, dig into the county results and answer:
· How many unique counties do we have in total in df_county_survey?
· What is the minimum number of responses in a county? Describe the new dataset, and the distribution of the number of survey responses by county
2.4 Bucket Number of Responses
In df_country_survey, create a new column N_resp_bucket that buckets the number of survey responses in steps of 25: <25, 25-50, 50-100, etc.
2.5: Join Survey and Geo Data
Join df_counties_co and df_county_survey, matching the FIPS column to the GEOID column. Save the joined dataframe to a new geodataframe, df_map.
2.6: Map
Plot a choropleth map of df_map, coloring each county by the bucketed number of survey responses, N_resp_bucket.
2.7 Takeaways on Survey Scope
Take a few sentences to answer our two questions. Looking at this data, in your opinion:
1. Do we have a good, representative sample of the entire state?
2. Do have enough data to evaluate the experiment by county?
What other information might you want to more robustly inform these questions?
(Don't wo
y if you don't know much about Colorado. Just discuss what you see and what you might want to know more about.)
Part 3: Evaluate Experiment
In Part 3, we'll evaulate if the survey responses from the treatment group (who saw the ads on Facebook about the negative impacts of oil & gas) were significantly different from those in the control group.
In the survey, question 1 asked respondents "Do you believe your community is better or worse off because of the oil and gas industry?" Respondents answered on a scale of 1 to 5, where 1 meant "Definitely better off" and 5 meant "Definitely worse off"
3.1: Treatment vs. Control Size
How many survey respondents were in the treatament group vs. the control group?
3.2: Differences between Treatment and Control
Calculate the average Q1_answer_code value for the treatment and the control groups.
3.3 Interpet Results
In a few sentences, discuss what you calculated above in 3.2. What is one follow up question you have, or what might be a next step to understand what is going on in greater detail?
Answered 3 days After Mar 24, 2023

Solution

Breeze answered on Mar 26 2023
42 Votes
Part 1: Data Intro and QA
1.1: Set Up
First, we will import the necessary modules and load the survey data into a dataframe called df_survey.
# Install Geopandas
!pip install geopandas --q
# Import pandas and numpy
import pandas as pd
import numpy as np
# Import geopandas
import geopandas as gpd
# Import plotnine
from plotnine import *
import plotnine
# Load survey data
df_survey = pd.read_csv('https:
aw.githubusercontent.com/smsidekick/project-sidekick/main
lihkjhdrsers.csv')
1.2: Explore Data
Now that we have loaded the survey data, let's take a closer look at it to understand its structure and content.
# View the first 5 rows of the data
print(df_survey.head())
# Get summary statistics for the data
print(df_survey.describe())
# Get information on the data types and missing values
print(df_survey.info())
1.3: Duplicate IDs
To determine if the person_id field is unique and if there are any duplicate values in that field,
we can count the number of unique IDs and compare it to the total number of rows in the dataframe.
# Count the number of unique person IDs
num_unique_ids = df_survey['person_id'].nunique()
print('Number of unique IDs:', num_unique_ids)
# Count the total number of rows in the dataframe
num_rows = df_survey.shape[0]
print('Number of rows:', num_rows)
# Check if there are any duplicate person IDs
if num_unique_ids == num_rows:
print('There are no duplicate IDs.')
else:
df_survey.drop_duplicates(subset='person_id', inplace=True)
print('Duplicate IDs removed.')
1.4: Complete Survey Responses
We can check if any individuals did not answer survey question 1 by looking for null values in the Q1_answer_code field. If there are any null values,
we can filter the dataframe to include responses only from those who completed survey question 1.
# Check for null values in Q1_answer_code
num_null_values = df_survey['Q1_answer_code'].isnull().sum()
if num_null_values > 0:
print('There are', num_null_values, 'null values in Q1_answer_code.')
df_complete = df_survey[df_survey['Q1_answer_code'].notnull()]
print('Filtered data to include only responses from those who completed survey question 1.')
else:
df_complete = df_survey.copy()
print('All responses completed survey question 1.')
1.5: Survey Speeders
To determine if any respondents in df_complete sped through the survey, we can look at the distribution of the total_duration_in_sec field.
We can use a histogram to visualize this distribution and
determine if there are any outliers that need to be removed.
# Create a histogram of total_duration_in_sec
(ggplot(df_complete, aes(x='total_duration_in_sec'))
+ geom_histogram(bins=50)
+...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here