Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

11/15/21, 8:28 PM mids-w200-assignments-upstream-fall2021/HW_units_11_12_13.ipynb at main · UC-Berkeley-I-School/mids-w200-assignments-upstream-fall…...

1 answer below »
11/15/21, 8:28 PM mids-w200-assignments-upstream-fall2021/HW_units_11_12_13.ipynb at main · UC-Berkeley-I-School/mids-w200-assignments-upstream-fall…
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021
lo
main/week_12/HW_units_11_12_13.ipynb 1/6
UC-Berkeley-I-School /mids-w200-assignments-upstream-fall2021 Private
Code Issues Pull requests Actions Projects Wiki Security
mids-w200-assignments-upstream-fall2021 / week_12 / HW_units_11_12_13.ipyn
foste
j Added week12 activity and HW History
1 contributo
main
782 lines (782 sloc) 22 KB
https:
github.com/UC-Berkeley-I-School
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/issues
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/pulls
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/actions
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/projects
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/wiki
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/security
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/tree/main/week_12
https:
github.com/foste
j
https:
github.com/foste
j
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/commit/0763ad3fa5c44b917452d56d6
10b7290db3b97
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021/commits/main/week_12/HW_units_11_12_13.ipyn
11/15/21, 8:28 PM mids-w200-assignments-upstream-fall2021/HW_units_11_12_13.ipynb at main · UC-Berkeley-I-School/mids-w200-assignments-upstream-fall…
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021
lo
main/week_12/HW_units_11_12_13.ipynb 2/6
Week 12 Assignment - W200 Introduction to Data
Science Programming, UC Berkeley MIDS
Write code in this Jupyter Notebook to solve the following problems. This assignment
addresses material covered in Unit 11. Please upload this Notebook with your solutions
to your GitHub repository in your SUBMISSIONS/week_12 folder by 11�59PM PST the
night before class. Do NOT push/upload the data file. If you turn-in anything on ISVC
please do so under the Week 12 Assignment category.
Objectives
Explore and glean insights from a real dataset using pandas
Practice using pandas for exploratory analysis, information gathering, and discovery
Practice using matplotlib for data visualization
Dataset
You are to analyze campaign contributions to the 2016 U.S. presidential primary races
made in California. Use the csv file located here: https:
drive.google.com/file/d/1Lgg-
PwXQ6TQLDowd6XyBxZw5g1NGWPjB/view?usp=sharing. You should download and
save this file in the same folder as this notebook is stored. This file originally came from
the U.S. Federal Election Commission (https:
www.fec.gov/).
DO NOT PUSH THIS FILE TO YOUR GITHUB REPO!
Best practice is to not have DATA files in your code repo. As shown below, the
default load is outside of the folder this notebook is in. If you change the folde
where the file is stored please update the first cell!
If you do accidentally push the file to your github repo - follow the directions here to
fix it:
https:
docs.google.com/document/d/15Irgb5V5G7pKPWgAerH7FPMpKeQRunbNflaW-
hR2hTA/edit?usp=sharing
Documentation for this data can be found here:
https:
drive.google.com/file/d/11o_SByceenv0NgNMstM-dxC1jL7I9fHL/view?
usp=sharing
General Guidelines:
This is a real dataset and so it may contain e
ors and other pecularities to work
through
This dataset is ~218mb, which will take some time to load (and probably won't load
in Google Sheets or Excel)
If you make assumptions, annotate them in your responses
Whil th i d / kd ll iti d ft h ti
https:
drive.google.com/file/d/1Lgg-PwXQ6TQLDowd6XyBxZw5g1NGWPjB/view?usp=sharing
https:
www.fec.gov
https:
docs.google.com/document/d/15Irgb5V5G7pKPWgAerH7FPMpKeQRunbNflaW-hR2hTA/edit?usp=sharing
https:
drive.google.com/file/d/11o_SByceenv0NgNMstM-dxC1jL7I9fHL/view?usp=sharing
11/15/21, 8:28 PM mids-w200-assignments-upstream-fall2021/HW_units_11_12_13.ipynb at main · UC-Berkeley-I-School/mids-w200-assignments-upstream-fall…
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021
lo
main/week_12/HW_units_11_12_13.ipynb 3/6
While there is one code/markdown cell positioned after each question as a
placeholder, some of your code
esponses may require multiple cells
Double-click the markdown cells that say YOUR ANSWER HERE to enter you
written answers. If you need more cells for your written answers, make them
markdown cells (rather than code cells)
Setup
Run the two cells below.
The first cell will load the data into a pandas dataframe named contrib . Note that a
custom date parser is defined to speed up loading. If Python were to guess the date
format, it would take even longer to load.
The second cell subsets the dataframe to focus on just the primary period through May
2016. Otherwise, we would see general election donations which would make it harde
to draw conclusions about the primaries.
1. Data Exploration (20 points)
1a. First, take a preliminary look at the data.
Print the shape of the data. What does this tell you about the number of variables
and rows you have?
Print a list of column names.
Review the documentation for this data (link above). Do you have all of the columns
you expect to have?
S ti i bl t l l d th d t ti I
In [ ]: import pandas as pd
import matplotlib.pyplot as plt
import datetime

# These commands below set some options for pandas and to have matplotlib
pd.set_option('display.max_rows', 1000)
pd.options.display.float_format = '{:,.2f}'.format
%matplotlib inline

# Define a date parser to pass to read_csv
d = lambda x: pd.datetime.strptime(x, '%d-%b-%y')

# Load the data
# We have this defaulted to the folder OUTSIDE of your repo - please chang
contrib = pd.read_csv('../../P XXXXXXXXXXCA.csv', index_col=False, parse_dat
print(contrib.shape)

# Note - for now, it is okay to ignore the warning about mixed types.
In [ ]: # Subset data to primary period
contrib = contrib.copy()[contrib['contb_receipt_dt'] <= datetime.datetime(
print(contrib.shape)
11/15/21, 8:28 PM mids-w200-assignments-upstream-fall2021/HW_units_11_12_13.ipynb at main · UC-Berkeley-I-School/mids-w200-assignments-upstream-fall…
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021
lo
main/week_12/HW_units_11_12_13.ipynb 4/6
Sometimes variable names are not clear unless we read the documentation. In you
own words, based on the documentation, what information does the
election_tp variable contain?
1a YOUR RESPONSE HERE
1b. Print the first 5 rows from the dataset to manually check some of the data.
This is a good idea to ensure the data loaded and the columns parsed co
ectly!
1c. Pick three variables from the dataset above and run some quick sanity checks.
When working with a new dataset, it is important to explore and sanity check you
variables. For example, you may want to examine the maximum and minimum values, a
frequency count, or something else. Use the three markdown cells below to explain if
your three chosen variables "pass" your sanity checks or if you have concerns about
the integrity of your data and why.
1c YOUR RESPONSE HERE
1d. Plotting a histogram
Make a histogram of one of the variables you picked above. What are some insights that
you can see from this histogram? Remember to include on your histogram:
Include a title
Include axis labels
The co
ect number of bins to see the
eakout of values
Hint: For some variables the range of values is very large. To do a better exploration,
make the initial histogram the full range and then you can make a smaller histogram
'zoomed' in on a discreet range.
1d YOUR RESPONSE HERE
2. Exploring Campaign Contributions (30 points)
In [2]: # 1a YOUR CODE HERE
In [3]: # 1b YOUR CODE HERE
In [4]: # 1c YOUR CODE HERE for variable #1
In [ ]: # 1c YOUR CODE HERE for variable #2
In [ ]: # 1c YOUR CODE HERE for variable #3
In [2]: # 1d YOUR CODE HERE
11/15/21, 8:28 PM mids-w200-assignments-upstream-fall2021/HW_units_11_12_13.ipynb at main · UC-Berkeley-I-School/mids-w200-assignments-upstream-fall…
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021
lo
main/week_12/HW_units_11_12_13.ipynb 5/6
Let's investigate the donations to the candidates.
2a. Present a table that shows the number of donations to each candidate sorted
y number of donations.
When presenting data as a table, it is often best to sort the data in a meaningful
way. This makes it easier for your reader to examine what you've done and to glean
insights. From now on, all tables that you present in this assignment (and course)
should be sorted.
Hint: Use the groupby method. Groupby is explained in Unit 13: async 13.3 & 13.5
Hint: Use the sort_values method to sort the data so that candidates with the
largest number of donations appear on top.
Which candidate received the largest number of contributions (variable
'contb_receipt_amt')?
2a YOUR RESPONSE HERE
2b. Now, present a table that shows the total value of donations to each candidate.
sorted by total value of the donations
Which candidate raised the most money in California?
2b YOUR RESPONSE HERE
2c. Combine the tables (sorted by either a or b above).
Looking at the two tables you presented above - if those tables are Series convert
them to DataFrames.
Rename the variable (column) names to accurately describe what is presented.
Merge together your tables to show the count and the value of donations to each
candidate in one table.
Hint: Use the merge method.
2d. Calculate and add a new variable to the table from 2c that shows the average $
per donation. Print this table sorted by the average donation
2e. Plotting a Bar Chart
Make a single bar chart that shows two different bars per candidate with one bar as the
total value of the donations and the other as average $ per donation.
In [3]: # 2a YOUR CODE HERE
In [ ]: # 2b YOUR CODE HERE
In [ ]: # 2c YOUR CODE HERE
In [ ]: # 2d YOUR CODE HERE
11/15/21, 8:28 PM mids-w200-assignments-upstream-fall2021/HW_units_11_12_13.ipynb at main · UC-Berkeley-I-School/mids-w200-assignments-upstream-fall…
https:
github.com/UC-Berkeley-I-School/mids-w200-assignments-upstream-fall2021
lo
main/week_12/HW_units_11_12_13.ipynb 6/6
Show the Candidates Name on the x-axis
Show the amount on the y-axis
Include a title
Include axis labels
Hint: Make the y-axis a log-scale to show both numbers! (matplotlib docs:
https:
matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.yscale.html )
2f. Comment on the results of your data analysis in a short paragraph.
There are several interesting conclusions you can draw from the table you have
created.
What have you learned about campaign contributions in California?
We are looking for data insights here rather than comments on the code!
2f YOUR RESPONSE HERE
3. Exploring Donor Occupations (30 points)
Above in part 2, we saw that some simple data analysis can give us insights into the
campaigns of our candidates. Now let's quickly look to see what kind of person is
donating to each campaign using the cont
_occupation variable.
3a. Show the top 5 occupations of individuals that contributed to Hillary Clinton.
Subset your data to create a dataframe with only donations for Hillary Clinton.
Then use the value_counts and head methods to present the top 5
occupations ( cont
_occupation ) for her donors.
In [ ]: # 2e YOUR CODE HERE
https:
matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.yscale.html
Answered 1 days After Nov 16, 2021

Solution

Bikram answered on Nov 17 2021
128 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here