Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

11/17/2019 1/14 Analyzing police traffic stop data The goal of this assignment is to give you experience using the pandas data analysis library. It will also give you experience using a...

1 answer below »
11/17/2019
1/14
Analyzing police traffic stop data
The goal of this assignment is to give you experience using the pandas data analysis li
ary. It will also give
you experience using a well-documented third-party li
ary, and navigating its documentation in search of
specific features that can help you complete your implementation.
You may work alone or in a pair on this assignment.
North Carolina traffic stop data
Much has been written about the impact of race on police traffic stops. In this assignment, we will examine
and analyze trends in traffic stops that occu
ed in the state of North Carolina from 2000 until 2015. We
will not be able to look at every single traffic stop, and will instead look at different subsets of data.
Feel free to continue exploring this data on your own after the assignment. You should have the necessary
programming skills to do so! You can find additional clean data from the Stanford Open Policing Project.




The pa6 directory includes a file named traffic_stops.py, which you will modify, a file named
pa6_helpers.py that contains a few useful functions, and a file named test_traffic_stops.py with
tests.
Please put all of your code for this assignment in traffic_stops.py. Do not add extra files and do not
modify any files other than traffic_stops.py.
The data/ directory in pa6 contains a file called get_files.sh that will download the data files necessary
for this assignment, along with some other files needed by the tests. To download these files, change into
the data/ directory and run this command from the Linux command-line:
$ sh get_files.sh
(Recall that we use $ to indicate the Linux command-line prompt. You should not include it when you run
this command.)
Please note that you must be connected to the network to use this script.
Do not add the data files to your repository! If you wish to use both CSIL & the Linux servers, and
your VM, you will need to run the get_files.sh script twice: once for CSIL & the Linux servers and once
for your VM.
Some of our utility code uses seaborn, a plotting li
ary. This li
ary is installed on the machines in CSIL.
You will need to install it on your VM using the following command:
sudo -H pip3 install seaborn
Due: Nov 20 at 23:59
Getting started
http:
pandas.pydata.org
https:
www.citylab.com/life/2018/06/is-it-time-to-reconsider-traffic-stops/561557
https:
openpolicing.stanford.edu/data
11/17/2019
2/14
The sudo command will ask for a password. Use uccs as the password.
We suggest that, as you work through the assignment, you keep an ipython3 session open to test the func-
tions you implement in traffic_stops.py. Run the following commands in ipython3 to get started:
In [1]: %load_ext autoreload

In [2]: %autoreload 2

In [3]: import pandas as pd

In [4]: import numpy as np

In [5]: import traffic_stops as ts
We will use ts to refer to the traffic_stops module in our examples below.
Data
The Stanford Open Policing Project maintains a database of records from traffic stops (i.e., when a police
officer pulls a driver over) around the country. We will be working with two different datasets extracted
from this database.
The first dataset contains data on traffic stops that occu
ed in the state of North Carolina. For each stop,
the dataset includes information related to the driver (gender, race, age, etc.), the stopping officer (a
unique identifier for the officer), and the stop itself (a unique identifier for the stop), the date of the stop,
the violation that triggered the stop, if any, etc. More specifically, the records from this dataset include the
following fields:
stop_id: a unique identifier of the stop
stop_date: the date of the stop
officer_id: a unique identifier for officers
driver_gender: the driver’s gende
driver_age: the driver’s age
driver_race: a column that combines information about the driver’s race and ethnicity
violation: the violation for which the driver was stopped
is_a
ested: a boolean that indicates whether the driver was a
ested
stop_outcome: the outcome of a stop (a
est, citation, written warning)
The gender column presumably contains information copied from the binary classification listed on the
driver’s license, which may or may not match the driver’s actual personal gender identity. The race column
presumably contains information about what the officer perceived the driver’s race to be, which may o
may not match the driver’s actual personal racial and ethnic identity.
We have constructed three files from this dataset for this assignment:
The first, all_stops_basic.csv, contains a small hand-picked sample of the data and is used in ou
test code.
The second, all_stops_assignment.csv, contains a random sample of records from 500K stops (out
of 10M).
The third, all_stops_mini.csv, contains a random sample of 20 records and will be useful fo
debugging.
Here, for example, is the data from all_stops_basic.csv:
stop_id,stop_date,officer_id,driver_gender,driver_age,driver_race,ethnicity,violatio
2168033, XXXXXXXXXX,10020,M,53.0,White,N,Registration/plates,False,Written Warning
4922383, XXXXXXXXXX,21417,M,22.0,Hispanic,H,Other,False,Citation
924766, XXXXXXXXXX,10231,M,38.0,White,N,Other,False,Citation
8559541, XXXXXXXXXX,11672,F,19.0,White,N,Other,False,Citation
8639335, XXXXXXXXXX,21371,F,76.0,White,N,Other,False,Citation
6198324, XXXXXXXXXX,11552,M,35.0,White,N,DUI,True,A
est
11/17/2019
3/14
Keep in mind that even “clean” data often contains i
egularities. You’ll notice when you look at these files
that some values are missing. For example, the officer_id is missing in the eighth line of the file. When
you load the data into a dataframe, missing values like these will be represented with NaN values.
The second dataset contains information specific to those stops from the first dataset that resulted in a
search. Each record in this dataset includes fields for:
stop_id: the stop’s unique identifie
search_type: the type of search (e.g., incident to a
est or protective frisk)
contraband_found: indicates whether contraband was found during the search
search_basis: the reason for the search (e.g., e
atic behavior or official information)
drugs_related_stop: indicates whether the stop was related to drugs
Here are the first ten lines from search_conducted_mini.csv:
stop_id,search_type,contraband_found,search_basis,drugs_related_stop
4173323,Probable Cause,False,Observation Suspected Contraband,
996719,Incident to A
est,True,Observation Suspected Contraband,
5428741,Incident to A
est,False,Other Official Info,
824895,Incident to A
est,False,E
atic Suspicious Behaviour,
816393,Protective Frisk,False,E
atic Suspicious Behaviour,
5657242,Incident to A
est,False,Other Official Info,
4534875,Incident to A
est,False,Suspicious Movement,
4733445,Incident to A
est,False,Other Official Info,
1537273,Incident to A
est,False,Other Official Info,
As with the first dataset, some values are missing and will be represented with NaN values when you load
the data into a dataframe.
Please note that a stop from the first dataset will be represented in this second dataset only if it resulted in
a search.
Pandas
You could write the code for this assignment using the csv li
ary, lists, dictionaries, and loops. The pur-
pose of this assignment, however, is to help you become more comfortable using pandas. As a result, you
are required to use pandas data frames to store the data and pandas methods to do the necessary compu-
tations. If you use pandas methods efficiently and effectively, functions should be short and will likely use
multiple pandas methods.
Some of the tasks we will ask you to do require using pandas features that have not been covered in class.
This is by design: one of the goals of this assignment is for you to learn to read and use API documenta-
tion. So, when figuring out these tasks, you are allowed (and, in fact, encouraged) to look at the Pandas
documentation. Not just that, any code you find in the Pandas documentation can be incorporated into
your code without attribution. (For your own convenience, though, we encourage you to include citations
for any code you get from the documentation that is more than one or two lines.) If, however, you find
Pandas examples elsewhere on the Internet, and use that code either directly or as inspiration, you must
include a code comment specifying its origin.
When solving the tasks in this assignment, you should assume that you can use a series of Pandas opera-
tions to perform the required computations. Before trying to implement anything, you should spend some
time trying to find the right methods for the task. We also encourage you to experiment with them in
ipython3 before you incorporate them into your code.
Our implementation used filtering and vector operations, as well as methods like agg, apply, cut,
to_datetime, fillna, groupby, isin, loc, merge, read_csv, rename, size, transform, unstack,
np.mean, np.where, along with a small number of lists and loops. Do not wo
y if you are not using all of
these methods!
58220, XXXXXXXXXX,,F,42.0,Black,N,Other,False,Citation
5109631, XXXXXXXXXX,11941,M,65.0,Black,N,Seat belt,False,Citation
http:
pandas.pydata.org/pandas-docs/stable
11/17/2019
4/14
Your tasks
Task 1: Reading in CSV files
Before we analyze our data, we must read it in. Often, we also need to process the data to make it analysis-
eady. It is usually good practice to define a function to read and process your data. In this task, you will
complete two such functions, one for each type of data.
You may find pd.read_csv, pd.to_datetime, pd.cut, and np.where along with dataframe methods, such
as fillna and isin, useful for Tasks 1a and 1b.
Task 1a: Building a dataframe from the stops CSV files
Your first task is to complete the function read_and_process_allstops in traffic_stops.py. This func-
tion takes the name of a CSV file that pertains to the all_stops dataset and should return a pandas
dataframe, if the file exists.
If the file does not exist, your function should return None. (You can use the li
ary function
os.path.exists to determine whether a file exists or a try block (see R&S Exceptions) that returns None
when the file cannot be opened.
Note about reading the data
The pandas read_csv function allows you to read a CSV file into a dataframe. When you use this function,
it is good practice to specify data types for the columns. You can do so by specifying a dictionary that maps
column names to types using the dtypes parameter. The set of types available for this purpose is a little
primitive. In particular, you can specify str, int, float, and bool (or their np equivalents) as initial col-
umn types. In some cases, you will need to adjust the types after you read in the data.
For this assignment (and in general), you should be very thoughtful about how you specify column data
types. Here are a few guidelines to consider:
A number that can begin
Answered Same Day Nov 18, 2021

Solution

Kshitij answered on Nov 21 2021
147 Votes
import numpy as np
import pandas as pd
# Defined constants for column names
ARREST_CITATION = 'a
est_or_citation'
IS_ARRESTED = 'is_a
ested'
YEAR_COL = 'stop_year'
MONTH_COL = 'stop_month'
DATE_COL = 'stop_date'
STOP_SEASON = 'stop_season'
STOP_OUTCOME = 'stop_outcome'
SEARCH_TYPE = 'search_type'
SEARCH_CONDUCTED = 'search_conducted'
AGE_CAT = 'age_category'
OFFICER_ID = 'officer_id'
STOP_ID = 'stop_id'
DRIVER_AGE = 'driver_age'
DRIVER_RACE = 'driver_race'
DRIVER_GENDER = 'driver_gender'
VIOLATION = "violation"
SEASONS_MONTHS = {
"winter": [12, 1, 2],
"spring": [3, 4, 5],
"summer": [6, 7, 8],
"fall": [9, 10, 11]}
NA_DICT = {
'drugs_related_stop': False,
'search_basis': "UNKNOWN"
}
AGE_BINS = [0, 21, 36, 50, 65, 100]
AGE_LABELS = ['juvenile', 'young_adult', 'adult', 'middle_aged', 'senior']
SUCCESS_STOPS = ['A
est', 'Citation']
CATEGORICAL_COLS = [AGE_CAT, DRIVER_GENDER, DRIVER_RACE,
STOP_SEASON, STOP_OUTCOME, VIOLATION]
# Task 1a
def read_and_process_allstops(csv_file):

type_dict = {STOP_ID: int, OFFICER_ID: str}
try:
df = pd.read_csv(csv_file, dtype= type_dict, parse_dates = [DATE_COL])
except:
return None
df[YEAR_COL] = df[DATE_COL].dt.year # Create Year column
df[MONTH_COL] = df[DATE_COL].dt.month # Create Month column
df[STOP_SEASON] = df[MONTH_COL].map({v_: k for k, v in...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here