11/17/20191/14Analyzing police traffic stop dataThe goal of this assignment is to give you...

Question

11/17/20191/14Analyzing police traffic stop dataThe goal of this assignment is to give you experience using the pandas data analysis liary. It will also giveyou experience using a well-documented third-party liary, and navigating its documentation in search ofspecific features that can help you complete your implementation.You may work alone or in a pair on this assignment.North Carolina traffic stop dataMuch has been written about the impact of race on police traffic stops. In this assignment, we will examineand analyze trends in traffic stops that occued in the state of North Carolina from 2000 until 2015. Wewill not be able to look at every single traffic stop, and will instead look at different subsets of data.Feel free to continue exploring this data on your own after the assignment. You should have the necessaryprogramming skills to do so! You can find additional clean data from the Stanford Open Policing Project.     The pa6 directory includes a file named traffic_stops.py, which you will modify, a file namedpa6_helpers.py that contains a few useful functions, and a file named test_traffic_stops.py withtests.Please put all of your code for this assignment in traffic_stops.py. Do not add extra files and do notmodify any files other than traffic_stops.py.The data/ directory in pa6 contains a file called get_files.sh that will download the data files necessaryfor this assignment, along with some other files needed by the tests. To download these files, change intothe data/ directory and run this command from the Linux command-line:$ sh get_files.sh (Recall that we use $ to indicate the Linux command-line prompt. You should not include it when you runthis command.)Please note that you must be connected to the network to use this script.Do not add the data files to your repository! If you wish to use both CSIL & the Linux servers, andyour VM, you will need to run the get_files.sh script twice: once for CSIL & the Linux servers and oncefor your VM.Some of our utility code uses seaborn, a plotting liary. This liary is installed on the machines in CSIL.You will need to install it on your VM using the following command:sudo -H pip3 install seabornDue: Nov 20 at 23:59Getting startedhttp:pandas.pydata.orghttps:www.citylab.com/life/2018/06/is-it-time-to-reconsider-traffic-stops/561557https:openpolicing.stanford.edu/data11/17/20192/14The sudo command will ask for a password. Use uccs as the password.We suggest that, as you work through the assignment, you keep an ipython3 session open to test the func-tions you implement in traffic_stops.py. Run the following commands in ipython3 to get started:In [1]: %load_ext autoreload  In [2]: %autoreload 2  In [3]: import pandas as pd  In [4]: import numpy as np  In [5]: import traffic_stops as tsWe will use ts to refer to the traffic_stops module in our examples below.DataThe Stanford Open Policing Project maintains a database of records from traffic stops (i.e., when a policeofficer pulls a driver over) around the country. We will be working with two different datasets extractedfrom this database.The first dataset contains data on traffic stops that occued in the state of North Carolina. For each stop,the dataset includes information related to the driver (gender, race, age, etc.), the stopping officer (aunique identifier for the officer), and the stop itself (a unique identifier for the stop), the date of the stop,the violation that triggered the stop, if any, etc. More specifically, the records from this dataset include thefollowing fields:stop_id: a unique identifier of the stopstop_date: the date of the stopofficer_id: a unique identifier for officersdriver_gender: the driver’s gendedriver_age: the driver’s agedriver_race: a column that combines information about the driver’s race and ethnicityviolation: the violation for which the driver was stoppedis_aested: a boolean that indicates whether the driver was aestedstop_outcome: the outcome of a stop (aest, citation, written warning)The gender column presumably contains information copied from the binary classification listed on thedriver’s license, which may or may not match the driver’s actual personal gender identity. The race columnpresumably contains information about what the officer perceived the driver’s race to be, which may omay not match the driver’s actual personal racial and ethnic identity.We have constructed three files from this dataset for this assignment:The first, all_stops_basic.csv, contains a small hand-picked sample of the data and is used in outest code.The second, all_stops_assignment.csv, contains a random sample of records from 500K stops (outof 10M).The third, all_stops_mini.csv, contains a random sample of 20 records and will be useful fodebugging.Here, for example, is the data from all_stops_basic.csv:stop_id,stop_date,officer_id,driver_gender,driver_age,driver_race,ethnicity,violatio2168033, XXXXXXXXXX,10020,M,53.0,White,N,Registration/plates,False,Written Warning4922383, XXXXXXXXXX,21417,M,22.0,Hispanic,H,Other,False,Citation924766, XXXXXXXXXX,10231,M,38.0,White,N,Other,False,Citation8559541, XXXXXXXXXX,11672,F,19.0,White,N,Other,False,Citation8639335, XXXXXXXXXX,21371,F,76.0,White,N,Other,False,Citation6198324, XXXXXXXXXX,11552,M,35.0,White,N,DUI,True,Aest11/17/20193/14Keep in mind that even “clean” data often contains iegularities. You’ll notice when you look at these filesthat some values are missing. For example, the officer_id is missing in the eighth line of the file. Whenyou load the data into a dataframe, missing values like these will be represented with NaN values.The second dataset contains information specific to those stops from the first dataset that resulted in asearch. Each record in this dataset includes fields for:stop_id: the stop’s unique identifiesearch_type: the type of search (e.g., incident to aest or protective frisk)contraband_found: indicates whether contraband was found during the searchsearch_basis: the reason for the search (e.g., eatic behavior or official information)drugs_related_stop: indicates whether the stop was related to drugsHere are the first ten lines from search_conducted_mini.csv:stop_id,search_type,contraband_found,search_basis,drugs_related_stop4173323,Probable Cause,False,Observation Suspected Contraband,996719,Incident to Aest,True,Observation Suspected Contraband,5428741,Incident to Aest,False,Other Official Info,824895,Incident to Aest,False,Eatic Suspicious Behaviour,816393,Protective Frisk,False,Eatic Suspicious Behaviour,5657242,Incident to Aest,False,Other Official Info,4534875,Incident to Aest,False,Suspicious Movement,4733445,Incident to Aest,False,Other Official Info,1537273,Incident to Aest,False,Other Official Info,As with the first dataset, some values are missing and will be represented with NaN values when you loadthe data into a dataframe.Please note that a stop from the first dataset will be represented in this second dataset only if it resulted ina search.PandasYou could write the code for this assignment using the csv liary, lists, dictionaries, and loops. The pur-pose of this assignment, however, is to help you become more comfortable using pandas. As a result, youare required to use pandas data frames to store the data and pandas methods to do the necessary compu-tations. If you use pandas methods efficiently and effectively, functions should be short and will likely usemultiple pandas methods.Some of the tasks we will ask you to do require using pandas features that have not been covered in class.This is by design: one of the goals of this assignment is for you to learn to read and use API documenta-tion. So, when figuring out these tasks, you are allowed (and, in fact, encouraged) to look at the Pandasdocumentation. Not just that, any code you find in the Pandas documentation can be incorporated intoyour code without attribution. (For your own convenience, though, we encourage you to include citationsfor any code you get from the documentation that is more than one or two lines.) If, however, you findPandas examples elsewhere on the Internet, and use that code either directly or as inspiration, you mustinclude a code comment specifying its origin.When solving the tasks in this assignment, you should assume that you can use a series of Pandas opera-tions to perform the required computations. Before trying to implement anything, you should spend sometime trying to find the right methods for the task. We also encourage you to experiment with them inipython3 before you incorporate them into your code.Our implementation used filtering and vector operations, as well as methods like agg, apply, cut,to_datetime, fillna, groupby, isin, loc, merge, read_csv, rename, size, transform, unstack,np.mean, np.where, along with a small number of lists and loops. Do not woy if you are not using all ofthese methods!58220, XXXXXXXXXX,,F,42.0,Black,N,Other,False,Citation5109631, XXXXXXXXXX,11941,M,65.0,Black,N,Seat belt,False,Citationhttp:pandas.pydata.org/pandas-docs/stable11/17/20194/14Your tasksTask 1: Reading in CSV filesBefore we analyze our data, we must read it in. Often, we also need to process the data to make it analysis-eady. It is usually good practice to define a function to read and process your data. In this task, you willcomplete two such functions, one for each type of data.You may find pd.read_csv, pd.to_datetime, pd.cut, and np.where along with dataframe methods, suchas fillna and isin, useful for Tasks 1a and 1b.Task 1a: Building a dataframe from the stops CSV filesYour first task is to complete the function read_and_process_allstops in traffic_stops.py. This func-tion takes the name of a CSV file that pertains to the all_stops dataset and should return a pandasdataframe, if the file exists.If the file does not exist, your function should return None. (You can use the liary functionos.path.exists to determine whether a file exists or a try block (see R&S Exceptions) that returns Nonewhen the file cannot be opened.Note about reading the dataThe pandas read_csv function allows you to read a CSV file into a dataframe. When you use this function,it is good practice to specify data types for the columns. You can do so by specifying a dictionary that mapscolumn names to types using the dtypes parameter. The set of types available for this purpose is a littleprimitive. In particular, you can specify str, int, float, and bool (or their np equivalents) as initial col-umn types. In some cases, you will need to adjust the types after you read in the data.For this assignment (and in general), you should be very thoughtful about how you specify column datatypes. Here are a few guidelines to consider:A number that can begin

Kshitij · Accepted Answer

import numpy as np
import pandas as pd
# Defined constants for column names
ARREST_CITATION = 'arrest_or_citation'
IS_ARRESTED = 'is_arrested'
YEAR_COL = 'stop_year'
MONTH_COL = 'stop_month'
DATE_COL = 'stop_date'
STOP_SEASON = 'stop_season'
STOP_OUTCOME = 'stop_outcome'
SEARCH_TYPE = 'search_type'
SEARCH_CONDUCTED = 'search_conducted'
AGE_CAT = 'age_category'
OFFICER_ID = 'officer_id'
STOP_ID = 'stop_id'
DRIVER_AGE = 'driver_age'
DRIVER_RACE = 'driver_race'
DRIVER_GENDER = 'driver_gender'
VIOLATION = "violation"
SEASONS_MONTHS = {
    "winter": [12, 1, 2],
    "spring": [3, 4, 5],
    "summer": [6, 7, 8],
    "fall": [9, 10, 11]}
NA_DICT = {
    'drugs_related_stop': False,
    'search_basis': "UNKNOWN"
    }
AGE_BINS = [0, 21, 36, 50, 65, 100]
AGE_LABELS = ['juvenile', 'young_adult', 'adult', 'middle_aged', 'senior']
SUCCESS_STOPS = ['Arrest', 'Citation']
CATEGORICAL_COLS = [AGE_CAT, DRIVER_GENDER, DRIVER_RACE,
                    STOP_SEASON, STOP_OUTCOME, VIOLATION]
# Task 1a
def read_and_process_allstops(csv_file):
    
    type_dict = {STOP_ID: int, OFFICER_ID: str}
    try:
        df = pd.read_csv(csv_file, dtype= type_dict, parse_dates = [DATE_COL])
    except:
        return None
    df[YEAR_COL] = df[DATE_COL].dt.year # Create Year column
    df[MONTH_COL] = df[DATE_COL].dt.month # Create Month column
    df[STOP_SEASON] = df[MONTH_COL].map({v_: k for k,

11/17/2019 1/14 Analyzing police traffic stop data The goal of this assignment is to give you experience using the pandas data analysis library. It will also give you experience using a...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment