Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Data files # Read in the Building_Permits.csv file in the data/ folder and store it in a data frame named# `permits`. This file contains building permits from the City of Seattle. See the link below#...

1 answer below »

Data files






# Read in the Building_Permits.csv file in the data/ folder and store it in a data frame named


# `permits`. This file contains building permits from the City of Seattle. See the link below


# for more info. Be sure set the working directory to the appropriate folder location.


# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr






permits <->






# How many unique values are in the `OriginalCity` column of the dataset? Store the unique values


# in a vector called `unique_cities`. As you can see - there is only one unique city but there are


# multiple variations on how it is stored in the data. (hint: you may want to look into the


# `unique()` function)










# Suppose we only want information on closed permits that have a valid type description. Filter


# the permits data frame to only include rows where the `StatusCurrent` column is "Issued" and


# the `PermitTypeMapped` is "Building". Store the filtered data frame as `filtered_permits`. It is


# fine to do the filtering across successive steps










# How many unique values for the `PermitClass` column are in `filtered_permits`? Store the unique


# values in a vector called `unique_permit_classes`.










# We'll work with this filtered dataset in future exercises and look at the values in the


# `PermitClass` column. Store the entire `filtered_permits` data frame as a file named


# 'filtered_building_permits.csv'. Set `row.names` to FALSE when saving.










# Read the filtered_building_permits.csv file that you just saved back in and store the data in a


# data frame named `test_read_write`. Ensure that test_read_write has the same dimensions as your


# `filtered_permits`. If it doesn't, you should check that `row.names` was appropriately set


# in the previous step and re-save.






Basic grammar functions






## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar


## sign ($) or bracket ([]) notation!


library(tidyr)


library(dplyr)


# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This


# file contains building permits from the City of Seattle. See the link below for more info. Be sure


# set the working directory to the appropriate folder location.


# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr


permits <->


permits


# Also load in the data/filtered_building_permits.csv file (what you created as part of the last


# exercise in Module 3). Store the data in a data frame called `filtered_permits`.






filtered_permits <->






# We're going to use dplyr to filter the permits data using the same steps as we did previously.


# Specifically, we're going to keep only entries that have a status of 'Issued' in the


# `StatusCurrent` column and have a 'Building' value for the `PermitTypeMapped` column. Use only


# functions from the `dplyr` library to perform the filtering. Store the filtered data in a data


# frame called `filtered_permits_dplyr`. Store the dimensions of the resulting data frame in a


# variable named `dimensions`. Then, print the value of the `dimensions` variable. Also ensure


# that `filtered_building_permits` and `filtered_permits_dplyr` are identical










# Now, let's create a `Year` column that contains the year that the permit was issued. To do this,


# use the `mutate` dplyr function and extract the first 4 characters of the `AppliedDate` column.


# Add this `year` column to the `filtered_permits_dplyr` data frame.










# Keep only the rows of the `filtered_permits_dplyr` that have a valid (non-blank) year. To do so,


# filter out all Year values that are `NaN` values (hint: see the `drop_na()` function from tidyverse)


# and then drop all entries that have "" as the Year. Then, select only the following columns to keep


# in the data frame: PermitClass, PermitClassMapped, and Year. Store the resulting data frame as


# `filtered_permits_dplyr`










# Finally, arrange the `filtered_permits_dplyr` data frame such that the most recent years appear


# first and the earliest years appear last










# Save the `filtered_permits_dplyr` data frame as "filtered_permits_dplyr.csv". Set `row.names` to


# FALSE. We will be using this data frame in the next exercise.






Pipes






# Now, we're going to filter/mutate/arrange the permits data frame in the same manner that we did


# in the previous exercise. However, instead of performing subsequent calls using dplyr functions,


# we are going to chain together dplyr calls using pipes. In a single pipe, create a


# `filtered_permits_piped` data frame that is identical to the `filtered_permits_dplyr` data frame.


# The steps in the process are:


# 1. FILTER permits such that the `StatusCurrent` column is 'Issued'


# 2. FILTER permits such that the `PermitTypeMapped` column is 'Building'


# 3. MUTATE the data frame to add a `Year` column that contains the first 4 characters of the


# `AppliedDate` column


# 4. FILTER permits such that they have a valid (non-blank) year (removing NaNs and "" entries)


# 5. SELECT only the following columns from the data frame: `PermitClass`, `PermitClassMapped`, and


# `Year`


# 6. ARRANGE the data frame so the most recent years are first and the earliest years are last










# Ensure that your `filtered_permits_dplyr` and `filtered_permits_piped` data frames are identical.


# You will first need to MUTATE the `Year` column in your `filtered_permits_piped` data frame to


# be an an integer value (hint: use the `as.integer` function). This is because, when saving/loading


# data frames, R will default to storing the data using types that it sees most fit. As an example,


# though when you saved the `filtered_permits_dplyr` data frame, `Year` was a string (we literally


# took a substring to form the column), you can see that it is now a numeric/integer value because


# we filtered, saved, and then re-loaded.










# Now, you can use the `identical()` function to compare You do not need to save the


# `filtered_permits_piped` data frame






Groups






## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar


## sign ($) or bracket ([]) notation!






# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This


# file contains building permits from the City of Seattle. See the link below for more info. Be sure


# set the working directory to the appropriate folder location.


# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr










# Filter the `permits` to only include instances where the estimated project cost value


# (`EstProjectCost`) is greater than 0. Also filter to only include new permits for building


# (Hint: use the `PermitTypeMapped` and `PermitTypeDesc` columns). Also, as you did in the


# previous exercise, add a `Year` column by taking the first four characters from the `date.


# `AppliedDate`` column. However, do not filter on the `Year` column as you did previously.


# Store the result in a data frame named `filtered_permits`














# Group the permits by Year to get the mean estimated project cost for each year. Name the


# mean estimated cost column `MeanEstCost`. Store the results in a data frame named


# `grouped_by_year`










# Which year after 2010 had the highest mean estimated project cost? Use `dplyr` functions


# and pipes to find the value of interest (Hint: use filter operations). Store the value in


# a data frame with a single row that has `Year` and `MeanEstCost` columns. Name the data


# frame `highest_cost_by_year`.


# Note: if you wanted to extract the `Year` value, you could use dplyr's `pull()` function.


# However, you are not asked to do that here. Keep the result as a data frame with a single


# row










# Now, create a second grouping to get the mean estimated project cost for each year and


# `PermitClass` as well as the number of records for each year and `PermitClass`


# combination (i.e. group by both simultaneously and summarize by both simultaneously).


# Store the results in a data frame named 'grouped_by_year_and_class`. The data frame


# should have four columns: `PermitClass`, `Year`, `N` (for the number of records) and


# `MeanEstCost`. (Hint: use the `n()` function in your summarize to get counts)










# For year and permit type combinations that had at least 1000 permits, what was the


# minimum estimated cost? Store the result in a data frame called `min_cost_high_count`


# that has the same four columns as your `grouped_by_year_and_class` data frame.






Joins






## Note: Use `dplyr` and `tidyr` functions for all exercises in this module. Do not use dollar


## sign ($) or bracket ([]) notation!






# Read in the data/Building_Permits.csv file and store it in a data frame named `permits`. This


# file contains building permits from the City of Seattle. See the link below for more info. Be sure


# set the working directory to the appropriate folder location.


# https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr










# Read in the data/zip_code_demos_2000.csv file and store it in a data frame named `zip_codes` This


# file contains demographic information for many of Washington State's zip codes from the year


# 2000.










# In the `zip_code` data frame, you can see that both the `population` and `population_density`


# columns are stored as character columns while they contain numeric values. This is because of


# the commas present in the data source. Let's clean both columns and convert them to numeric


# values by first replacing the commas with "" characters (you can use functions from the


# `stringr` package to do this or use R's `gsub()` function) and then convert them to numeric


# values using the `as.numeric()` function. Instead of storing the results in a new variable,


# simply update the values of the existing `zip_codes` data frame.










# Filter the `permits` data frame to only include instances where the `OriginalZip` variable is


# not NA (use tidyverse's `drop_na()` function). Store the filtered data frame as a data frame


# named `filtered_permits`










# Which zip codes in the `filtered_permits` data frame do we not have information for in the


# `zip_codes` data frame? To answer this question, create a `not_found` vector that contains


# all the zip codes from `filtered_permits` that are not in `zip_codes`. One approach


# to create this variable is to first isolate all unique zip codes in `filtered_permits` and


# then use the `%in%` function to find which of the unique zip codes are NOT in the `zip_codes`


# data frame.










# Without changing the names of any columns, perform a left join of the `filtered_permits` data


# frame and the `zip_codes` data frame. Have the `filtered_permits` data frame be the "left"


# side of the join. Store the resulting data frame in a variable named `left_joined_data`










# Now, find which of the rows in the `filtered_permits` data frame were not matched to rows in


# the `zip_codes` data frame. To do so, filter to rows of `left_joined_data` where one of the


# columns from the `zip_codes` data frame is NA (you do not need to check for all columns, just


# pick one and use that). Store the result in a data frame named `not_matched_left_join`










# Do an inner join of the `filtered_permits` data frame and the `zip_codes` data frame. Store


# the result in a data frame named `inner_joined_data`.










# Now, we can see if there is any relationship between zip code demographics and the number of


# permits for a given area. Create a data frame named `grouped_by_zip` that is a grouping of


# the `inner_joined_data` data frame by the `OriginalZip` column and then summarized to give


# a column of counts (i.e. the number of permits in each zip code) and the mean population


# density of each zip code






# Is there any relationship that you can see between the number of permits in a zip code and


# the population density of that zip code? You do not need to do any formal tests of this -


# simply doing an "eyeball" test from looking at the data is enough.


#


# From here, we can see how joining multiple data sets would allow us to answer really


# interesting questions about the relationship between building permits and population density.


# Of course, in this instance, it comes with the caveat that the population demographic data is


# a bit older than the permit data. We can also expand this to see the relationship between


# population demographics (age, race, ethnicity, education) and the number of residential


# building permits in an area. We can also look at income and commercial building permits in


# an area. Being able to join data opens doors so we're no longer limited to a single data set!










Tidyr






filter the `permits` to only include instances where the


# estimated project cost value (`EstProjectCost`) is greater than 0. Also filter to only include


# new permits for building (Hint: use the `PermitTypeMapped` and `PermitTypeDesc` columns). Add


# a `Year` column by taking the first four characters from the date. Store the result in a data


# frame named `filtered_permits`










# Similar to what you did in a previous exercise: create a grouping to get the number of records


# for each year and `PermitClass` combination (i.e. group by both simultaneously and summarize


# by both simultaneously). Store the results in a data frame named 'grouped_by_year_and_class`.


# The data frame should have three columns: `PermitClass`, `Year`, and `N` (for the number of


# records). (Hint: use the `n()` function in your summarize to get counts)










# As you can see, the `grouped_by_year_and_class` data frame is a "long" data frame. Create a


# data frame named `grouped_by_year_and_class_wide` that is a wide version of the same data. Use


# tidyr's `pivote_wider()` function to do this. Your "wide" data frame should have a column for the


# Year and then a column for every category of `PermitClass` (8 columns in total). The values of


# the data frame should be counts (i.e. your values for `N`)










# Now, use the `grouped_by_year_and_class_wide` data frame to recreate the


# `grouped_by_year_and_class` data frame using tidyr's `pivot_longer()` function. The cols to pivot


# should be everything except the Year column (written as "-Year" , note the "-" symbol). The


# column names should go to a column called "PermitClass", and the values should go to a column


# called "N" (to match the orignal data frame). After creating the data frame, filter out rows with


# NA values. Store the results in a data frame named `grouped_by_year_and_class_long`. Check that


# the dimensions of the resulting data frame are the same as the `grouped_by_year_and_class` data frame










# Lastly, arrange both the `grouped_by_year_and_class` and `grouped_by_year_and_class_long` data


# frames by year and `PermitClass` (the order doesn't matter as long as it is consistent across)


# the two data frames). Ensure that the two data frames are identical by using the `identical()`


# function.










ggplot basics






# Read in the data/filtered_permits_dplyr.csv file (what was created during the previous module)


# as a data frame named `filtered_permits`. This data should be a filtered version of the permits


# data that has three columns: PermitClass, PermitClassMapped, and Year. The data should have


# 8,238 observations (rows).














# Group the data by year to get a count of the number of permits for each year. Store the result


# in a variable named `permits_by_year`. You should use dplyr to do this grouping (hint: use


# `group_by()` followed by a `summarize()` call using the `n()` function).














# Using ggplot, create a bar chart of the number of permits by year. Store the plot as `plot1`.


# After storing, you can show the plot using the `show` function (i.e. `show(plot1)`).














# Filter the `permits_by_year` data frame such that it only contains Years including and after


# 2010. Store the results in a data frame named `permits_by_year_filtered`. (hint: You can


# simply filter the `permits_by_year` data frame - you do not need to filter and perform the


# group by aggregation on the `permits` data frame, though that is an option.)














# Using ggplot, create a bar chart of the number of permits by year. Store the plot as `plot1`.


# After storing, you can show the plot using the `show` function (i.e. `show(plot1)`).














# Now, create a new plot (`plot2`) with axis labels. Name the bottom axis "Year" and the left


# (vertical) axis "Number of Permits". Change the title of the plot to


# "Number of Active Permits by Year".


# You can do make these changes by altering `plot1` (i.e. plot2 <- plot1="" +="" arguments)="">


# than recreating the plot from scratch Be sure the resulting plot is stored as `plot3`. You


# can use the `show` function to view the plot.










ggplot application






# Read in the data/filtered_permits_dplyr.csv file (what was created during the previous module)


# as a data frame named `filtered_permits`. This data should be a filtered version of the permits


# data that has three columns: PermitClass, PermitClassMapped, and Year. The data should have


# 8,238 observations (rows).














# As you did in the previous exercise, filter out permits prior to 2010. This time, however,


# instead of grouping by year, group by both year and the permit class to get a count of the


# number of permits for each class for each year. Store the result in a variable named


# `permits_by_year_by_class`. You should use dplyr pipes to do this filtering and grouping.














# Using ggplot, create a stacked bar chart that shows the number of permits each year with


# colors to indicate the `PermitClass` (see Figure 16.7 in the textbook for an example).


# Name the bottom axis "Year" and the left (vertical) axis "Number of Permits". Save the


# figure in a variable called `plot1`.














# Change the scale of the graph so it is colored using a ColorBrewer palette. You can see


# the available palettes here:


# https://www.r-graph-gallery.com/38-rcolorbrewers-palettes_files/figure-html/thecode-1.png


# Save the result as `plot1` (replacing the previous variable).














# Using ggplot, create a facet of the permits by year using `PermitClass` as your facet. This


# should result in a bar chart of the number of permits per year for each unique value of


# `PermitClass`. The facets should all be in a single plot that you should store as `plot2`.

























Answered Same Day Mar 24, 2023

Solution

Mohd answered on Mar 25 2023
27 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here