Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Programming language R ################################################################################## Assignment 3: Visualization ###### Visual analysis of World Bank Indicators###### Below each...

1 answer below »

Programming language R ###############################################################################

### Assignment 3: Visualization ###

### Visual analysis of World Bank Indicators

###

### Below each prompt in the file, write the code necessary/indicated to generate

### the required plots. See the assignment page on Canvas for details.





###############################################################################

##### PART 1: Loading and Understanding the Data #####

### In this section you will load the data and necessary packages





# For this assignment, you'll be working with the database of World Bank Development

# Indicators. You can explore also explore all of this data online at

# https://data.worldbank.org/products/wdi

#

# For this assignment you will be using a separate R package called `wbstats` that

# will let you use basic functions to "access" the indicator database. (Note that

# you can look at the package documentation at

# https://cran.r-project.org/web/packages/wbstats/wbstats.pdf to see all of the

# different functions and options it offers as well as how to use them. This

# documentation is in the standard R package doc format, so is worth viewing at least

# once to have a sense for what these look like!

install.packages("ggplot2")

install.packages("dplyr")

# [1pt] Start by installing and loading (with library()) the `wbstats` package.

# DO NOT include the install.packages() command in your script file.

library(wbstats)

library(ggplot2)

# Also load in other required packages (`ggplot`, `dplyr`, etc) here.

# You can alternatively just load the whole `tidyverse` package

# Do not include any `install.package()` calls in this file

library(tidyverse)

library(dplyr)

# [2pt] The World Bank organizes data about countries into a different indicators

# (measures). For example: "Total Population" is an indicator, as is "Individuals

# using the Internet (% of population)". You can view a complete list of the

# indicators on the World Bank's website at https://data.worldbank.org/indicator?tab=all

# (you will be using this website repeatedly in this assignment).

#

# Get a listing of the available indicators by calling the package-provided

# `wb_indicators()` function, which returns a data frame of information about them.

# Print out the number of rows in this data frame to see how many indicators there

# are (and this listing is missing a few!) Also inspect the data frame (such as

# using View()) to see what information is about about each indicator.

#

# IMPORTANT: notice that each indicator contains an "Indicator ID", a special code

# used to refer to that indicator. This is because the names are so long and

# complex, so the World Bank uses codes to refer to each piece of data. Instead of

# "Individuals using the Internet...", you'd refer to indicator IT.NET.USER.ZS.

# In general, you will be using these IDs as identifiers, rather than the full text

# of the indicator's title.





list_assignment <->






# [2pt] You can find the codes for different indicator on the World Bank's website.

# You can visit the full list of indicators at https://data.worldbank.org/indicator?tab=all

# and click on each one to get more information on that indicator (including seeing

# a sample visualization). You can find the indicator iD by clicking the "Details"

# button, or by looking in the URL (it's part of the path).

# See the `examples/indicators.png` file in this project for an example.

#

# Using this website, find the indicator ID for the "CO2 emissions (kt)" indicator.

# In a comment below, state the ID for this indicator to show that you looked it up.

#

# (It's also possible to look up indicator codes by using the provided `wb_search()`

# function, but it can be a bit less reliable than checking the website (and requires

# regular expressions to use well).





#EN.ATM.CO2E.KT





# [3pt] Once you've identified an indicator of interest and its ID, you can use

# use the `wbstats` package to access the data for that indicator. You get data

# from the World Bank by using the `wb_data()` function. This function expects at

# least two (named) arguments: `country` which should be a character vector of

# countries to get data on (with a few special options), and `indicator` which

# should be a character vector of indicator IDs to access. For example, you can

# get % Internet Users data for all countries with the following:

#

# wb_data(country = "all", indicator = c("IT.NET.USER.ZS"), mrv = 1, lang = "en")

#

# The `mrv = 1` argument ("most recent value") says to get just a single year's

# worth of data (the most recent year). It's also possible to give a specific

# range of years; see the `wbstats` documentation for details.

#

# Using the `wb_data()` function, get a data frame of the "CO2 emissions (kt)"

# for all countries for the 1 most recent year (use "countries_only" as the `country`

# argument to just get countries and not aggregations). Save (all) the data for

# the top 10 countries with highest carbon emissions in a data frame called

# `top_10_co2_countries`. You will need to do some light data wrangling to choose

# only these 10 rows.

#

# Note that for all data wrangling in this assignment, you can either use `dplyr`

# functions, base R syntax (dollar signs and brackets), or a mix of both. I

# strongly recommend you use `dplyr` primarily, but do what seems simplest and

# makes sense to you.





top_10_co2_countries <- wb_data(country="countries_only" ,="" indicator="c(" en.atm.co2e.kt"),"="" mrv="1," lang="en" )="" %="">%


filter(!is.na("EN.ATM.CO2E.KT")) %>%


arrange(desc("EN.ATM.CO2E.KT")) %>%


head(10)





###############################################################################

##### PART 2: CO2 Emissions by Country #####

### In this section you will generate a bar chart of the total CO2 emissions of

### the top-10 countries with the highest emission levels

###

### You can see an example of this plot in `examples/top_10_co2_plot.png`

###

### The instructions below have multiple steps as a single comment; it is up to

### you to organize your code below that.

###

### Throughout this assignment, you are welcome to adjust the styling of the plots

### (e.g., make text different sizes, use different colors, etc), so long as you

### maintain the *effectiveness* and *expressiveness* of the plots.





# [2pt] Use the `ggplot2()` function to create the plot. The data will be your

# `top_10_co2_countries` from the previous section.













#

# [4pt] You will need to use column geometry (https://ggplot2.tidyverse.org/reference/geom_bar.html)

# to create the chart. The country's ISO3 code (the three-letter code used to

# refer to that country, such as "USA" or "IND") will go on the x-axis, and the

# emission amount will go on the y-axis.

# You can use the `reorder()` function to "sort" the country ISO3 codes (a factor,

# the first argument) by the indicator value column (the second argument), and

# then use that sorted list as the aesthetic mapping. See

# https://www.r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html#reorder

# for an example

ggplot(data = top_10_co2_countries, mapping = aes(x = reorder(iso3c, EN.ATM.CO2E.KT), EN.ATM.CO2E.KT)) +


geom_col()









#

# [2pt] Use the `labs()` function to specify the title and axis labels for your

# chart. The title should be "Top 10 Countries by CO2 Emissions", the x-axis

# should be labeled "Country (iso3)", and the y-axis should be labeled with the

# complete indicator name.

# Optionally, you can effectively adjust the formatting of the numbers on the

# y-axis using the scales package (https://scales.r-lib.org/). This will let you

# put e.g., commas in the large numbers. Note that this will also involve

# specifying a scale for youur plot!

#

# [1pt] Once completed, save your plot in a variable called `top_10_c02_plot.`

# Note that you can print() out this variable in order to see the plot generated

# when you run your script.





















###############################################################################

##### PART 3: US Income Equality over Time #####

### In this section you will generate a line chart showing the change in income

### inequality (in the USA) over time.

###

### You can see an example of this plot in `examples/us_inequality_plot.png`

###

### The instructions below have multiple steps as a single comment; it is up to

### you to organize your code below that.





# You'll first need to access and wrangle the data in order to plot it. Save the

# wrangled data in a data frame called `us_income_years`.

#

# [2pt] Use the `wb_data()` function to access data for the following 3 indicators

# for the country "USA" (you'll need to look up their IDs):

# "Income share held by highest 10%", "Income share held by lowest 20%", and

# "Income share held by second 20%". Get the 20 most recent years worth of data.





us_income_years <- wb_data(country="USA" ,="" indicator="">


"SI.DST.FRST.20",


"SI.DST.02ND.20"),


mrv = 20,


lang = "en")





# [1pt] You'll need to mutate the data frame and convert the `date` column into a

# numeric value (using `as.numeric()`) so that you can plot it easier.

#





us_income_years <- us_income_years="" %="">%


mutate('date_number' = as.integer(date))





# [2pt] Also mutate the data frame and create columns for the "wealth of the top 10%"

# (e.g., `wealth_top_10`), and for the "wealth of the bottom 40%" (e.g., `wealth_bottom_40`)

# (which is the lowest and second lowest 20% combined).





us_income_years <- us_income_years="" %="">%


mutate('wealth_top_10' = SI.DST.10TH.10, 'wealth_bottom_40'= SI.DST.02ND.20 + SI.DST.FRST.20)

#

# [3pt] You'll need to pivot this data into *long* format, gathering the values

# from the two columns ("top 10%" and "bottom 40%") into a single column. This

# will allow you to plot them as two separate lines using a single geometry.

# Optionally, so you can order your legend correctly, you should mutate the long

# data frame to convert the "category" column into a factor (using the `factor()`

# function), with the `wealth_top_10` as the first level.

#





us_income_years_long <- us_income_years="" %="">%


pivot_longer(cols = c('wealth_top_10', 'wealth_bottom_40')


, names_to = 'Category'


, values_to = 'US_Income') %>%


arrange(desc(Category))





# In the end, your data frame should have 40 rows; 1 for each year-and-category

# (top 10% or bottom 40%).













# You can then create your line plot:

#

# [1pt] The plot will use your us_income_years data frame as a data source.

#

# [5pt] The plot should include both point geometry and smooth line geometry:

# https://ggplot2.tidyverse.org/reference/geom_point.html

# https://ggplot2.tidyverse.org/reference/geom_path.html

# (so you can see the points and the trend—the smoothed trend looks better).

# Each should have the date mapped to the x-axis, the value mapped to the y-axis,

# and the category (top 10% or 40%) mapped to the color.









ggplot_income <- ggplot_income=""><- ggplot(data="us_income_years_long," aes(x="Date_num" ,="" y="us_income_years" ,="" color="Category))">


geom_point(size = 1) +


geom_smooth() +


xlab('years') +


ylab('Percentage of income')





#

# [2pt] Specify appropriately detailed title and axis labels for your chart.

# Also use an appropriate *scale function* (for colors that are *discrete*) to

# customize the labels of the color mapping legend and making them readable

# (e.g., "Top 10% of Pop." and "Bottom 40% of Pop.")

#

# [1pt] Once completed, save your plot in a variable called `us_wealth_plot`.

# Note that you can print() out this variable in order to see the plot generated

# when you run your script.





















###############################################################################

##### PART 4: Health Expenditures by Country #####

### In this section you will generate a plot showing the amount spent on

### healthcare across "high-income" countries.

###

### You can see an example of this plot in `examples/health_costs_plot.png`

###

### The instructions below have multiple steps as a single comment; it is up to

### you to organize your code below that.





# You will again need to access and wrangle the data in order to plot it. Note

# that this wrangling is more complex than the previous plots. Save your fully-

# wrangled data in a data frame called `health_costs` (though there will be some

# steps before you get there!)

countries <- wb_countries()="" %="">%


filter(income_level_iso3c == "HIC") %>%


pull(iso3c)





#

# [2pt] You'll first need to a list of "high income" countries to get the data on.

# You can get access to general information about countries by calling the

# `wb_countries()` function. Filter this data frame for countries that are

# "High Income", and then extract (pull) a vector of the ISO3 codes for these

# countries (there should be around 80 of them).

countries <- wb_countries()="" %="">%


filter(high_income) %>%

pull(iso3c)

#

# [2pt] Use the `wb_data()` function to access data on the following 4 indicators

# for the high-income countries (you will need to look up the indicator IDs):

# - "Current health expenditure per capita (current US$)"

# - "Domestic general government health expenditure per capita (current US$)"

# - "Domestic private health expenditure per capita (current US$)"

# - "Out-of-pocket expenditure per capita (current US$)"

# You should get the 1 most recent year.

#





health_costs <->


country = countries,


indicator = c(


"SH.XPD.CHEX.PC.CD",


"SH.XPD.GHED.PC.CD",


"SH.XPD.PVTD.PC.CD",


"SH.XPD.OOPC.PC.CD"


),


mrv = 1,


lang = "en"

)

# [3pt] You will need to pivot this data into a *longer* format. You want a *names*

# column (e.g., `indicatorID`) of indicator names, and a *values* column of their values.

# After you pivot, you filter out any countries with `NA` values. The `drop_na()`

# function works great for this.





countries <->

drop_na()

#

# [2pt] In order to make sure that your chart legend is readable, replace the ID

# codes with understandable text (e.g., "Total Spending", "Government Spending",

# "Private Spending", and "Out of Pocket Costs"). Note that I find it easier to

# do this replacement using base R syntax (bracket notation) than dplyr, as a

# separate set of 4 statements.

#

# [2pt] Additionally, you'll need a separate data frame (e.g., `total_health_costs`)

# of just the "Total Spending" data.













# Once you have your data ready, you can create your plot:

#

# [1pt] Your plot will use the `health_costs` data frame as the primary data source.

#

# [1pt] Your plot will include multiple geometries that will share aesthetics.

# Because of this, you will define your "default" aesthetic mapping as an argument

# to the ggplot() function. You should map the country's `iso3c` code to the x-axis

# (`reorder()` it by value, as you did in the first plot); the indicator value to

# the y-axis; and the indicator name/ID to the color.

#

# [3pt] Your plot's primary geometry will be point geometry. Specify that the

# `shape` of each point will be based on the indicator.

#

# [4pt] Your plot will also need to include lines from the bottom axis to the

# total cost point (for readability). You can do this by adding in a `linerange`

# geometry https://ggplot2.tidyverse.org/reference/geom_linerange.html. This

# geometry should use the `total_health_costs` as its data source, have a minimum-y

# aesthetic of 0 and a maximum-y aesthetic of the value column (which will come

# from the `total_health_costs`).

# Add the `linerange` geom *before* the point geom to have the points appear "on top".

#

# [2pt] Use a scale function to give different colors to the points. I used

# Colorbrewer's "Dark2" palette, but you can choose a different palette (or define

# your own set of colors).

#

# [2pt] Specify appropriately detailed title and axis labels for your chart.

# Remember to also provide an identical label for the color & shape aesthetics

# to style the legend.

#

# [2pt] Finally, use the `theme()` function (https://ggplot2.tidyverse.org/reference/theme.html)

# to specify a "theme" and styling of your plot. In particular, you can set the

# `axis.text.x` to be an `element_text()` value (https://ggplot2.tidyverse.org/reference/element.html)

# with a smaller size and an angle--this will make the labels not overlap each other.

# You can also specify the `legend.position` in order to place the legend somewhere

# else (like in the otherwise blank space. I used `c(.2,.8)` as a position--the

# numbers are the "ratio" of how far along the axis to place the plot).

# Search the documentation and other resources for examples of these (common) adjustments.

#

# [1pt] When completed, save your plot in a variable called `health_costs_plot`.

# Note that you can print() out this variable in order to see the plot generated

# when you run your script.





















###############################################################################

##### PART 5: Map: Changes in Forestation around the World #####

### In this section you will generate a choropleth map of the forestry changes

### for each country country plotted on a global map.

###

### You can see an example of this plot in `examples/forested_map_plot.png`

###

### The instructions below have multiple steps as a single comment; it is up to

### you to organize your code below that.

###

### You may create this map using ggplot2 or using the `leaflet` package; use of

### other external mapping packages is not allowed. The below instructions cover

### how to do this using ggplot2.





# As before, you'll first need to wrangle the indicator data you need. Save your

# fully-wrangled data in a data frame called `forest_area`.

#

forest_area <>


wb_data("AG.LND.FRST.ZS", country = "countries_only", mrv = 20)





# [2pt] Use the `wb_data()` function to access data for the "Forest area (% of land area)"

# indicator (you'll need to look up its ID). Get data for "countries_only", and

# data for the most recent 20 years.

#

# [4pt] You'll need to calculate the change in forest area between the earliest

# and most recent years (1999 and XXXXXXXXXXTo do this, first spread out (pivot_wider)

# the values--the ISO3 number will be the primary id, the names will come from the

# `date`, and the values will come from the indicator column. Then add a new column

# (e.g., `forest_change`) that is the difference between the 2018 value and the

# 1999 value (` XXXXXXXXXX`).

# Because the column names will be strings that look like numbers (e.g., "1997"),

# it's easier to access the column values using double-bracket notation than using

# dplyr. Alternatively, you can rename the columns for easier access.

#

# [5pt] To make your choropleth map be readable and effective, you won't want to

# try and assign a different color to each of 260 different values (that will be

# a lot of colors and hard to distinguish!) Instead, you should break up (factor)

# the data in a small number of groups (called "bins"). Each "bin" will represent

# a range of values--for example, one bin might represent values from 0-5%, one bin

# values from 5%-10%, and so forth. You will then be able to give each "bin" a

# color, so that your map will only have 5 or 6 different colors representing

# different "levels" or "tiers" of forestry loss, rather than 260 colors.

# In short: you will color by a categorical value, rather than a continuous value!

#

# Use the `cut()` function to create a different column (i.e., `change_labels`) of

# "labels" representing each "bin" of data. This function takes as arguments a

# vector to divide (e.g., `forest_area$change`), and as a vector of breaks--the

# values that should act as dividing lines or "cut-offs" for each bin level.

# For example, you'd use "15%" as a break point to divide the data into bins of

# 0%-15%" and 15%-30%. Also specify a labels argument that is a vector of

# appropriate labels to use for naming each factor level (e.g., `c("0%-5%", "5%-10%")`).

# See https://rpubs.com/pierrelafortune/cutdocumentation (among others tutorials)

# for a more detailed example of using this function.

# You can look at the example image for a good set of "break points"

# You can assign this new factor (the result of the `cut()` function) to an

# additional column in your data frame (e.g., `forest_area$change_as_factor`)

















# Once you have the indicator data, you'll need to prepare the map. For ggplot2,

# you'll need a set of polygons which represent each country in the world and

# can form the basis for a geometric object layer.

#

# [1pt] Get a data frame of these polygons by calling the `map_data("world")`

# function provided by ggplot2.

#

# [3pt] However, this data frame only lists countries by name, and country names

# are not standardized across data sets (e.g., different data sets may have

# "United States", "USA", "US", etc). Thus you need to provide the map data a

# three-letter country code (called an ISO3 code) based on their country name.

#

# You can find the ISO3 codes by using the `iso.alpha()` function from the `maps`

# package (https://cran.r-project.org/web/packages/maps/maps.pdf)

# (which you will need to install and load separately, at the top of your file).

# Pass the `iso.alpha()` function the `region` vector of the map data (the country

# names), and a `n = 3` argument to get the three-letter country codes. Mutate the

# map data frame to add a column of ISO3 codes for each country (the value returned

# from the `iso.alpha()` function).













# With the map data in hand, you can combine that with your indicator data to

# create a plottable data frame.

#

# [1pt] *left join* the world map data frame (on the left) to the `forest_area`.

# Join by ISO3 country code. This will create a giant data frame with a copy of

# the indicator value in each point of the polygon.













# Finally, the data is ready so you can create the choropleth map:

#

# [1pt] The data for your plot should be the joined map/forest-area data frame.

#

# [4pt] Use polygon geometry to create the plot. Map the `long` value to the

# x-coodinate, the `lat` value to the y-coordinate, and `group` points together.

#

# [3pt] Map the *fill* (not the color!) to the change value--this will color the

# inside of the polygons, not just the outlines. You must also specify a colorbrewer

# scale for the map fill; I used the "RdYlGn" palette (in reverse order).

#

# [2pt] Your plot should use a map coordinate system, such as `coord_quickmap()`

# https://ggplot2.tidyverse.org/reference/coord_map.html

#

# [2pt] You can easily get rid of the x and y axis labels by including a void theme

# https://ggplot2.tidyverse.org/reference/ggtheme.html

# You'll still need to add a title to the plot.

#

# [1pt] When completed, save your plot in a variable called `world_forest_plot`.

# Note that you can print() out this variable in order to see the plot generated

# when you run your script.

#

# This map will show a lot of data and geometry, so may take a couple seconds to

# generate. Be patient!

# Some countries may be missing values for some indicators or years if the data

# was unavailable. It's okay if these countries are left "blank" in your map).





















###############################################################################

##### PART 6: Your Own Plot #####

### In this section you will create a visualization of something that is

### important to _you_ based on the World Bank data.





# For this visualization, you can choose to visualize any information from the

# World Bank data set that you wish. For example, you could visualize differences

# in Internet usage, economic development, or anything else. Look through the

# available indicators for topics that seem like they might be interesting.

# https://data.worldbank.org/indicator?tab=all

#

# Your visualization will need to use at least three "data features" (think: columns).

# This means you'll need to use either multiple indicators, use multiple years

# from a single indicator, or use multiple years from multiple indicators.

# Pro tip: you can often easily produce an "interesting" analysis by taking two

# seemingly unreleated topics and then comparing them to show that the are actually related!

#

# You will almost certainly need to do some light data wrangling to get the

# information you want. Think about what question your visualization will be able

# to answer, and then what data you'll need for that question.

# (But don't overthink this; the goal here is to practice making visualizations,

# not to be a time sink!)

#

# Your visualization must be created with the ggplot2 package. It will need to

# meet the following requirements:

# - At a minimum, it will need to include either 2 simple geometries (points,

# lines, columns) _or_ 1 "complex" geometry (e.g., polygons, hex bins, etc.).

# A visual element such as facets count as a simple geometry (so having a single

# point geometry with facets would be sufficient).

# - It will need to encode three (3) or more features (columns) to different aesthetics

# (e.g., x, y, and color).

# - It needs to include an adjusted scale for at least one of the aesthetics.

# Picking a color palette is sufficient.

# - (You are not required to specify position adjustments or coordinate systems,

# though you are welcome to if you wish)

# - It must include appropriate titles and labels. In particular, make sure that

# any "legend" labels are clear and understandable.

# When your designing your visualization, think about how you can make it both

# effective and expressive.

#

# When completed, save your plot in a variable with a descriptive name (and not

# just `my_plot`). Note that you can print() out this variable in order to see

# the plot generated when you run your script.




Answered Same Day Mar 23, 2023

Solution

Subhanbasha answered on Mar 24 2023
33 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here