Programming language R ###############################################################################

### Assignment 3: Visualization ###

### Visual analysis of World Bank Indicators


### Below each prompt in the file, write the code necessary/indicated to generate

### the required plots. See the assignment page on Canvas for details.


##### PART 1: Loading and Understanding the Data #####

### In this section you will load the data and necessary packages

# For this assignment, you'll be working with the database of World Bank Development

# Indicators. You can explore also explore all of this data online at



# For this assignment you will be using a separate R package called `wbstats` that

# will let you use basic functions to "access" the indicator database. (Note that

# you can look at the package documentation at

# to see all of the

# different functions and options it offers as well as how to use them. This

# documentation is in the standard R package doc format, so is worth viewing at least

# once to have a sense for what these look like!



# [1pt] Start by installing and loading (with library()) the `wbstats` package.

# DO NOT include the install.packages() command in your script file.



# Also load in other required packages (`ggplot`, `dplyr`, etc) here.

# You can alternatively just load the whole `tidyverse` package

# Do not include any `install.package()` calls in this file



# [2pt] The World Bank organizes data about countries into a different indicators

# (measures). For example: "Total Population" is an indicator, as is "Individuals

# using the Internet (% of population)". You can view a complete list of the

# indicators on the World Bank's website at

# (you will be using this website repeatedly in this assignment).


# Get a listing of the available indicators by calling the package-provided

# `wb_indicators()` function, which returns a data frame of information about them.

# Print out the number of rows in this data frame to see how many indicators there

# are (and this listing is missing a few!) Also inspect the data frame (such as

# using View()) to see what information is about about each indicator.


# IMPORTANT: notice that each indicator contains an "Indicator ID", a special code

# used to refer to that indicator. This is because the names are so long and

# complex, so the World Bank uses codes to refer to each piece of data. Instead of

# "Individuals using the Internet...", you'd refer to indicator IT.NET.USER.ZS.

# In general, you will be using these IDs as identifiers, rather than the full text

# of the indicator's title.

list_assignment <->

# [2pt] You can find the codes for different indicator on the World Bank's website.

# You can visit the full list of indicators at

# and click on each one to get more information on that indicator (including seeing

# a sample visualization). You can find the indicator iD by clicking the "Details"

# button, or by looking in the URL (it's part of the path).

# See the `examples/indicators.png` file in this project for an example.


# Using this website, find the indicator ID for the "CO2 emissions (kt)" indicator.

# In a comment below, state the ID for this indicator to show that you looked it up.


# (It's also possible to look up indicator codes by using the provided `wb_search()`

# function, but it can be a bit less reliable than checking the website (and requires

# regular expressions to use well).


# [3pt] Once you've identified an indicator of interest and its ID, you can use

# use the `wbstats` package to access the data for that indicator. You get data

# from the World Bank by using the `wb_data()` function. This function expects at

# least two (named) arguments: `country` which should be a character vector of

# countries to get data on (with a few special options), and `indicator` which

# should be a character vector of indicator IDs to access. For example, you can

# get % Internet Users data for all countries with the following:


# wb_data(country = "all", indicator = c("IT.NET.USER.ZS"), mrv = 1, lang = "en")


# The `mrv = 1` argument ("most recent value") says to get just a single year's

# worth of data (the most recent year). It's also possible to give a specific

# range of years; see the `wbstats` documentation for details.


# Using the `wb_data()` function, get a data frame of the "CO2 emissions (kt)"

# for all countries for the 1 most recent year (use "countries_only" as the `country`

# argument to just get countries and not aggregations). Save (all) the data for

# the top 10 countries with highest carbon emissions in a data frame called

# `top_10_co2_countries`. You will need to do some light data wrangling to choose

# only these 10 rows.


# Note that for all data wrangling in this assignment, you can either use `dplyr`

# functions, base R syntax (dollar signs and brackets), or a mix of both. I

# strongly recommend you use `dplyr` primarily, but do what seems simplest and

# makes sense to you.

top_10_co2_countries <- wb_data(country="countries_only" ,="" indicator="c(" en.atm.co2e.kt"),"="" mrv="1," lang="en" )="" %="">%

filter(!"EN.ATM.CO2E.KT")) %>%

arrange(desc("EN.ATM.CO2E.KT")) %>%



##### PART 2: CO2 Emissions by Country #####

### In this section you will generate a bar chart of the total CO2 emissions of

### the top-10 countries with the highest emission levels


### You can see an example of this plot in `examples/top_10_co2_plot.png`


### The instructions below have multiple steps as a single comment; it is up to

### you to organize your code below that.


### Throughout this assignment, you are welcome to adjust the styling of the plots

### (e.g., make text different sizes, use different colors, etc), so long as you

### maintain the *effectiveness* and *expressiveness* of the plots.

# [2pt] Use the `ggplot2()` function to create the plot. The data will be your

# `top_10_co2_countries` from the previous section.


# [4pt] You will need to use column geometry (

# to create the chart. The country's ISO3 code (the three-letter code used to

# refer to that country, such as "USA" or "IND") will go on the x-axis, and the

# emission amount will go on the y-axis.

# You can use the `reorder()` function to "sort" the country ISO3 codes (a factor,

# the first argument) by the indicator value column (the second argument), and

# then use that sorted list as the aesthetic mapping. See


# for an example

ggplot(data = top_10_co2_countries, mapping = aes(x = reorder(iso3c, EN.ATM.CO2E.KT), EN.ATM.CO2E.KT)) +



# [2pt] Use the `labs()` function to specify the title and axis labels for your

# chart. The title should be "Top 10 Countries by CO2 Emissions", the x-axis

# should be labeled "Country (iso3)", and the y-axis should be labeled with the

# complete indicator name.

# Optionally, you can effectively adjust the formatting of the numbers on the

# y-axis using the scales package ( This will let you

# put e.g., commas in the large numbers. Note that this will also involve

# specifying a scale for youur plot!


# [1pt] Once completed, save your plot in a variable called `top_10_c02_plot.`

# Note that you can print() out this variable in order to see the plot generated

# when you run your script.


##### PART 3: US Income Equality over Time #####

### In this section you will generate a line chart showing the change in income

### inequality (in the USA) over time.


### You can see an example of this plot in `examples/us_inequality_plot.png`


### The instructions below have multiple steps as a single comment; it is up to

### you to organize your code below that.

# You'll first need to access and wrangle the data in order to plot it. Save the

# wrangled data in a data frame called `us_income_years`.


# [2pt] Use the `wb_data()` function to access data for the following 3 indicators

# for the country "USA" (you'll need to look up their IDs):

# "Income share held by highest 10%", "Income share held by lowest 20%", and

# "Income share held by second 20%". Get the 20 most recent years worth of data.

us_income_years <- wb_data(country="USA" ,="" indicator="">



mrv = 20,

lang = "en")

# [1pt] You'll need to mutate the data frame and convert the `date` column into a

# numeric value (using `as.numeric()`) so that you can plot it easier.


us_income_years <- us_income_years="" %="">%

mutate('date_number' = as.integer(date))

# [2pt] Also mutate the data frame and create columns for the "wealth of the top 10%"

# (e.g., `wealth_top_10`), and for the "wealth of the bottom 40%" (e.g., `wealth_bottom_40`)

# (which is the lowest and second lowest 20% combined).

us_income_years <- us_income_years="" %="">%

mutate('wealth_top_10' = SI.DST.10TH.10, 'wealth_bottom_40'= SI.DST.02ND.20 + SI.DST.FRST.20)


# [3pt] You'll need to pivot this data into *long* format, gathering the values

# from the two columns ("top 10%" and "bottom 40%") into a single column. This

# will allow you to plot them as two separate lines using a single geometry.

# Optionally, so you can order your legend correctly, you should mutate the long

# data frame to convert the "category" column into a factor (using the `factor()`

# function), with the `wealth_top_10` as the first level.


us_income_years_long <- us_income_years="" %="">%

pivot_longer(cols = c('wealth_top_10', 'wealth_bottom_40')

, names_to = 'Category'

, values_to = 'US_Income') %>%


# In the end, your data frame should have 40 rows; 1 for each year-and-category

# (top 10% or bottom 40%).

# You can then create your line plot:


# [1pt] The plot will use your us_income_years data frame as a data source.


# [5pt] The plot should include both point geometry and smooth line geometry:



# (so you can see the points and the trend—the smoothed trend looks better).

# Each should have the date mapped to the x-axis, the value mapped to the y-axis,

# and the category (top 10% or 40%) mapped to the color.

ggplot_income <- ggplot_income=""><- ggplot(data="us_income_years_long," aes(x="Date_num" ,="" y="us_income_years" ,="" color="Category))">

geom_point(size = 1) +

geom_smooth() +

xlab('years') +

ylab('Percentage of income')


# [2pt] Specify appropriately detailed title and axis labels for your chart.

# Also use an appropriate *scale function* (for colors that are *discrete*) to

# customize the labels of the color mapping legend and making them readable

# (e.g., "Top 10% of Pop." and "Bottom 40% of Pop.")


# [1pt] Once completed, save your plot in a variable called `us_wealth_plot`.

# Note that you can print() out this variable in order to see the plot generated

# when you run your script.


##### PART 4: Health Expenditures by Country #####

### In this section you will generate a plot showing the amount spent on

### healthcare across "high-income" countries.


### You can see an example of this plot in `examples/health_costs_plot.png`


### The instructions below have multiple steps as a single comment; it is up to

### you to organize your code below that.

# You will again need to access and wrangle the data in order to plot it. Note

# that this wrangling is more complex than the previous plots. Save your fully-

# wrangled data in a data frame called `health_costs` (though there will be some

# steps before you get there!)

countries <- wb_countries()="" %="">%

filter(income_level_iso3c == "HIC") %>%



# [2pt] You'll first need to a list of "high income" countries to get the data on.

# You can get access to general information about countries by calling the

# `wb_countries()` function. Filter this data frame for countries that are

# "High Income", and then extract (pull) a vector of the ISO3 codes for these

# countries (there should be around 80 of them).

countries <- wb_countries()="" %="">%

filter(high_income) %>%



# [2pt] Use the `wb_data()` function to access data on the following 4 indicators

# for the high-income countries (you will need to look up the indicator IDs):

# - "Current health expenditure per capita (current US$)"

# - "Domestic general government health expenditure per capita (current US$)"

# - "Domestic private health expenditure per capita (current US$)"

# - "Out-of-pocket expenditure per capita (current US$)"

# You should get the 1 most recent year.


health_costs <->

country = countries,

indicator = c(






mrv = 1,

lang = "en"


# [3pt] You will need to pivot this data into a *longer* format. You want a *names*

# column (e.g., `indicatorID`) of indicator names, and a *values* column of their values.

# After you pivot, you filter out any countries with `NA` values. The `drop_na()`

# function works great for this.

countries <->



# [2pt] In order to make sure that your chart legend is readable, replace the ID

# codes with understandable text (e.g., "Total Spending", "Government Spending",

# "Private Spending", and "Out of Pocket Costs"). Note that I find it easier to

# do this replacement using base R syntax (bracket notation) than dplyr, as a

# separate set of 4 statements.


# [2pt] Additionally, you'll need a separate data frame (e.g., `total_health_costs`)

# of just the "Total Spending" data.

# Once you have your data ready, you can create your plot:


# [1pt] Your plot will use the `health_costs` data frame as the primary data source.


# [1pt] Your plot will include multiple geometries that will share aesthetics.

# Because of this, you will define your "default" aesthetic mapping as an argument

# to the ggplot() function. You should map the country's `iso3c` code to the x-axis

# (`reorder()` it by value, as you did in the first plot); the indicator value to

# the y-axis; and the indicator name/ID to the color.


# [3pt] Your plot's primary geometry will be point geometry. Specify that the

# `shape` of each point will be based on the indicator.


# [4pt] Your plot will also need to include lines from the bottom axis to the

# total cost point (for readability). You can do this by adding in a `linerange`

# geometry This

# geometry should use the `total_health_costs` as its data source, have a minimum-y

# aesthetic of 0 and a maximum-y aesthetic of the value column (which will come

# from the `total_health_costs`).

# Add the `linerange` geom *before* the point geom to have the points appear "on top".


# [2pt] Use a scale function to give different colors to the points. I used

# Colorbrewer's "Dark2" palette, but you can choose a different palette (or define

# your own set of colors).


# [2pt] Specify appropriately detailed title and axis labels for your chart.

# Remember to also provide an identical label for the color & shape aesthetics

# to style the legend.


# [2pt] Finally, use the `theme()` function (

# to specify a "theme" and styling of your plot. In particular, you can set the

# `axis.text.x` to be an `element_text()` value (

# with a smaller size and an angle--this will make the labels not overlap each other.

# You can also specify the `legend.position` in order to place the legend somewhere

# else (like in the otherwise blank space. I used `c(.2,.8)` as a position--the

# numbers are the "ratio" of how far along the axis to place the plot).

# Search the documentation and other resources for examples of these (common) adjustments.


# [1pt] When completed, save your plot in a variable called `health_costs_plot`.

# Note that you can print() out this variable in order to see the plot generated

# when you run your script.


##### PART 5: Map: Changes in Forestation around the World #####

### In this section you will generate a choropleth map of the forestry changes

### for each country country plotted on a global map.


### You can see an example of this plot in `examples/forested_map_plot.png`


### The instructions below have multiple steps as a single comment; it is up to

### you to organize your code below that.


### You may create this map using ggplot2 or using the `leaflet` package; use of

### other external mapping packages is not allowed. The below instructions cover

### how to do this using ggplot2.

# As before, you'll first need to wrangle the indicator data you need. Save your

# fully-wrangled data in a data frame called `forest_area`.


forest_area <>

wb_data("AG.LND.FRST.ZS", country = "countries_only", mrv = 20)

# [2pt] Use the `wb_data()` function to access data for the "Forest area (% of land area)"

# indicator (you'll need to look up its ID). Get data for "countries_only", and

# data for the most recent 20 years.


# [4pt] You'll need to calculate the change in forest area between the earliest

# and most recent years (1999 and XXXXXXXXXXTo do this, first spread out (pivot_wider)

# the values--the ISO3 number will be the primary id, the names will come from the

# `date`, and the values will come from the indicator column. Then add a new column

# (e.g., `forest_change`) that is the difference between the 2018 value and the

# 1999 value (` XXXXXXXXXX`).

# Because the column names will be strings that look like numbers (e.g., "1997"),

# it's easier to access the column values using double-bracket notation than using

# dplyr. Alternatively, you can rename the columns for easier access.


# [5pt] To make your choropleth map be readable and effective, you won't want to

# try and assign a different color to each of 260 different values (that will be

# a lot of colors and hard to distinguish!) Instead, you should break up (factor)

# the data in a small number of groups (called "bins"). Each "bin" will represent

# a range of values--for example, one bin might represent values from 0-5%, one bin

# values from 5%-10%, and so forth. You will then be able to give each "bin" a

# color, so that your map will only have 5 or 6 different colors representing

# different "levels" or "tiers" of forestry loss, rather than 260 colors.

# In short: you will color by a categorical value, rather than a continuous value!


# Use the `cut()` function to create a different column (i.e., `change_labels`) of

# "labels" representing each "bin" of data. This function takes as arguments a

# vector to divide (e.g., `forest_area$change`), and as a vector of breaks--the

# values that should act as dividing lines or "cut-offs" for each bin level.

# For example, you'd use "15%" as a break point to divide the data into bins of

# 0%-15%" and 15%-30%. Also specify a labels argument that is a vector of

# appropriate labels to use for naming each factor level (e.g., `c("0%-5%", "5%-10%")`).

# See (among others tutorials)

# for a more detailed example of using this function.

# You can look at the example image for a good set of "break points"

# You can assign this new factor (the result of the `cut()` function) to an

# additional column in your data frame (e.g., `forest_area$change_as_factor`)

# Once you have the indicator data, you'll need to prepare the map. For ggplot2,

# you'll need a set of polygons which represent each country in the world and

# can form the basis for a geometric object layer.


# [1pt] Get a data frame of these polygons by calling the `map_data("world")`

# function provided by ggplot2.


# [3pt] However, this data frame only lists countries by name, and country names

# are not standardized across data sets (e.g., different data sets may have

# "United States", "USA", "US", etc). Thus you need to provide the map data a

# three-letter country code (called an ISO3 code) based on their country name.


# You can find the ISO3 codes by using the `iso.alpha()` function from the `maps`

# package (

# (which you will need to install and load separately, at the top of your file).

# Pass the `iso.alpha()` function the `region` vector of the map data (the country

# names), and a `n = 3` argument to get the three-letter country codes. Mutate the

# map data frame to add a column of ISO3 codes for each country (the value returned

# from the `iso.alpha()` function).

# With the map data in hand, you can combine that with your indicator data to

# create a plottable data frame.


# [1pt] *left join* the world map data frame (on the left) to the `forest_area`.

# Join by ISO3 country code. This will create a giant data frame with a copy of

# the indicator value in each point of the polygon.

# Finally, the data is ready so you can create the choropleth map:


# [1pt] The data for your plot should be the joined map/forest-area data frame.


# [4pt] Use polygon geometry to create the plot. Map the `long` value to the

# x-coodinate, the `lat` value to the y-coordinate, and `group` points together.


# [3pt] Map the *fill* (not the color!) to the change value--this will color the

# inside of the polygons, not just the outlines. You must also specify a colorbrewer

# scale for the map fill; I used the "RdYlGn" palette (in reverse order).


# [2pt] Your plot should use a map coordinate system, such as `coord_quickmap()`



# [2pt] You can easily get rid of the x and y axis labels by including a void theme


# You'll still need to add a title to the plot.


# [1pt] When completed, save your plot in a variable called `world_forest_plot`.

# Note that you can print() out this variable in order to see the plot generated

# when you run your script.


# This map will show a lot of data and geometry, so may take a couple seconds to

# generate. Be patient!

# Some countries may be missing values for some indicators or years if the data

# was unavailable. It's okay if these countries are left "blank" in your map).


##### PART 6: Your Own Plot #####

### In this section you will create a visualization of something that is

### important to _you_ based on the World Bank data.

# For this visualization, you can choose to visualize any information from the

# World Bank data set that you wish. For example, you could visualize differences

# in Internet usage, economic development, or anything else. Look through the

# available indicators for topics that seem like they might be interesting.



# Your visualization will need to use at least three "data features" (think: columns).

# This means you'll need to use either multiple indicators, use multiple years

# from a single indicator, or use multiple years from multiple indicators.

# Pro tip: you can often easily produce an "interesting" analysis by taking two

# seemingly unreleated topics and then comparing them to show that the are actually related!


# You will almost certainly need to do some light data wrangling to get the

# information you want. Think about what question your visualization will be able

# to answer, and then what data you'll need for that question.

# (But don't overthink this; the goal here is to practice making visualizations,

# not to be a time sink!)


# Your visualization must be created with the ggplot2 package. It will need to

# meet the following requirements:

# - At a minimum, it will need to include either 2 simple geometries (points,

# lines, columns) _or_ 1 "complex" geometry (e.g., polygons, hex bins, etc.).

# A visual element such as facets count as a simple geometry (so having a single

# point geometry with facets would be sufficient).

# - It will need to encode three (3) or more features (columns) to different aesthetics

# (e.g., x, y, and color).

# - It needs to include an adjusted scale for at least one of the aesthetics.

# Picking a color palette is sufficient.

# - (You are not required to specify position adjustments or coordinate systems,

# though you are welcome to if you wish)

# - It must include appropriate titles and labels. In particular, make sure that

# any "legend" labels are clear and understandable.

# When your designing your visualization, think about how you can make it both

# effective and expressive.


# When completed, save your plot in a variable with a descriptive name (and not

# just `my_plot`). Note that you can print() out this variable in order to see

# the plot generated when you run your script.

