Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Data Transformation Data Transformation Zhichao Jiang XXXXXXXXXX Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need....

1 answer below »
Data Transformation
Data Transformation
Zhichao Jiang
XXXXXXXXXX
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you will need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with.
1 Import data
1.1 Working directory
R associates itself with a folder (i.e. directory) on your computer. To see which one, run getwd() at the console.
    This folder is known as your “working directory”
    When you save files, R will save them here
    When you load files, R will look for them here
2 Data transformation
What geoms shoul be used for this graph?
We will learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:
    Pick observations by their values (filter()).
    Reorder the rows (a
ange()).
    Pick variables by their names (select()).
    Create new variables with functions of existing variables (mutate()).
    Collapse many values down to a single summary (summarize()).
These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the ve
s for a language of data manipulation.
All ve
s work similarly:
    The first argument is a data frame (ti
le).
    The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
    The result is a new data frame.
Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these ve
s work.
2.1 select()
select(babynames,name,prop)
## # A ti
le: 1,924,665 x 2
## name XXXXXXXXXXprop
## ## 1 Mary XXXXXXXXXX
## 2 Anna XXXXXXXXXX
## 3 Emma XXXXXXXXXX
## 4 Elizabeth 0.0199
## 5 Minnie XXXXXXXXXX
## 6 Margaret XXXXXXXXXX
## 7 Ida XXXXXXXXXX
## 8 Alice XXXXXXXXXX
## 9 Bertha XXXXXXXXXX
## 10 Sarah XXXXXXXXXX
## # … with 1,924,655 more rows
2.2 Select helpers
    use : to select range of columns
select(babynames,name:prop)
## # A ti
le: 1,924,665 x 3
## name XXXXXXXXXXn prop
## ## 1 Mary XXXXXXXXXX
## 2 Anna XXXXXXXXXX
## 3 Emma XXXXXXXXXX
## 4 Elizabeth XXXXXXXXXX
## 5 Minnie XXXXXXXXXX
## 6 Margaret XXXXXXXXXX
## 7 Ida XXXXXXXXXX
## 8 Alice XXXXXXXXXX
## 9 Bertha XXXXXXXXXX
## 10 Sarah XXXXXXXXXX
## # … with 1,924,655 more rows
    use - to select every column but
select(babynames,-c(name,prop))
## # A ti
le: 1,924,665 x 3
## year sex n
## ## XXXXXXXXXXF XXXXXXXXXX
## XXXXXXXXXXF XXXXXXXXXX
## XXXXXXXXXXF XXXXXXXXXX
## XXXXXXXXXXF XXXXXXXXXX
## XXXXXXXXXXF XXXXXXXXXX
## XXXXXXXXXXF XXXXXXXXXX
## XXXXXXXXXXF XXXXXXXXXX
## XXXXXXXXXXF XXXXXXXXXX
## XXXXXXXXXXF XXXXXXXXXX
## XXXXXXXXXXF XXXXXXXXXX
## # … with 1,924,655 more rows
    use starts_with() to select columns start with
select(babynames,starts_with("n"))
## # A ti
le: 1,924,665 x 2
## name XXXXXXXXXXn
## ## 1 Mary XXXXXXXXXX
## 2 Anna XXXXXXXXXX
## 3 Emma XXXXXXXXXX
## 4 Elizabeth 1939
## 5 Minnie XXXXXXXXXX
## 6 Margaret 1578
## 7 Ida XXXXXXXXXX
## 8 Alice XXXXXXXXXX
## 9 Bertha XXXXXXXXXX
## 10 Sarah XXXXXXXXXX
## # … with 1,924,655 more rows
    use ends_with() to select columns end with
select(babynames,ends_with("e"))
## # A ti
le: 1,924,665 x 1
## name
##
## 1 Mary
## 2 Anna
## 3 Emma
## 4 Elizabeth
## 5 Minnie
## 6 Margaret
## 7 Ida
## 8 Alice
## 9 Bertha
## 10 Sarah
## # … with 1,924,655 more rows
    use contains() to select columns contain
select(babynames,contains("e"))
## # A ti
le: 1,924,665 x 3
## year sex name
##
## XXXXXXXXXXF Mary
## XXXXXXXXXXF Anna
## XXXXXXXXXXF Emma
## XXXXXXXXXXF Elizabeth
## XXXXXXXXXXF Minnie
## XXXXXXXXXXF Margaret
## XXXXXXXXXXF Ida
## XXXXXXXXXXF Alice
## XXXXXXXXXXF Bertha
## XXXXXXXXXXF Sarah
## # … with 1,924,655 more rows
    use num_range() to select named in prefix, number style
select(babynames,num_range("x",1:5))
## # A ti
le: 1,924,665 x 0
2.3 $ and select()
$ extracts columnn contents as a vector. select() extracts column contents as a ti
le.
select(babynames, n)
abynames$n
2.3.1 Your turn
Which of these is NOT a way to select the name and n columns together?
select(babynames, -c(year, sex, prop))
select(babynames, name:n)
select(babynames, starts_with("n"))
select(babynames, ends_with("n"))
2.4 filter()
filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.
filter(babynames, name == "Ga
et")
## # A ti
le: 110 x 5
## year sex name n prop
## ## XXXXXXXXXXM Ga
et XXXXXXXXXX
## XXXXXXXXXXM Ga
et XXXXXXXXXX
## XXXXXXXXXXM Ga
et XXXXXXXXXX
## XXXXXXXXXXM Ga
et XXXXXXXXXX
## XXXXXXXXXXM Ga
et XXXXXXXXXX
## XXXXXXXXXXM Ga
et XXXXXXXXXX
## XXXXXXXXXXM Ga
et XXXXXXXXXX
## XXXXXXXXXXM Ga
et XXXXXXXXXX
## XXXXXXXXXXM Ga
et XXXXXXXXXX
## XXXXXXXXXXM Ga
et XXXXXXXXXX
## # … with 100 more rows
2.4.1 Missing values
One important feature of R that can make comparison tricky are missing values, or NA (“not availables”). NA represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.
NA > 5
## [1] NA
NA + 10
## [1] NA
NA == NA
## [1] NA
NA | FALSE
## [1] NA
NA & FALSE
## [1] FALSE
NA*0
## [1] NA
Inf*0
## [1] NaN
If you want to determine if a value is missing, use is.na() filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly.
df <- ti
le(x = c(1, NA, 3))
filter(df, x > 1)
## # A ti
le: 1 x 1
## x
## ## XXXXXXXXXX
filter(df, is.na(x) | x > 1)
## # A ti
le: 2 x 1
## x
## ## 1 NA
## XXXXXXXXXX
2.4.2 Your turn
    Use filter, babynames, and the logical operators to find:
    All of the rows where prop is greater than or equal to 0.08
    All of the children named “Sea”
2.4.3 Boolean operators
filter(babynames, name == "Ga
ett", year == 1880)
## # A ti
le: 1 x 5
## year sex name XXXXXXXXXXn prop
## ## XXXXXXXXXXM Ga
ett XXXXXXXXXX
filter(babynames, name == "Ga
ett" & year == 1880)
## # A ti
le: 1 x 5
## year sex name XXXXXXXXXXn prop
## ## XXXXXXXXXXM Ga
ett XXXXXXXXXX
Answered Same Day Oct 30, 2021

Solution

Pooja answered on Oct 31 2021
135 Votes
li
ary(tidyr)
li
ary(dplyr)
li
ary(ggplot2)
li
ary(nycflights13)
li
ary(openair)
nycflights13::flights
#summary(flights)
#View(flights)
#2a#
table1 <- table(flights$dest)
sort(table1, ascending = TRUE)
#b#
flights$gained_time <- c(flights$a
_delay - flights$dep_delay)
plot(flights$dep_delay,...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here