Chapter 9 Tidyverse
9.1 Introduction
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
library(tidyverse)
9.2 tidyr
The goal of tidyr
is to help you create tidy data. Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. Tidy data is data where:
Every column is variable.
Every row is an observation.
Every cell is a single value.
9.3 dplyr
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
filter()
: extract rows based on condition(s)
arrange()
: sort observations in ascending or descending order of a particular variable
mutate()
: change or create a column
summarize()
: summarize the data
group_by
: group rows by variable(s) to create a summary using summarize()
9.4 purrr
purrr
enhances R’s functional programming toolkit by providing a complete and consistent set of tools for working with functions and vectors.
9.4.1 Eliminating for loops using map() function
- map(): Use if you want to apply a function to each element of the list or a vector.
# defining a function which returns square
<- function(x){
square return(x*x)
}# Create a vector of number
<- c(2,4,5,6)
vector1 # Using map() fucntion to generate squares
map(vector1, square)
## [[1]]
## [1] 4
##
## [[2]]
## [1] 16
##
## [[3]]
## [1] 25
##
## [[4]]
## [1] 36
- map2(): Use if you are going to apply a function to a pair of elements from two different lists or vectors.
<- c(2, 4, 5, 6)
x <- c(2, 3, 4, 5)
y <- function(x, y){
to_Power return(x**y)
}map2(x, y, to_Power)
## [[1]]
## [1] 4
##
## [[2]]
## [1] 64
##
## [[3]]
## [1] 625
##
## [[4]]
## [1] 7776
- pmap(): Use if you need to apply a function to a group of elements from a list of lists.
<- mtcars[1:5,c("mpg", "hp", "disp")]
mtcars_sub pmap(mtcars_sub, sum)
## [[1]]
## [1] 291
##
## [[2]]
## [1] 291
##
## [[3]]
## [1] 223.8
##
## [[4]]
## [1] 389.4
##
## [[5]]
## [1] 553.7
9.4.2 Working with lists using purrr package
The tasks related to lists can be put into five buckets as given below:
Filtering lists
Summarizing lists
Transforming lists
Reshaping Lists
Join or Combine Lists
9.4.2.1 Filtering lists
The five functions which we find of help and interest here are
- pluck() or chuck(): Using these functions, you can extract or select a particular element from a list by using its name or index. The only difference is that in case the element is not present in the list
pluck()
function consistently return NULL whereaschuck()
will always through an error.
<- list("R", "Statistics", "Blog")
ls1 pluck(ls1, 2)
## [1] "Statistics"
You will notice that if you pass index as 4, which does not exist in the list. The pluck()
function will return a NULL value.
<- list("R", "Statistics", "Blog")
ls1 pluck(ls1, 4)
## NULL
- keep(): A handy function, as the same suggests, using this function, we can observe only those elements in the list which pass a logical test. Here we will only keep elements that are greater than five into the list.
<- list(23, 12, 14, 7, 2, 0, 24, 98)
ls2 keep(ls2, function(x) x > 5)
## [[1]]
## [1] 23
##
## [[2]]
## [1] 12
##
## [[3]]
## [1] 14
##
## [[4]]
## [1] 7
##
## [[5]]
## [1] 24
##
## [[6]]
## [1] 98
- discard(): The function drops those values which fail to pass the logical tests. Say we want to drop NA values then you can use is.na()to discard observations which are represented NA in the list.
<- list(23, NA, 14, 7, NA, NA, 24, 98)
ls3 discard(ls3, is.na)
## [[1]]
## [1] 23
##
## [[2]]
## [1] 14
##
## [[3]]
## [1] 7
##
## [[4]]
## [1] 24
##
## [[5]]
## [1] 98
- compact(): A simple, straightforward function that drops all the NULL values present in the list. Please do not confuse NA values with that of NULL values. These are two different types in R.
<- list(23, NULL, NA, 34)
ls4 compact(ls4)
## [[1]]
## [1] 23
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] 34
- head_while(): An interesting function, the function kind of checks for the logical condition for each element in the list starting from the top and returns head elements until one does not pass the logical test. In the below example, we check if the element is character or not.
<- list("R", "Statistics", "Blog", 2, 3, 1)
ls5 head_while(ls5, is.character)
## [[1]]
## [1] "R"
##
## [[2]]
## [1] "Statistics"
##
## [[3]]
## [1] "Blog"
If you are interested in tail elements, then the purrr package provides tail_while()
function. With this, we end the list filtering functions. These are some of the most common functions which you will find of interest in day to day working.
9.4.2.2 Summarising Lists
There are a couple of functions which purrr provides, here we will talk about the most widely used four functions.
- every(): This function returns TRUE if all the elements in a list pass a condition or test. In the below example,
every()
function returns FALSE as one of the elements inside the list is not a character.
<- list("R", 2, "Rstatistics", "Blog")
sm1 every(sm1, is.character)
## [1] FALSE
- some(): It is similar to the every() as in it checks for a condition towards all the elements inside a list but return TRUE if even one value passes the test or logic.
<- list("R", 2, "Rstatistics", "Blog")
sm2 some(sm1, is.character)
## [1] TRUE
- has_element(): The function returns true if the list contains the element mentioned.
<- list("R", 2, "Rstatistics", "Blog")
sm2 has_element(sm2, 2)
## [1] TRUE
- detect(): Returns the first element that passes the test or logical condition. Here the function will return the element itself. Below we are looking for elements that are numeric in the given list. Although we have two elements in the list, the function only returns the first one, 2.
<- list("R", 2, "Rstatistics", "Blog", 3)
sm3 detect(sm3, is.numeric)
## [1] 2
- detect_index(): Just like detect this function, also checks for the elements which pass the test and return the index of the first element from the list.
<- list(2, "Rstatistics", "Blog", TRUE)
sm4 detect_index(sm4, is.logical)
## [1] 4
9.4.2.3 Reshaping Lists
Flattening and getting transpose of a list are the two tasks that you will find your self doing pretty consistently as part of data wrangling. If you have made so far with this tutorial, you know that flattening is something you will be engaging with too often. The tasks mentioned here can be achieved using the following functions.
- flatten(): The function removes the level hierarchy from the list of lists. The equivalent function to this in Base R would be
unlist()
function. Although the two are similar, flatten() only removes the single layer of hierarchy and is stable. What this means is that you always know the output type. There are subgroup functions which, when used, ensure that you get the desired output. The sub-group functions are as mentioned below:
flatten_lgl(): returns a logical vector
flatten_int(): returns an integer vector
flatten_dbl(): returns a double vector
4.flatten_chr(): returns a character vector
flatten_dfr(): returns a data frames created by row-binding
flatten_dfc(): returns a data frames created by column-binding
Let’s look at the output generated by flatten() and its subgroup functions. First, let us create a list of numbers. If you want, you can pick any work from the above example code.
<- rerun(2, sample(6))
x x
## [[1]]
## [1] 3 5 2 6 1 4
##
## [[2]]
## [1] 6 5 1 2 4 3
So our list consists of 4 numerical vectors containing the random numbers between 1 to 6. We will now flatten the list using flatten_int() function.
flatten_int(x)
## [1] 3 5 2 6 1 4 6 5 1 2 4 3
All the functions mentioned have very straight forward and simple syntax.
- transpose(): The function converts a pair of lists into a list of pairs.
<- rerun(2, x = runif(1), y = runif(3))
x x
## [[1]]
## [[1]]$x
## [1] 0.6565405
##
## [[1]]$y
## [1] 0.1978803 0.9411851 0.8463121
##
##
## [[2]]
## [[2]]$x
## [1] 0.1559741
##
## [[2]]$y
## [1] 0.02976176 0.45462761 0.30839978
%>% transpose() %>% str() x
## List of 2
## $ x:List of 2
## ..$ : num 0.657
## ..$ : num 0.156
## $ y:List of 2
## ..$ : num [1:3] 0.198 0.941 0.846
## ..$ : num [1:3] 0.0298 0.4546 0.3084
9.4.2.4 Join or Combine Lists
You can join two lists in different ways. One is you can append one behind the other, and second, you can append at the beginning of the other list. The purrr
package provides functions that help to achieve these tasks.
- append(): This function appends the list at the end of the other list.
<- list(22, 11, 44, 55)
a <- list(11, 99, 77)
b flatten_dbl(append(a, b))
## [1] 22 11 44 55 11 99 77
- prepend(): Using this function, we can append a list before another list.
<- list(22, 11, 44, 55)
a <- list(11, 99, 77)
b flatten_dbl(prepend(a, b))
## [1] 11 99 77 22 11 44 55
9.4.3 Other useful functions
In this section, we will cover functions that do not necessarily fall into the above categories. But we believe knowing these functions will improve your programming skills tremendously.
- cross_df(): The function returns a data frame where each row is a combination of list elements.
<- list( empId = c(100, 101, 102, 103),
df name = c("John", "Jack", "Jill", "Cathy"),
exp = c(4, 10, 6, 8))
df
## $empId
## [1] 100 101 102 103
##
## $name
## [1] "John" "Jack" "Jill" "Cathy"
##
## $exp
## [1] 4 10 6 8
Here we have three vectors stored in a list. We can now use cross_df()
function to get the data frame.
cross_df(df)
## # A tibble: 64 x 3
## empId name exp
## <dbl> <chr> <dbl>
## 1 100 John 4
## 2 101 John 4
## 3 102 John 4
## 4 103 John 4
## 5 100 Jack 4
## 6 101 Jack 4
## 7 102 Jack 4
## 8 103 Jack 4
## 9 100 Jill 4
## 10 101 Jill 4
## # … with 54 more rows
- rerun(): You can use
rerun()
an repeat a function n number of times. The function is equivalent to therepeat()
function. Thererun()
function is very useful when it comes to generating sample data in R.
rerun(1, print("Hello, World!"))
## [1] "Hello, World!"
## [[1]]
## [1] "Hello, World!"
- reduce(): The reduce function recursively applies a function or an operation to each element of a list or vector. For example, say I want to add all the numbers of a vector. Notice that we are using backtick instead of inverted commos here.
reduce(c(4,12,30, 16), `+`)
## [1] 62
Let’s look at another example. Say I want to concatenate the first element of each vector inside a list. To achieve this, we can use paste function as mentioned below.
<- list(c(0, 1), c(2, 3), c(4, 5))
x reduce(x, paste)
## [1] "0 2 4" "1 3 5"
The function also has a variant named reduce2()
. If your work involves two vectors or lists, you can use reduce2()
instead of reduce()
.
- accumulate(): The function sequentially applies a function to a vector or list. It works just like
reduce()
, but also returns intermediate results. At each iteration, the function takes two arguments. One is the initial value or the result from the previous step, and the second is the next value in the vector. For further understanding, let’s take a look at the below example, which returns the cumulative sum of values in a vector.
accumulate(c(1,2,3,4,5), sum)
## [1] 1 3 6 10 15
The function can be implemented on two different lists through the use of accumulate2()
.