Chapter 9 Tidyverse

9.1 Introduction

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

library(tidyverse)

9.2 tidyr

The goal of tidyr is to help you create tidy data. Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. Tidy data is data where:

  • Every column is variable.

  • Every row is an observation.

  • Every cell is a single value.

9.3 dplyr

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.

filter(): extract rows based on condition(s)

arrange(): sort observations in ascending or descending order of a particular variable

mutate(): change or create a column

summarize(): summarize the data

group_by: group rows by variable(s) to create a summary using summarize()

9.4 purrr

purrr enhances R’s functional programming toolkit by providing a complete and consistent set of tools for working with functions and vectors.

9.4.1 Eliminating for loops using map() function

  • map(): Use if you want to apply a function to each element of the list or a vector.
# defining a function which returns square
square <- function(x){
  return(x*x)
}
# Create a vector of number
vector1 <- c(2,4,5,6)
# Using map() fucntion to generate squares
map(vector1, square)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 16
## 
## [[3]]
## [1] 25
## 
## [[4]]
## [1] 36
  • map2(): Use if you are going to apply a function to a pair of elements from two different lists or vectors.
x <- c(2, 4, 5, 6)
y <- c(2, 3, 4, 5)
to_Power <- function(x, y){
  return(x**y)
}
map2(x, y, to_Power)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 64
## 
## [[3]]
## [1] 625
## 
## [[4]]
## [1] 7776
  • pmap(): Use if you need to apply a function to a group of elements from a list of lists.
mtcars_sub <- mtcars[1:5,c("mpg", "hp", "disp")]
pmap(mtcars_sub, sum)
## [[1]]
## [1] 291
## 
## [[2]]
## [1] 291
## 
## [[3]]
## [1] 223.8
## 
## [[4]]
## [1] 389.4
## 
## [[5]]
## [1] 553.7

9.4.2 Working with lists using purrr package

The tasks related to lists can be put into five buckets as given below:

  • Filtering lists

  • Summarizing lists

  • Transforming lists

  • Reshaping Lists

  • Join or Combine Lists

9.4.2.1 Filtering lists

The five functions which we find of help and interest here are

  • pluck() or chuck(): Using these functions, you can extract or select a particular element from a list by using its name or index. The only difference is that in case the element is not present in the list pluck() function consistently return NULL whereas chuck() will always through an error.
ls1 <- list("R", "Statistics", "Blog")
pluck(ls1, 2)
## [1] "Statistics"

You will notice that if you pass index as 4, which does not exist in the list. The pluck() function will return a NULL value.

ls1 <- list("R", "Statistics", "Blog")
pluck(ls1, 4)
## NULL
  • keep(): A handy function, as the same suggests, using this function, we can observe only those elements in the list which pass a logical test. Here we will only keep elements that are greater than five into the list.
ls2 <- list(23, 12, 14, 7, 2, 0, 24, 98)
keep(ls2, function(x) x > 5)
## [[1]]
## [1] 23
## 
## [[2]]
## [1] 12
## 
## [[3]]
## [1] 14
## 
## [[4]]
## [1] 7
## 
## [[5]]
## [1] 24
## 
## [[6]]
## [1] 98
  • discard(): The function drops those values which fail to pass the logical tests. Say we want to drop NA values then you can use is.na()to discard observations which are represented NA in the list.
ls3 <- list(23, NA, 14, 7, NA, NA, 24, 98)
discard(ls3, is.na)
## [[1]]
## [1] 23
## 
## [[2]]
## [1] 14
## 
## [[3]]
## [1] 7
## 
## [[4]]
## [1] 24
## 
## [[5]]
## [1] 98
  • compact(): A simple, straightforward function that drops all the NULL values present in the list. Please do not confuse NA values with that of NULL values. These are two different types in R.
ls4 <- list(23, NULL, NA, 34)
compact(ls4)
## [[1]]
## [1] 23
## 
## [[2]]
## [1] NA
## 
## [[3]]
## [1] 34
  • head_while(): An interesting function, the function kind of checks for the logical condition for each element in the list starting from the top and returns head elements until one does not pass the logical test. In the below example, we check if the element is character or not.
ls5 <- list("R", "Statistics", "Blog", 2, 3, 1)
head_while(ls5, is.character)
## [[1]]
## [1] "R"
## 
## [[2]]
## [1] "Statistics"
## 
## [[3]]
## [1] "Blog"

If you are interested in tail elements, then the purrr package provides tail_while() function. With this, we end the list filtering functions. These are some of the most common functions which you will find of interest in day to day working.

9.4.2.2 Summarising Lists

There are a couple of functions which purrr provides, here we will talk about the most widely used four functions.

  • every(): This function returns TRUE if all the elements in a list pass a condition or test. In the below example, every() function returns FALSE as one of the elements inside the list is not a character.
sm1 <- list("R", 2, "Rstatistics", "Blog")
every(sm1, is.character)
## [1] FALSE
  • some(): It is similar to the every() as in it checks for a condition towards all the elements inside a list but return TRUE if even one value passes the test or logic.
sm2 <- list("R", 2, "Rstatistics", "Blog")
some(sm1, is.character)
## [1] TRUE
  • has_element(): The function returns true if the list contains the element mentioned.
sm2 <- list("R", 2, "Rstatistics", "Blog")
has_element(sm2, 2)
## [1] TRUE
  • detect(): Returns the first element that passes the test or logical condition. Here the function will return the element itself. Below we are looking for elements that are numeric in the given list. Although we have two elements in the list, the function only returns the first one, 2.
sm3 <- list("R", 2, "Rstatistics", "Blog", 3)
detect(sm3, is.numeric)
## [1] 2
  • detect_index(): Just like detect this function, also checks for the elements which pass the test and return the index of the first element from the list.
sm4 <- list(2, "Rstatistics", "Blog", TRUE)
detect_index(sm4, is.logical)
## [1] 4

9.4.2.3 Reshaping Lists

Flattening and getting transpose of a list are the two tasks that you will find your self doing pretty consistently as part of data wrangling. If you have made so far with this tutorial, you know that flattening is something you will be engaging with too often. The tasks mentioned here can be achieved using the following functions.

  • flatten(): The function removes the level hierarchy from the list of lists. The equivalent function to this in Base R would be unlist() function. Although the two are similar, flatten() only removes the single layer of hierarchy and is stable. What this means is that you always know the output type. There are subgroup functions which, when used, ensure that you get the desired output. The sub-group functions are as mentioned below:
  1. flatten_lgl(): returns a logical vector

  2. flatten_int(): returns an integer vector

  3. flatten_dbl(): returns a double vector

4.flatten_chr(): returns a character vector

  1. flatten_dfr(): returns a data frames created by row-binding

  2. flatten_dfc(): returns a data frames created by column-binding

Let’s look at the output generated by flatten() and its subgroup functions. First, let us create a list of numbers. If you want, you can pick any work from the above example code.

x <- rerun(2, sample(6))
x
## [[1]]
## [1] 3 5 2 6 1 4
## 
## [[2]]
## [1] 6 5 1 2 4 3

So our list consists of 4 numerical vectors containing the random numbers between 1 to 6. We will now flatten the list using flatten_int() function.

flatten_int(x)
##  [1] 3 5 2 6 1 4 6 5 1 2 4 3

All the functions mentioned have very straight forward and simple syntax.

  • transpose(): The function converts a pair of lists into a list of pairs.
x <- rerun(2, x = runif(1), y = runif(3))
x
## [[1]]
## [[1]]$x
## [1] 0.6565405
## 
## [[1]]$y
## [1] 0.1978803 0.9411851 0.8463121
## 
## 
## [[2]]
## [[2]]$x
## [1] 0.1559741
## 
## [[2]]$y
## [1] 0.02976176 0.45462761 0.30839978
x %>% transpose() %>% str()
## List of 2
##  $ x:List of 2
##   ..$ : num 0.657
##   ..$ : num 0.156
##  $ y:List of 2
##   ..$ : num [1:3] 0.198 0.941 0.846
##   ..$ : num [1:3] 0.0298 0.4546 0.3084

9.4.2.4 Join or Combine Lists

You can join two lists in different ways. One is you can append one behind the other, and second, you can append at the beginning of the other list. The purrr package provides functions that help to achieve these tasks.

  • append(): This function appends the list at the end of the other list.
a <- list(22, 11, 44, 55)
b <- list(11, 99, 77)
flatten_dbl(append(a, b))
## [1] 22 11 44 55 11 99 77
  • prepend(): Using this function, we can append a list before another list.
a <- list(22, 11, 44, 55)
b <- list(11, 99, 77)
flatten_dbl(prepend(a, b))
## [1] 11 99 77 22 11 44 55

9.4.3 Other useful functions

In this section, we will cover functions that do not necessarily fall into the above categories. But we believe knowing these functions will improve your programming skills tremendously.

  • cross_df(): The function returns a data frame where each row is a combination of list elements.
df <- list( empId = c(100, 101, 102, 103),
            name = c("John", "Jack", "Jill", "Cathy"),
            exp = c(4, 10, 6, 8))
df
## $empId
## [1] 100 101 102 103
## 
## $name
## [1] "John"  "Jack"  "Jill"  "Cathy"
## 
## $exp
## [1]  4 10  6  8

Here we have three vectors stored in a list. We can now use cross_df() function to get the data frame.

cross_df(df)
## # A tibble: 64 x 3
##    empId name    exp
##    <dbl> <chr> <dbl>
##  1   100 John      4
##  2   101 John      4
##  3   102 John      4
##  4   103 John      4
##  5   100 Jack      4
##  6   101 Jack      4
##  7   102 Jack      4
##  8   103 Jack      4
##  9   100 Jill      4
## 10   101 Jill      4
## # … with 54 more rows
  • rerun(): You can use rerun() an repeat a function n number of times. The function is equivalent to the repeat() function. The rerun() function is very useful when it comes to generating sample data in R.
rerun(1, print("Hello, World!"))
## [1] "Hello, World!"
## [[1]]
## [1] "Hello, World!"
  • reduce(): The reduce function recursively applies a function or an operation to each element of a list or vector. For example, say I want to add all the numbers of a vector. Notice that we are using backtick instead of inverted commos here.
reduce(c(4,12,30, 16), `+`)
## [1] 62

Let’s look at another example. Say I want to concatenate the first element of each vector inside a list. To achieve this, we can use paste function as mentioned below.

x <- list(c(0, 1), c(2, 3), c(4, 5))
reduce(x, paste)
## [1] "0 2 4" "1 3 5"

The function also has a variant named reduce2(). If your work involves two vectors or lists, you can use reduce2() instead of reduce().

  • accumulate(): The function sequentially applies a function to a vector or list. It works just like reduce(), but also returns intermediate results. At each iteration, the function takes two arguments. One is the initial value or the result from the previous step, and the second is the next value in the vector. For further understanding, let’s take a look at the below example, which returns the cumulative sum of values in a vector.
accumulate(c(1,2,3,4,5), sum)
## [1]  1  3  6 10 15

The function can be implemented on two different lists through the use of accumulate2().

9.5 stringer