We read through the fifth chapter of the book. It was about data transformation with dplyr, using its primary functions filter(), arrange(), select(), mutate(), summarise() and groupby(). Personally I am already familiar these, so here I focus on eccentric trivia and tidbits I liked in the chapter.

#import the libraries we need
library(tidyverse)

NA values (Q4 in Exercise 5.2.4)

NA > 5
## [1] NA

NA + 10
## [1] NA

10 == NA
## [1] NA
#note that in the second, it calculates NA ^ 0 first before negative
c( NA ^ 0, -NA ^ 0, (-NA) ^ 0 )
## [1]  1 -1  1
#note that anything that is neither 0 nor 1 does not equate to TRUE/FALSE
c(-Inf, -10, -1, 0, 1, 10, Inf) == TRUE
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
c(-Inf, -10, -1, 0, 1, 10, Inf) == FALSE
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

This means that NA can be either TRUE or FALSE. We don’t know. This leads to such behaviours with the | and & operators.

#only one side needs to be TRUE
NA | TRUE 
## [1] TRUE

#both sides needs to be TRUE, otherwise FALSE
NA & FALSE 
## [1] FALSE

#the answer can still be FALSE if left side is FALSE
NA & TRUE 
## [1] NA

#the answer can still be TRUE if left side is TRUE
NA | FALSE 
## [1] NA
NA * 0
## [1] NA

This is because Inf * 0 evaluates as NaN for ‘Not a Number’. Take note of the following behaviours.

Inf * 0
## [1] NaN

pi / 0 #finite values divided by 0
## [1] Inf

0 / 0 
## [1] NaN

TRUE/FALSE versus truthy/falsy (a minor technicality)

Every value in R has an inherent boolean value, generally known as either truthy or falsey. This is different from being equal to TRUE/FALSE, which is what the == operator tests for!

Remember from above that only 0 == FALSE and 1 == TRUE, everything else is neither.

x <- c(0, 1, pi, Inf, -1, -pi, -Inf)

x == TRUE
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

x == FALSE
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

However, all positive and negative numbers except 0 are inherently truthy, which one can see when they are run as a condition of an if statement.

ifelse(x, 'truthy', 'falsey')
## [1] "falsey" "truthy" "truthy" "truthy" "truthy" "truthy" "truthy"

Have a go and check how NA and NaN are assessed. Spoiler: errors incoming!

A useful idiom I tend to use to do rowwise summaries

Let’s say, in this unrealistic task, for each row, we want to calculate the mean of values that belong to columns containing the term “time”.

#provide a glimpse of the dataset we are working with
library(nycflights13)

colnames(flights)
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

In other words, we want to calculate the mean between dep_time, sched_dep_time, arr_time, sched_arr_time and air_time for each row. Let’s take only a subset of the dataset so it’s easier to see. I also added a unique index to each row.

data_subset <- flights %>% 
  select(month, dep_time, sched_dep_time, arr_time, sched_arr_time, air_time) %>% 
  mutate(unique_id = 1:nrow(flights))

head(data_subset)
## # A tibble: 6 x 7
##   month dep_time sched_dep_time arr_time sched_arr_time air_time unique_id
##   <int>    <int>          <int>    <int>          <int>    <dbl>     <int>
## 1     1      517            515      830            819      227         1
## 2     1      533            529      850            830      227         2
## 3     1      542            540      923            850      160         3
## 4     1      544            545     1004           1022      183         4
## 5     1      554            600      812            837      116         5
## 6     1      554            558      740            728      150         6

The function summarise only works in a column fashion. To get around the problem, what I do is first gather the values from the five “_time" columns into a single column. I then use group_by (based on unique identifiers) and summarise to calculate the mean, which I rejoin with the original data frame.

data_subset %>% 
  gather("key", "value", contains("time")) %>% 
  group_by(unique_id) %>% 
  summarise(mean_time = mean(value)) %>% 
  full_join(data_subset)
## # A tibble: 336,776 x 8
##    unique_id mean_time month dep_time sched_dep_time arr_time
##        <int>     <dbl> <int>    <int>          <int>    <int>
##  1         1      582.     1      517            515      830
##  2         2      594.     1      533            529      850
##  3         3      603      1      542            540      923
##  4         4      660.     1      544            545     1004
##  5         5      584.     1      554            600      812
##  6         6      546      1      554            558      740
##  7         7      616      1      555            600      913
##  8         8      528.     1      557            600      709
##  9         9      596.     1      557            600      838
## 10        10      559.     1      558            600      753
## # ... with 336,766 more rows, and 2 more variables: sched_arr_time <int>,
## #   air_time <dbl>

For such a task, I used to use rowwise() but has been advised that the function is not vectorised and therefore inefficient. There might still be a less clunky way to do what I showed — please email me if you know!

Miscellaneous bits

flights %>% filter(arr_delay <= 120, dep_delay <= 120) 
flights %>% filter(!(arr_delay > 120 | dep_delay > 120)) 
(dec25 <- filter(flights, month == 12, day == 25))
## # A tibble: 719 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    12    25      456            500        -4      649
##  2  2013    12    25      524            515         9      805
##  3  2013    12    25      542            540         2      832
##  4  2013    12    25      546            550        -4     1022
##  5  2013    12    25      556            600        -4      730
##  6  2013    12    25      557            600        -3      743
##  7  2013    12    25      557            600        -3      818
##  8  2013    12    25      559            600        -1      855
##  9  2013    12    25      559            600        -1      849
## 10  2013    12    25      600            600         0      850
## # ... with 709 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>