We read through the fifth chapter of the book. It was about data transformation with **dplyr**, using its primary functions `filter()`

, `arrange()`

, `select()`

, `mutate()`

, `summarise()`

and `groupby()`

. Personally I am already familiar these, so here I focus on eccentric trivia and tidbits I liked in the chapter.

```
#import the libraries we need
library(tidyverse)
```

- Almost any operation with
`NA`

generally results in an unknown value.

```
NA > 5
## [1] NA
NA + 10
## [1] NA
10 == NA
## [1] NA
```

- Assuming
`NA`

is a number, the range of values it can take is`{-Inf, Inf}`

. Any number raised to the power of zero is 1, therefore the following will not return an unknown value.

```
#note that in the second, it calculates NA ^ 0 first before negative
c( NA ^ 0, -NA ^ 0, (-NA) ^ 0 )
```

`## [1] 1 -1 1`

- Based on the documentation for
`?logical`

, logical vectors are coerced to integer vectors in contexts where a numerical value is required, with`TRUE`

being mapped to`1L`

, and`FALSE`

to`0L`

:

```
#note that anything that is neither 0 nor 1 does not equate to TRUE/FALSE
c(-Inf, -10, -1, 0, 1, 10, Inf) == TRUE
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
c(-Inf, -10, -1, 0, 1, 10, Inf) == FALSE
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
```

This means that `NA`

can be either `TRUE`

or `FALSE`

. We don’t know. This leads to such behaviours with the `|`

and `&`

operators.

```
#only one side needs to be TRUE
NA | TRUE
## [1] TRUE
#both sides needs to be TRUE, otherwise FALSE
NA & FALSE
## [1] FALSE
#the answer can still be FALSE if left side is FALSE
NA & TRUE
## [1] NA
#the answer can still be TRUE if left side is TRUE
NA | FALSE
## [1] NA
```

`NA * 0`

does not result in 0, as one might expect.

`NA * 0`

`## [1] NA`

This is because `Inf * 0`

evaluates as `NaN`

for ‘Not a Number’. Take note of the following behaviours.

```
Inf * 0
## [1] NaN
pi / 0 #finite values divided by 0
## [1] Inf
0 / 0
## [1] NaN
```

- What do you think
`Inf - Inf`

evaluates to? Check out this link.

Every value in R has an inherent boolean value, generally known as either **truthy** or **falsey**. This is **different** from being **equal** to `TRUE/FALSE`

, which is what the `==`

operator tests for!

Remember from above that only `0 == FALSE`

and `1 == TRUE`

, everything else is neither.

```
x <- c(0, 1, pi, Inf, -1, -pi, -Inf)
x == TRUE
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
x == FALSE
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
```

However, **all positive and negative numbers except 0** are inherently **truthy**, which one can see when they are run as a condition of an `if`

statement.

```
ifelse(x, 'truthy', 'falsey')
## [1] "falsey" "truthy" "truthy" "truthy" "truthy" "truthy" "truthy"
```

Have a go and check how `NA`

and `NaN`

are assessed. Spoiler: errors incoming!

Let’s say, in this *unrealistic* task, for each row, we want to calculate the mean of values that belong to columns containing the term “time”.

```
#provide a glimpse of the dataset we are working with
library(nycflights13)
colnames(flights)
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
```

In other words, we want to calculate the mean between `dep_time`

, `sched_dep_time`

, `arr_time`

, `sched_arr_time`

and `air_time`

**for each row**. Let’s take only a subset of the dataset so it’s easier to see. I also added a unique index to each row.

```
data_subset <- flights %>%
select(month, dep_time, sched_dep_time, arr_time, sched_arr_time, air_time) %>%
mutate(unique_id = 1:nrow(flights))
head(data_subset)
```

```
## # A tibble: 6 x 7
## month dep_time sched_dep_time arr_time sched_arr_time air_time unique_id
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 1 517 515 830 819 227 1
## 2 1 533 529 850 830 227 2
## 3 1 542 540 923 850 160 3
## 4 1 544 545 1004 1022 183 4
## 5 1 554 600 812 837 116 5
## 6 1 554 558 740 728 150 6
```

The function `summarise`

only works in a column fashion. To get around the problem, what I do is first `gather`

the values from the five “_time" columns into a single column. I then use `group_by`

(based on unique identifiers) and `summarise`

to calculate the mean, which I rejoin with the original data frame.

```
data_subset %>%
gather("key", "value", contains("time")) %>%
group_by(unique_id) %>%
summarise(mean_time = mean(value)) %>%
full_join(data_subset)
```

```
## # A tibble: 336,776 x 8
## unique_id mean_time month dep_time sched_dep_time arr_time
## <int> <dbl> <int> <int> <int> <int>
## 1 1 582. 1 517 515 830
## 2 2 594. 1 533 529 850
## 3 3 603 1 542 540 923
## 4 4 660. 1 544 545 1004
## 5 5 584. 1 554 600 812
## 6 6 546 1 554 558 740
## 7 7 616 1 555 600 913
## 8 8 528. 1 557 600 709
## 9 9 596. 1 557 600 838
## 10 10 559. 1 558 600 753
## # ... with 336,766 more rows, and 2 more variables: sched_arr_time <int>,
## # air_time <dbl>
```

For such a task, I used to use `rowwise()`

but has been advised that the function is not vectorised and therefore inefficient. There might still be a less clunky way to do what I showed — please email me if you know!

- The following statements are equivalent due to
**de Morgan’s Law**.

```
flights %>% filter(arr_delay <= 120, dep_delay <= 120)
flights %>% filter(!(arr_delay > 120 | dep_delay > 120))
```

- Wrapping an assignment in parentheses both prints out the results, and saves them to a variable.

`(dec25 <- filter(flights, month == 12, day == 25))`

```
## # A tibble: 719 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 12 25 456 500 -4 649
## 2 2013 12 25 524 515 9 805
## 3 2013 12 25 542 540 2 832
## 4 2013 12 25 546 550 -4 1022
## 5 2013 12 25 556 600 -4 730
## 6 2013 12 25 557 600 -3 743
## 7 2013 12 25 557 600 -3 818
## 8 2013 12 25 559 600 -1 855
## 9 2013 12 25 559 600 -1 849
## 10 2013 12 25 600 600 0 850
## # ... with 709 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
```

- Here is a list of verbs/functions I found useful from the chapter:
`near`

(section 5.2.1): useful when testing two numbers for equality due to finite precision arithmetic.`between`

(Q2 in exercise 5.2.4)`everything`

(section 5.4): used in conjunction with`select()`

`one_of`

(Q3 in section 5.4.1)`lead`

and`lag`

(section 5.5.1)