We read through the first half of the third chapter about data visualisation using ggplot2.

## Questions for review (answers in the book)

1. What is an aesthetic? Name three examples of aesthetic in ggplot2. (section 3.3)

2. What are facets?
• Distinguish between facet_wrap() and facet_grid(). (section 3.5)
3. What is a geom? (section 3.6)
• What’s the difference between global and local mapping?
4. What is a stat? (section 3.7)
• Differentiate between the stats "identity" and "count" do in the context of geom_bar().
• Which is the default?
5. What is a position adjustment? (section 3.8)
• Describe what the position adjustments "stack", "identity", "dodge" and "fill" do in the context of geom_bar().
• Which is the default?
• How does position = jitter help with overplotting in scatterplots?
6. What is the default coordinate system in ggplot2?

## Questions to ponder (answers in the brain)

1. Distinguish between discrete and continuous data. For these aesthetics, discuss which is suitable for discrete, continuous data, both or neither.
• size
• shape
• colour
• linetype
• alpha
• fill

Consider: discrete data can be categorial or non-categorial.

2. With respect to geom_point objects, what is the maximum number of dimensions of our data could be simultaneously represented within a single ggplot object?
• Consider we have two axes, various aesthetics and facetting at our disposal.
3. Human eyes are positioned horizontally, giving us a wider stereoscopic vision.
• Is that why we prefer to have more columns than rows when arranging subplots in a plot (related to question 6 in section 3.5.1)?
• Perhaps that’s why computer screens were designed to be longer horizontally than vertically? Why did modern smartphones reverse the trend?

## Interesting discussions

• There are over 30 geoms and 20 stats in ggplot2, and they tell different stories about the same dataset. Check out their official cheatsheet, as well as the from Data to Viz website. The latter features a section about the caveats of commonly used plots, and how one could work around their pitfalls:
• Boxplots summarise a distribution into several numbers: Q1, median, Q3, etc. However, as described here, we lose information about the sample size and underlying data distribution.
• When comparing a pair of observations, the classical barplots with error bars can be misleading. This page describes how outliers and unequal sample sizes between the compared datasets can lead us to make false conclusions.
• Is mean or median better to summarise data? Would the sample size affect your choice?

• Bar charts pose notable limitations for data visualisation. One point mentioned was that the bars have to be anchored at a specific value, usually zero. Stacked bar charts might be difficult to interepret, therefore try to separate the stacks (use position = "dodge"). Visualisation gets even easier if the bars are ordered by height.

## Random code chunks from the book and meeting

Here is a template for any type of plot with ggplot2. The template emphasises the layered grammar of graphics, which is what “gg” in ggplot2 stands for. Confession: I only realised this after half a year of using ggplot2.

ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>

Every geom has a default stat, and every stat has a default geom. The following code chunks generate the same plot. If desired, one could change the statistical transformation of a particular geom.

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
• Conflict between packages can occur if the packages have function(s) of a similar name. An example is the filter() function from dplyr and plyr. Avoid loading too many packages at once. Here is a way to use a function without loading the entire package:
dplyr::filter()
plyr::filter()