We went through the first chapter of the book, which is an introduction to data science and its core principles.

Name the six packages that form the core of the

**tidyverse**.What size (in Gb) are datasets in

**big data**?Define

**visualisation**and**modeling**in the context of data science. What are their respective strengths and weaknesses?Give three examples of

**non-rectangular data**. Do you think these are categorised as**structured**or**unstructured**data?

What is

**data**?What is

**data science**? How is it different from**statistics**?Why do you want to join the book club? What do you aim to achieve from it?

**Exploratory data analysis is a lost art.**People seem ingrained to seek the statistical test that gives them an arbitrary p-value. One should actually explore the data, check for outliers and understand how the data is distributed. Do not naively apply a commonly used statistical test without any preliminary analysis — a common habit! Exploring the data is definitely necessary to even begin making sense of it, especially if the data is too large and complex to decipher by eye. A counter argument, however, is that too much exploration could lead to “fishing” for patterns in the data. There has to be a balance and this depends on a researcher’s integrity.Unlike data science,

**statistics**could be understood without data. A researcher could study statistics by making simulations on the computer and finding the best mathematical way to describe them. Simulations are invaluable for equations that cannot be solved analytically as well. Check this out: Data science vs. statistics: two cultures?**Rectangular data**, as mentioned into the book, probably refers to**tabular or spreadsheet data**, where the order of the columns and rows do not matter whichever they appear.*Images*are “rectangular” in some sense, but unlike tabular data, jumbling up the rows or columns would render an image meaningless.The idea of what counts as

**big data**seems to differ from people to people, depending on coding experience. The book mentioned that 10-100 Gb counts as “larger data”, but to some attendees a table with 50,000 rows (likely only few Mbs) might already be intimidating.

Aim to have rotating chairs, so everyone has a say.

How long should the meeting duration be? Need to account for the busy schedules of people who work in the lab.

Attendees are recommended to use

*slack*to mention topics of interest**2 days in advance**before a meeting. Chair could check it in advance.Separate channel for exercises on

*slack*?Aim to pace the sessions accordingly to avoid anyone falling behind. Please shout (or type) if you need more time for a chapter.