tidyr

class: center, middle, inverse, title-slide

# tidyr
### Haley Jeppson, Sam Tyner

---

class: primary
# What is tidy data?

> "Happy families are all alike; every unhappy family is unhappy in its own way."<br>
*-Leo Tolstoy*

- Resource: follow along `tidyr` vignette

- available as `vignette("tidy-data", package="tidyr")`

- vignette is version of the [tidy data paper](https://www.jstatsoft.org/article/view/v059i10) with updated code

---
class: primary
# Outline

- Different sources of messiness

- Key-Value pairs

- `tidyr`: `spread` and `gather`, `separate`

---
class: primary
# Data storage

There are different ways of encoding the same information:

.pull-left[
**Option 1:**
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;border-color:#aaa;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#aaa;color:#333;background-color:#fff;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#aaa;color:#fff;background-color:#f38630;}
.tg .tg-s6z2{text-align:center}
.tg .tg-gmsq{background-color:#FCFBE3;font-style:italic;text-align:center}
.tg .tg-lyaj{background-color:#FCFBE3;text-align:center}
.tg .tg-hgcj{font-weight:bold;text-align:center}
</style>
<table class="tg">
  <tr>
    <th class="tg-hgcj">Name</th>
    <th class="tg-hgcj">Treatment A</th>
    <th class="tg-hgcj">Treatment B</th>
  </tr>
  <tr>
    <td class="tg-s6z2">John Smith</td>
    <td class="tg-gmsq">NA</td>
    <td class="tg-s6z2">18</td>
  </tr>
  <tr>
    <td class="tg-s6z2">Jane Doe</td>
    <td class="tg-lyaj">4</td>
    <td class="tg-s6z2">1</td>
  </tr>
  <tr>
    <td class="tg-s6z2">Mary Johnson</td>
    <td class="tg-lyaj">6</td>
    <td class="tg-s6z2">7</td>
  </tr>
</table>

**Option 2:**

Neither #1 nor #2 are "clean" versions of the data: observed information is part of the data structure; some implicit information is assumed

---
class: primary
# Reality vs. Ideal

.pull-left[
**Wide format**

- some variables are spread out across columns.

- typically uses less space to display

- how you would typically choose to present your data

- far less repetition of labels and row elements

![](img/tablewide2.png)

]
.pull-right[

**Long format**

- each variable is a column

- each observation is a row

- is likely not the data's most compact form

]

---
class: primary
# Tidy Data

- Each variable is a column

- Each observation is a row

- Each type of observational unit forms a table

<br/>

---
class: primary
# Sources of Messiness

**Five main ways tables of data tend not to be tidy:**

1. Column headers are values, not variable names.

2. Multiple variables are stored in one column.

3. Variables are stored in both rows and columns.

4. Multiple types of observational units are stored in the same table.

5. A single observational unit is stored in multiple tables.

---
class: primary
# What makes data tidy?

.pull-left[
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;border-color:#aaa;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#aaa;color:#333;background-color:#fff;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#aaa;color:#fff;background-color:#f38630;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-amwm">Name</th>
    <th class="tg-amwm">Treatment</th>
    <th class="tg-amwm">Measurement</th>
  </tr>
  <tr>
    <td class="tg-baqh">John Smith</td>
    <td class="tg-baqh">A</td>
    <td class="tg-baqh">NA</td>
  </tr>
  <tr>
    <td class="tg-baqh">John Smith</td>
    <td class="tg-baqh">B</td>
    <td class="tg-baqh">18</td>
  </tr>
  <tr>
    <td class="tg-baqh">Jane Doe</td>
    <td class="tg-baqh">A</td>
    <td class="tg-baqh">4</td>
  </tr>
  <tr>
    <td class="tg-baqh">Jane Doe</td>
    <td class="tg-baqh">B</td>
    <td class="tg-baqh">1</td>
  </tr>
  <tr>
    <td class="tg-baqh">Mary Johnson</td>
    <td class="tg-baqh">A</td>
    <td class="tg-baqh">6</td>
  </tr>
  <tr>
    <td class="tg-baqh">Mary Johnson</td>
    <td class="tg-baqh">B</td>
    <td class="tg-baqh">7</td>
  </tr>
</table>

]

.pull-right[

- `treatment` and `patient` uniquely describe a single row in the dataset.

- `treatment` and `patient` are **key variables**,

- `score` is a **measurement variable**

- this makes `treatment-patient` and `score` a **key-value pair**

]

---
class: primary
# Key-value pairs
 
**Key-Value pairs** (KVP) - also *attribute-value*, *field-value*, *name-value*:
abstract data representation that allows a lot of flexibility

One way of telling whether a data set is tidy is to check that all keys for a value are aligned in one row:

---
class: primary
# Key-value pairs

**Keys/Identifiers**:

- Identify a record (must be unique)

- Example: Indices on an random variable

- Fixed by design of experiment (known in advance)

- May be single or composite (may have one or more variables)

**Values/Measures**:

- Collected during the experiment (not known in advance)

- Usually numeric quantities

---
class: primary
# Tidying data plan

Very few functions are needed to tidy data:

**Messy (1)**: `tidyr` functions `gather` and `spread`.

- `gather (data, key, value, ...)`: take multiple columns and collapse into key-value pairs

- `spread (data, key, value, fill = NA)`: spread a key-value pair across multiple columns.

**Messy (2)**: `tidyr` function `separate (data, col, into, sep = " ")`:

- separate one column into multiple columns

**Messy (3)**: `dplyr` - some combination of the functions discussed previously

**Messy (4)**: `dplyr` functionality `join` (and friends) to combine multiple data sets

---
class: primary
# Untidy data

```r
fries <- read_csv("https://raw.githubusercontent.com/heike/rwrks/gh-pages/03-r-format/data/frenchfries.csv")
head(fries)
```

```
## # A tibble: 6 x 9
##    time treatment subject   rep potato buttery grassy rancid painty
##   <int>     <int>   <int> <int>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
## 1     1         1       3     1   2.90   0.      0.     0.     5.50
## 2     1         1       3     2  14.0    0.      0.     1.10   0.  
## 3     1         1      10     1  11.0    6.40    0.     0.     0.  
## 4     1         1      10     2   9.90   5.90    2.90   2.20   0.  
## 5     1         1      15     1   1.20   0.100   0.     1.10   5.10
## 6     1         1      15     2   8.80   3.00    3.60   1.50   2.30
```

---
class: primary
# Why untidy?

This format is not ideal for data analysis

```r
library(ggplot2)

ggplot(fries) + 
  geom_boxplot(aes(x="1_buttery", y=buttery), fill = "cyan4") +
  geom_boxplot(aes(x = "2_grassy", y = grassy), fill = "darkorange2") +
  geom_boxplot(aes(x = "3_painty", y = painty), fill = "darkorchid1") +
  geom_boxplot(aes(x = "4_potato", y = potato), fill = "chartreuse3") +
  geom_boxplot(aes(x = "5_rancid", y = rancid), fill = "deeppink") +
  xlab("variable") + ylab("rating")
```

![](09-r-tidy-data_files/figure-html/unnamed-chunk-3-1.png)

---
class: primary
# `tidyr::gather()`

Need to take data from wide format to long format.

When gathering, you need to specify: 
- the **keys** (identifiers) 
- the **values** (measures)

```r
fries_long <- fries %>% 
  gather(key = variable, value = rating, potato:painty)
head(fries_long)
```

```
## # A tibble: 6 x 6
##    time treatment subject   rep variable rating
##   <int>     <int>   <int> <int> <chr>     <dbl>
## 1     1         1       3     1 potato     2.90
## 2     1         1       3     2 potato    14.0 
## 3     1         1      10     1 potato    11.0 
## 4     1         1      10     2 potato     9.90
## 5     1         1      15     1 potato     1.20
## 6     1         1      15     2 potato     8.80
```

---
class: primary
# Better form

```r
ggplot(fries_long) + 
  geom_boxplot(aes(x = variable, y = rating, fill = variable))
```

![](09-r-tidy-data_files/figure-html/unnamed-chunk-5-1.png)

---
class: primary
# Long to Wide

In certain applications, we may wish to take a long dataset and convert it to a wide dataset (Perhaps for displaying the data in a table).

```r
head(fries_long)
```

---
class: primary
# Spread

We use the `spread` function from `tidyr` to do this:

```r
fries_wide <- fries_long %>% 
  spread(key = variable, value = rating)

head(fries_wide)
```

```
## # A tibble: 6 x 9
##    time treatment subject   rep buttery grassy painty potato rancid
##   <int>     <int>   <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     1         1       3     1   0.      0.     5.50   2.90   0.  
## 2     1         1       3     2   0.      0.     0.    14.0    1.10
## 3     1         1      10     1   6.40    0.     0.    11.0    0.  
## 4     1         1      10     2   5.90    2.90   0.     9.90   2.20
## 5     1         1      15     1   0.100   0.     5.10   1.20   1.10
## 6     1         1      15     2   3.00    3.60   2.30   8.80   1.50
```

We are now back to our original format

---
class: primary
# Spread

We can also use the `spread` function to compare reps

```r
fries_wide <- fries_long %>% 
  spread(key = rep, value = rating)

head(fries_wide)
```

```
## # A tibble: 6 x 6
##    time treatment subject variable   `1`   `2`
##   <int>     <int>   <int> <chr>    <dbl> <dbl>
## 1     1         1       3 buttery   0.    0.  
## 2     1         1       3 grassy    0.    0.  
## 3     1         1       3 painty    5.50  0.  
## 4     1         1       3 potato    2.90 14.0 
## 5     1         1       3 rancid    0.    1.10
## 6     1         1      10 buttery   6.40  5.90
```

---
class: primary
# Compare reps

```r
fries_wide %>%
  ggplot(aes(x = `1`, y = `2`)) + 
  geom_point() +
  facet_wrap(~variable) + 
  geom_abline(colour = "grey50")
```

![](09-r-tidy-data_files/figure-html/unnamed-chunk-9-1.png)

---
class: primary
# separate

```r
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df
```

```
##      x
## 1 <NA>
## 2  a.b
## 3  a.d
## 4  b.c
```

```r
df %>% separate(x, into = c("A", "B"))
```

```
##      A    B
## 1 <NA> <NA>
## 2    a    b
## 3    a    d
## 4    b    c
```

---
class: primary
# Your Turn

The Iowa Data Portal is a wealth of information on and about the State of Iowa.

The website 
[Campaign Expenditures](https://data.iowa.gov/Government/Campaign-Expenditures/3adi-mht4/data) provides data on campaign expenditures. The code below reads the data into an R session.

```
url <- "https://data.iowa.gov/api/views/3adi-mht4/rows.csv"
campaign <- readr::read_csv(url)
```

Assess the 'messiness' of the data. List issues that prevent us from working with the data directly. Which of these issues are of type (1) or (2) of messiness?

---
class: primary
# Messiness in data

- `Date` is text, in the format of Month/Day/Year (Messy 2)

- city coordinates are a combination of City name, state, zip code and geographic latitude and longitude. (Messy 2)

- expenditure amount is a textual expression, not a number (Messy different)

no Messy 1? - problems of type Messy 1 are typically hard to detect and often up to interpretation/dependent on the analysis to be done.

---
class: primary
# Your Turn

During the 1870 census data on people's occupation was collected. The data [occupation-1870](../data/occupation-1870.csv) contains state-level aggregates of occupation by gender.

- Use `tidyr` to get the data into a long format.

- Separate the `occupation.gender` type variable into two variables.

- Spread the data such that you can draw scatterplots of values for men against women facetted by occupation.

---
class: primary
# Dates and Times

Dates are deceptively hard to work with in R.

**Example**: 02/05/2012. Is it February 5th, or May 2nd?

Other things are difficult too:

- Time zones
- POSIXct format in base R is challenging

The **lubridate** package helps tackle some of these issues.

---
class: primary
# Basic Lubridate Use

```r
library(lubridate)

now()
today()
now() + hours(4)
today() - days(2)
```

```
## [1] "2018-06-18 08:43:39 CDT"
## [1] "2018-06-18"
## [1] "2018-06-18 12:43:39 CDT"
## [1] "2018-06-16"
```

---
class: primary
# Parsing Dates

```r
ymd("2013-05-14")
mdy("05/14/2013")
dmy("14052013")
ymd_hms("2013:05:14 14:50:30", tz = "America/Chicago")
```

```
## [1] "2013-05-14"
## [1] "2013-05-14"
## [1] "2013-05-14"
## [1] "2013-05-14 14:50:30 CDT"
```

---
class: primary
# Your Turn

The flights dataset contains information on over 300,000 flights that departed from New York City in the year 2013.

```r
flights <- read_csv("http://heike.github.io/rwrks/03-r-format/data/flights.csv")
```

---
class: primary
# Your Turn

1. Using the `flights` data, create a new column Date using lubridate. You will need to paste together the columns year, month, and day in order to do this. See the `paste` function.

2. Use `dplyr` to calculate the average departure delay for each date.

3. Plot the date versus the average departure delay

![](09-r-tidy-data_files/figure-html/unnamed-chunk-14-1.png)

---
class: primary
# Your Turn

- Read in the billboard top 100 music data:

```r
billboard <- read.csv("http://heike.github.io/rwrks/03-r-format/data/billboard.csv")
```
    
- Use `tidyr` to convert this data into a long format.

- Use `ggplot2` to create something like this:

![](09-r-tidy-data_files/figure-html/unnamed-chunk-16-1.png)