6 Manipulating Data with dplyr (Continuation)

6.1 Grouping & Summarising

Please load the following libraries and data frames for this chapter:

library(tidyverse)
library(magrittr)
url <- "http://www.phonetik.uni-muenchen.de/~jmh/lehre/Rdf"
int <- read.table(file.path(url, "intdauer.txt"))
coronal <- read.table(file.path(url, "coronal.txt"))
vdata <- read.table(file.path(url, "vdata.txt"))

Chapter 5 was concerned with summary statistics regarding the F1 values stored in the data frame vdata. Of course we can calculate these values just as well using the tidyverse syntax. This is actually what the function summarise() from the package dplyr was made for. This function changes that data frame completely because the original observations are summarised to new, descriptive values. The columns of the original data are also usually dropped and new columns containing the calculated values are created. The function takes as arguments the new column names and a function with which the values for the new column shall be calculated:

vdata %>% summarise(arith_mean = mean(F1))
##   arith_mean
## 1      407.3

The output of this pipe is a data frame of just one column called arith_mean and one observation which contains the arithmetic mean of vdata$F1. We can also compute several summary statistics within one call of summarise(), which results in a data frame with more columns:

vdata %>% summarise(arith_mean = mean(F1),
                    std_dev = sd(F1),
                    s = sum(F1),
                    maximum = max(F1),
                    Q1 = quantile(F1, 0.25))
##   arith_mean std_dev       s maximum  Q1
## 1      407.3   145.8 1214532    1114 300

The functions mutate() and summarise() have in common that they create new columns from the old ones. However, while mutate() keeps all original columns and observations, summarise() creates a new data frame with usually less columns and observations than before.

Say, you want to compute the arithmetic mean of F1 for a specific vowel V in the data frame vdata – how would you do that? Using tidyverse, probably like this (for the vowel V == "E"):

vdata %>% 
  filter(V == "E") %>% 
  summarise(arith_mean = mean(F1))
##   arith_mean
## 1      426.2

The mean F1 for “E” is approx. 426 Hz. If you’re interested in the vowel-specific mean F1 values, it is very inefficient to rerun the code above for every single vowel category in the data frame. Instead, you can use the function group_by() which takes as arguments the names of all columns by which you want to group the result of the code. summarise() computes the desired summary statistics then per group. In our example, let’s compute the mean for each vowel:

vdata %>% 
  group_by(V) %>% 
  summarise(arith_mean = mean(F1))
## # A tibble: 7 × 2
##   V     arith_mean
##   <chr>      <dbl>
## 1 %           424.
## 2 A           645.
## 3 E           426.
## 4 I           311.
## 5 O           434.
## 6 U           304.
## 7 Y           302.

This code created two columns: one contains the seven distinct vowel from the original data frame, the second the vowel-specific mean F1 values. Of course you can group the results by more than one column. It is, for instance, probably the case that the mean F1 is not just different for each vowel, but that the tenseness Tense of the vowel also affects F1. That is why we group now by vowel and tenseness and then compute the mean F1:

vdata %>% 
  group_by(V, Tense) %>% 
  summarise(arith_mean = mean(F1))
## `summarise()` has grouped output by 'V'. You can
## override using the `.groups` argument.
## # A tibble: 14 × 3
## # Groups:   V [7]
##    V     Tense arith_mean
##    <chr> <chr>      <dbl>
##  1 %     +           368.
##  2 %     -           479.
##  3 A     +           668.
##  4 A     -           622.
##  5 E     +           363.
##  6 E     -           488.
##  7 I     +           276.
##  8 I     -           346.
##  9 O     +           348.
## 10 O     -           520.
## 11 U     +           259.
## 12 U     -           348.
## 13 Y     +           266.
## 14 Y     -           338.

Now we can see the mean F1 for lax “%”, tense “%” (ignore the weird vowel encoding), lax “A”, tense “A” and so on.

Further Information: summarise() warning

Above you can see a warning that was thrown by the summarise() command. Warnings try to get your attention – don’t ignore them! This specific warning shows that the result of the pipe is a grouped data frame (object class grouped_df) and that V is the grouping variable:

vdata %>% 
  group_by(V, Tense) %>% 
  summarise(arith_mean = mean(F1)) %>% 
  class()
## `summarise()` has grouped output by 'V'. You can
## override using the `.groups` argument.
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

The warning also tells you that you can the change the grouping of the results by using the function’s argument .groups. This argument can take on different values as you can read on the help page for summarise().

In the previous code snippets in which we used group_by() together with summarise() the warning didn’t show up because we grouped by only one variable; thus the grouping is removed automatically in the results.

Importantly, you can only use group_by() on categorical columns, i.e. factors. It makes no sense to use grouping on non-categorical, numeric columns because there are no groups there (rather, every value in a numeric column is unique or at least rare). We, on the other hand, want to calculate summary statistics for categorical groups.

Lastly we want to introduce the function n() and n_distinct(). n() takes no arguments and is used within summarise() and after group_by() to return the number of observations (rows) per group. n_distinct() takes the name of one column as an argument and finds out how many distinct (unique) values of a variable there are per group.

# number of rows for every combination of V and Tense
vdata %>% 
  group_by(V, Tense) %>% 
  summarise(count = n())
## `summarise()` has grouped output by 'V'. You can
## override using the `.groups` argument.
## # A tibble: 14 × 3
## # Groups:   V [7]
##    V     Tense count
##    <chr> <chr> <int>
##  1 %     +       212
##  2 %     -       214
##  3 A     +       214
##  4 A     -       218
##  5 E     +       210
##  6 E     -       215
##  7 I     +       210
##  8 I     -       214
##  9 O     +       214
## 10 O     -       214
## 11 U     +       208
## 12 U     -       215
## 13 Y     +       211
## 14 Y     -       213
# number of unique participants `Vpn` per region and social class
coronal %>% 
  group_by(Region, Socialclass) %>% 
  summarise(count = n_distinct(Vpn))
## `summarise()` has grouped output by 'Region'. You can
## override using the `.groups` argument.
## # A tibble: 9 × 3
## # Groups:   Region [3]
##   Region Socialclass count
##   <chr>  <chr>       <int>
## 1 R1     LM             40
## 2 R1     UM             30
## 3 R1     W              11
## 4 R2     LM             26
## 5 R2     UM             18
## 6 R2     W              36
## 7 R3     LM             22
## 8 R3     UM             31
## 9 R3     W              26

Further Information: Describing functions unambiguously

Since functions from the tidyverse, especially from dplyr, have pretty common names (filter(), summarise(), rename()), they can be masked easily by functions of the same name from other packages. So if a function returns an error message, try re-loading the package of which the function you used is a part, or use the following notation: dplyr::filter().

6.2 Arranging

In your everyday work with data frames it can be useful to arrange the data frame with regard to rows and/or columns. Use arrange() for sorting rows and relocate() for sorting columns. Here we arrange the data frame int in ascending order by duration Dauer:

int %>% arrange(Dauer)
##    Vpn    dB Dauer
## 10  S2 25.60    55
## 5   S1 23.47    67
## 7   S2 30.08    81
## 14  S1 20.88   103
## 9   S1 21.37   116
## 2   S2 32.54   120
## 4   S2 28.38   131
## 13  S1 26.60   144
## 1   S1 24.50   162
## 6   S2 37.82   169
## 8   S1 24.50   192
## 15  S2 26.05   212
## 3   S2 38.02   223
## 12  S1 44.27   232
## 11  S1 40.20   252

arrange() can also sort alphabetically and with regard to several columns:

int %>% arrange(Vpn, Dauer)
##    Vpn    dB Dauer
## 5   S1 23.47    67
## 14  S1 20.88   103
## 9   S1 21.37   116
## 13  S1 26.60   144
## 1   S1 24.50   162
## 8   S1 24.50   192
## 12  S1 44.27   232
## 11  S1 40.20   252
## 10  S2 25.60    55
## 7   S2 30.08    81
## 2   S2 32.54   120
## 4   S2 28.38   131
## 6   S2 37.82   169
## 15  S2 26.05   212
## 3   S2 38.02   223

To put a data frame in descending order, you can use desc() within arrange():

int %>% arrange(Vpn, desc(Dauer))
##    Vpn    dB Dauer
## 11  S1 40.20   252
## 12  S1 44.27   232
## 8   S1 24.50   192
## 1   S1 24.50   162
## 13  S1 26.60   144
## 9   S1 21.37   116
## 14  S1 20.88   103
## 5   S1 23.47    67
## 3   S2 38.02   223
## 15  S2 26.05   212
## 6   S2 37.82   169
## 4   S2 28.38   131
## 2   S2 32.54   120
## 7   S2 30.08    81
## 10  S2 25.60    55

relocate() takes as arguments the names of all columns that are to be relocated. If you submit no further arguments, the columns are put in first place. Otherwise you can use the arguments .before and .after to specify the new location of the columns:

vdata %>% slice(1)
##       X    Y  F1  F2   dur V Tense Cons Rate Subj
## 1 52.99 4.36 313 966 106.9 %     -    P    a   bk
vdata %>% relocate(Subj) %>% slice(1)
##   Subj     X    Y  F1  F2   dur V Tense Cons Rate
## 1   bk 52.99 4.36 313 966 106.9 %     -    P    a
vdata %>% relocate(Subj, Cons) %>% slice(1)
##   Subj Cons     X    Y  F1  F2   dur V Tense Rate
## 1   bk    P 52.99 4.36 313 966 106.9 %     -    a
vdata %>% relocate(where(is.numeric), .after = Subj) %>% slice(1)
##   V Tense Cons Rate Subj     X    Y  F1  F2   dur
## 1 %     -    P    a   bk 52.99 4.36 313 966 106.9
vdata %>% relocate(where(is.character), .before = dur) %>% slice(1)
##       X    Y  F1  F2 V Tense Cons Rate Subj   dur
## 1 52.99 4.36 313 966 %     -    P    a   bk 106.9