9 Joining with dplyr

Please load the following packages and data frame for this chapter:

library(tidyverse)
library(magrittr)
url <- "http://www.phonetik.uni-muenchen.de/~jmh/lehre/Rdf"
vcv <- read.table(file.path(url, "vcvC.txt"))

Perhaps you are familiar with so called joins from your work with relational data bases and/or SQL. The principle of relational data bases is to store all information and values in themed tables. After an experiment, for instance, you might have a table of measured values (formants, fundamental frequency, etc.) and one table with metadata about the participants (age, origin, education, etc.). Additionally it might make sense to have a table with information about the collected material (words, lexical stress, phonemes, stimuli, etc.) and about the tested conditions (speech rate, which interlocutor, etc.). If necessary, one can (temporarily) join two tables via a key, i.e. a column that contains some sort of ID. In dplyr we differentiate between mutating joins and filtering joins.

9.1 Mutating Joins

Similarly to mutate(), mutating joins add columns to a data frame x. Contrary to mutate() though, the added columns come from a different data frame y. In both data frames x and y there must be a column that contains the identifier or key which connects the two tables.

9.1.1 Inner Join

The simplest form of a mutating join is the so called inner join. The function inner_join() takes as arguments the two data frames x and y and the identifier by. The result of an inner join contains all columns of both x and y as well as all rows that have matches in both data frames. Missing values (NA) do not appear in the result of an inner join, so be careful with that.

As an example we will be using a table of fundamental frequency values measured in the speech of ten speakers, as well as a table with metadata concerning the speakers:

measures <- tibble(subject = rep(paste0("s", 1:10), each = 10),
                   F0 = rnorm(100, 120, 15))
measures
## # A tibble: 100 × 2
##    subject    F0
##    <chr>   <dbl>
##  1 s1      103. 
##  2 s1      115. 
##  3 s1      129. 
##  4 s1      146. 
##  5 s1       95.2
##  6 s1      114. 
##  7 s1      114. 
##  8 s1      147. 
##  9 s1      104. 
## 10 s1       88.6
## # ℹ 90 more rows
meta <- tibble(subject = paste0("s", 1:10),
               age = rep(c("old", "young"), each = 5))
meta
## # A tibble: 10 × 2
##    subject age  
##    <chr>   <chr>
##  1 s1      old  
##  2 s2      old  
##  3 s3      old  
##  4 s4      old  
##  5 s5      old  
##  6 s6      young
##  7 s7      young
##  8 s8      young
##  9 s9      young
## 10 s10     young

Both tibbles have a column subject which we will use as key in our inner join:

inner_join(x = measures, y = meta, by = "subject")
## # A tibble: 100 × 3
##    subject    F0 age  
##    <chr>   <dbl> <chr>
##  1 s1      103.  old  
##  2 s1      115.  old  
##  3 s1      129.  old  
##  4 s1      146.  old  
##  5 s1       95.2 old  
##  6 s1      114.  old  
##  7 s1      114.  old  
##  8 s1      147.  old  
##  9 s1      104.  old  
## 10 s1       88.6 old  
## # ℹ 90 more rows

It can happen that the column that provides the identifier has different names in x and y. In this case we can tell the join via by = c("a"="b") that the column a from data frame x shall be matched with the column b from data frame y. This works for all joining functions presented in this chapter.

measures %<>% rename(Vpn = subject)
inner_join(x = measures, y = meta, by = c("Vpn"="subject"))
## # A tibble: 100 × 3
##    Vpn      F0 age  
##    <chr> <dbl> <chr>
##  1 s1    103.  old  
##  2 s1    115.  old  
##  3 s1    129.  old  
##  4 s1    146.  old  
##  5 s1     95.2 old  
##  6 s1    114.  old  
##  7 s1    114.  old  
##  8 s1    147.  old  
##  9 s1    104.  old  
## 10 s1     88.6 old  
## # ℹ 90 more rows
measures %<>% rename(subject = Vpn)

This example has so far always returned all observations from measures and all columns of both data frames. This is because measures and meta both contain information on the same ten speakers. If we have less information in one of the two data frames, the result is suddenly missing rows:

# measures of 20 instead of 10 speakers:
measures <- tibble(subject = rep(paste0("s", 1:20), each = 10),
                   F0 = rnorm(200, 120, 15))
inner_join(x = measures, y = meta, by = "subject")
## # A tibble: 100 × 3
##    subject    F0 age  
##    <chr>   <dbl> <chr>
##  1 s1      154.  old  
##  2 s1      133.  old  
##  3 s1      108.  old  
##  4 s1      127.  old  
##  5 s1      128.  old  
##  6 s1      107.  old  
##  7 s1       93.8 old  
##  8 s1      115.  old  
##  9 s1      103.  old  
## 10 s1      128.  old  
## # ℹ 90 more rows

The result of this inner join has no rows for speakers 11 to 20 because there is no information on these speakers in the data frame meta!

9.1.2 Outer Join

Contrary to the inner join, the outer join keeps rows with missing values. The missing values are shown as NA. The simplest version of an outer join is the full join which returns all rows and all columns from both data frames. The function to do this is full_join() (we point out an interesting part of the result here using slice()):

full_join(x = measures, y = meta, by = "subject") %>% 
  slice(95:105)
## # A tibble: 11 × 3
##    subject    F0 age  
##    <chr>   <dbl> <chr>
##  1 s10     131.  young
##  2 s10     101.  young
##  3 s10     144.  young
##  4 s10     111.  young
##  5 s10     135.  young
##  6 s10     123.  young
##  7 s11     133.  <NA> 
##  8 s11     109.  <NA> 
##  9 s11      96.3 <NA> 
## 10 s11     118.  <NA> 
## 11 s11     110.  <NA>

This result consists of 200 rows (if we do not use slice()) – when we used the inner join above with the exact same two tibbles, the result only consisted of 100 rows. This is because full_join() kept the 100 rows of values for speakers 11 to 20 while inner_join() removed them because there was no matching metadata for these speakers in meta. Thus, the result of the full join contains NA values in the appended column age for speakers 11 to 20.

If you don’t want to keep all rows from both data frames, the functions left_join() and right_join() are at your disposal. The left_join() keeps all rows from data frame x while the right_join() keeps all rows from data frame y. In our example we want to keep all rows from measures and just add the information on the speakers’ age from meta:

left_join(x = measures, y = meta, by = "subject")
## # A tibble: 200 × 3
##    subject    F0 age  
##    <chr>   <dbl> <chr>
##  1 s1      154.  old  
##  2 s1      133.  old  
##  3 s1      108.  old  
##  4 s1      127.  old  
##  5 s1      128.  old  
##  6 s1      107.  old  
##  7 s1       93.8 old  
##  8 s1      115.  old  
##  9 s1      103.  old  
## 10 s1      128.  old  
## # ℹ 190 more rows
right_join(x = meta, y = measures, by = "subject")
## # A tibble: 200 × 3
##    subject age      F0
##    <chr>   <chr> <dbl>
##  1 s1      old   154. 
##  2 s1      old   133. 
##  3 s1      old   108. 
##  4 s1      old   127. 
##  5 s1      old   128. 
##  6 s1      old   107. 
##  7 s1      old    93.8
##  8 s1      old   115. 
##  9 s1      old   103. 
## 10 s1      old   128. 
## # ℹ 190 more rows

9.2 Filtering Joins

The second type of joins in R are the so called filtering joins which do not add columns but instead only return selected rows. There are two functions for this in the tidyverse:

  • semi_join(): returns all observations in data frame x for which there is a match in data frame y
  • anti_join(): returns all observations in data frame x for which there is no match in data frame y

We will demonstrate these functions using the following data frames:

vcv %<>% as_tibble()
vcv
## # A tibble: 810 × 7
##       RT Subject Vowel Lang  Left  Right Context
##    <dbl> <chr>   <chr> <chr> <chr> <chr> <chr>  
##  1  648. S209    a     AE    f     h     f.h    
##  2  617  S209    a     AE    f     sh    f.sh   
##  3  728. S209    a     AE    f     sh    f.sh   
##  4  629  S209    a     AE    f     th    f.th   
##  5  688. S209    a     AE    f     th    f.th   
##  6  602. S209    a     AE    s     h     s.h    
##  7  632. S209    a     AE    s     sh    s.sh   
##  8  574  S209    a     AE    s     th    s.th   
##  9  719  S209    a     AE    s     h     s.h    
## 10  569  S209    a     AE    s     th    s.th   
## # ℹ 800 more rows
vcv_summary <- vcv %>% 
  group_by(Subject, Vowel) %>% 
  summarise(mean_rt = mean(RT)) %>% 
  ungroup() %>%
  slice_max(mean_rt, n = 5)
## `summarise()` has grouped output by 'Subject'. You can
## override using the `.groups` argument.
vcv_summary
## # A tibble: 5 × 3
##   Subject Vowel mean_rt
##   <chr>   <chr>   <dbl>
## 1 S502    i       1173.
## 2 S502    u       1076.
## 3 S502    a       1000.
## 4 S508    u        816.
## 5 S508    i        780.

The data frame vcv_summary contains the five highest mean reaction times and the corresponding speaker and vowel. If we want to find out from which observations in vcv these mean values were calculated, we can use the semi join. More precisely, we want all rows from vcv returned for which there is a match in vcv_summary with respect to the columns Subject and Vowel.

semi_join(x = vcv, y = vcv_summary, by = c("Subject", "Vowel"))
## # A tibble: 75 × 7
##       RT Subject Vowel Lang  Left  Right Context
##    <dbl> <chr>   <chr> <chr> <chr> <chr> <chr>  
##  1 1120. S502    a     D     f     h     f.h    
##  2 1066. S502    a     D     f     sh    f.sh   
##  3  848  S502    a     D     f     sh    f.sh   
##  4 1148. S502    a     D     f     th    f.th   
##  5 1130  S502    a     D     f     th    f.th   
##  6  938  S502    a     D     s     h     s.h    
##  7 1124. S502    a     D     s     sh    s.sh   
##  8  981  S502    a     D     s     th    s.th   
##  9  774  S502    a     D     s     h     s.h    
## 10 1104  S502    a     D     s     th    s.th   
## # ℹ 65 more rows

This result contains all observations from which the average reaction times in vcv_summary were computed. We can underline this by looking at the unique combinations of Subject and Vowel in the result of the semi join – they are the same five combinations as in vcv_summary.

semi_join(x = vcv, y = vcv_summary, by = c("Subject", "Vowel")) %>% 
  select(Subject, Vowel) %>% 
  unique()
## # A tibble: 5 × 2
##   Subject Vowel
##   <chr>   <chr>
## 1 S502    a    
## 2 S502    i    
## 3 S502    u    
## 4 S508    i    
## 5 S508    u

With the anti join, on the other hand, we get all observations from which the mean reaction times were not computed, or in other words: all observations from vcv for which there is no match in vcv_summary with regard to Subject and Vowel.

anti_join(x = vcv, y = vcv_summary, by = c("Subject", "Vowel"))
## # A tibble: 735 × 7
##       RT Subject Vowel Lang  Left  Right Context
##    <dbl> <chr>   <chr> <chr> <chr> <chr> <chr>  
##  1  648. S209    a     AE    f     h     f.h    
##  2  617  S209    a     AE    f     sh    f.sh   
##  3  728. S209    a     AE    f     sh    f.sh   
##  4  629  S209    a     AE    f     th    f.th   
##  5  688. S209    a     AE    f     th    f.th   
##  6  602. S209    a     AE    s     h     s.h    
##  7  632. S209    a     AE    s     sh    s.sh   
##  8  574  S209    a     AE    s     th    s.th   
##  9  719  S209    a     AE    s     h     s.h    
## 10  569  S209    a     AE    s     th    s.th   
## # ℹ 725 more rows

Anti joins can be helpful in searching for errors in your own data. Let’s reuse the example from above, measures with values for 20 speakers and meta with metadata for these speakers. If we apply the anti join here, we will immediately see for which rows in measures there is no match in meta, i.e. for which speakers we are missing metadata.

anti_join(x = measures, y = meta, by = "subject")
## # A tibble: 100 × 2
##    subject    F0
##    <chr>   <dbl>
##  1 s11     133. 
##  2 s11     109. 
##  3 s11      96.3
##  4 s11     118. 
##  5 s11     110. 
##  6 s11     127. 
##  7 s11     120. 
##  8 s11      97.4
##  9 s11     119. 
## 10 s11     101. 
## # ℹ 90 more rows

The result is the opposite of the inner join: Here we receive the 100 observations from speakers 11 to 20 for whom there is no metadata.

Further Information: dplyr

The joins were the last functions from the tidyverse which we introduce in this module. An overview of all functions presented here and many more can be found in this Cheatsheet.