9 Joining with dplyr
Please load the following packages and data frame for this chapter:
library(tidyverse)
library(magrittr)
url <- "http://www.phonetik.uni-muenchen.de/~jmh/lehre/Rdf"
vcv <- read.table(file.path(url, "vcvC.txt"))
Perhaps you are familiar with so called joins from your work with relational data bases and/or SQL. The principle of relational data bases is to store all information and values in themed tables. After an experiment, for instance, you might have a table of measured values (formants, fundamental frequency, etc.) and one table with metadata about the participants (age, origin, education, etc.). Additionally it might make sense to have a table with information about the collected material (words, lexical stress, phonemes, stimuli, etc.) and about the tested conditions (speech rate, which interlocutor, etc.). If necessary, one can (temporarily) join two tables via a key, i.e. a column that contains some sort of ID. In dplyr
we differentiate between mutating joins and filtering joins.
9.1 Mutating Joins
Similarly to mutate()
, mutating joins add columns to a data frame x
. Contrary to mutate()
though, the added columns come from a different data frame y
. In both data frames x
and y
there must be a column that contains the identifier or key which connects the two tables.
9.1.1 Inner Join
The simplest form of a mutating join is the so called inner join. The function inner_join()
takes as arguments the two data frames x
and y
and the identifier by
. The result of an inner join contains all columns of both x
and y
as well as all rows that have matches in both data frames. Missing values (NA) do not appear in the result of an inner join, so be careful with that.
As an example we will be using a table of fundamental frequency values measured in the speech of ten speakers, as well as a table with metadata concerning the speakers:
## # A tibble: 100 × 2
## subject F0
## <chr> <dbl>
## 1 s1 103.
## 2 s1 115.
## 3 s1 129.
## 4 s1 146.
## 5 s1 95.2
## 6 s1 114.
## 7 s1 114.
## 8 s1 147.
## 9 s1 104.
## 10 s1 88.6
## # ℹ 90 more rows
## # A tibble: 10 × 2
## subject age
## <chr> <chr>
## 1 s1 old
## 2 s2 old
## 3 s3 old
## 4 s4 old
## 5 s5 old
## 6 s6 young
## 7 s7 young
## 8 s8 young
## 9 s9 young
## 10 s10 young
Both tibbles have a column subject
which we will use as key in our inner join:
## # A tibble: 100 × 3
## subject F0 age
## <chr> <dbl> <chr>
## 1 s1 103. old
## 2 s1 115. old
## 3 s1 129. old
## 4 s1 146. old
## 5 s1 95.2 old
## 6 s1 114. old
## 7 s1 114. old
## 8 s1 147. old
## 9 s1 104. old
## 10 s1 88.6 old
## # ℹ 90 more rows
It can happen that the column that provides the identifier has different names in x
and y
. In this case we can tell the join via by = c("a"="b")
that the column a
from data frame x
shall be matched with the column b
from data frame y
. This works for all joining functions presented in this chapter.
## # A tibble: 100 × 3
## Vpn F0 age
## <chr> <dbl> <chr>
## 1 s1 103. old
## 2 s1 115. old
## 3 s1 129. old
## 4 s1 146. old
## 5 s1 95.2 old
## 6 s1 114. old
## 7 s1 114. old
## 8 s1 147. old
## 9 s1 104. old
## 10 s1 88.6 old
## # ℹ 90 more rows
This example has so far always returned all observations from measures
and all columns of both data frames. This is because measures
and meta
both contain information on the same ten speakers. If we have less information in one of the two data frames, the result is suddenly missing rows:
# measures of 20 instead of 10 speakers:
measures <- tibble(subject = rep(paste0("s", 1:20), each = 10),
F0 = rnorm(200, 120, 15))
inner_join(x = measures, y = meta, by = "subject")
## # A tibble: 100 × 3
## subject F0 age
## <chr> <dbl> <chr>
## 1 s1 154. old
## 2 s1 133. old
## 3 s1 108. old
## 4 s1 127. old
## 5 s1 128. old
## 6 s1 107. old
## 7 s1 93.8 old
## 8 s1 115. old
## 9 s1 103. old
## 10 s1 128. old
## # ℹ 90 more rows
The result of this inner join has no rows for speakers 11 to 20 because there is no information on these speakers in the data frame meta
!
9.1.2 Outer Join
Contrary to the inner join, the outer join keeps rows with missing values. The missing values are shown as NA. The simplest version of an outer join is the full join which returns all rows and all columns from both data frames. The function to do this is full_join()
(we point out an interesting part of the result here using slice()
):
## # A tibble: 11 × 3
## subject F0 age
## <chr> <dbl> <chr>
## 1 s10 131. young
## 2 s10 101. young
## 3 s10 144. young
## 4 s10 111. young
## 5 s10 135. young
## 6 s10 123. young
## 7 s11 133. <NA>
## 8 s11 109. <NA>
## 9 s11 96.3 <NA>
## 10 s11 118. <NA>
## 11 s11 110. <NA>
This result consists of 200 rows (if we do not use slice()
) – when we used the inner join above with the exact same two tibbles, the result only consisted of 100 rows. This is because full_join()
kept the 100 rows of values for speakers 11 to 20 while inner_join()
removed them because there was no matching metadata for these speakers in meta
. Thus, the result of the full join contains NA values in the appended column age
for speakers 11 to 20.
If you don’t want to keep all rows from both data frames, the functions left_join()
and right_join()
are at your disposal. The left_join()
keeps all rows from data frame x
while the right_join()
keeps all rows from data frame y
. In our example we want to keep all rows from measures
and just add the information on the speakers’ age from meta
:
## # A tibble: 200 × 3
## subject F0 age
## <chr> <dbl> <chr>
## 1 s1 154. old
## 2 s1 133. old
## 3 s1 108. old
## 4 s1 127. old
## 5 s1 128. old
## 6 s1 107. old
## 7 s1 93.8 old
## 8 s1 115. old
## 9 s1 103. old
## 10 s1 128. old
## # ℹ 190 more rows
## # A tibble: 200 × 3
## subject age F0
## <chr> <chr> <dbl>
## 1 s1 old 154.
## 2 s1 old 133.
## 3 s1 old 108.
## 4 s1 old 127.
## 5 s1 old 128.
## 6 s1 old 107.
## 7 s1 old 93.8
## 8 s1 old 115.
## 9 s1 old 103.
## 10 s1 old 128.
## # ℹ 190 more rows
9.2 Filtering Joins
The second type of joins in R are the so called filtering joins which do not add columns but instead only return selected rows. There are two functions for this in the tidyverse:
semi_join()
: returns all observations in data framex
for which there is a match in data framey
anti_join()
: returns all observations in data framex
for which there is no match in data framey
We will demonstrate these functions using the following data frames:
## # A tibble: 810 × 7
## RT Subject Vowel Lang Left Right Context
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 648. S209 a AE f h f.h
## 2 617 S209 a AE f sh f.sh
## 3 728. S209 a AE f sh f.sh
## 4 629 S209 a AE f th f.th
## 5 688. S209 a AE f th f.th
## 6 602. S209 a AE s h s.h
## 7 632. S209 a AE s sh s.sh
## 8 574 S209 a AE s th s.th
## 9 719 S209 a AE s h s.h
## 10 569 S209 a AE s th s.th
## # ℹ 800 more rows
vcv_summary <- vcv %>%
group_by(Subject, Vowel) %>%
summarise(mean_rt = mean(RT)) %>%
ungroup() %>%
slice_max(mean_rt, n = 5)
## `summarise()` has grouped output by 'Subject'. You can
## override using the `.groups` argument.
## # A tibble: 5 × 3
## Subject Vowel mean_rt
## <chr> <chr> <dbl>
## 1 S502 i 1173.
## 2 S502 u 1076.
## 3 S502 a 1000.
## 4 S508 u 816.
## 5 S508 i 780.
The data frame vcv_summary
contains the five highest mean reaction times and the corresponding speaker and vowel. If we want to find out from which observations in vcv
these mean values were calculated, we can use the semi join. More precisely, we want all rows from vcv
returned for which there is a match in vcv_summary
with respect to the columns Subject
and Vowel
.
## # A tibble: 75 × 7
## RT Subject Vowel Lang Left Right Context
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1120. S502 a D f h f.h
## 2 1066. S502 a D f sh f.sh
## 3 848 S502 a D f sh f.sh
## 4 1148. S502 a D f th f.th
## 5 1130 S502 a D f th f.th
## 6 938 S502 a D s h s.h
## 7 1124. S502 a D s sh s.sh
## 8 981 S502 a D s th s.th
## 9 774 S502 a D s h s.h
## 10 1104 S502 a D s th s.th
## # ℹ 65 more rows
This result contains all observations from which the average reaction times in vcv_summary
were computed. We can underline this by looking at the unique combinations of Subject
and Vowel
in the result of the semi join – they are the same five combinations as in vcv_summary
.
semi_join(x = vcv, y = vcv_summary, by = c("Subject", "Vowel")) %>%
select(Subject, Vowel) %>%
unique()
## # A tibble: 5 × 2
## Subject Vowel
## <chr> <chr>
## 1 S502 a
## 2 S502 i
## 3 S502 u
## 4 S508 i
## 5 S508 u
With the anti join, on the other hand, we get all observations from which the mean reaction times were not computed, or in other words: all observations from vcv
for which there is no match in vcv_summary
with regard to Subject
and Vowel
.
## # A tibble: 735 × 7
## RT Subject Vowel Lang Left Right Context
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 648. S209 a AE f h f.h
## 2 617 S209 a AE f sh f.sh
## 3 728. S209 a AE f sh f.sh
## 4 629 S209 a AE f th f.th
## 5 688. S209 a AE f th f.th
## 6 602. S209 a AE s h s.h
## 7 632. S209 a AE s sh s.sh
## 8 574 S209 a AE s th s.th
## 9 719 S209 a AE s h s.h
## 10 569 S209 a AE s th s.th
## # ℹ 725 more rows
Anti joins can be helpful in searching for errors in your own data. Let’s reuse the example from above, measures
with values for 20 speakers and meta
with metadata for these speakers. If we apply the anti join here, we will immediately see for which rows in measures
there is no match in meta
, i.e. for which speakers we are missing metadata.
## # A tibble: 100 × 2
## subject F0
## <chr> <dbl>
## 1 s11 133.
## 2 s11 109.
## 3 s11 96.3
## 4 s11 118.
## 5 s11 110.
## 6 s11 127.
## 7 s11 120.
## 8 s11 97.4
## 9 s11 119.
## 10 s11 101.
## # ℹ 90 more rows
The result is the opposite of the inner join: Here we receive the 100 observations from speakers 11 to 20 for whom there is no metadata.
Further Information: dplyr
The joins were the last functions from the tidyverse which we introduce in this module. An overview of all functions presented here and many more can be found in this Cheatsheet.