5 Summary Statistics

Please load the following packages and data frame for this chapter:

library(tidyverse)
library(magrittr)
url <- "http://www.phonetik.uni-muenchen.de/~jmh/lehre/Rdf"
vdata <- read.table(file.path(url, "vdata.txt"))

When you want to get an overview of some data, it is helpful to compute so called summary statistics (sometimes called descriptive statistics). Among others, these are the arithmetic mean, median, variance, standard deviation, minimum, maximum, etc. Here we aim to show you how these values can be computed without using functions from the tidyverse. Since R is a statistics software, functions to compute summary statistics are always available. The following code snippet presents the most important functions from base R for descriptive statistics, applied to the F1 values in vdata:

mean(vdata$F1)             # arithmetic mean

## [1] 407.3

median(vdata$F1)           # median

## [1] 366

var(vdata$F1)              # variance

## [1] 21255

sd(vdata$F1)               # standard deviation

## [1] 145.8

min(vdata$F1)              # minimum

## [1] 0

max(vdata$F1)              # maximum

## [1] 1114

range(vdata$F1)            # minimum & maximum

## [1]    0 1114

quantile(vdata$F1, 0.25)   # first quartile

## 25% 
## 300

quantile(vdata$F1, 0.75)   # third quartile

##   75% 
## 509.8

IQR(vdata$F1)              # interquartile range

## [1] 209.8

5.1 Mean & Median

The arithmetic mean is calculated by summing \(n\) numbers and then dividing this sum by \(n\). Here is a very simple example:

nums <- 1:5
s <- sum(nums)
s

## [1] 15

count <- length(nums)
count

## [1] 5

# mean:
s/count

## [1] 3

# for comparison:
mean(nums)

## [1] 3

The median on the other hand is the middle number in a sorted sequence of numbers. Let’s reuse the above example (in which the numbers are already in ascending order):

nums

## [1] 1 2 3 4 5

median(nums)

## [1] 3

For an even number of numbers the median is the mean of the two middle values, e.g.:

nums <- 1:6
median(nums)

## [1] 3.5

mean(c(3, 4))

## [1] 3.5

The median is more robust against outliers than the mean. Outliers are data points that are more extreme then the majority of data points in a data set. Here is another simple example:

nums <- c(1:5, 100)
nums

## [1]   1   2   3   4   5 100

mean(nums)

## [1] 19.17

median(nums)

## [1] 3.5

The number 100 is obviously an outlier in the vector called nums. Because of that, the mean is now much higher than previously, while the median has changed only slightly.

5.2 Variance & Standard Deviation

Variance and standard deviation are related measures for the dispersion of values around their mean. More precisely, the variance is the sum of the squared deviations of the values from their mean, divided by the number of values minus 1, while the standard deviation is the square root of the variance. The following example demonstrates how to compute the variance and standard deviation manually.

nums <- c(12, 6, 24, 3, 17)
# mean
m <- mean(nums)
m

## [1] 12.4

# squared differences
squared_diff <- (nums - m)^2
squared_diff

## [1]   0.16  40.96 134.56  88.36  21.16

# number of values
n <- length(nums)
n

## [1] 5

# sum of the squared differences
s <- sum(squared_diff)
s

## [1] 285.2

# variance
variance <- s / (n - 1)
variance

## [1] 71.3

# using the function var():
var(nums)

## [1] 71.3

To compute the standard deviation from that (which is used far more often in statistics than the variance) we only need to extract the square root of the variance:

std_dev <- sqrt(variance)
std_dev

## [1] 8.444

# using the function sd():
sd(nums)

## [1] 8.444

5.3 Quantiles

A quantile divides data points in such a way that a given part of the data points is below the quantile. Quantile is a hypernym: depending on how many chunks you divide your data points into, you can also use the terms percentile (100 chunks) or quartile (4 chunks). The median is another quantile because 50% of the data are below the median. In R, the function quantile() computes quantiles. The function takes as arguments the data points (i.e. a numeric vector) and then the proportion of data points that should be below the value to be computed. Important quantiles are the first and third quartile, i.e. the thresholds below which a quarter or three quarters of all data points lie.

quantile(vdata$F1, 0.25)   # first quartile

## 25% 
## 300

quantile(vdata$F1, 0.75)   # third quartile

##   75% 
## 509.8

IQR(vdata$F1)              # interquartile range

## [1] 209.8

The difference between the first and third quartile is called interquartile range and can be computed with the function IQR().

5.4 Example of a Boxplot

A boxplot contains many of the descriptive information that we have learned about so far:

Median: the horizontal line within the box is the median.
Box: the box contains the middle 50% of the data points. The lower end of the box is the first quartile (Q1), the upper end is the third quartile (Q3). The box is as big as the interquartile range.
Whiskers: the vertical lines stretching upwards/downwards from Q1 and Q3 to the highest/lowest data point that lies within 1.5 * IQR. The calculation of the whiskers as 1.5 * IQR is valid for boxplots created with ggplot2, but some other programs use a different calculation.
Points: outliers, i.e. all data points that are not contained in the box or whiskers.

Here you see a boxplot for F1 from the data frame vdata:

Later in this course, you’ll learn how to create this boxplot yourself.