4.1.2 Measures of central tendency and measures of dispersion

Parent Previous Next

4.1.2.1 Boxplot


For summarizing numerical data, there are five important numbers one should mention: the two extremes (minimum and maximum), the median, and the lower and upper quartile. It is advisable to use median and quartiles instead of mean and standard deviation, because the latter ones are more susceptible to skewed or heavy-tailed distributions. What makes the boxplot (also called box-whisker-plot) a very useful graph is that it shows all of these five numbers. Thus, it only takes some seconds to reveal the central tendencies and dispersion of one’s data. Additionally, a boxplot may also indicate any outliers. What is important to remember here is that – as the boxplot exemplifies – a measure of central tendency should never be reported without a measure of dispersion, for the measure of central tendency alone is not very representative (Gries 2009a: 201f.). For further information about the different measures of central tendency and of dispersion together with their functions, please consider Baayen (2008: 21ff.), Gries (2009a: 201) or Johnson (2008: 24ff.).

ln its simplest form, the function boxplot just requires one vector as an argument. It is recommendable though to also add notch=T (T meaning TRUE) in order to get notches, whose benefits will be explained in the next paragraph.


> boxplot(town1, town2, notch=T, names=c("Town 1", "Town 2"))¶


Every feature of a boxplot describes a certain aspect (cf. Baayen 2008: 30, Gries 2009a: 204f.): The bold, horizontal line represents the median, one of the measures of central tendency (cf. light green line in Fig. 8). The regular horizontal lines imposing the upper and lower boundary of the box show the interquartile range from the first to the third quartile (cf. purple lines). Quartiles are the three data points obtained by dividing a sorted data set into four equal-sized parts. Each quartile represents the boundary between two of the four subsets. The 50% quartile is also known as the median (Baayen 2008: 28). The whiskers - the dashed vertical lines with the horizontal limits - represent the largest and smallest values that are not more than 1.5 interquartile ranges away from the box (cf. dark green lines). Values beyond the range of the whiskers, which are represented with an individual dot, are potential outliers. The notches on the left and right sides of the boxes extend across the range ±1.58*IQR/sqrt(n), IQR meaning the interquartile range, sqrt standing for square root, and n representing the number of data points. If the notches of two boxplots overlap, this is a strong indicator that the medians are most likely not significantly different from each other, but of course still a test would be needed to verify this assumption (cf. orange lines).

Looking at Fig. 8, it is apparent that the medians of both towns are quite close to each other and the notches overlap. This means that the central tendencies of the two towns are very similar. Also, you can see that Town 1 exhibits much more heterogeneous values than town 2, because the boxes and the whiskers of Town 1 cover a larger range of the y-axis and its notches are huge. This represents a larger dispersion of the data values for Town 1 (Gries 2009a: 204f.).


Fig. 8: Boxplot of the temperatures of the two towns (modified after Gries 2009b: 118).

Created with the Personal Edition of HelpNDoc: Create HTML Help, DOC, PDF and print manuals from 1 single source