In statistics, a data set consists of items (objects organized in rows) and one or more variables (organized in columns). A variable is the operationalized characteristic of an item, for example ‘LENGTH’ for the investigation of word lengths of subjects, and has a set of collected values or levels, the data. For example, for the sentence ‘The man went to work’ the variable LENGTH has a value of ‘2’, as the subject consists of two words (Gries 2009a: 174).
Variables can be distinguished in two ways. First, by the role they play in an analysis (dependent or independent variable), and second, by their information value (level of measurement) (Gries 2009a: 174).
- The dependent variable is the one whose behavior or distribution is studied and expected to change whenever the independent variable is altered. That means it is assumed that the independent variable is the cause for the change of the dependent variable and therefore correlates with how the dependent variable changes (Gries 2009a: 174).
- Variables can be divided in classes according to their information value or descriptive properties. In the books used for writing this companion website, each of the authors refers to four classes: nominal/categorical, ordinal, interval and ratio variables. For the following explanations, compare Gries (2009a: 174f.) and Johnson (2008: 4f.).
- Nominal/categorical variables have values that are labels or names. For example, if the two NPs ‘the book’ and ‘a table’ are investigated concerning the variable ‘Definiteness’, then ‘the book’ can be categorized as ‘definite’, and ‘a table’ ‘indefinite’. The fact that the two NPs have different labels means that they belong to different categories. However, they do not have a meaningful order on any scale, even if ‘definite’ was labeled ‘1’ and ‘indefinite’ was labeled ‘2’. The two different numbers still only represent two different levels of the variable DEFINITENESS.
- Ordinal variables have values that can be put in an order or sequence. Unlike nominal/categorical variables, different values here do allow for ranking the items, for example the values ‘pronominal’, ‘simple lexical’, ‘non-clausally modified lexical’ and ‘clausally modified lexical’ for the variable ‘Syntactic Complexity of NPs’. Instead, the values could also be numbered from 1 to 4, with higher numbers reflecting higher degrees of complexity. Still, there is no way to precisely quantify the differences between the values on a measurable scale. Another famous example of ordinal variables are grades.
- The next variable class to be distinguished is that of interval variables. Interval variables have values that are measured on a scale that does not have a true zero value. This means that these variables, in comparison to ordinal variables, do contain information about the difference between different values, but because the zero point on the interval scale is arbitrary, ratios between numbers on the scale are not meaningful. Well-known examples of interval data are the Centigrade and Fahrenheit temperature scales.
- More often you will deal with ratio variables. This property is measured on a scale that does have an absolute, non-arbitrary zero point and – as is suggested by the name - ratios of the collected values are meaningful. For instance, a NP with the value ‘four’ for the variable ‘syllabic length’ is twice as long as one with the value ‘two’.
Depending on the research question, a data set can contain one, two, or more than two variables. Univariate (one-dimensional) data sets summarize the distribution of one variable, whereas bivariate (two-dimensional) or multivariate (multi-dimensional) data sets characterize the relation of two or more variables. In the following chapters, functions for obtaining the visualization of data sets are classified and discussed in these three categories.
Created with the Personal Edition of HelpNDoc: Free EPub and documentation generator