Categorical Variable

In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values. Categorical variables are often used to represent categorical data. In computer science and some branches of mathematics, categorical variables are referred to as enumerations or enumerated types. Commonly (though not in this article), the word level is used to refer to one of the possible values of a categorical variable.

A categorical variable that can take on exactly two values is termed a binary variable or dummy variable and is typically treated on its own as a special case. As a result, categorical variables are often assumed to contain, or at least potentially contain, three or more values. See the discussion below.

Examples of values that might be represented in a categorical variable:

  • The blood type of a person: A, B, AB or O.
  • The state that a resident of the United States lives in.
  • The political party that a voter in a European country might vote for: Christian Democrat, Social Democrat, Green Party, etc.
  • The type of a rock: igneous, sedimentary or metamorphic.
  • The identity of a particular word (e.g., in a language model): One of V possible choices, for a vocabulary of size V.

For ease in statistical processing, categorical variables may be assigned numeric indices, e.g. 1 through K for a K-way categorical variable (i.e. a variable that can express exactly K possible values). In general, however, the numbers are arbitrary, and have no significance beyond simply providing a convenient label for a particular value. In other words, the values in a categorical variable exist on a nominal scale: they each represent a logically separate concept, and in general cannot be meaningfully ordered or otherwise manipulated as numbers would. Instead, valid operations are equivalence, set membership, and other set-related operations.

As a result, the central tendency of a set of categorical variables is given by its mode; neither the mean nor the median can be defined. As an example, given a set of people, we can consider the set of categorical variables corresponding to their last names. We can consider operations such as equivalence (whether two people have the same last name), set membership (whether a person has a name in a given list), counting (how many people have a given last name), or finding the mode (which name occurs most often). However, we cannot meaningfully compute the "sum" of Smith + Johnson, or ask whether Smith is "less than" or "greater than" Johnson. As a result, we cannot meaningfully ask what the "average name" (the mean) or the "middle-most name" (the median) is in a set of names.

Note that this ignores the concept of alphabetical order, which is a property that is not inherent in the names themselves, but in the way we construct the labels. For example, if we write the names in Cyrillic and consider the Cyrillic ordering of letters, we might get a different result of evaluating "Smith < Johnson" than if we write the names in the standard Latin alphabet; and if we write the names in Chinese characters, we cannot meaningfully evaluate "Smith < Johnson" at all, because no consistent ordering is defined for such characters. However, if we do consider the names as written, e.g., in the Latin alphabet, and define an ordering corresponding to standard alphabetical order, then we have effectively converted them into ordinal variables defined on an ordinal scale.

Categorical random variables are normally described statistically by a categorical distribution, which allows an arbitrary K-way categorical variable to be expressed with separate probabilities specified for each of the K possible outcomes. Multiple independent identically distributed categorical variables are often grouped together using a multinomial distribution, which counts the number of times each possible outcome has been seen. Regression analysis on categorical outcomes is accomplished through multinomial logistic regression, multinomial probit or a related type of discrete choice model.

Categorical variables that have only two possible outcomes (e.g., "yes" vs. "no" or "success" vs. "failure") are known as binary variables (or Bernoulli variables). Because of their importance, these variables are often considered a separate category, with a separate distribution (the Bernoulli distribution) and separate regression models (logistic regression, probit regression, etc.). As a result, the term "categorical variable" is often reserved for cases with 3 or more outcomes, sometimes termed a multi-way variable in opposition to a binary variable.

It is also possible to consider categorical variables where the number of categories is not fixed in advance. As an example, for a categorical variable describing a particular word, we might not know in advance the size of the vocabulary, and we would like to allow for the possibility of encountering words that we haven't already seen. Standard statistical models, such as those involving the categorical distribution and multinomial logistic regression, assume that the number of categories is known in advance, and changing the number of categories on the fly is tricky. In such cases, more advanced techniques must be used. An example is the Dirichlet process, which falls in the realm of nonparametric statistics. In such a case, it is logically assumed that an infinite number of categories exist, but at any one time most of them (in fact, all but a finite number) have never been seen. All formulas are phrased in terms of the number of categories actually seen so far rather than the (infinite) total number of potential categories in existence, and methods are created for incremental updating of statistical distributions, including adding "new" categories.

Read more about Categorical Variable:  Categorical Variables in Regression

Famous quotes containing the words categorical and/or variable:

    We do the same thing to parents that we do to children. We insist that they are some kind of categorical abstraction because they produced a child. They were people before that, and they’re still people in all other areas of their lives. But when it comes to the state of parenthood they are abruptly heir to a whole collection of virtues and feelings that are assigned to them with a fine arbitrary disregard for individuality.
    Leontine Young (20th century)

    There is not so variable a thing in nature as a lady’s head-dress.
    Joseph Addison (1672–1719)