Categorical Distribution - With A Conjugate Prior

With A Conjugate Prior

In Bayesian statistics, the Dirichlet distribution is the conjugate prior distribution of the categorical distribution (and also the multinomial distribution). This means that in a model consisting of a data point having a categorical distribution with unknown parameter vector p, and (in standard Bayesian style) we choose to treat this parameter as a random variable and give it a prior distribution defined using a Dirichlet distribution, then the posterior distribution of the parameter, after incorporating the knowledge gained from the observed data, is also a Dirichlet. Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one. This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties.

Formally, this can be expressed as follows. Given a model

\begin{array}{lclcl}
\boldsymbol\alpha &=& (\alpha_1, \ldots, \alpha_K) &=& \text{concentration hyperparameter} \\
\mathbf{p}\mid\boldsymbol\alpha &=& (p_1, \ldots, p_K) &\sim& \operatorname{Dir}(K, \boldsymbol\alpha) \\
\mathbb{X}\mid\mathbf{p} &=& (x_1, \ldots, x_N) &\sim& \operatorname{Cat}(K,\mathbf{p})
\end{array}

then the following holds:

\begin{array}{lclcl}
\mathbf{c} &=& (c_1, \ldots, c_K) &=& \text{number of occurrences of category }i = \sum_{j=1}^N \\
\mathbf{p} \mid \mathbb{X},\boldsymbol\alpha &\sim& \operatorname{Dir}(K,\mathbf{c}+\boldsymbol\alpha) &=& \operatorname{Dir}(K,c_1+\alpha_1,\ldots,c_K+\alpha_K)
\end{array}

This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution given a collection of N samples. Intuitively, we can view the hyperprior vector α as pseudocounts, i.e. as representing the number of observations in each category that we have already seen. Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution.

Further intuition comes from the expected value of the posterior distribution (see the article on the Dirichlet distribution):

This says that the expected probability of seeing a category i among the various discrete distributions generated by the posterior distribution is simply equal to the proportion of occurrences of that category actually seen in the data, including the pseudocounts in the prior distribution. This makes a great deal of intuitive sense: If, for example, there are three possible categories, and we saw category 1 in our observed data 40% of the time, we would expect on average to see category 1 40% of the time in the posterior distribution as well.

(Note that this intuition is ignoring the effect of the prior distribution. Furthermore, it's important to keep in mind that the posterior is a distribution over distributions. Remember that the posterior distribution in general tells us what we know about the parameter in question, and in this case the parameter itself is a discrete probability distribution, i.e. the actual categorical distribution that generated our data. For example, if we saw the 3 categories in the ratio 40:5:55 in our observed data, then ignoring the effect of the prior distribution, we would expect the true parameter — i.e. the true, underlying distribution that generated our observed data — to have the average value of (0.40,0.05,0.55), which is indeed what the posterior tells us. However, the true distribution might actually be (0.35,0.07,0.58) or (0.42,0.04,0.54) or various other nearby possibilities. The amount of uncertainty involved here is specified by the variance of the posterior, which is controlled by the total number of observations – the more data we observe, the less our uncertainty about the true parameter.)

(Technically, the prior parameter should actually be seen as representing prior observations of category . Then, the updated posterior parameter represents posterior observations. This reflects the fact that a Dirichlet distribution with has a completely flat shape — essentially, a uniform distribution over the simplex of possible values of p. Logically, a flat distribution of this sort represents total ignorance, corresponding to no observations of any sort. However, the mathematical updating of the posterior works fine if we ignore the term and simply think of the α vector as directly representing a set of pseudocounts. Furthermore, doing this avoids the issue of interpreting values less than 1.)

Read more about this topic:  Categorical Distribution

Famous quotes containing the word prior:

    And ‘tis remarkable that they
    Talk most who have the least to say.
    —Matthew Prior (1664–1721)