Latent Dirichlet Allocation - Inference

Inference

See also: Dirichlet-multinomial distribution

Learning the various distributions (the set of topics, their associated word probabilities, the topic of each word, and the particular topic mixture of each document) is a problem of Bayesian inference. The original paper used a variational Bayes approximation of the posterior distribution; alternative inference techniques use Gibbs sampling and expectation propagation.

Following is the derivation of the equations for collapsed Gibbs sampling, which means s and s will be integrated out. For simplicity, in this derivation the documents are all assumed to have the same length . The derivation is equally valid if the document lengths vary.

According to the model, the total probability of the model is:

 P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\theta},
\boldsymbol{\varphi};\alpha,\beta) = \prod_{i=1}^K
P(\varphi_i;\beta) \prod_{j=1}^M P(\theta_j;\alpha) \prod_{t=1}^N
P(Z_{j,t}|\theta_j)P(W_{j,t}|\varphi_{Z_{j,t}}) ,

where the bold-font variables denote the vector version of the variables. First of all, and need to be integrated out.


\begin{align}
& P(\boldsymbol{Z}, \boldsymbol{W};\alpha,\beta) = \int_{\boldsymbol{\theta}} \int_{\boldsymbol{\varphi}} P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\theta}, \boldsymbol{\varphi};\alpha,\beta) \, d\boldsymbol{\varphi} \, d\boldsymbol{\theta} \\ = & \int_{\boldsymbol{\varphi}} \prod_{i=1}^K P(\varphi_i;\beta) \prod_{j=1}^M \prod_{t=1}^N P(W_{j,t}|\varphi_{Z_{j,t}}) \, d\boldsymbol{\varphi} \int_{\boldsymbol{\theta}} \prod_{j=1}^M P(\theta_j;\alpha) \prod_{t=1}^N P(Z_{j,t}|\theta_j) \, d\boldsymbol{\theta}.
\end{align}

Note that all the s are independent to each other and the same to all the s. So we can treat each and each separately. We now focus only on the part.


\int_{\boldsymbol{\theta}} \prod_{j=1}^M P(\theta_j;\alpha) \prod_{t=1}^N P(Z_{j,t}|\theta_j) d\boldsymbol{\theta} = \prod_{j=1}^M \int_{\theta_j} P(\theta_j;\alpha) \prod_{t=1}^N
P(Z_{j,t}|\theta_j) \, d\theta_j .

We can further focus on only one as the following:

 \int_{\theta_j} P(\theta_j;\alpha) \prod_{t=1}^N
P(Z_{j,t}|\theta_j) \, d\theta_j .

Actually, it is the hidden part of the model for the document. Now we replace the probabilities in the above equation by the true distribution expression to write out the explicit equation.


\begin{align}
& \int_{\theta_j} P(\theta_j;\alpha) \prod_{t=1}^N P(Z_{j,t}|\theta_j) \, d\theta_j \\ = & \int_{\theta_j} \frac{\Gamma\bigl(\sum_{i=1}^K \alpha_i \bigr)}{\prod_{i=1}^K \Gamma(\alpha_i)} \prod_{i=1}^K \theta_{j,i}^{\alpha_i - 1} \prod_{t=1}^N P(Z_{j,t}|\theta_j) \, d\theta_j.
\end{align}

Let be the number of word tokens in the document with the same word symbol (the word in the vocabulary) assigned to the topic. So, is three dimensional. If any of the three dimensions is not limited to a specific value, we use a parenthesized point to denote. For example, denotes the number of word tokens in the document assigned to the topic. Thus, the right most part of the above equation can be rewritten as:

 \prod_{t=1}^N P(Z_{j,t}|\theta_j) = \prod_{i=1}^K
\theta_{j,i}^{n_{j,(\cdot)}^i} .

So the integration formula can be changed to:


\begin{align}
& \int_{\theta_j} \frac{\Gamma\bigl(\sum_{i=1}^K \alpha_i \bigr)}{\prod_{i=1}^K \Gamma(\alpha_i)} \prod_{i=1}^K \theta_{j,i}^{\alpha_i - 1} \prod_{i=1}^K \theta_{j,i}^{n_{j,(\cdot)}^i} \, d\theta_j \\ = & \int_{\theta_j} \frac{\Gamma\bigl(\sum_{i=1}^K \alpha_i \bigr)}{\prod_{i=1}^K \Gamma(\alpha_i)} \prod_{i=1}^K \theta_{j,i}^{n_{j,(\cdot)}^i+\alpha_i - 1} \, d\theta_j.
\end{align}

Clearly, the equation inside the integration has the same form as the Dirichlet distribution. According to the Dirichlet distribution,

 \int_{\theta_j} \frac{\Gamma\bigl(\sum_{i=1}^K
n_{j,(\cdot)}^i+\alpha_i \bigr)}{\prod_{i=1}^K
\Gamma(n_{j,(\cdot)}^i+\alpha_i)} \prod_{i=1}^K
\theta_{j,i}^{n_{j,(\cdot)}^i+\alpha_i - 1} \, d\theta_j =1 .

Thus,


\begin{align}
& \int_{\theta_j} P(\theta_j;\alpha) \prod_{t=1}^N P(Z_{j,t}|\theta_j) \, d\theta_j = \int_{\theta_j} \frac{\Gamma\bigl(\sum_{i=1}^K \alpha_i \bigr)}{\prod_{i=1}^K \Gamma(\alpha_i)} \prod_{i=1}^K \theta_{j,i}^{n_{j,(\cdot)}^i+\alpha_i - 1} \, d\theta_j \\
= & \frac{\Gamma\bigl(\sum_{i=1}^K \alpha_i \bigr)}{\prod_{i=1}^K \Gamma(\alpha_i)}\frac{\prod_{i=1}^K \Gamma(n_{j,(\cdot)}^i+\alpha_i)}{\Gamma\bigl(\sum_{i=1}^K n_{j,(\cdot)}^i+\alpha_i \bigr)} \int_{\theta_j} \frac{\Gamma\bigl(\sum_{i=1}^K n_{j,(\cdot)}^i+\alpha_i \bigr)}{\prod_{i=1}^K \Gamma(n_{j,(\cdot)}^i+\alpha_i)} \prod_{i=1}^K \theta_{j,i}^{n_{j,(\cdot)}^i+\alpha_i - 1} \, d\theta_j \\
= & \frac{\Gamma\bigl(\sum_{i=1}^K \alpha_i \bigr)}{\prod_{i=1}^K \Gamma(\alpha_i)}\frac{\prod_{i=1}^K \Gamma(n_{j,(\cdot)}^i+\alpha_i)}{\Gamma\bigl(\sum_{i=1}^K n_{j,(\cdot)}^i+\alpha_i \bigr)}.
\end{align}

Now we turn our attentions to the part. Actually, the derivation of the part is very similar to the part. Here we only list the steps of the derivation:


\begin{align}
& \int_{\boldsymbol{\varphi}} \prod_{i=1}^K P(\varphi_i;\beta) \prod_{j=1}^M \prod_{t=1}^N P(W_{j,t}|\varphi_{Z_{j,t}}) \, d\boldsymbol{\varphi} \\
= & \prod_{i=1}^K \int_{\varphi_i} P(\varphi_i;\beta) \prod_{j=1}^M \prod_{t=1}^N P(W_{j,t}|\varphi_{Z_{j,t}}) \, d\varphi_i\\
= & \prod_{i=1}^K \int_{\varphi_i} \frac{\Gamma\bigl(\sum_{r=1}^V \beta_r \bigr)}{\prod_{r=1}^V \Gamma(\beta_r)} \prod_{r=1}^V \varphi_{i,r}^{\beta_r - 1} \prod_{r=1}^V \varphi_{i,r}^{n_{(\cdot),r}^i} \, d\varphi_i \\
= & \prod_{i=1}^K \int_{\varphi_i} \frac{\Gamma\bigl(\sum_{r=1}^V \beta_r \bigr)}{\prod_{r=1}^V \Gamma(\beta_r)} \prod_{r=1}^V \varphi_{i,r}^{n_{(\cdot),r}^i+\beta_r - 1} \, d\varphi_i \\
= & \prod_{i=1}^K \frac{\Gamma\bigl(\sum_{r=1}^V \beta_r
\bigr)}{\prod_{r=1}^V \Gamma(\beta_r)}\frac{\prod_{r=1}^V
\Gamma(n_{(\cdot),r}^i+\beta_r)}{\Gamma\bigl(\sum_{r=1}^V
n_{(\cdot),r}^i+\beta_r \bigr)} .
\end{align}

For clarity, here we write down the final equation with both and integrated out:


\begin{align}
& P(\boldsymbol{Z}, \boldsymbol{W};\alpha,\beta) \\
= & \prod_{j=1}^M \frac{\Gamma\bigl(\sum_{i=1}^K \alpha_i
\bigr)}{\prod_{i=1}^K \Gamma(\alpha_i)}\frac{\prod_{i=1}^K
\Gamma(n_{j,(\cdot)}^i+\alpha_i)}{\Gamma\bigl(\sum_{i=1}^K
n_{j,(\cdot)}^i+\alpha_i \bigr)} \times \prod_{i=1}^K
\frac{\Gamma\bigl(\sum_{r=1}^V \beta_r \bigr)}{\prod_{r=1}^V
\Gamma(\beta_r)}\frac{\prod_{r=1}^V
\Gamma(n_{(\cdot),r}^i+\beta_r)}{\Gamma\bigl(\sum_{r=1}^V
n_{(\cdot),r}^i+\beta_r \bigr)} .
\end{align}

The goal of Gibbs Sampling here is to approximate the distribution of . Since is invariable for any of Z, Gibbs Sampling equations can be derived from directly. The key point is to derive the following conditional probability:

 P(Z_{(m,n)}|\boldsymbol{Z_{-(m,n)}},
\boldsymbol{W};\alpha,\beta)=\frac{P(Z_{(m,n)},
\boldsymbol{Z_{-(m,n)}},\boldsymbol{W};\alpha,\beta)}
{P(\boldsymbol{Z_{-(m,n)}}, \boldsymbol{W};\alpha,\beta)} ,

where denotes the hidden variable of the word token in the document. And further we assume that the word symbol of it is the word in the vocabulary. denotes all the s but . Note that Gibbs Sampling needs only to sample a value for, according to the above probability, we do not need the exact value of P(Z_{m,n}|\boldsymbol{Z_{-(m,n)}},
\boldsymbol{W};\alpha,\beta) but the ratios among the probabilities that can take value. So, the above equation can be simplified as:


\begin{align}
& P(Z_{(m,n)}=k|\boldsymbol{Z_{-(m,n)}}, \boldsymbol{W};\alpha,\beta) \\
\propto &
P(Z_{(m,n)}=k,\boldsymbol{Z_{-(m,n)}},\boldsymbol{W};\alpha,\beta) \\
= & \left(\frac{\Gamma\left(\sum_{i=1}^K \alpha_i
\right)}{\prod_{i=1}^K \Gamma(\alpha_i)}\right)^M \prod_{j\neq m}
\frac{\prod_{i=1}^K
\Gamma(n_{j,(\cdot)}^i+\alpha_i)}{\Gamma\bigl(\sum_{i=1}^K
n_{j,(\cdot)}^i+\alpha_i \bigr)} \\
& \times \left( \frac{\Gamma\bigl(\sum_{r=1}^V \beta_r
\bigr)}{\prod_{r=1}^V \Gamma(\beta_r)}\right)^K \prod_{i=1}^K
\prod_{r\neq v}
\Gamma(n_{(\cdot),r}^i+\beta_r) \\
& \times \frac{\prod_{i=1}^K
\Gamma(n_{m,(\cdot)}^i+\alpha_i)}{\Gamma\bigl(\sum_{i=1}^K
n_{m,(\cdot)}^i+\alpha_i \bigr)} \prod_{i=1}^K \frac{
\Gamma(n_{(\cdot),v}^i+\beta_v)}{\Gamma\bigl(\sum_{r=1}^V
n_{(\cdot),r}^i+\beta_r \bigr)} \\
\propto & \frac{\prod_{i=1}^K
\Gamma(n_{m,(\cdot)}^i+\alpha_i)}{\Gamma\bigl(\sum_{i=1}^K
n_{m,(\cdot)}^i+\alpha_i \bigr)} \prod_{i=1}^K \frac{
\Gamma(n_{(\cdot),v}^i+\beta_v)}{\Gamma\bigl(\sum_{r=1}^V
n_{(\cdot),r}^i+\beta_r \bigr)}.
\end{align}

Finally, let be the same meaning as but with the excluded. The above equation can be further simplified by treating terms not dependent on as constants:


\begin{align}
\propto & \frac{\prod_{i\neq k}
\Gamma(n_{m,(\cdot)}^{i,-(m,n)}+\alpha_i)}{\Gamma\bigl((\sum_{i=1}^K
n_{m,(\cdot)}^{i,-(m,n)}+\alpha_i) +1\bigr)} \prod_{i\neq k} \frac{
\Gamma(n_{(\cdot),v}^{i,-(m,n)}+\beta_v)}{\Gamma\bigl(\sum_{r=1}^V
n_{(\cdot),r}^{i,-(m,n)}+\beta_r \bigr)}\\
\times & \Gamma(n_{m,(\cdot)}^{k,-(m,n)}+\alpha_k + 1) \frac{
\Gamma(n_{(\cdot),v}^{k,-(m,n)}+\beta_v +
1)}{\Gamma\bigl((\sum_{r=1}^V n_{(\cdot),r}^{k,-(m,n)}+\beta_r)+1
\bigr)} \\
\propto & \frac{\Gamma(n_{m,(\cdot)}^{k,-(m,n)}+\alpha_k +
1)}{\Gamma\bigl((\sum_{i=1}^K n_{m,(\cdot)}^{i,-(m,n)}+\alpha_i)+1
\bigr)} \frac{ \Gamma(n_{(\cdot),v}^{k,-(m,n)}+\beta_v +
1)}{\Gamma\bigl((\sum_{r=1}^V n_{(\cdot),r}^{k,-(m,n)}+\beta_r)+1
\bigr)}\\
= &
\frac{\Gamma(n_{m,(\cdot)}^{k,-(m,n)}+\alpha_k)\bigl(n_{m,(\cdot)}^{k,-(m,n)}+\alpha_k\bigr)}
{\Gamma\bigl(\sum_{i=1}^K
n_{m,(\cdot)}^{i,-(m,n)}+\alpha_i\bigr)\bigl(\sum_{i=1}^K
n_{m,(\cdot)}^{i,-(m,n)}+\alpha_i\bigr)} \frac{
\Gamma\bigl(n_{(\cdot),v}^{k,-(m,n)}+\beta_v\bigr)\bigl(n_{(\cdot),v}^{k,-(m,n)}+\beta_v\bigr)}
{\Gamma\bigl(\sum_{r=1}^V n_{(\cdot),r}^{k,-(m,n)}+\beta_r\bigr)
\bigl(\sum_{r=1}^V n_{(\cdot),r}^{k,-(m,n)}+\beta_r)} \\
\propto & \frac{\bigl(n_{m,(\cdot)}^{k,-(m,n)}+\alpha_k\bigr)}
{\bigl(\sum_{i=1}^K n_{m,(\cdot)}^{i,-(m,n)}+\alpha_i\bigr)} \frac{
\bigl(n_{(\cdot),v}^{k,-(m,n)}+\beta_v\bigr)} {\bigl(\sum_{r=1}^V
n_{(\cdot),r}^{k,-(m,n)}+\beta_r)}\\
\propto & \bigl(n_{m,(\cdot)}^{k,-(m,n)}+\alpha_k\bigr)\frac{
\bigl(n_{(\cdot),v}^{k,-(m,n)}+\beta_v\bigr)} {\bigl(\sum_{r=1}^V
n_{(\cdot),r}^{k,-(m,n)}+\beta_r)}.
\end{align}

Note that the same formula is derived in the article on the Dirichlet-multinomial distribution, as part of a more general discussion of integrating Dirichlet distribution priors out of a Bayesian network.

Read more about this topic:  Latent Dirichlet Allocation

Famous quotes containing the word inference:

    I shouldn’t want you to be surprised, or to draw any particular inference from my making speeches, or not making speeches, out there. I don’t recall any candidate for President that ever injured himself very much by not talking.
    Calvin Coolidge (1872–1933)

    I have heard that whoever loves is in no condition old. I have heard that whenever the name of man is spoken, the doctrine of immortality is announced; it cleaves to his constitution. The mode of it baffles our wit, and no whisper comes to us from the other side. But the inference from the working of intellect, hiving knowledge, hiving skill,—at the end of life just ready to be born,—affirms the inspirations of affection and of the moral sentiment.
    Ralph Waldo Emerson (1803–1882)

    The inference is, that God has restated the superiority of the West. God always does like that when a thousand white people surround one dark one. Dark people are always “bad” when they do not admit the Divine Plan like that. A certain Javanese man who sticks up for Indonesian Independence is very lowdown by the papers, and suspected of being a Japanese puppet.
    Zora Neale Hurston (1891–1960)