Boltzmann Machine - Training

Training

The units in the Boltzmann Machine are divided into "visible" units, V, and "hidden" units, H. The visible units are those which receive information from the "environment", i.e. our training set is a set of binary vectors over the set V. The distribution over the training set is denoted .

As is discussed above, the distribution over global states converges as the Boltzmann machine reaches thermal equilibrium. We denote this distribution, after we marginalize it over the hidden units, as .

Our goal is to approximate the "real" distribution using the which will be produced (eventually) by the machine. To measure how similar the two distributions are, we use the Kullback-Leibler divergence, :

Where the sum is over all the possible states of . is a function of the weights, since they determine the energy of a state, and the energy determines, as promised by the Boltzmann distribution. Hence, we can use a gradient descent algorithm over, so a given weight, is changed by subtracting the partial derivative of with respect to the weight.

There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to ). The other is the "negative" phase where the network is allowed to run freely, i.e. no units have their state determined by external data. Surprisingly enough, the gradient with respect to a given weight, is given by the very simple equation (proved in Ackley et al.):

Where:

  • is the probability of units i and j both being on when the machine is at equilibrium on the positive phase.
  • is the probability of units i and j both being on when the machine is at equilibrium on the negative phase.
  • denotes the learning rate

This result follows from the fact that at thermal equilibrium the probability of any global state when the network is free-running is given by the Boltzmann distribution (hence the name "Boltzmann machine").

Remarkably, this learning rule is fairly biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (or synapse biologically speaking) does not need information about anything other than the two neurons it connects. This is far more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as backpropagation.

The training of a Boltzmann machine does not use the EM algorithm, which is heavily used in machine learning. By minimizing the KL-divergence, it is equivalent as maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step.

Training the biases is similar, but uses only single node activity:

Read more about this topic:  Boltzmann Machine

Famous quotes containing the word training:

    In Washington, success is just a training course for failure.
    Simon Hoggart (b. 1946)

    I am not a suffragist, nor do I believe in “careers” for women, especially a “career” in factory and mill where most working women have their “careers.” A great responsibility rests upon woman—the training of children. This is her most beautiful task.
    Mother Jones (1830–1930)

    There is all the difference in the world between departure from recognised rules by one who has learned to obey them, and neglect of them through want of training or want of skill or want of understanding. Before you can be eccentric you must know where the circle is.
    Ellen Terry (1847–1928)