Training
The units in the Boltzmann Machine are divided into "visible" units, V, and "hidden" units, H. The visible units are those which receive information from the "environment", i.e. our training set is a set of binary vectors over the set V. The distribution over the training set is denoted .
As is discussed above, the distribution over global states converges as the Boltzmann machine reaches thermal equilibrium. We denote this distribution, after we marginalize it over the hidden units, as .
Our goal is to approximate the "real" distribution using the which will be produced (eventually) by the machine. To measure how similar the two distributions are, we use the Kullback-Leibler divergence, :
Where the sum is over all the possible states of . is a function of the weights, since they determine the energy of a state, and the energy determines, as promised by the Boltzmann distribution. Hence, we can use a gradient descent algorithm over, so a given weight, is changed by subtracting the partial derivative of with respect to the weight.
There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to ). The other is the "negative" phase where the network is allowed to run freely, i.e. no units have their state determined by external data. Surprisingly enough, the gradient with respect to a given weight, is given by the very simple equation (proved in Ackley et al.):
Where:
- is the probability of units i and j both being on when the machine is at equilibrium on the positive phase.
- is the probability of units i and j both being on when the machine is at equilibrium on the negative phase.
- denotes the learning rate
This result follows from the fact that at thermal equilibrium the probability of any global state when the network is free-running is given by the Boltzmann distribution (hence the name "Boltzmann machine").
Remarkably, this learning rule is fairly biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (or synapse biologically speaking) does not need information about anything other than the two neurons it connects. This is far more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as backpropagation.
The training of a Boltzmann machine does not use the EM algorithm, which is heavily used in machine learning. By minimizing the KL-divergence, it is equivalent as maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step.
Training the biases is similar, but uses only single node activity:
Read more about this topic: Boltzmann Machine
Famous quotes containing the word training:
“When a man goes through six years training to be a doctor he will never be the same. He knows too much.”
—Enid Bagnold (18891981)
“Its [motherhood] the biggest on-the-job- training program in existence today.”
—Erma Bombeck (20th century)
“The triumphs of peace have been in some proximity to war. Whilst the hand was still familiar with the sword-hilt, whilst the habits of the camp were still visible in the port and complexion of the gentleman, his intellectual power culminated; the compression and tension of these stern conditions is a training for the finest and softest arts, and can rarely be compensated in tranquil times, except by some analogous vigor drawn from occupations as hardy as war.”
—Ralph Waldo Emerson (18031882)