Pattern Recognition - Problem Statement (supervised Version)

Problem Statement (supervised Version)

Formally, the problem of supervised pattern recognition can be stated as follows: Given an unknown function (the ground truth) that maps input instances to output labels, along with training data assumed to represent accurate examples of the mapping, produce a function that approximates as closely as possible the correct mapping . (For example, if the problem is filtering spam, then is some representation of an email and is either "spam" or "non-spam"). In order for this to be a well-defined problem, "approximates as closely as possible" needs to be defined rigorously. In decision theory, this is defined by specifying a loss function that assigns a specific value to "loss" resulting from producing an incorrect label. The goal then is to minimize the expected loss, with the expectation taken over the probability distribution of . In practice, neither the distribution of nor the ground truth function are known exactly, but can be computed only empirically by collecting a large number of samples of and hand-labeling them using the correct value of (a time-consuming process, which is typically the limiting factor in the amount of data of this sort that can be collected). The particular loss function depends on the type of label being predicted. For example, in the case of classification, the simple zero-one loss function is often sufficient. This corresponds simply to assigning a loss of 1 to any incorrect labeling and is equivalent to computing the accuracy of the classification procedure over the set of test data (i.e. counting up the fraction of instances that the learned function labels correctly. The goal of the learning procedure is to maximize this test accuracy on a "typical" test set.

For a probabilistic pattern recognizer, the problem is instead to estimate the probability of each possible output label given a particular input instance, i.e. to estimate a function of the form

where the feature vector input is, and the function f is typically parameterized by some parameters . In a discriminative approach to the problem, f is estimated directly. In a generative approach, however, the inverse probability is instead estimated and combined with the prior probability using Bayes' rule, as follows:

When the labels are continuously distributed (e.g. in regression analysis), the denominator involves integration rather than summation:

The value of is typically learned using maximum a posteriori (MAP) estimation. This finds the best value that simultaneously meets two conflicting objects: To perform as well as possible on the training data and to find the simplest possible model. Essentially, this combines maximum likelihood estimation with a regularization procedure that favors simpler models over more complex models. In a Bayesian context, the regularization procedure can be viewed as placing a prior probability on different values of . Mathematically:

where is the value used for in the subsequent evaluation procedure, and, the posterior probability of, is given by

In the Bayesian approach to this problem, instead of choosing a single parameter vector, the probability of a given label for a new instance is computed by integrating over all possible values of, weighted according to the posterior probability:

Read more about this topic:  Pattern Recognition

Famous quotes containing the words problem and/or statement:

    Any solution to a problem changes the problem.
    —R.W. (Richard William)

    The most distinct and beautiful statement of any truth must take at last the mathematical form.
    Henry David Thoreau (1817–1862)