k-nearest Neighbor Algorithm

Algorithm

The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.

Usually Euclidean distance is used as the distance metric; however this is only applicable to continuous variables. In cases such as text classification, another metric such as the overlap metric (or Hamming distance) can be used. Often, the classification accuracy of "k"-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.

A drawback to the basic "majority voting" classification is that the classes with the more frequent examples tend to dominate the prediction of the new vector, as they tend to come up in the k nearest neighbors when the neighbors are computed due to their large number. One way to overcome this problem is to weigh the classification taking into account the distance from the test point to each of its k nearest neighbors. Another way to overcome this drawback is possible by one level of abstraction in data representation. In this way, for example with a SOM, each node of SOM becomes a representative (cluster center) of a group of similar points, regardless of their density in the original training data set. Then, in the next step, instead of a direct KNN, one can apply the same procedure to the nodes of the trained SOM.

KNN is a special case of a variable-bandwidth, kernel density "balloon" estimator with a uniform kernel.

Read more about this topic: k-nearest Neighbor Algorithm

k-nearest Neighbor Algorithm - Algorithm