Outlier - Identifying Outliers

Identifying Outliers

There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.

Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outlier detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of computer science and statistics.

There are three fundamental approaches to the problem of outlier detection:

  • Type 1 - Determine the outliers with no prior knowledge of the data. This is essentially a learning approach analogous to unsupervised clustering. The approach processes the data as a static distribution, pinpoints the most remote points, and flags them as potential outliers.
  • Type 2 - Model both normality and abnormality. This approach is analogous to supervised classification and requires pre-labeled data, tagged as normal or abnormal.
  • Type 3 - Model only normality (or in a few cases model abnormality). This is analogous to a semi-supervised recognition or detection task. It may be considered semi-supervised as the normal class is taught but the algorithm learns to recognize abnormality.

Model-based methods which are commonly used for identification assume that the data are from a normal distribution, and identify observations which are deemed "unlikely" based on mean and standard deviation:

  • Chauvenet's criterion
  • Grubbs' test for outliers
  • Peirce's criterion

It is proposed to determine in a series of observations the limit of error, beyond which all observations involving so great an error may be rejected, provided there are as many as such observations. The principle upon which it is proposed to solve this problem is, that the proposed observations should be rejected when the probability of the system of errors obtained by retaining them is less than that of the system of errors obtained by their rejection multiplied by the probability of making so many, and no more, abnormal observations. (Quoted in the editorial note on page 516 to Peirce (1982 edition) from A Manual of Astronomy 2:558 by Chauvenet.)

  • Dixon's Q test
  • ASTM E178 Standard Practice for Dealing With Outlying Observations

Other methods flag observations based on measures such as the interquartile range. For example, if and are the lower and upper quartiles respectively, then one could define an outlier to be any observation outside the range:

for some constant .

Other approaches are distance-based and density-based, and all of them frequently use the distance to the k-nearest neighbors to label observations as outliers or non-outliers.

Read more about this topic:  Outlier

Famous quotes containing the word identifying:

    And the serial continues:
    Pain, expiation, delight, more pain,
    A frieze that lengthens continually, in the lucky way
    Friezes do, and no plot is produced,
    Nothing you could hang an identifying question on.
    John Ashbery (b. 1927)