BIRCH (data Clustering) - BIRCH Clustering Algorithm

BIRCH Clustering Algorithm

Given a set of N d-dimensional data points, the clustering feature of the set is defined as the triple, where is the linear sum and is the square sum of data points.

Clustering features are organized in a CF tree, which is a height balanced tree with two parameters: branching factor and threshold . Each non-leaf node contains at most entries of the form, where is a pointer to its th child node and the clustering feature representing the associated subcluster. A leaf node contains at most entries each of the form . It also has two pointers prev and next which are used to chain all leaf nodes together. The tree size depends on the parameter T. A node is required to fit in a page of size P. B and L are determined by P. So P can be varied for performance tuning. It is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster.

In the algorithm in the first step it scans all data and builds an initial memory CF tree using the given amount of memory. In the second step it scans all the leaf entries in the initial CF tree to rebuild a smaller CF tree, while removing outliers and grouping crowded subclusters into larger ones. In step three an existing clustering algorithm is used to cluster all leaf entries. Here an agglomerative hierarchical clustering algorithm is applied directly to the subclusters represented by their CF vectors. It also provides the flexibility of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step a set of clusters is obtained that captures major distribution pattern in the data. However there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 the centroids of the clusters are used produced in step as seeds and redistribute the data points to its closest seeds to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier.

Read more about this topic:  BIRCH (data Clustering)

Famous quotes containing the word birch:

    The birch stripped of its bark, or the charred stump where a tree has been burned down to be made into a canoe,—these are the only traces of man, a fabulous wild man to us. On either side, the primeval forest stretches away uninterrupted to Canada, or to the “South Sea”; to the white man a drear and howling wilderness, but to the Indian a home, adapted to his nature, and cheerful as the smile of the Great Spirit.
    Henry David Thoreau (1817–1862)