Random Forest - Visualization

Visualization

In order to form an intuitive visualization of the model-space represented by a random forest, a dataset consisting of 200 random points (100 green points and 100 red points) was created. The green points were drawn from a Gaussian distribution with a centroid at (0,1), and the red points were drawn from a Gaussian distribution with a centroid at (1,0). In both cases, the variance was circular with an average radius of 1.

A Random Forest model, consisting of 50 trees, was trained on this data. The purity of the color indicates the portion of the 50 trees that voted in agreement. Significant over-fit can be observed in this visualization.

For contrast, a logistic regression model (which is somewhat less-prone to over-fit) was also trained on this same data.

(Typically, random forest is best-suited for use with categorical features, but continuous features were used in this illustration because they were easier to visualize.)

Read more about this topic:  Random Forest