Multivariate Adaptive Regression Splines - The Basics

The Basics

This section introduces MARS using a few examples. We start with a set of data: a matrix of input variables x, and a vector of the observed responses y, with a response for each row in x. For example, the data could be:

x y
10.5 16.4
10.7 18.8
10.8 19.7
... ...
20.6 77.0

Here there is only one independent variable, so the x matrix is just a single column. Given these measurements, we would like to build a model which predicts the expected y for a given x.

A linear model for the above data is


\hat{y} = -37 + 5.1 x

The hat on the indicates that is estimated from the data. The figure on the right shows a plot of this function: a line giving the predicted versus x, with the original values of y shown as red dots.

The data at the extremes of x indicates that the relationship between y and x may be non-linear (look at the red dots relative to the regression line at low and high values of x). We thus turn to MARS to automatically build a model taking into account non-linearities. MARS software constructs a model from the given x and y as follows


\begin{align}
\hat{y} = &\ 25 \\
& + 6.1 \max(0, x - 13) \\
& - 3.1 \max(0, 13 - x) \\
\end{align}

The figure on the right shows a plot of this function: the predicted versus x, with the original values of y once again shown as red dots. The predicted response is now a better fit to the original y values.

MARS has automatically produced a kink in the predicted y to take into account non-linearity. The kink is produced by hinge functions. The hinge functions are the expressions starting with (where is if, else ). Hinge functions are described in more detail below.

In this simple example, we can easily see from the plot that y has a non-linear relationship with x (and might perhaps guess that y varies with the square of x). However, in general there will be multiple independent variables, and the relationship between y and these variables will be unclear and not easily visible by plotting. We can use MARS to discover that non-linear relationship.

An example MARS expression with multiple variables is


\begin{align} \mathrm{ozone} = &\ 5.2 \\
& + 0.93 \max(0, \mathrm{temp} - 58) \\
& - 0.64 \max(0, \mathrm{temp} - 68) \\
& - 0.046 \max(0, 234 - \mathrm{ibt}) \\
& - 0.016 \max(0, \mathrm{wind} - 7) \max(0, 200 - \mathrm{vis})\\
\end{align}

This expression models air pollution (the ozone level) as a function of the temperature and a few other variables. Note that the last term in the formula (on the last line) incorporates an interaction between and .

The figure on the right plots the predicted as and vary, with the other variables fixed at their median values. The figure shows that wind does not affect the ozone level unless visibility is low. We see that MARS can build quite flexible regression surfaces by combining hinge functions.

To obtain the above expression, the MARS model building procedure automatically selects which variables to use (some variables are important, others not), the positions of the kinks in the hinge functions, and how the hinge functions are combined.

Read more about this topic:  Multivariate Adaptive Regression Splines