Generalized Linear Model - Intuition

Intuition

Ordinary linear regression predicts the expected value of a given unknown quantity (the response variable, a random variable) as a linear combination of a set of observed values (predictors). This implies that a constant change in a predictor leads to a constant change in the response variable (i.e. a linear-response model). This is appropriate when the response variable has a normal distribution (intuitively, when a response variable can vary essentially indefinitely in either direction with no fixed "zero value", or more generally for any quantity that only varies by a relatively small amount, e.g. human heights).

However, these assumptions are inappropriate for many types of response variables. For example, in many cases when the response variable must be positive and can vary over a wide scale, constant input changes lead to geometrically varying rather than constantly varying output changes. As an example, a model that predicts that each increase in 10 degrees leads to 1,000 more people going to a given beach is unlikely to generalize well over both small beaches (e.g. those where the expected attendance was 50 at the lower temperature) and large beaches (e.g. those where the expected attendance was 10,000 at the lower temperature). An even worse problem is that, since the model also implies that a drop in 10 degrees leads 1,000 fewer people going to a given beach, a beach whose expected attendance was 50 at the higher temperature would now be predicted to have the impossible attendance value of -950! Logically, a more realistic model would instead predict a constant rate of increased beach attendance (e.g. an increase in 10 degrees leads to a doubling in beach attendance, and a drop in 10 degrees leads to a halving in attendance). Such a model is termed an exponential-response model (or log-linear model, since the logarithm of the response is predicted to vary linearly).

Similarly, a model that predicts a probability of making a yes/no choice (a Bernoulli variable) is even less suitable as a linear-response model, since probabilities are bounded on both ends (they must be between 0 and 1). Imagine, for example, a model that predicts the likelihood of a given person going to the beach as a function of temperature. A reasonable model might predict, for example, that a change in 10 degrees makes a person two times more or less likely to go to the beach. But what does "twice as likely" mean in terms of a probability? It cannot literally mean to double the probability value (e.g. 50% becomes 100%, 75% becomes 150%, etc.). Rather, it is the odds that are doubling: from 2:1 odds, to 4:1 odds, to 8:1 odds, etc. Such a model is a log-odds model.

Generalized linear models cover all these situations by allowing for response variables that have arbitrary distributions (rather than simply normal distributions), and for an arbitrary function of the response variable (the link function) to vary linearly with the predicted values (rather than assuming that the response itself must vary linearly). For example, the case above of predicted number of beach attendees would typically be modeled with a Poisson distribution and a log link, while the case of predicted probability of beach attendance would typically be modeled with a Bernoulli distribution (or binomial distribution, depending on exactly how the problem is phrased) and a log-odds (or logit) link function.

Read more about this topic:  Generalized Linear Model

Famous quotes containing the word intuition:

    All appeared new, and strange at first, inexpressibly rare and delightful and beautiful. I was a little stranger, which at my entrance into the world was saluted and surrounded with innumerable joys. My knowledge was divine. I knew by intuition those things which since my Apostasy, I collected again by the highest reason.
    Thomas Traherne (1636–1674)