Multi-armed Bandit

Multi-armed Bandit

In probability theory, the multi-armed bandit problem is the problem a gambler faces at a row of slot machines when deciding which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.

Robbins in 1952, realizing the importance of the problem, constructed convergent population selection strategies in "Some aspects of the sequential design of experiments". .

A theorem, the Gittins index published first by John C. Gittins gives an optimal policy for maximizing the expected discounted reward.

In practice, multi-armed bandits have been used to model the problem of managing research projects in a large organization, like a science foundation or a pharmaceutical company. Given its fixed budget, the problem is to allocate resources among the competing projects, whose properties are only partially known now but may be better understood as time passes.

In the early versions of the multi-armed bandit problem, the gambler has no initial knowledge about the levers. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the lever that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other levers.

The multi-armed bandit is sometimes called a -armed bandit or -armed bandit.

Read more about Multi-armed Bandit: Empirical Motivation, The Multi-armed Bandit Model, Variations, Bandit Strategies