Multi-armed Bandit - The Multi-armed Bandit Model

The Multi-armed Bandit Model

The multi-armed bandit (or just bandit for short) can be seen as a set of real distributions, each distribution being associated with the rewards delivered by one of the K levers. Let be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of the collected rewards. The horizon H is the number of rounds that remain to be played. The bandit problem is formally equivalent to a one-state Markov decision process. The regret after T rounds is defined as the difference between the reward sum associated with an optimal strategy and the sum of the collected rewards:, where is the maximal reward mean, and is the reward at time t. A strategy whose average regret per round tends to zero with probability 1 when the number of played rounds tends to infinity is a zero-regret strategy. Intuitively, zero-regret strategies are guaranteed to converge to an optimal strategy, not necessarily unique, if enough rounds are played.

Read more about this topic: Multi-armed Bandit

Famous quotes containing the word model:

“For an artist to marry his model is as fatal as for a gourmet to marry his cook: the one gets no sittings, and the other gets no dinners.”
—Oscar Wilde (1854–1900)