Markov Decision Process

Algorithms

MDPs can be solved by linear programming or dynamic programming. In what follows we present the latter approach.

Suppose we know the state transition function and the reward function, and we wish to calculate the policy that maximizes the expected discounted reward.

The standard family of algorithms to calculate this optimal policy requires storage for two arrays indexed by state: value, which contains real values, and policy which contains actions. At the end of the algorithm, will contain the solution and will contain the discounted sum of the rewards to be earned (on average) by following that solution from state .

The algorithm has the following two kinds of steps, which are repeated in some order for all the states until no further changes take place. They are

Their order depends on the variant of the algorithm; one can also do them for all states at once or state by state, and more often to some states than others. As long as no state is permanently excluded from either of the steps, the algorithm will eventually arrive at the correct solution.

Read more about this topic: Markov Decision Process

Markov Decision Process - Algorithms