Markov Decision Processes: CS 520 Lecture notes

Intelligent Systems Lecture Notes

23 November 2011 • Markov Decision Processes

From the definition of V^π()

(1) V^π(s) = Q^π(s, π(s))

From the definition of Q^π()

(2) Q^π(s, π(s)) = \(\sum\)s_i Pr(s_i | s, π(s))(R(s, a, s_i) + γV^π(s_i)

Redistribute the summation in equation 2 to get

(3) Q^π(s, π(s)) = \(\sum\)s_i Pr(s_i | s, π(s))R(s, a, s_i) + \(\sum\)s_i Pr(s_i | s, π(s))γV^π(s_i)

The first summation is a constant, call it c₀. The product Pr(s_i | s, π(s))γ in the second summation is also constant; call it c_i. Substitute for constants and expand the summation in equation 3 to get

(4) Q^π(s, π(s)) = c₀ + c₁V^π(s₁) + \(\cdots\) + c_nV^π(s_n)

Use equation 1 to substitute the left-hand side of equation 4 to get

(5) V^π(s) = c₀ + c₁V^π(s₁) + \(\cdots\) + c_nV^π(s_n)

For any s = s_i, the left-hand side of equation 5 can be brought over to the right and combined with one of the terms to define the linear equation

(6) 0 = c₀ + c₁V^π(s₁) + \(\cdots\) + c_nV^π(s_n)

Each of the n states s_i defines a version of equation 6 for a system of n linear equations in n unknowns.

This page last modified on 2006 January 24.