Intelligent Systems Lecture Notes

23 November 2011 • Markov Decision Processes


From the definition of Vπ()
(1) Vπ(s) = Qπ(s, π(s))
From the definition of Qπ()
(2) Qπ(s, π(s)) = \(\sum\)si Pr(si | s, π(s))(R(s, a, si) + γVπ(si)
Redistribute the summation in equation 2 to get
(3) Qπ(s, π(s)) = \(\sum\)si Pr(si | s, π(s))R(s, a, si) + \(\sum\)si Pr(si | s, π(s))γVπ(si)
The first summation is a constant, call it c0. The product Pr(si | s, π(s))γ in the second summation is also constant; call it ci. Substitute for constants and expand the summation in equation 3 to get
(4) Qπ(s, π(s)) = c0 + c1Vπ(s1) + \(\cdots\) + cnVπ(sn)
Use equation 1 to substitute the left-hand side of equation 4 to get
(5) Vπ(s) = c0 + c1Vπ(s1) + \(\cdots\) + cnVπ(sn)
For any s = si, the left-hand side of equation 5 can be brought over to the right and combined with one of the terms to define the linear equation
(6) 0 = c0 + c1Vπ(s1) + \(\cdots\) + cnVπ(sn)
Each of the n states si defines a version of equation 6 for a system of n linear equations in n unknowns.
This page last modified on 2006 January 24.