From the definition of
Vπ()
(1) Vπ(s) = Qπ(s, π(s))
From the definition of
Qπ()
(2) Qπ(s, π(s)) = \(\sum\)si
Pr(si |
s, π(s))(R(s, a, si) +
γVπ(si)
Redistribute the summation in equation 2 to get
(3) Qπ(s, π(s)) = \(\sum\)si
Pr(si |
s, π(s))R(s, a, si) +
\(\sum\)si
Pr(si |
s, π(s))γVπ(si)
The first summation is a constant, call it
c0. The product
Pr(
si |
s, π(
s))γ in the
second summation is also constant; call it
ci. Substitute for constants
and expand the summation in equation 3 to get
(4) Qπ(s, π(s)) = c0 + c1Vπ(s1) + \(\cdots\) + cnVπ(sn)
Use equation 1 to substitute the left-hand side of equation 4 to get
(5) Vπ(s) = c0 + c1Vπ(s1) + \(\cdots\) + cnVπ(sn)
For any
s =
si, the left-hand side of equation 5 can be brought
over to the right and combined with one of the terms to define the linear
equation
(6) 0 = c0 + c1Vπ(s1) + \(\cdots\) + cnVπ(sn)
Each of the
n states
si defines a version of equation 6 for a
system of
n linear equations in
n unknowns.
This page last modified on 2006 January 24.