|
|
π | Vπ |
s0, a0, r0, …, si, ai, ri, …, sk, –, rk
Qπ(s, a) = \(\sum\)n Pr(n | s, a)(R(s, a, n) + γVπ(n))
…, 11, up, 12, …
…, 11, up, 21, …
…, 11, up, 12, …
Pr(12 | 11, up, 11) = 2/3
Pr(21 | 11, up, 11) = 1/3
passive ADP(sn, rn) | |||||
if V[sn] = null then V[sn] = R[sn] = rn | |||||
if sc ≠ null | |||||
Nsa[sc, ac]++ | |||||
Nssa[sn, sc, ac]++ | |||||
for each s ∈ S do | |||||
Pr[s, sc, ac] ← Nssa[s, sc, ac]/Nsa[sc, ac] | |||||
V ← the solution to the related linear equations | |||||
if sn is terminal | |||||
then sc, ac ← null | |||||
else sc, ac ← sn, π[sn] | |||||
return ac |
(v1 + \(\cdots\) + vn)/ntreats (weights) all values the same.
Let Ak be the average of k samples | |||
Ak | = | (v1 + \(\cdots\) + vk)/k | |
Multiply by k | |||
kAk | = | v1 + \(\cdots\) + vk - 1 + vk | |
= | (k - 1)Ak - 1 + vk | ||
Divide by k | |||
Ak | = | (1 - 1/k)Ak - 1 + vk/k | |
Let αk = 1/k | |||
Ak | = | (1 - αk)Ak - 1 + αkvk | |
= | Ak - 1 + αk(vk - Ak - 1) |
Ak - 1 + αk(vk - Ak - 1)
Move the mouse pointer over the lighter curves at the data points (bends in the curve) to highlight the curve.
Q-learning(S, A, γ, α) | |||||
Q[s, a] ← whatever | |||||
s ← start state | |||||
repeat | |||||
pick a from A and perform it in state s | |||||
observe the reward r and next state n | |||||
Q[s, a] ← Q[s, a] + α(r + γmaxan Q[n, an] - Q[s, a]) | |||||
s ← n | |||||
until done |