| 
 
 | 
 
 | 
| 
 | 
 | |
| π | Vπ | 
s0, a0, r0, …, si, ai, ri, …, sk, –, rk


Qπ(s, a) = \(\sum\)n Pr(n | s, a)(R(s, a, n) + γVπ(n))
…, 11, up, 12, …
…, 11, up, 21, …
…, 11, up, 12, …
Pr(12 | 11, up, 11) = 2/3
Pr(21 | 11, up, 11) = 1/3
| passive ADP(sn, rn) | |||||
| if V[sn] = null then V[sn] = R[sn] = rn | |||||
| if sc ≠ null | |||||
| Nsa[sc, ac]++ | |||||
| Nssa[sn, sc, ac]++ | |||||
| for each s ∈ S do | |||||
| Pr[s, sc, ac] ← Nssa[s, sc, ac]/Nsa[sc, ac] | |||||
| V ← the solution to the related linear equations | |||||
| if sn is terminal | |||||
| then sc, ac ← null | |||||
| else sc, ac ← sn, π[sn] | |||||
| return ac | |||||
(v1 + \(\cdots\) + vn)/ntreats (weights) all values the same.
| Let Ak be the average of k samples | |||
| Ak | = | (v1 + \(\cdots\) + vk)/k | |
| Multiply by k | |||
| kAk | = | v1 + \(\cdots\) + vk - 1 + vk | |
| = | (k - 1)Ak - 1 + vk | ||
| Divide by k | |||
| Ak | = | (1 - 1/k)Ak - 1 + vk/k | |
| Let αk = 1/k | |||
| Ak | = | (1 - αk)Ak - 1 + αkvk | |
| = | Ak - 1 + αk(vk - Ak - 1) | ||
Ak - 1 + αk(vk - Ak - 1)
Move the mouse pointer over the lighter curves at the data points (bends in the curve) to highlight the curve.
| Q-learning(S, A, γ, α) | |||||
| Q[s, a] ← whatever | |||||
| s ← start state | |||||
| repeat | |||||
| pick a from A and perform it in state s | |||||
| observe the reward r and next state n | |||||
| Q[s, a] ← Q[s, a] + α(r + γmaxan Q[n, an] - Q[s, a]) | |||||
| s ← n | |||||
| until done | |||||