TD() Converges with Probability 1.

Peter Dayan   Terry Sejnowski
Machine Learning, 14, 295-301.


Abstract

The methods of temporal differences (Samuel, 1959; Sutton 1984, 1988) allow agents to learn accurate predictions about stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future. Sutton (1988) proved that for a special case of temporal differences, the expected values of the predictions converge to their correct values, as larger samples are taken, and Dayan (1992) extended his proof to the general case. This paper proves the stronger result that the predictions of a slightly modified form of temporal difference learning converge with probability one, and shows how to quantify the rate of convergence.
compressed postscript   pdf

back to:   top     publications