TD(
) Converges
with Probability 1.
Peter Dayan   Terry Sejnowski
Machine Learning, 14, 295-301.
Abstract
The methods of temporal differences (Samuel, 1959; Sutton 1984, 1988)
allow agents to learn accurate predictions about stationary stochastic
future outcomes. The learning is effectively stochastic approximation
based on samples extracted from the process generating the agent's
future.
Sutton (1988) proved that for a special case of temporal differences,
the expected values of the predictions converge to their correct
values, as larger samples are taken, and Dayan (1992) extended his
proof to the general case. This paper proves the stronger result that
the predictions of a slightly modified form of temporal difference
learning converge with probability one, and shows how to quantify the
rate of convergence.
compressed postscript   pdf