Using EM for Reinforcement Learning
Peter Dayan   Geoff Hinton
Neural Computation, 9, 271-278.
Abstract
We discsus Hinton's (1989)
relative payoff procedure (RPP), a static reinforcement
learning algorithm whose foundation is not stochastic gradient
ascent. We show circumstances under which applying the RPP is
guaranteed to increase the mean return, even though it can make
large changes in the values of the parameters. The proof is based on
a mapping between the RPP and a form of the expectation-maximisation
procedure of Dempster, Laird & Rubin (1976).
compressed postscript   pdf