Temporal Difference Learning of Position Evaluation in the Game of Go
Nici Schraudolph   Peter Dayan   Terry
Sejnowski
NIPS 6, 817-824.
Abstract
Despite their facility at most other games of strategy, computers
remain inept players of the game of Go. Its high branching factor
defeats the tree search approach used in computer chess, while its
long-range spatiotemporal interactions make position evaluation
extremely challenging. Further development of conventional Go programs
is hampered by their knowledge-intensive nature. We demonstrate a
viable alternative by training neural networks to evaluate Go
positions via temporal difference (TD) learning.
We developed network architectures that reflect the spatial
organisation of both input and reinforcement signals on the Go
board, and training protocols that provide exposure to competent
(though unlabelled) play. These techniques yield far better
performance than undifferentiated networks trained by self-play
alone. A network with less than 500 weights learned within 3,000
games of 9x9 Go a position evaluation function that enables a
primitive one-ply search to defeat a commercial Go program at a
low playing level.
compressed postscript   pdf