Gatsby Unit | Research

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

Paul Schrater, C. Shawn Green and Daniel Acuna

Center for Cognitive Sciences, University of Minnesota

Monday 17 May 2010

14.00

Seminar Room B10 (Basement)

Alexandra House, 17 Queen Square, London, WC1N 3AR

Rational control of aspiration in learning

One of the fundamental questions for any agent entering a novel environment is how to balance exploratory and exploitative actions. This is especially difﬁcult when the total number of states, the number of rewarded states, and the distribution of rewards at rewarded states are initially unknown. How much to explore in an unknown environment is controlled by the agent’s beliefs about the value of unexperienced states, which we refer to as the agent’s "aspiration." To examine the effect of aspiration on explorator y behavior in humans, we used a well-known challenging test problem for exploration in reinforcement learning - the "chain game." Brieﬂy, this task produces two reasonable policies - a small reward policy that requires little exploration to ﬁnd and a large reward policy that requires more explorator y (and unrewarded) actions to ﬁnd. Human strategies fell into two distinct groups; one group performed enough unrewarded exploratory actions to ﬁnd the larger reward policy, while the second group under-explored and stuck with the low reward policy without ever experiencing the larger reward state. In debrieﬁng, subjects in the former group repor ted ﬁnding the small local maxima quickly, but believed that higher rewards were possible and thus continued exploration. Conversely, subjects in the latter group typically repor ted an initial exploratory phase, but upon ﬁnding only the local maxima and otherwise only unrewarded states, determined that the optimal solution was to exploit the small local maxima. Using model-based Bayesian reinforcement-learning, an agent can be made to mimic either of these groups by manipulating the agent’s initial prior belief about the size of the state space and/or the magnitude of potential rewards. In par ticular, we model these prior beliefs in terms of hyper parameters of a hierarchical Dirichlet process prior on the entries in the state-action transition matrix. Interestingly, these hyper parameters can be learned from data. Given that aspiration is obviously critical in determining explorator y behavior, the question then arises: what factors determine aspiration in humans? One potential source of information is knowledge regarding the reward history of others. To test the effect of this type of knowledge on exploratory choice behavior, subjects were again placed within an environment with an easy-to-ﬁnd low-reward policy and hard-to-ﬁnd high-reward policy. After half the total trials (and convergence to the easy to ﬁnd policy), a "high-score" sheet was shown to the subjects. One group was shown high-reward values and while the other group was shown low-reward values similar to their own score. While low-reward scores produced no change in behavior in the second half of the experiment, the group exposed to high-reward scores showed a complete reinitialization of exploration. Finally, we also tested whether subjects are capable of inferring aspiration directly from the statistics of the environment. Subjects played repeated games with similar reward structures (i.e., similar probabilities of states being rewarded, similar distributions on reward amount, etc). Choice behavior was extremely sensitive to these statistics (e.g. , in sparse environments with few rewarded states, subjects typically exploited the ﬁrst rewarded state they encountered).