GATSBY COMPUTATIONAL NEUROSCIENCE UNIT
UCL Logo

 

 

The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond

 

 

Olivier Cappe

Telecom ParisTech & CNRS, Paris

 

 

The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond

 

In this joint work with Aurélien Garivier, we present a finite-time analysis of the KL-UCB algorithm, an online, horizon-free index policy for stochastic bandit problems. Two distinct results are discussed: first, for arbitrary bounded rewards, KL-UCB satisfies a uniformly better regret bound than the well-known UCB algorithm; second, in the particular case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins.

 

A numerical study comparing KL-UCB with its main competitors (UCB, UCB2, UCB-Tuned, UCB-V, DMED) confirms that KL-UCB is remarkably efficient and stable, including for short time horizons. KL-UCB is also the only method that always performs better than the basic UCB policy.

 

Finally, we show how direct adaptations of the KL-UCB algorithm are also optimal for specific classes of (possibly unbounded) rewards, including those generated from exponential families of distributions.