Learning Theoretic and Bayesian Inductive Principles
19-21 July 2004

Programme: abstracts and links

A.Philip Dawid University College London Bayesian and Prequential Inference for Model Selection Links:  Philip Dawid

The Bayesian approach to inference licenses us to express any kind of uncertainty in simple probabilistic form. In the context of model selection, this applies both to uncertainty about the unknown parameter of a model, and uncertainty about the model itself. It also supports probabilistic prediction of unobserved quantities, a feature that is particularly valuable for model selection, as well as being important in its own right.

I shall describe how a predictive Bayesian approach to model selection automatically exhibits desirable behaviour --- in particular, guarding against overfitting --- without the need for any extraneous or ad hoc ingredients such as regularisation, nor oversimplifying assumptions such as independent observations.

I shall then generalise this by introducing the methodology of Prequential Analysis, which assesses and compares model directly in terms of their 1-step ahead predictive performance and so is naturally suited to the task of model criticism and model selection. An important result is the availability of "optimal" forecasts, based on an extension of Empirical Risk Minimisation, even when the model is incorrectly specified.

Paul Vitanyi        
CWI and University of Amsterdam, Netherlands
Extracting Meaning with Kolmogorov and Shannon Links: See paper at or and recent work with Peter Grunwald and Nikolai Vereshchagin.

As perhaps the last mathematical innovation of an extraordinary scientific career, Kolmogorov in 1974 proposed to found statistical theory on finite combinatorial and computability principles independent of probabilistic assumptions, as the relation between the individual data and its explanation (model), expressed by Kolmogorov's structure function. It turns out that this proposal is formally a Shannon's rate distortion function for individual data and related to lossy compression.

In classical probabilistic statistics the goodness of the selection process is measured in terms of expectations over probabilistic ensembles. For current applications, average relations are often irrelevant, since the part of the support of the probability density function that will ever be observed has about zero measure. This may be the case in, for example, complex video and sound analysis. There arises the problem that for individual cases the selection performance may be bad although the performance is good on average, or vice versa. There is also the problem of what probability means, whether it is subjective, objective, or exists at all. Kolmogorov's proposal outlined strives for the firmer and less contentious ground expressed in finite combinatorics and effective computation.

This Kolmogorov's structure function, its variations and its relation to model selection, have obtained some notoriety (many papers and Cover and Thomas textbook on Information Theory) but have not before been comprehensively analyzed and understood. It has always been questioned why Kolmogorov chose to focus on the mysterious function denoted as $h_x$, rather than on a more evident function denoted as $\beta_x$ (for details see paper referred to below). Our main result, with the beauty of truth, justifies Kolmogorov's intuition. One easily stated consequence is: For all data, minimizing a two-part code consisting of one part model description and one part data-to-model code (essentially the celebrated MDL code \cite{Ri83}), subject to a given model-complexity constraint, as well as minimizing the one-part code consisting of just the data-to-model code (essentially the maximum likelihood estimator), {\em in every case} (and not only with high probability) selects a model that is a ``best explanation'' of the data within the given constraint. In particular, when the ``true'' model that generated the data is not in the model class considered, then the ML or MDL estimator still give a model that ``best fits'' the data. This notion of ``best explanation'' and ``best fit'' is understood in the sense that the data is ``most typical'' for the selected model in a rigorous mathematical sense that is discussed below. A practical consequence is as follows: While the best fit (minimal randomness deficiency under complexity constraints on the model) cannot be computationally monotonically approximated, we can monotonically minimize the two-part code, or the one-part code, and thus monotonically approximate {\em implicitly} the best fitting model. But this should be sufficient: we want the best model rather than a number that measures its goodness. These results open the possibility of model selection and prediction that are best possible for {\em individual} data samples, and thus usher in a completely new era of statistical inference that is *always* best rather than *expected*. It turns out that the structure function can be viewed as a rate distortion function for individual data. With this change of viewpoint we connect Kolmogorov's approach to non-probabilistic statistics and model selection with Claude Shannon's rate distortion theory and the theory of lossy compression. This leads to a theory of rate distortion for individual data and a general class of structure functions.

Sham Kakade University of Pennsylvania, USA Online Bounds for Bayesian Algorithms Links: shamkakadeslides.pdf;

We provide competitive online bounds for a number of Bayesian algorithms, without making any assumptions on the data generation process. These bounds show that some of the most popular Bayesian algorithms (such as least squares regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. Interestingly, our lower bound also shows that the performance on the worst case sequence of examples is sometimes not much different from the performance on the worst i.i.d. sequence of examples. We also provide bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression.
(joint work with Andrew Ng)

Marcus Hutter IDSIA, Switzerland Online Prediction - Bayes versus Experts Links: marcuspres.pdf; marcuspres.pdf

We derive a very general regret bound in the framework of prediction with expert advice, which challenges the best known regret bound for Bayesian sequence prediction. Both bounds of the form sqrt(Loss × complexity) hold for any bounded loss-function, any prediction and observation spaces, arbitrary expert/ environment classes and weights, and unknown sequence length.

John Langford   TTI - Chicago, USA The precise role of "bias" or "prior" in sample complexity bounds Links: Presentation; Paper

Sample complexity bounds are frequentist/PAC approaches for analyzing the performance of learning machines.  Despite being frequentist, a wide variety of "train set bounds" have some notion of "prior" built in. This notion is often hidden and differs in subtle ways from Bayesian notions of "prior". 

I'll discuss the precise role of a "prior" in frequentist/PAC style bounds and discuss where the "prior" is in several such bounds. 

Suboptimal Behavior of Bayes (and
MDL) under a Misspecified Prior
Links: Presentation; Related research 1and 2

One of the drawbacks of the Bayesian approach to learning is the work required to elicit or specify a Bayesian prior over distributions producing the observed data.  To compensate for this, there are various techniques which take a partial specification of a prior and then automatically compile it into a distribution over distributions producing the observed data. For the class of classification, there is an example showing that the process of compiling a "prior" over classifiers into a prior over distributions results in poor behavior.  In particular, for any number of samples, the Bayesian, MAP, and MDL algorithms will all choose a significantly suboptimal classifier.  This is particularly interesting because there exists a simple algorithm which suceeds in this test for any "prior" over classifiers on any learning problem.



Olivier Catoni   CNRS -- Universite Paris 6, France Transductive PAC-Bayesian classification Links: Paper

We will discuss PAC Bayesian classification theory in the so called transductive setting where both a learning sample and a test sample jointly drawn from an exchangeable distribution are given to the statistician. The labels from the learning sample are known, the labels from the test sample are to be predicted. The following points will be discussed :
• The use of pseudo prior distributions depending on the design (the objects present either in the learning sample or the test sample). Their link with Vapnik Cervonenkis bounds, the way they can be recovered by integrating with respect to the test set without using a symmetrizing trick, resulting in some improvement in the constants. The way they can also be improved by considering a test sample larger than the learning sample (this is made possible by the fact that we do not use symmetrizing).
• Localized generalization error bounds and their link with Gibbs posterior distributions.
• The special cases of compression schemes and of support vector machines.
• The use of relative bounds (comparing two classification rules instead of comparing the learning error rate and the test error rate of a single rule). Their usefulness in the presence of noisy labels.

David MacKay University of Cambridge, UK Text entry, Bayesian inference, and Asymptotically universal prediction Links:
Jan Poland       IDSIA, Switzerland Online Methods in Learning Theory Links: janpoland.pdf;slides

We present theoretical results on the performance of Bayesian online learning, namely predictive error bounds for the Bayes mixture and MDL with respect to a countable model class. We briefly discuss how assertions and improvements might be obtained for active learning.

Joel Ratsaby University College London
A Sharp Threshold Result for VC- Classes of Large-Margin Functions. Links: Presentation; Relevant papers
Claudio Gentile University della Insubria, Italy Statistical learning techniques based on worst-case on-line algorithms Links: Talk

An on-line learning algorithm is an algorithm that sweeps through a sequence of sample items only once. In the worst-case setting of on-line learning, theoretical performance is proven via a pointwise analysis, i.e., an analysis that avoids statistical assumptions on the way data are generated. In this tutorial, we first review some of the most relevant (and recent) on-line algorithms in the worst-case setting; then we discuss how to adapt these algorithms to a more classical statistical learning framework. Some of the results presented are well-known, some are very recent, others are unpublished.

Gabor Lugosi Pompeu Fabra University, Spain Sequential prediction under limited feedback Links: gaborlugosipres.pdf; lugositalk.pdf
In this talk we survey some variants of the sequential prediction problem introduced by Hannan and Blackwell in the 50's and picked up later by various areas of research including information theory, computational learning theory, game theory, and statistics. The common theme of the variants studied here is that the decision maker (or forecaster) has only access to limited information about the sequence of past outcomes. Label-efficient prediction, "apple tasting", the adversarial bandit problem and other partial monitoring problems may be formulated in this framework. We study consistency, establish some bounds on rates of convergence, and ask some questions.
(joint work with Nicolo` Cesa-Bianchi and Gilles Stoltz)

Amiran Ambroldaze University of Southampton, UK Rademacher Complexity and Lipschitz Functions Links: Slides;

Rademacher and Gaussian complexities are successfully used in learning theory for measuring capacity of the class of functions to be learned. One of the most important (and di cult) properties of these complexities relies on their connection to the Lipschitz condition. In this paper we investigate this connection and state some open problems.
(joint work with John Shawe-Taylor)

Peter Bartlett University of California, Berkeley, USA Large margin classifiers: consequences of convex cost functions Links: peterbartlett.pdf; Papers\bartlettasspaper.pdf; bartlettasspaper2.pdf

Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, minimize a cost function that can be viewed as a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. This talk will cover some statistical and computational aspects of these methods, including universal consistency, convergence rates, and estimation of conditional probabilities of class labels. In the context of kernel classifiers, we will also discuss the sparseness of the kernel expansion, and algorithms for structured multi-class classification that exploit structure analogous to the factorizations common in generative models.