EU PASCAL WORKSHOP ON
Learning Theoretic and Bayesian Inductive Principles
1921 July 2004
Programme: abstracts and links
A.Philip Dawid University College London  Bayesian and Prequential Inference for Model Selection  Links: Philip Dawid 
The Bayesian approach to inference licenses us to express any kind of uncertainty in simple probabilistic form. In the context of model selection, this applies both to uncertainty about the unknown parameter of a model, and uncertainty about the model itself. It also supports probabilistic prediction of unobserved quantities, a feature that is particularly valuable for model selection, as well as being important in its own right. I shall describe how a predictive Bayesian approach to model selection automatically exhibits desirable behaviour  in particular, guarding against overfitting  without the need for any extraneous or ad hoc ingredients such as regularisation, nor oversimplifying assumptions such as independent observations. I shall then generalise this by introducing the methodology of Prequential Analysis, which assesses and compares model directly in terms of their 1step ahead predictive performance and so is naturally suited to the task of model criticism and model selection. An important result is the availability of "optimal" forecasts, based on an extension of Empirical Risk Minimisation, even when the model is incorrectly specified. 

Paul Vitanyi CWI and University of Amsterdam, Netherlands 
Extracting Meaning with Kolmogorov and Shannon  Links: See paper at http://www.cwi.nl/~paulv/selection.html or http://arXiv.org/abs/cs.CC/0204037 and recent work with Peter Grunwald and Nikolai Vereshchagin. 
As perhaps the last mathematical innovation of an extraordinary scientific career, Kolmogorov in 1974 proposed to found statistical theory on finite combinatorial and computability principles independent of probabilistic assumptions, as the relation between the individual data and its explanation (model), expressed by Kolmogorov's structure function. It turns out that this proposal is formally a Shannon's rate distortion function for individual data and related to lossy compression. In classical probabilistic statistics the goodness of the selection process is measured in terms of expectations over probabilistic ensembles. For current applications, average relations are often irrelevant, since the part of the support of the probability density function that will ever be observed has about zero measure. This may be the case in, for example, complex video and sound analysis. There arises the problem that for individual cases the selection performance may be bad although the performance is good on average, or vice versa. There is also the problem of what probability means, whether it is subjective, objective, or exists at all. Kolmogorov's proposal outlined strives for the firmer and less contentious ground expressed in finite combinatorics and effective computation. This Kolmogorov's structure function, its variations and its relation to model selection, have obtained some notoriety (many papers and Cover and Thomas textbook on Information Theory) but have not before been comprehensively analyzed and understood. It has always been questioned why Kolmogorov chose to focus on the mysterious function denoted as $h_x$, rather than on a more evident function denoted as $\beta_x$ (for details see paper referred to below). Our main result, with the beauty of truth, justifies Kolmogorov's intuition. One easily stated consequence is: For all data, minimizing a twopart code consisting of one part model description and one part datatomodel code (essentially the celebrated MDL code \cite{Ri83}), subject to a given modelcomplexity constraint, as well as minimizing the onepart code consisting of just the datatomodel code (essentially the maximum likelihood estimator), {\em in every case} (and not only with high probability) selects a model that is a ``best explanation'' of the data within the given constraint. In particular, when the ``true'' model that generated the data is not in the model class considered, then the ML or MDL estimator still give a model that ``best fits'' the data. This notion of ``best explanation'' and ``best fit'' is understood in the sense that the data is ``most typical'' for the selected model in a rigorous mathematical sense that is discussed below. A practical consequence is as follows: While the best fit (minimal randomness deficiency under complexity constraints on the model) cannot be computationally monotonically approximated, we can monotonically minimize the twopart code, or the onepart code, and thus monotonically approximate {\em implicitly} the best fitting model. But this should be sufficient: we want the best model rather than a number that measures its goodness. These results open the possibility of model selection and prediction that are best possible for {\em individual} data samples, and thus usher in a completely new era of statistical inference that is *always* best rather than *expected*. It turns out that the structure function can be viewed as a rate distortion function for individual data. With this change of viewpoint we connect Kolmogorov's approach to nonprobabilistic statistics and model selection with Claude Shannon's rate distortion theory and the theory of lossy compression. This leads to a theory of rate distortion for individual data and a general class of structure functions. 

Sham Kakade University of Pennsylvania, USA  Online Bounds for Bayesian Algorithms  Links: shamkakadeslides.pdf; shamkakadepaper.ps 
We provide competitive online bounds for a number of Bayesian algorithms, without making any assumptions on the data generation process. These bounds show that some of the most popular Bayesian algorithms (such as least squares regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. Interestingly, our lower bound also shows that the performance on the worst case sequence of examples is sometimes not much different from the performance on the worst i.i.d. sequence of examples. We also provide bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 

Marcus Hutter IDSIA, Switzerland  Online Prediction  Bayes versus Experts  Links: marcuspres.pdf; marcuspres.pdf 
We derive a very general regret bound in the framework of prediction with expert advice, which challenges the best known regret bound for Bayesian sequence prediction. Both bounds of the form sqrt(Loss × complexity) hold for any bounded lossfunction, any prediction and observation spaces, arbitrary expert/ environment classes and weights, and unknown sequence length. 

John Langford TTI  Chicago, USA  The precise role of "bias" or "prior" in sample complexity bounds  Links: Presentation; Paper 
Sample complexity bounds are frequentist/PAC approaches for analyzing the performance of learning machines. Despite being frequentist, a wide variety of "train set bounds" have some notion of "prior" built in. This notion is often hidden and differs in subtle ways from Bayesian notions of "prior". I'll discuss the precise role of a "prior" in frequentist/PAC style bounds and discuss where the "prior" is in several such bounds. 

Suboptimal Behavior of Bayes (and MDL) under a Misspecified Prior 
Links: Presentation; Related research 1and 2  
One of the drawbacks of the Bayesian approach to learning is the work required to elicit or specify a Bayesian prior over distributions producing the observed data. To compensate for this, there are various techniques which take a partial specification of a prior and then automatically compile it into a distribution over distributions producing the observed data. For the class of classification, there is an example showing that the process of compiling a "prior" over classifiers into a prior over distributions results in poor behavior. In particular, for any number of samples, the Bayesian, MAP, and MDL algorithms will all choose a significantly suboptimal classifier. This is particularly interesting because there exists a simple algorithm which suceeds in this test for any "prior" over classifiers on any learning problem. 



Olivier Catoni CNRS  Universite Paris 6, France  Transductive PACBayesian classification  Links: Paper 
We will discuss PAC Bayesian classification theory in the so called transductive setting where both a learning sample and a test sample jointly drawn from an exchangeable distribution are given to the statistician. The labels from the learning sample are known, the labels from the test sample are to be predicted. The following points will be discussed : 

David MacKay University of Cambridge, UK  Text entry, Bayesian inference, and Asymptotically universal prediction  Links: http://www.inference.phy.cam.ac.uk/mackay/presentations/gatsby04/ 
Jan Poland IDSIA, Switzerland  Online Methods in Learning Theory  Links: janpoland.pdf;slides 
We present theoretical results on the performance of Bayesian online learning, namely predictive error bounds for the Bayes mixture and MDL with respect to a countable model class. We briefly discuss how assertions and improvements might be obtained for active learning. 

Joel Ratsaby University College London 
A Sharp Threshold Result for VC Classes of LargeMargin Functions.  Links: Presentation; Relevant papers 
Claudio Gentile University della Insubria, Italy  Statistical learning techniques based on worstcase online algorithms  Links: Talk 
An online learning algorithm is an algorithm that sweeps through a sequence of sample items only once. In the worstcase setting of online learning, theoretical performance is proven via a pointwise analysis, i.e., an analysis that avoids statistical assumptions on the way data are generated. In this tutorial, we first review some of the most relevant (and recent) online algorithms in the worstcase setting; then we discuss how to adapt these algorithms to a more classical statistical learning framework. Some of the results presented are wellknown, some are very recent, others are unpublished. 

Gabor Lugosi Pompeu Fabra University, Spain  Sequential prediction under limited feedback  Links: gaborlugosipres.pdf; lugositalk.pdf 
In this talk we survey some variants of the sequential prediction problem introduced by Hannan and Blackwell in the 50's and picked up later by various areas of research including information theory, computational learning theory, game theory, and statistics. The common theme of the variants studied here is that the decision maker (or forecaster) has only access to limited information about the sequence of past outcomes. Labelefficient prediction, "apple tasting", the adversarial bandit problem and other partial monitoring problems may be formulated in this framework. We study consistency, establish some bounds on rates of convergence, and ask some questions. (joint work with Nicolo` CesaBianchi and Gilles Stoltz) 

Amiran Ambroldaze University of Southampton, UK  Rademacher Complexity and Lipschitz Functions  Links: Slides; 
Rademacher and Gaussian complexities are successfully used in learning theory for measuring capacity of the class of functions to be learned. One of the most important (and di cult) properties of these complexities relies on their connection to the Lipschitz condition. In this paper we investigate this connection and state some open problems. 

Peter Bartlett University of California, Berkeley, USA  Large margin classifiers: consequences of convex cost functions  Links: peterbartlett.pdf; Papers\bartlettasspaper.pdf; bartlettasspaper2.pdf 
Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, minimize a cost function that can be viewed as a convex surrogate of the 01 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. This talk will cover some statistical and computational aspects of these methods, including universal consistency, convergence rates, and estimation of conditional probabilities of class labels. In the context of kernel classifiers, we will also discuss the sparseness of the kernel expansion, and algorithms for structured multiclass classification that exploit structure analogous to the factorizations common in generative models. 