Arthur Gretton

links

home group publications talks teaching workshops software Gatsby Unit ELLIS Unit, UCL

Contact arthur.gretton@gmail.com
Gatsby Computational Neuroscience Unit
Sainsbury Wellcome Centre
25 Howland Street
London W1T 4JG UK

Recent talks

Causal Effect Estimation with Hidden Confounders using Instruments and Proxies
Causal effect estimation from observational data requires taking into account hidden confounders. I will describe two practial approaches to ensuring robustness against these hidden confounders: Instrumental Variable Regression, and Proxy Methods. In both cases, emphasis will be placed on settings where variables may be continuous valued, high dimensional, and complex: for example, treatment might be the design and placement of a web ad, rather than simply whether or not the ad is shown. I will demonstrate the method in the setting of offline policy evaluation for reinforcement learning.

Slides from the ELLIS RobustML Workshop.
Causal Effect Estimation with Context and Confounders
A fundamental causal modelling task is to predict the effect of an intervention (or treatment) on an outcome, given context/covariates. Examples include predicting the effect of a medical treatment on patient health given patient symptoms and demographic information, or predicting the effect of ticket pricing on airline sales given seasonal fluctuations in demand. The problem becomes especially challenging when the treatment and context are complex (for instance, "treatment" might be a web ad design or a radiotherapy plan), and when only observational data is available (i.e., we have access to historical data, but cannot intervene ourselves). The challenge is greater still when the covariates are not observed, and constitute a hidden source of confounding.
I will give an overview of some practical tools and methods for estimating causal effects of complex, high dimensional treatments from observational data. The approach is based on conditional feature means, which represent conditional expectations of relevant model features. These features can be deep neural nets (adaptive, finite dimensional, learned from data), or kernel features (fixed, infinite dimensional, enforcing smoothness). When hidden confounding is present, a neural net implementation of instrumental variable regression can be used to correct for this confounding. The methods will be applied to modelling employment outcomes for the US Job Corps program for Disadvantaged Youth, and in policy evaluation for reinforcement learning.

Slides from the AISTATS keynote, Valencia 2023.
Gradient Flows on Kernel Divergence Measures
We construct Wasserstein gradient flows on two measures of divergence, and study their convergence properties. The first divergence measure is the Maximum Mean Discrepancy (MMD): an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), which serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, and relate this flow to the problem of optimizing neural networks.
The second divergence measure on which we define a flow is the KALE (KL Approximate Lower-bound Estimator) divergence. This is a regularized version of the Fenchel dual problem defining the KL over a restricted class of functions (again, a Reproducing Kernel Hilbert Space (RKHS)). We also propose a way to regularize both the MMD and KALE gradient flows, based on an injection of noise in the gradient. This algorithmic fix comes with theoretical and empirical evidence. We compare the MMD and KALE flows, illustrating that the KALE gradient flow is particularly well suited when the target distribution is supported on a low-dimensional manifold.

Slides from the talk given at the Geometry and Statistics in Data Sciences Workshop IHP, Paris (November 2022).
Causal modelling with distribution embeddings: treatment effects, counterfactuals, mediation, and proxies
A fundamental causal modelling task is to predict the effect of an intervention (or treatment) D=d on outcome Y in the presence of observed covariates X. As a common example from medical science, treatment might be a particular medicine, covariates might be relevant patient information (blood pressure, cholesterol levels, age), and outcome is whether a cure is achieved.
We can estimate the average effect of a treatment from observations of historical data, by marginalising our estimate of the conditional mean E(Y|X,D) over P(X): for instance, what is the average probability of being cured if we give Medicine A to patients with covariate distribution P(X)? More complex causal questions require taking conditional expectations. For instance, the average treatment on the treated (ATT) addresses a counterfactual: what is the outcome of an intervention d' on the subpopulation that received treatment d (for instance, for the population that received Medicine A, what would have happened had we given them Medicine B)? Or we might be interested in the Conditional Average Treatment Effect: what is the average effect of Medicine A for patients with certain blood pressure readings? Finally, we might be interested in the case that covariates are not observed, but indirect "proxy" covariate information is known.
We address these questions in the nonparametric setting using both kernel and NN methods, which apply for very general treatments D and covariates X (the presentation will focus primarily on the kernel case for simplicity). We provide strong statistical guarantees under general smoothness assumptions, and a straightforward and robust implementation (a few lines of code). The method is mostly demonstrated by addressing causal modelling questions arising from the US Job Corps program for Disadvantaged Youth.

Slides from the talk given at the University of Oxford Statistics Department (November 2022). Talk also given at the Deeplearn Summer School 2022 (with video), G-Reseach, CMU, and UC Berkeley.
Generalized Energy-Based Models
I will introduce Generalized Energy Based Models (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the "generator"). In particular, while the energy function is analogous to the GAN critic function, it is not discarded after training.
GEBMs are trained by alternating between learning the energy and the base, much like a GAN. Both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via MCMC, thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. GEBMs also return state-of-the-art performance on density modelling tasks, and when using base measures with an explicit form.

Slides from the talks given at the Simons Institute, Georgia Tech, LSE, University of Bristol, University of York, McGill, the University of Edinburgh, G Research, the Institute for Advanced Study at Princeton.
GANs with integral probability metrics: some results and conjectures
I will explore issues of critic design for generative adversarial networks. The talk will focus on integral probability metric (IPM) losses, specifically the Wasserstein loss as implemented in the WGAN-GP, and the MMD GAN. We will begin with an introduction to IPM losses, their relation to moment matching in the case of the Maximum Mean Discrepancy (MMD), and how IPMs relate to f-divergences (answer: almost not at all). Next, we will look at GAN design using these IPM losses: we will discuss the convergence of the GAN training algorithm when critic and generator are alternately trained, with a focus on exploring the effects of problem-specific critic gradient penalties in GAN optimisation and critic regularisation. We'll end with some theoretical results on gradient bias for IPM-based GANs, and on the relation between the gradient-penalised MMD and the Wasserstein-1 distance.

Slides from the talk at Oxford, February 2020.

Earlier slides from the talk at MILA, October 2019.
Kernel tests of goodness-of-fit using Stein's method
I will describe nonparametric, kernel-based tests to assess the relative goodness of fit of models with intractable unnormalized densities. We will begin with the case of models for which the marginal densities are known in closed form, up to normalisation. In this case, we compare expectations of infinite dictionaries of features under the model and data distributions, where these expectations agree when the model and data match. The features are chosen to have zero expectation under the model, which can be achieved for unnormalised densities using the Stein trick. We will obtain improved test performance by computing a small set of Stein features, chosen so as to maximise the power of the resulting test. These features yield an interpretable picture of model and data mismatch, and are computable in linear time. Finally, I will describe a test of relative goodness of fit for multiple models, where it is desired to find which model fits best, with the understanding that “all models are wrong." This final test applies even in the case where the models contain latent variables, and closed-form marginal distributions of the observed variables cannot be computed. In the case of models with low-dimensional latent structure and high-dimensional observations, our test significantly outperforms the relative maximum mean discrepancy test, which cannot exploit the latent structure.

Slides from the talk at Duke University Statistics Department, October 2019.
The Maximum Mean Discrepancy for training Generative Adversarial Networks (2018)
Generative adversarial networks (GANs) use neural networks as generative models, creating realistic samples that mimic real-life reference samples (for instance, images of faces, bedrooms, and more). These networks require an adaptive critic function while training, to teach the networks how to move improve their samples to better match the reference data. I will describe a kernel divergence measure, the maximum mean discrepancy (MMD), which represents one such critic function. This function can be used both in hypothesis testing (to test whether the distributions of two samples are significantly different), and to train a GAN. With gradient regularisation, the MMD is used to obtain state-of-the art performance (as of June 2018) on challenging image generation tasks, including 160 × 160 CelebA and 64 × 64 ImageNet. In addition to adversarial network training, I'll discuss the Kernel Inception Distance (KID), a measure for benchmarking GAN performance.

Slides from various talks given in 2018, at the University of Cardiff, the University of Chicago Booth, Carnegie Mellon University, the ICML 2018 Workshop on Deep Generative Models, the University of Bristol, MIT, Facebook AI Research Paris (in roughly reverse Chronological order)

Papers: On gradient regularizers for MMD GANs at NeurIPS 2018 ; Demystifying MMD GANs at ICLR 2018 .
UAI 2017 Tutorial on representing and comparing probabilities (August 2017)
The purpose of this tutorial is to provide an introduction to distribution embeddings and their applications, with a focus on recent tools (from 2014-2017) developed to represent and compare probabilty distributions. The first part of the talk will focus entirely on the representation of probability distributions, using both kernel methods and explicit features. The focus will be on designing kernels or features to make two distributions as distinguishable as possible. The second part of the talk will focus on more sophisticated applications of distribution representations: model criticism (using Stein's method to test against a parametric model), learning from probabilities as inputs, testing for independence and testing multi-way interaction.

Slides 1 and Slides 2, and Video.
Conditional Densities and Efficient Models in Infinite Exponential Families (December 2017)
The exponential family is one of the most powerful and widely used classes of models in statistics. A method was recently developed to fit this model when the natural parameter and sufficient statistic are infinite dimensional, using a score matching approach. The infinite exponential family is a natural generalisation of the finite case, much like the Gaussian and Dirichlet processes generalise their respective finite modfels.
I'll describe two recent results which make this model more applicable in practice, by reducing the computational burden and improving performance for high-dimensional data. The firsrt is a Nytsrom-like approximation to the full solution. We prove that this approximate solution has the same consistency and convergence rates as the full-rank solution (exactly in Fisher distance, and nearly in other distances), with guarantees on the degree of cost and storage reduction. The second result is a generalisation of the model family to the conditional case, again with consistency guarantees. In experiments, the conditional model generally outperforms a competing approach with consistency guarantees, and is competitive with a deep conditional density model on datasets that exhibit abrupt transitions and heteroscedasticity.

Talk slides for the NeurIPS 2017 workshop on Modeling and Learning Interactions from Complex Data.

Papers: Efficient and principled score estimation with Nyström kernel exponential families paper ; Kernel Conditional Exponential Family paper .
Learning Interpretable Features to Compare Distributions
We presnt adaptive two-sample tests with maximum testing power and interpretable features, using two divergence measures: the maximum mean discrepancy (MMD), and differences of learned smooth features (the mean embedding (ME) test, NeurIPS 2016). In both cases, the key point is that variance matters: it is not enough to have a large empirical divergence; we also need to have high confidence in the value of our divergence. We use an optimized MMD discriminator to evaluate and troubleshoot a generative adversarial network (GAN), by detecting subtle differences in the distribution of GAN outputs and real hand-written digits which humans are unable to find (for instance, small imbalances in the proportions of certain digits, or minor distortions that are implausible in normal handwriting). We use the linear-time ME test to distinguish positive and negative emotions on a facial expression database, showing that a distinguishing feature reveals the facial areas most relevant to emotion.

Talk slides for the NeurIPS 2016 workshop on generative adversarial networks;more detailed slides from the Dagstuh workshop.

Papers: Adaptive MMD test paper ; linear-time ME test paper .

Software: Adaptive MMD test code; linear-time ME test code.
Kernel Statistical Tests for Random Processes
We propose a kernel approach to statistical testing for random processes, which we illustrate using two settings: a two-sample test (comparison of marginal distributions of random processes), and an independence test. Our test statistics are in both cases straightforward V- statistics, derived from embeddings of probability measures to a reproducing kernel Hilbert space (RKHS). We use both shuffling and wild bootstrap approaches to construct provably consistent tests, and demonstrate that these are successful for cases where a naive permutation-based bootstrap fails. In experiments, the wild bootstrap gives strong performance on synthetic examples, on audio data, and in performance benchmarking for the Gibbs sampler.

Talk slides from the ISNPS 2016 conference.

Papers: A Wild Bootstrap for Degenerate Kernel Tests (NeurIPS 2014, uses wild bootstrap for null), and A Kernel Independence Test for Random Processes (ICML 2014, uses shuffling for null).

Software: Wild Bootstrap and Shuffling tests.
A Kernel Test of Goodness of Fit
We describe a nonparametric statistical test for goodness-of-fit: given a set of samples, the test determines how likely it is that these were generated from a target density function. The measure of goodness-of-fit is a divergence constructed via Stein's method using functions from a Reproducing Kernel Hilbert Space. Our test statistic is based on an empirical estimate of this divergence, taking the form of a V-statistic in terms of the log gradients of the target density and the kernel. We derive a statistical test, both for i.i.d. and non-i.i.d. samples, where we estimate the null distribution quantiles using a wild bootstrap procedure. We apply our test to quantifying convergence of approximate Markov Chain Monte Carlo methods, statistical model criticism, and evaluating quality of fit vs model complexity in nonparametric density estimation.

Talk slides from ICML 2016 .

Paper and Code.
Learning with probabilities as inputs, using kernels
Kernels have been widely applied when learning labels or regression values from very broad classes of inputs, for example images, strings, histograms, and graphs. In this talk, we deal with the case where inputs to a learning algorithm are probability distributions, or samples from distributions. The latter setting might include the task of labelling bags of image patches, or regressing from a collection of air samples to the pollution level.

We first show how to summarize probability distribution-valued inputs in terms of expected features, and to learn from kernels defined from these features. We demonstrate how to efficiently approximate the features to scale up our approach. We address two supervised learning settings: expectation propagation for general models, and classifying the direction of causal influence from observed samples of cause and effect. We also show, in the case of learning from samples from distributions, how many samples per distribution are needed for learning to occur. Time permitting, we might also cover the problem of learning with probability-valued outputs, for nonparametric Bayesian inference.

Talk slides from the NeurIPS 2015 workshop on Probabilistic Integration.

Papers: Learning for passing expectation propagation messages (UAI 2015), Two-Stage Sampled Learning Theory on Distributions (AISTATS 2015) and extended JMLR submission.

Software: Learning for passing expectation propagation messages and Two-Stage Sampled Learning Theory on Distributions
Kernel Adaptive Metropolis-Hastings

Talk slides from the NeurIPS 2015 workshop on Scalable Monte Carlo

Papers: Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families (NeurIPS 2015), Kernel Adaptive Metropolis-Hastings (ICML 2014).

Software: Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families, Kernel Adaptive Metropolis-Hastings.
Nonparametric Bayesian inference using kernel distribution embeddings
A method is presented for approximate Bayesian inference, where explicit models for the prior and likelihood are unknown (or difficult to compute), but sampling from these distributions is possbile. The method expresses the prior as an element in a reproducing kernel Hilbert space, and the likelihood as a family of such elements, indexed by the conditioning variable. These distribution embeddings may be computed directly from training samples: in particular, the likelihood embedding is the solution of a mulitask learning problem with (potentially) infinite tasks, where each task is a different feature space dimension.

Kernel message passing can be applied to any domains where kernels are defined, handling challenging cases such as discrete variables with huge domains, or very complex, non-Gaussian continuous distributions. An empirical comparison with approximate Bayesian computation (ABC) shows better performance in high dimensions. Finally, the approach is applied to camera angle recovery from captured images, showing better performance than the extended Kalman filter.

Talk slides from the NeurIPS 2013 workshop on new directions in transfer and multi-task learning (the current slides are from an extended version given at the Lancaster Statistics Department).

IEEE Signal Processing Magazine 2013 Tutorial, JMLR 2013 Paper , ICML 2013 Paper , ICML 2012 Paper
Optimal kernel choice for kernel hypothesis testing
We consider two nonparametric hypothesis testing problems: (1) Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis p=q; and (2) Given a joint distribution p_xy over random variables x and y, an independence test determines whether to reject the null hypothesis of independence, p_xy = p_x p_y. In testing whether two distributions are identical, or whether two random variables are independent, we require a test statistic which is a measure of distance between probability distributions. One choice of test statistic is the maximum mean discrepancy (MMD), a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability.

In this talk, I will provide a tutorial overview of kernel distances on probabilities, and show how these may be used in two-sample and independence testing. I will then describe a strategy for optimal kernel choice, and compare it with earlier heuristics (including other multiple kernel learning approaches).

Talk slides from the CMU ML lunch series (also presented at Microsoft Research Cambridge), video (scroll down for the talk)

Software

NeurIPS 2012 paper
Hypothesis Testing and Bayesian Inference: New Applications of Kernel Methods
In the early days of kernel machines research, the "kernel trick" was considered a useful way of constructing nonlinear learning algorithms from linear ones, by applying the linear algorithms to feature space mappings of the original data. More recently, it has become clear that a potentially more far reaching use of kernels is as a linear way of dealing with higher order statistics, by mapping probabilities to a suitable reproducing kernel Hilbert space (i.e., the feature space is an RKHS).

I will describe how probabilities can be mapped to kernel feature spaces, and how to compute distances between these mappings. A measure of strength of dependence between two random variables follows naturally from this distance. Applications that make use of kernel probability embeddings include:
• Nonparametric two-sample testing and independence testing in complex (high dimensional) domains. In the latter case, we test whether text in English is translated from the French, as opposed to being random extracts on the same topic.
• Inference on graphical models, in cases where the variable interactions are modeled nonparametrically (i.e., when parametric models are impractical or unknown).

Talk slides from a talk at the Oxford statistics department (also presented at Google Mountain View, Yahoo, Inria)
Bayesian Inference with Kernels
An embedding of probability distributions into a reproducing kernel Hilbert space (RKHS) has been introduced: like the characteristic function, this provides a unique representation of a probability distribution in a high dimensional feature space. This representation forms the basis of an inference procedure on graphical models, where the likelihoods are represented as RKHS functions. The resulting algorithm is completely nonparametric: all aspects of the model are represented implicitly, and learned from a training sample. Both exact inference on trees and loopy belief propagation on pairwise Markov random fields are demonstrated.

Kernel message passing can be applied to general domains where kernels are defined, handling challenging cases such as discrete variables with huge domains, or very complex, non-Gaussian continuous distributions. We apply kernel message passing and competing approaches to cross-language document retrieval, depth prediction from still images, protein configuration prediction, and paper topic inference from citation networks: these are all large-scale problems, with continuous-valued or structured random variables having complex underlying probability distributions. In all cases, kernel message passing performs outstandingly, being orders of magnitude faster than state-of-the-art nonparametric alternatives, and returning more accurate results.

Talk slides (with proofs in supplementary slides at the end) and video from the CMU machine learning lunch series (scroll down to the talk)

Software

AISTATS 2010 paper for inference on trees; AISTATS 2011 paper for inference on loopy graphs

NeurIPS 2010 Workshop on Challenges of Data Visualization (invited talk); also presented at the University of Cambridge, MSR Cambridge, University of Bristol, and CMU.

Consistent Nonparametric Tests of Independence: L1, Log-Likelihood and Kernel
Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1, log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distribution-free strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically alpha-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value alpha, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data.

Talk slides

Software

JMLR paper and earlier ALT paper for the L1 and log-likelihood tests, NeurIPS paper for the kernel test

Prague Stochastics 2010

Covariate Shift by Kernel Mean Matching
Assume we are given sets of observations of training and test data, where (unlike in the classical setting) the training and test distributions are allowed to differ. Thus for learning purposes, we face the problem of re-weighting the training data such that its distribution more closely matches that of the test data. We consider specifically the case where the difference in training and test distributions occurs only in the marginal distribution of the covariates: the conditional distribution of the outputs given the covariates is unchanged. We achieve covariate shift correction by matching covariate distributions between training and test sets in a high dimensional feature space (specifically, a reproducing kernel Hilbert space). This approach does not require distribution estimation, making it suited to high dimensions and structured data, where distribution estimates may not be practical.

We first describe the general setting of covariate shift correction, and the importance weighting approach. While direct density estimation provides an estimate of the importance weights, this has two potential disadvantages: it may not offer the best bias/variance tradeoff, and density estimation might be difficult on complex, high dimensional domains (such as text). We then describe how distributions may be mapped to reproducing kernel Hilbert spaces (RKHS), and review distances between such mappings. We demonstrate a transfer learning algorithm that reweights the training points such that their RKHS mapping matches that of the (unlabeled) test points. The sample weights are obtained by a simple quadratic programming procedure. Our correction method yields its greatest and most consistent advantages when the learning algorithm returns a classifier/regressor that is "simpler" than the data might suggest. On the other hand, even an ideal sample reweighting may not be of practical benefit given a sufficiently powerful learning algorithm (if available).

Talk slides and video

Software

Book chapter and NeurIPS paper

NeurIPS 2009 Workshop on Transfer Learning for Structured Data (invited talk)

Introduction to Independent Component Analysis
Independent component analysis (ICA) is a technique for extracting underlying sources of information from linear mixtures of these sources, based only on the assumption that the sources are independent of each other. To illustrate the idea, we might be in a room containing several people (the sources) talking simultaneously, with microphones picking up multiple conversations at once (the mixtures), and we might wish to automatically recover the original separate conversations from these mixtures. More broadly, ICA is used in a very wide variety of applications, including signal extraction from EEG, image processing, bioinformatics, and economics. I will present an introduction to ICA, which includes a description of principal component analysis (PCA), and how ICA differs from PCA, the maximum likelihood approach, the case where fixed nonlinearities are used as heuristics for source extraction, some more modern information theoretic approaches, and a kernel-based method. I will also cover two optimization strategies, and provide a comparison of the various approaches on benchmark data, to reveal the strengths and failure modes of different ICA algorithms (with a focus on modern, recently published methods).

Talk Slides

Software for Fast Kernel ICA demo

MLD student research symposium (invited talk)

Earlier talks

here

Videolectures

ICML tutorial