links
Gatsby Computational Neuroscience Unit CSML University College London
Collaborations
I am afflilated with the MPI for Biological Cybernetics as a research scientist.
I am in collaboration with the Select Lab at CMU.
Contact
arthur.gretton@gmail.com
Gatsby Computational Neuroscience Unit
Alexandra House
17 Queen Square
London - WC1N 3AR
Phone
207-679 1186
Recent talks
- Optimal kernel choice for kernel hypothesis testing
We consider two nonparametric hypothesis testing problems: (1) Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis p=q; and (2) Given a joint distribution p_xy over random variables x and y, an independence test determines whether to reject the null hypothesis of independence, p_xy = p_x p_y. In testing whether two distributions are identical, or whether two random variables are independent, we require a test statistic which is a measure of distance between probability distributions. One choice of test statistic is the maximum mean discrepancy (MMD), a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability.
In this talk, I will provide a tutorial overview of kernel distances on probabilities, and show how these may be used in two-sample and independence testing. I will then describe a strategy for optimal kernel choice, and compare it with earlier heuristics (including other multiple kernel learning approaches).
Talk slides from the CMU ML lunch series (also presented at Microsoft Research Cambridge), video (scroll down for the talk)
Software
NIPS 2012 paper
- Hypothesis Testing and Bayesian Inference: New Applications of
Kernel Methods
In the early days of kernel machines research, the "kernel trick" was considered a useful way of constructing nonlinear learning algorithms from linear ones, by applying the linear algorithms to feature space mappings of the original data. More recently, it has become clear that a potentially more far reaching use of kernels is as a linear way of dealing with higher order statistics, by mapping probabilities to a suitable reproducing kernel Hilbert space (i.e., the feature space is an RKHS).
I will describe how probabilities can be mapped to kernel feature spaces, and how to compute distances between these mappings. A measure of strength of dependence between two random variables follows naturally from this distance. Applications that make use of kernel probability embeddings include:
Nonparametric two-sample testing and independence testing in complex (high dimensional) domains. In the latter case, we test whether text in English is translated from the French, as opposed to being random extracts on the same topic.
Inference on graphical models, in cases where the variable interactions are modeled nonparametrically (i.e., when parametric models are impractical or unknown).
Talk slides from a talk at the Oxford statistics department (also presented at Google Mountain View, Yahoo, Inria)
- Bayesian Inference with Kernels
An embedding of probability distributions into a reproducing kernel Hilbert space (RKHS) has been introduced: like the characteristic function, this provides a unique representation of a probability distribution in a high dimensional feature space. This representation forms the basis of an inference procedure on graphical models, where the likelihoods are represented as RKHS functions. The resulting algorithm is completely nonparametric: all aspects of the model are represented implicitly, and learned from a training sample. Both exact inference on trees and loopy belief propagation on pairwise Markov random fields are demonstrated.
Kernel message passing can be applied to general domains where kernels are defined, handling challenging cases such as discrete variables with huge domains, or very complex, non-Gaussian continuous distributions. We apply kernel message passing and competing approaches to cross-language document retrieval, depth prediction from still images, protein configuration prediction, and paper topic inference from citation networks: these are all large-scale problems, with continuous-valued or structured random variables having complex underlying probability distributions. In all cases, kernel message passing performs outstandingly, being orders of magnitude faster than state-of-the-art nonparametric alternatives, and returning more accurate results.
Talk slides (with proofs in supplementary slides at the end) and video from the CMU machine learning lunch series (scroll down to the talk)
Software
AISTATS 2010 paper for inference on trees; AISTATS 2011 paper for inference on loopy graphs
NIPS 2010 Workshop on Challenges of Data Visualization (invited talk); also presented at the University of Cambridge, MSR Cambridge, University of Bristol, and CMU.
- Consistent Nonparametric Tests of Independence: L1, Log-Likelihood and Kernel
Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1, log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distribution-free strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically alpha-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value alpha, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data.
Talk slides
Software
JMLR paper and earlier ALT paper for the L1 and log-likelihood tests, NIPS paper for the kernel test
Prague Stochastics 2010
- Covariate Shift by Kernel Mean Matching
Assume we are given sets of observations of training and test data, where (unlike in the classical setting) the training and test distributions are allowed to differ. Thus for learning purposes, we face the problem of re-weighting the training data such that its distribution more closely matches that of the test data. We consider specifically the case where the difference in training and test distributions occurs only in the marginal distribution of the covariates: the conditional distribution of the outputs given the covariates is unchanged. We achieve covariate shift correction by matching covariate distributions between training and test sets in a high dimensional feature space (specifically, a reproducing kernel Hilbert space). This approach does not require distribution estimation, making it suited to high dimensions and structured data, where distribution estimates may not be practical.
We first describe the general setting of covariate shift correction, and the importance weighting approach. While direct density estimation provides an estimate of the importance weights, this has two potential disadvantages: it may not offer the best bias/variance tradeoff, and density estimation might be difficult on complex, high dimensional domains (such as text). We then describe how distributions may be mapped to reproducing kernel Hilbert spaces (RKHS), and review distances between such mappings. We demonstrate a transfer learning algorithm that reweights the training points such that their RKHS mapping matches that of the (unlabeled) test points. The sample weights are obtained by a simple quadratic programming procedure. Our correction method yields its greatest and most consistent advantages when the learning algorithm returns a classifier/regressor that is "simpler" than the data might suggest. On the other hand, even an ideal sample reweighting may not be of practical benefit given a sufficiently powerful learning algorithm (if available).
Talk slides and video
Software
Book chapter and NIPS paper
NIPS 2009 Workshop on Transfer Learning for Structured Data (invited talk)
- Introduction to Independent Component Analysis
Independent component analysis (ICA) is a technique for extracting underlying sources of information from linear mixtures of these sources, based only on the assumption that the sources are independent of each other. To illustrate the idea, we might be in a room containing several people (the sources) talking simultaneously, with microphones picking up multiple conversations at once (the mixtures), and we might wish to automatically recover the original separate conversations from these mixtures. More broadly, ICA is used in a very wide variety of applications, including signal extraction from EEG, image processing, bioinformatics, and economics. I will present an introduction to ICA, which includes a description of principal component analysis (PCA), and how ICA differs from PCA, the maximum likelihood approach, the case where fixed nonlinearities are used as heuristics for source extraction, some more modern information theoretic approaches, and a kernel-based method. I will also cover two optimization strategies, and provide a comparison of the various approaches on benchmark data, to reveal the strengths and failure modes of different ICA algorithms (with a focus on modern, recently published methods).
Talk Slides
Software for Fast Kernel ICA demo
MLD student research symposium (invited talk)
Earlier talks
-
Slides from earlier talks may be found here (scroll down to the "talks" category in each year).
I also have talks on Videolectures, but
I have given better presentations on some of the earlier topics (e.g. the ICML tutorial).