# Bayesian Approaches to Distribution Regression

Oxford
Gatsby unit, UCL
Oxford
Imperial College London

arXiv:1705.04293

Learning on Distributions, Functions, Graphs and Groups
NIPS 2017

## Distribution regression  -0.856  0.562  1.39

Labels Observe Model

Different numbers of samples

## Distribution regression applications

• “JIT” regression for EP messages
• Choose summary statistics for ABC
• Infer demographic voting behavior [Flaxman+ KDD-15, 2016]
• Galaxy cluster mass from velocities [Ntampaka+ ApJ 2015/2016]
• Red shift estimation from galaxy clusters
• Model radioactive isotope behavior [Jin+ NSS-16]
• Predict grape yields from vineyard images
• (Plus lots of classification, anomaly detection, etc)

## Distribution regression with kernel mean embeddings

• Standard approach [e.g. Muandet+ NIPS 2012]
• Choose RKHS with kernel
• e.g.
• Use mean embedding
• Reproducing property: for ,
• Mean embeddings are good representation for regression

## Estimating kernel mean embeddings

• is a good summary of
• But don't know or ; just have samples
• Natural estimate: empirical mean
• Inner products:
• But…point estimate worse for small

## Posterior for mean embeddings [Flaxman+ UAI 2016]

• Place prior:
• Likelihood is “observed” at points :
• Get a closed-form GP posterior for :
• Mean matches Stein shrinkage estimator

## Distribution regression with kernel mean embeddings

• Model label as function of mean embedding:
• Estimate with ridge regression:
• Landmark approximation:
• Have uncertainty about both and

## Shrinkage model

• Uses 's GP posterior for
• Point estimate for weights
• Observations
• Landmark approximation:
• Get
• Can get MAP estimate for , , kernel params, …

## Bayesian linear regression

• Assumes are known exactly: point estimate at
• Regression weight uncertainty:
• Observations:
• Posterior for is normal
• Hyperparameters: , , kernel params…

## Full Bayesian Distribution Regression

• Shrinkage posterior for
• Normal model for regression weights:
• Observations
• is non-conjugate; MCMC inference with Stan

## Models recap

ModelInference
ShrinkageGP modelpoint est.conjugate+MAP
BLRpoint est.normalconjugate+MAP
BDRGP modelnormalconjugate+MCMC

## Toy experiment

• Labels uniform over
• 5d data points: have

## Toy experiment: NLL

BDR shrinkage BLR in NLL: ## Toy experiment results: MSE

Same BDR shrinkage BLR in MSE. (Predicting mean: 1.3)   ## Toy experiment

• Variant with constant , added noise:
• BDR BLR shrinkage in NLL, MSE
• BDR can take advantage of both situations

## Age prediction from face images , , , , IMDb database : 400k images of 20k celebrities ## Age prediction results

Features: last hidden layer of Rothe et al.'s CNN Shrinkage really helps!

## Recap

Three Bayesian models for distribution regression:

ModelInference
ShrinkageGP modelpoint est.conjugate+MAP
BLRpoint est.normalconjugate+MAP
BDRGP modelnormalconjugate+MCMC
• Both kinds of uncertainty can help
• BDR can take advantage of both settings

Bayesian Approaches to Distribution Regression

Ho Chung Leon Law, Dougal J. Sutherland, Dino Sejdinovic, Seth Flaxman

arXiv:1705.04293