# Efficient and principled score estimation with kernel exponential families

Dougal Sutherland
Gatsby unit, UCL

Based on:

Approximating high dimensional functions, 19 December 2017

## Our goal

• A multivariate density model that:
• Is flexible enough to model complex data (nonparametric)
• Is computationally efficient to estimate and evaluate
• Works in high(ish) dimensions
• Has statistical guarantees
Parametric modelNonparametric model
Multivariate GaussianGaussian process
Finite mixture modelDirichlet process mixture model
Exponential family???

## Exponential families

• Can write many densities on as:
• Gaussian:
• Gamma:
• Density only sees data through (and )
• Can we make richer?

## Infinitely many features with kernels!

• Kernel: “similarity” function
• For any positive definite on , there's a reproducing kernel Hilbert space of functions on with
• is the infinite-dimensional feature map of

## Functions in the RKHS

• Functions in are (closure of) linear combinations of :

## Reproducing property

• Functions in are (closure of) linear combinations of :
• Reproducing property:

## Kernel exponential families

• Lift parameters from to the RKHS , with kernel :
• Use and , so:
• Covers standard exponential families:
• e.g. normal distribution has

## Rich class of densities

• For integrally strictly positive definite that decay to 0,
is dense in the set of distributions that decay like
• Examples of :

## Density estimation

• Given sample , want so that
• Default approach: maximum likelihood
• , are hard to compute
• Ill-posed for characteristic kernels
• Need a way to fit an unnormalized probabilistic model
• Could then estimate once after fitting

## Unnormalized density / score estimation

• Don't necessarily need to compute afterwards
• (“energy”) lets us:
• Find modes (global or local)
• Sample (with MCMC)
• The score, , lets us:
• Run HMC when we can't evaluate gradients
• Construct Monte Carlo control functionals
• But again, need a way to find

## Score matching in KEFs

[Sriperumbudur, Fukumizu, Gretton, Hyvärinen, and Kumar, JMLR 2017]

• Idea: minimize Fisher divergence
• Under mild assumptions, integration by parts gives
• Estimate with

## Score matching in KEFs

• Minimize regularized loss function:
• Thanks to representer theorem, we know

## Score matching fit

• Can get an analytic solution to weights in that span:
• has simple closed form
• Solving an linear system takes time!

## The Nyström approximation

• Representer theorem: know the best is in
• Nyström approximation: find best in subspace
• If has size- basis, can find coefficients in time
• Worst case: , no improvement
• If ,
• If ,

## Nyström basis choice

• Representer theorem: know the best is in
• Reasonable : pick at random, then
• “lite”: pick at random, then

## Analysis: Main result

• Well-specified case: there is some so that
• Some technical assumptions on kernel, density smoothness
• is a parameter depending on problem smoothness:
• in the worst case, in the best
• Choose basis of size uniformly, :
• Fisher distance :
• Consistent; same rate as full solution
• , KL, Hellinger, ():
• Same rate, consistency for hard problems
• Rate saturates slightly sooner for easy problems

## Proof ideas

• Error can be bounded relative to
• Define best estimator in basis :
• New Bernstein inequality for sums of correlated operators
• Approximation key part: covers “most of”

## Other main score estimator

• Train a denoising autoencoder with normal noise
• Normalized reconstruction error estimates the score:
• …as long as autoencoder has infinite capacity
• …and is at global optimum
• is like a bandwidth, have to tune it

## Experiments

• Two categories:
• Synthetic experiments
• We know true score, and so can compute
• We can evaluate quality of proposed gradient steps

## Synthetic experiment: Grid

Normal distributions at vertices of a hypercube in

## Synthetic experiment: Grid

Normal distributions at vertices of a hypercube in

## Grid results

Normal distributions at vertices of a hypercube in

## Grid runtimes

Normal distributions at vertices of a hypercube in

## Synthetic experiment: Rings

• Three circles, plus small Gaussian noise in radial direction
• Other dimensions are uniform Gaussian noise

## Synthetic experiment: Rings

• Three circles, plus small Gaussian noise in radial direction
• Other dimensions are uniform Gaussian noise

## Rings results

• Want to marginalize out hyperparameter choice for a GP classifier
• Can estimate likelihood of hyperparameters with EP
• Can't get gradients this way: no HMC
• Approximate HMC method :
• Fit model to estimate scores based on current samples
• Propose HMC trajectories based on score estimate
• Metropolis rejection step accounts for errors in score estimate
• Our experiment (UCI Glass dataset):
• Fit model on "ground truth" samples
• 2000 samples from 40 very-thinned random walks
• Track HMC acceptance rate for many trajectories
• higher acceptance rate better model

## Hyperparameter surface

• Nice surface to optimize with gradient descent
• Potentially allows for complex (deep?) kernels

## Recap so far

• Kernel exponential families are rich density models
• Hard to normalize, so maximum likelihood intractible
• Can do score matching, but it's slow
• Nyström gives speedups with guarantees on performance
• Lite version even faster, but no guarantees (yet)

Efficient and principled score estimation with kernel exponential families
Dougal J. Sutherland*, Heiko Strathmann*, Michael Arbel, Arthur Gretton
arXiv:1705.08360

Kernel Conditional Exponential Family (arXiv:1711.05363)
Michael Arbel and Arthur Gretton

• often has a graphical structure:
• Estimate each factor independently

## Kernel conditional exponential family

• General definition: use joint feature map :
• Proposed particular case:
• Feature map given by:
• is a vector-valued RKHS

## What loss function to use?

• Obvious approach: minimize
• Problem: expression still depends on
• Zero iff conditional densities are equal for -almost all
• Leads to analytical minimizer of same form as before

## Failure mode of expected conditional score

Why does it fail?

Loss is

since supports are disjoint, and we drop the normalizer…

Violates assumptions of model, but need to be careful as we approach this situation!

## Unconditional vs conditional model

• Parkinsons: biomedical voice measurements from patients with early-stage Parkinson's disease
• Red Wine: physiochemical measurements on wine samples
ParkinsonsRed Wine
Dimension1511
Samples5,8751,599

## Unconditional vs conditional model

• Parkinsons:
• Red Wine:

Compare to:

• LSCADE: density ratio estimation, has guarantees
• RNADE: deep learning, no guarantees

Negative log likelihoods: (smaller is better; 5 splits)

ParkinsonsRed Wine
KCEF2.86 ± 0.7711.8 ± 0.93
LSCDE15.89 ± 1.4814.43 ± 1.5

## Recap

• Kernel exponential families are rich density models
• Hard to normalize, so need score matching
• Nyström gives speedups with guarantees on performance
• Conditional version: more complex, higher-dimensional models

Efficient and principled score estimation with kernel exponential families
Dougal J. Sutherland*, Heiko Strathmann*, Michael Arbel, Arthur Gretton
arXiv:1705.08360

Kernel Conditional Exponential Family (arXiv:1711.05363)
Michael Arbel and Arthur Gretton