with kernel exponential families

arXiv:1705.08360

UCL CSML seminar, 15 December 2017

- A multivariate density model that:
- Is flexible enough to model complex data (nonparametric)
- Is computationally efficient to estimate and evaluate
- Has statistical guarantees

- Can write many densities on as:
- Gaussian:
- Gamma:
- Density only sees data through (and )
- Can we make richer?

*Kernel*: “similarity” function- e.g. exponentiated quadratic

- For any positive definite on , there's a reproducing kernel Hilbert space of functions on with
- is the infinite-dimensional
*feature map*of

- Functions in are (closure of) linear combinations of :

- Functions in are (closure of) linear combinations of :
- Reproducing property:

- Lift parameters from to the RKHS , with kernel :
- Use and , so:
- Covers standard exponential families:
- e.g. normal distribution has

- For rich-enough that decay to 0,

is dense in the set of distributions that decay like- Examples of :

- Given sample , want so that
- is hard to compute
- Maximum likelihood would need at each step: intractable
- Need a way to fit an unnormalized probabilistic model
- Could then estimate
*once*after fitting

- Don't necessarily need to compute afterwards
- (“energy”) lets us:
- Find modes (global or local)
- Sample (with MCMC)
- …
- The score, , lets us:
- Run HMC when we can't evaluate gradients
- Construct Monte Carlo control functionals
- …
- But again, need a way to find

- Idea: minimize Fisher divergence
- Under mild assumptions, integration by parts gives
- Estimate with

[Sriperumbudur, Fukumizu, Gretton, Hyvärinen, and Kumar, JMLR 2017]

- Minimize regularized loss function:
- Thanks to representer theorem, we know

- Can get an analytic solution to weights in that span:
- has simple closed form
- Solving an linear system takes time!

- Representer theorem: know the best is in
- Nyström approximation: find best in subspace
- If has size- basis, can find coefficients in time
- Worst case: , no improvement
- If ,
- If ,

- Representer theorem: know the best is in
- Reasonable : pick at random, then
*“lite”*: pick at random, then

- Well-specified case: there is some so that
- Some technical assumptions on kernel, density smoothness
- If we choose basis points uniformly:
- is a parameter depending on problem smoothness:
- in the worst case, in the best

- As long as , with we get:
- Fisher distance :
- Consistent; same rate as full solution

- , KL, Hellinger, ():
- Same rate, consistency for hard problems
- Consistent with slightly slower rate for easy problems

- Fisher distance :

- is a parameter depending on problem smoothness:

- Error can be bounded relative to
- Define best estimator in basis :
- Using lots of techniques from:

- Error due to finite number of samples
- We show it's
- Proof is based on:
- standard concentration inequalities in
- a new Bernstein inequality for sums of correlated operators
- some tricks manipulating operators in

- We show it's , as long as
- Key part of proof: covers “most of”
- Also needs new Bernstein inequality for correlated operators
- Along with lots of tricks
- Only part specific to the choice of

- Final bound is just plugging those together and optimizing
- Slightly better rate in Fisher divergence
- Loss that only cares about where lies

- Doesn't account for in a satisfying way
- Constants about problem difficulty depend on

- Only 17 pages of dense manipulations!

- Train a denoising autoencoder with normal noise
- Normalized reconstruction error estimates the score:
- …as long as autoencoder has infinite capacity
- …and is at global optimum
- is like a bandwidth, have to tune it

- Two categories:
- Synthetic experiments
- We know true score, and so can compute

- Gradient-free HMC
- We can evaluate quality of proposed gradient steps

Normal distributions at vertices of a hypercube in

Normal distributions at vertices of a hypercube in

Normal distributions at vertices of a hypercube in

Normal distributions at vertices of a hypercube in

- Three circles, plus small Gaussian noise in radial direction
- Other dimensions are uniform Gaussian noise

- Three circles, plus small Gaussian noise in radial direction
- Other dimensions are uniform Gaussian noise

- Hyperparameter selection for GP classifier on UCI Glass dataset
- Can estimate likelihood of hyperparameters with EP
- Can't get gradients this way
- Fit models to 2000 samples from 40 very-thinned random walks
- Track HMC Metropolis acceptance rate for many trajectories
- higher acceptance rate better model

- Nice surface to optimize with gradient descent
- Potentially allows for complex (deep?) kernels

- Kernel exponential families are rich density models
- Hard to normalize, so maximum likelihood impossible
- Can do score matching, but it's slow
- Nyström gives speedups with guarantees on performance
- Lite version even faster, but no guarantees (yet)
- Could apply to kernel
*conditional*exponential family

**Efficient and principled score estimation with kernel exponential families**__Dougal J. Sutherland__^{*}, Heiko Strathmann^{*}, Michael Arbel, Arthur Gretton

arXiv:1705.08360