Efficient and principled score estimation
with kernel exponential families

arXiv:1705.08360

UCL CSML seminar, 15 December 2017

\DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\bigO}{\mathcal O} \DeclareMathOperator*{\E}{\mathbb E} \newcommand{\h}{\mathcal H} \newcommand{\R}{\mathbb R} \DeclareMathOperator{\spn}{span} \newcommand{\tp}{^\mathsf{T}} \newcommand{\ud}{\mathrm d} \newcommand{\nc}{\color{red}{n}} \newcommand{\Mc}{\color{#55b}{M}} \newcommand{\mc}{\color{#00f}{m}} \newcommand{\dc}{\color{green}{d}}

Our goal

  • A multivariate density model that:
    • Is flexible enough to model complex data (nonparametric)
    • Is computationally efficient to estimate and evaluate
    • Has statistical guarantees

Exponential families

  • Can write many densities on \R^d as: p(x) = \exp\big( \big\langle \underbrace{\eta}_{\substack{\text{natural}\\\text{parameter}}}, \underbrace{T(x)}_{\substack{\text{sufficient}\\\text{statistic}}} \big\rangle_{\R^s} - \underbrace{A(\eta)}_\text{log-normalizer} \big) \, \underbrace{q_0(x)}_{\substack{\text{base}\\\text{measure}}}
    • Gaussian: T(x) = \begin{bmatrix} x & x^2 \end{bmatrix}
    • Gamma: T(x) = \begin{bmatrix} x & \log x \end{bmatrix}
  • Density only sees data x through \E_X T(X) (and q_0 )
    • Can we make T richer?

Infinitely many features with kernels!

  • Kernel: “similarity” function k(x, y) \in \R
    • e.g. exponentiated quadratic k(x, y) = \exp\left( - \tfrac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)
  • For any positive definite k on \mathcal X , there's a reproducing kernel Hilbert space \h of functions on \mathcal X with k(x, y) = \langle k(x, \cdot), k(y, \cdot) \rangle_\h
  • k(x, \cdot) \in \h is the infinite-dimensional feature map of x

Functions in the RKHS

  • Functions in \h are (closure of) linear combinations of \varphi_k : f(x) = \sum_a \beta_a k(X_a, x)

Reproducing property

  • Functions in \h are (closure of) linear combinations of \varphi_k : f(x) = \sum_a \beta_a k(X_a, x)
  • Reproducing property: \begin{align} \langle f, k(x, \cdot) \rangle_\h &{}= \left\langle \sum_a \beta_a k(X_a, \cdot), k(x, \cdot) \right\rangle_\h \\&\fragment[3]{{}= \sum_a \beta_a \left\langle k(X_a, \cdot), k(x, \cdot) \right\rangle_\h} \\&\fragment[4]{{}= \sum_a \beta_a k(x_a, x)} \\&\fragment[5]{{}= f(x)} \end{align}

Kernel exponential families

  • Lift parameters from \R^s to the RKHS \h , with kernel k :
  • Use \eta = f \in \h and T(x) = k(x, \cdot) , so: \begin{align} p_f(x) &= \exp\big( \big\langle \eta, T(x) \big\rangle_\h - A(\eta) \big) \, q_0(x) \\&\fragment{{}= \exp\big( \big\langle f, k(x, \cdot) \big\rangle_\h - A(f) \big) \, q_0(x)} \\&\fragment{{}= \exp\big( f(x) - A(f) \big) \, q_0(x)} \end{align}
  • Covers standard exponential families: k(x, y) = T(x) \cdot T(y)
    • e.g. normal distribution has k(x, y) = x y + x^2 y^2

Rich class of densities

p_f(x) = \exp\big(f(x) - A(f) \big) \, q_0(x)
  • For rich-enough k that decay to 0,
    \{ p_f \}_{f \in \h} is dense in the set of distributions that decay like q_0
    • Examples of \exp(f(x)) :

Density estimation

p_f(x) = \exp\big(f(x) - A(f) \big) \, q_0(x)
  • Given sample \{ X_b \}_{b \in [n]} \sim p_0 , want f so that p_f \approx p_0
  • A(f) is hard to compute
  • Maximum likelihood would need A(f) at each step: intractable
  • Need a way to fit an unnormalized probabilistic model
  • Could then estimate A(f) once after fitting

Unnormalized density / score estimation

  • Don't necessarily need to compute A(f) afterwards
  • f + \log q_0 = \log p_f + A(f) (“energy”) lets us:
    • Find modes (global or local)
    • Sample (with MCMC)
  • The score, \nabla_x f(x) = \nabla_x \log p_f(x) , lets us:
    • Run HMC when we can't evaluate gradients
    • Construct Monte Carlo control functionals
  • But again, need a way to find f

Score matching in KEFs

    [Sriperumbudur, Fukumizu, Gretton, Hyvärinen, and Kumar, JMLR 2017]

  • Idea: minimize Fisher divergence J(p_0 \| p_f) J(f) = \frac{1}{2} \int p_0(x) \, \left\lVert \nabla_x \log p_f(x) - \nabla_x \log p_0(x) \right\rVert_2^2 \, \ud x
  • Under mild assumptions, integration by parts gives J(f) = \int p_0(x) \sum_{i=1}^d \left[ \partial_i^2 f(x) + \frac12 \left( \partial_i f(x) \right)^2 \right] \ud x + C(p_0, q_0)
  • Estimate with \hat J(f) = \frac1n \sum_{a=1}^n \sum_{i=1}^d \left[ \partial_i^2 f(X_a) + \frac12 \left( \partial_i f(X_a) \right)^2 \right]

Score matching in kernel exp. families

  • Minimize regularized loss function: \hat J_\lambda(f) = \frac1n \sum_{a=1}^n \sum_{i=1}^d \left[ \partial_i^2 f(X_a) + \frac12 \left( \partial_i f(X_a) \right)^2 \right] + \frac12 \lambda \lVert f \rVert_\h^2
  • Thanks to representer theorem, we know f_{\lambda,n} \in \fragment[2][highlight-current-red]{\spn\left\{ \partial_i k(X_a, \cdot) \right\}_{a \in [n]}^{i \in [d]}} \cup \fragment[3][highlight-current-red]{\spn\left\{ \partial_i^2 k(X_a, \cdot) \right\}_{a \in [n]}^{i \in [d]}}

Density estimation fit

  • Can get an analytic solution to weights in that span: f_{\lambda,\nc}(x) = \sum_{a=1}^{\nc} \sum_{i=1}^{\dc} \beta_{(a, i)} \partial_i k(X_a, x) - \frac{\hat\xi}{\lambda}
    • \hat\xi has simple closed form
    • \beta = -\frac1\lambda \bigl( \underbrace{G_{XX}}_{\nc\dc \times \nc\dc} + n \lambda I \bigr)^{-1} \underbrace{h_X}_{\nc\dc \times 1}
    • Solving an \nc\dc \times \nc\dc linear system takes \bigO(\nc^3 \dc^3) time!

The Nyström approximation

  • Representer theorem: know the best f \in \h is in \spn\left\{ \partial_i k(X_a, \cdot) \right\}_{a \in [\nc]}^{i \in [\dc]} \cup \spn\left\{ \partial_i^2 k(X_a, \cdot) \right\}_{a \in [\nc]}^{i \in [\dc]}
  • Nyström approximation: find best f in subspace \h_Y \subset \h
  • If \h_Y has size- \Mc basis, can find coefficients in \bigO(\nc\dc \Mc^2) time \beta = - (\tfrac1n \underbrace{B_{XY}\tp}_{\Mc \times \nc\dc} \; \underbrace{B^\phantom{\mathsf T}_{XY}}_{\nc\dc \times \Mc} + \lambda \underbrace{G_{YY}}_{\Mc \times \Mc})^\dagger \underbrace{h_Y}_{\Mc \times 1}
    • Worst case: \Mc = \nc \dc , no improvement
    • If \Mc = \mc \dc , \bigO(\nc \mc^2 \dc^3)
    • If \Mc = \mc , \bigO(\nc \mc^2 \dc)

Nyström basis choice

  • Representer theorem: know the best f \in \h is in \spn\left\{ \partial_i k(X_a, \cdot) \right\}_{a \in [\nc]}^{i \in [\dc]} \cup \spn\left\{ \partial_i^2 k(X_a, \cdot) \right\}_{a \in [\nc]}^{i \in [\dc]}
  • Reasonable \h_Y : pick Y = \{ (a, i) \} \subset [\nc] \times [\dc] at random, then \h_Y = \spn\left\{ \partial_i k(X_a, \cdot) \right\}_{(a, i) \in Y}
  • “lite”: pick Y = \{ a \} \subset [\nc] at random, then \h_Y = \spn\left\{ k(X_a, \cdot) \right\}_{a \in Y}

Theory: Main result

  • Well-specified case: there is some f_0 \in \h so that p_0 = p_{f_0}
  • Some technical assumptions on kernel, density smoothness
  • If we choose basis points uniformly:
    • \theta is a parameter depending on problem smoothness:
      • \frac12 in the worst case, \frac13 in the best
    • As long as M = \Omega(n^\theta d \log n) , with \lambda = n^{-\theta} we get:
      • Fisher distance J :
        • Consistent; same rate as full solution
      • \lVert\cdot\rVert_\h , KL, Hellinger, L_r ( 1 \le r \le \infty ):
        • Same rate, consistency for hard problems
        • Consistent with slightly slower rate for easy problems

Proof ideas

  • Error can be bounded relative to \lVert f - f_0 \rVert_\h
  • Define best estimator in basis f_\lambda^m = \argmin_{f \in \h_Y} J_\lambda(f) :
  • \lVert f_{\lambda,n}^m - f_0 \rVert_\h \le \underbrace{\lVert f_{\lambda,n}^m - f_\lambda^m \rVert_\h}_\text{estimation} + \underbrace{\lVert f_{\lambda}^m - f_0 \rVert_\h}_\text{approximation}
  • Using lots of techniques from:

Estimation error: \lVert f_{\lambda,n}^m - f_\lambda^m \rVert_\h

  • Error due to finite number of samples n
  • We show it's \bigO_{p_0}\left( \frac{1}{\lambda \sqrt n} \right)
  • Proof is based on:
    • standard concentration inequalities in \h
    • a new Bernstein inequality for sums of correlated operators
    • some tricks manipulating operators in \h

Approximation error: \lVert f_{\lambda}^m - f_0 \rVert_\h

  • We show it's \bigO_{p_0}\left( \lambda^b \right) , as long as M = \Omega\left( \frac1\lambda \log \frac1\lambda \right)
  • Key part of proof: \h_Y covers “most of” f_0
    • Also needs new Bernstein inequality for correlated operators
    • Along with lots of tricks
    • Only part specific to the choice of \h_Y

Nyström convergence rate

  • Final bound is just plugging those together and optimizing \lambda
  • Slightly better rate in Fisher divergence
    • Loss that only cares about where p_0 lies
  • Doesn't account for d in a satisfying way
    • Constants about problem difficulty depend on d
  • Only 17 pages of dense manipulations!

Other main score estimator

  • Train a denoising autoencoder with normal noise \sigma
  • Normalized reconstruction error estimates the score:
    • \frac{r_\sigma(x) - x}{\sigma} \to \nabla_x p_0(x)
  • …as long as autoencoder has infinite capacity
  • …and is at global optimum
  • \sigma is like a bandwidth, have to tune it

Experiments

  • Two categories:
    • Synthetic experiments
      • We know true score, and so can compute J
    • Gradient-free HMC
      • We can evaluate quality of proposed gradient steps

Synthetic experiment: Grid

Normal distributions at d vertices of a hypercube in \R^d

Synthetic experiment: Grid

Normal distributions at d vertices of a hypercube in \R^d

Grid results

Normal distributions at d vertices of a hypercube in \R^d


Grid runtimes

Normal distributions at d vertices of a hypercube in \R^d


Synthetic experiment: Rings

  • Three circles, plus small Gaussian noise in radial direction
  • Other dimensions are uniform Gaussian noise

Synthetic experiment: Rings

  • Three circles, plus small Gaussian noise in radial direction
  • Other dimensions are uniform Gaussian noise

Rings results


Gradient-free HMC

  • Hyperparameter selection for GP classifier on UCI Glass dataset
  • Can estimate likelihood of hyperparameters with EP
    • Can't get gradients this way
  • Fit models to 2000 samples from 40 very-thinned random walks
  • Track HMC Metropolis acceptance rate for many trajectories
    • higher acceptance rate \approx better model

Gradient-free HMC

Hyperparameter surface

  • Nice surface to optimize with gradient descent
    • Potentially allows for complex (deep?) kernels

Recap

  • Kernel exponential families are rich density models
  • Hard to normalize, so maximum likelihood impossible
  • Can do score matching, but it's slow
  • Nyström gives speedups with guarantees on performance
  • Lite version even faster, but no guarantees (yet)
  • Could apply to kernel conditional exponential family

Efficient and principled score estimation with kernel exponential families
Dougal J. Sutherland*, Heiko Strathmann*, Michael Arbel, Arthur Gretton
arXiv:1705.08360