Efficiently estimating densities and scores
with Nyström kernel exponential families

Gatsby unit, University College London

Gatsby Tri-Centre Meeting, June 2018

\DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\bigO}{\mathcal O} \DeclareMathOperator*{\E}{\mathbb E} \newcommand{\h}{\mathcal H} \newcommand{\g}{\mathcal G} \newcommand{\R}{\mathbb R} \DeclareMathOperator{\spn}{span} \newcommand{\tp}{^\mathsf{T}} \newcommand{\ud}{\mathrm d} \newcommand{\nc}{\color{#d62728}{n}} \newcommand{\Xc}{\color{#d62728}{X}} \newcommand{\Mc}{\color{#1f77b4}{M}} \newcommand{\Yc}{\color{#1f77b4}{Y}} \newcommand{\mc}{\color{#17becf}{m}} \newcommand{\dc}{\color{#2ca02c}{d}}

Our goal

  • A multivariate density model that:
    • Is flexible enough to model complex data (nonparametric)
    • Is computationally efficient to estimate and evaluate
    • Works in high(ish) dimensions
    • Has statistical guarantees
Parametric modelNonparametric model
Multivariate GaussianGaussian process
Finite mixture modelDirichlet process mixture model
Exponential familyKernel exponential family

Exponential families

  • Can write many classic densities on \R^d as: p(x) = \exp\big( \big\langle \underbrace{\eta}_{\substack{\text{natural}\\\text{parameter}}}, \underbrace{T(x)}_{\substack{\text{sufficient}\\\text{statistic}}} \big\rangle_{\R^s} - \underbrace{A(\eta)}_\text{log-normalizer} \big) \, \underbrace{q_0(x)}_{\substack{\text{base}\\\text{measure}}}
    • Gaussian: T(x) = \begin{bmatrix} x & x^2 \end{bmatrix}
    • Gamma: T(x) = \begin{bmatrix} x & \log x \end{bmatrix}
  • Density only sees data x through \E_X T(X) (and q_0 )
    • Can we make T richer?

Infinitely many features with kernels!

  • Kernel: “similarity” function k(x, y) \in \R
    • e.g. exponentiated quadratic k(x, y) = \exp\left( - \tfrac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)
  • Reproducing kernel Hilbert space \h of functions on \mathcal X with f(x) = \langle f, k_x \rangle_\h
  • k_x \in \h is the infinite-dimensional feature map of x
    • Similarity to all possible inputs y : k_x(y) = k(x, y)

Kernel exponential families

  • Lift parameters from \R^s to the RKHS \h , with kernel k :
  • Use parameter \eta = f \in \h and sufficient statistic T(x) = k_x : \begin{align} p_f(x) % &= \exp\big( % \big\langle \eta, T(x) \big\rangle_\h % - A(\eta) % \big) \, q_0(x) &= \exp\big( \big\langle f, k_x \big\rangle_\h - A(f) \big) \, q_0(x) \\& {}= \exp\big( f(x) - A(f) \big) \, q_0(x) \end{align}
  • Covers standard exponential families: k(x, y) = T(x) \cdot T(y)
    • e.g. normal distribution has k(x, y) = x y + x^2 y^2

Rich class of densities

p_f(x) = \exp\big(f(x) - A(f) \big) \, q_0(x)
  • For “strong” k that decay to 0, \{ p_f \}_{f \in \h} is dense in the set of continuous densities with tails like q_0
    • Examples of \exp(f(x)) with Gaussian k :

Density estimation

p_f(x) = \exp\big(f(x) - A(f) \big) \, q_0(x)
  • Given sample \{ X_b \}_{b \in [n]} \sim p_0 , want f so that p_f \approx p_0
  • Usual approach: maximum likelihood
    • A(f) , \nabla A(f) are hard to compute
    • Likelihood equations ill-posed for characteristic kernels
  • Need a way to fit an unnormalized probabilistic model
    • Could then estimate A(f) once after fitting

Unnormalized density / score estimation

  • Don't necessarily need to compute A(f) afterwards
  • f + \log q_0 = \log p_f + A(f) (“energy”) lets us:
    • Find modes (global or local)
    • Sample (with MCMC)
  • The score, \nabla_x f(x) = \nabla_x \log p_f(x) , lets us:
    • Run HMC when we can't evaluate gradients
    • Construct Monte Carlo control functionals
  • But again, need a way to find f

Score matching in KEFs

    [Sriperumbudur, Fukumizu, Gretton, Hyvärinen, and Kumar, JMLR 2017]

  • Idea: minimize Fisher divergence J(p_0 \| p_f) J(f) = \frac{1}{2} \int p_0(x) \, \left\lVert \nabla_x \log p_f(x) - \nabla_x \log p_0(x) \right\rVert_2^2 \, \ud x
  • Under mild assumptions, integration by parts gives J(f) = \int p_0(x) \sum_{i=1}^d \left[ \partial_i^2 f(x) + \frac12 \left( \partial_i f(x) \right)^2 \right] \ud x + C(p_0, q_0)
  • Can estimate with Monte Carlo

Score matching in KEFs

  • Minimize regularized loss function: \hat J_\lambda(f) = \frac1n \sum_{a=1}^n \sum_{i=1}^d \left[ \partial_i^2 f(X_a) + \frac12 \left( \partial_i f(X_a) \right)^2 \right] + \frac12 \lambda \lVert f \rVert_\h^2
  • Representer theorem tells us minimizer of \hat J_\lambda over \h is f_{\lambda,\Xc} \in \fragment[2][highlight-current-red]{\spn\left\{ \partial_i k_{X_a} \right\}_{a \in [n]}^{i \in [d]}} \cup \fragment[3][highlight-current-red]{\spn\left\{ \partial_i^2 k_{X_a} \right\}_{a \in [n]}^{i \in [d]}}

Best score matching fit in subspace

  • Best f \in \h is in \color{#8172b2}{\h_\text{full}} = \spn\left\{ \partial_i k_{X_a} \right\}_{a \in [\nc]}^{i \in [\dc]} \cup \spn\left\{ \partial_i^2 k_{X_a} \right\}_{a \in [\nc]}^{i \in [\dc]}
  • Can find best f in subspace of dim \Mc in \bigO(\nc\dc \Mc^2 + \Mc^3) time \beta = - (\tfrac1n \underbrace{B_{\Xc \Yc}\tp}_{\Mc \times \nc\dc} \; \underbrace{B^\phantom{\mathsf T}_{\Xc \Yc}}_{\nc\dc \times \Mc} + \lambda \underbrace{G_{\Yc \Yc}}_{\Mc \times \Mc})^\dagger \underbrace{h_\Yc}_{\Mc \times 1}
    • \color{#8172b2}{\h_\text{full}} : \Mc = \nc \dc , \mathcal{O}(\nc^3 \dc^3) time!

Nyström approximation

  • \color{#8172b2}{\h_\text{full}} = \spn\left\{ \partial_i k_{X_a} \right\}_{a \in [\nc]}^{i \in [\dc]} \cup \spn\left\{ \partial_i^2 k_{X_a} \right\}_{a \in [\nc]}^{i \in [\dc]}
  • Nyström approximation: find fit in different (smaller) \h_{\Yc}
  • One choice: pick \Yc \subset [\nc] , \lvert \Yc \rvert = \mc at random, then \color{#ccb974}{\h_\text{nys}^Y} = \spn\left\{ \partial_i k_{X_a} \right\}_{a \in \Yc}^{i \in [\dc]} \qquad \mathcal{O}(\nc \mc^2 \dc^3)\text{ time}
  • “lite”: pick \Yc at random, then \color{#64b5c2}{\h_\text{lite}^Y} = \spn\left\{ k_{X_a} \right\}_{a \in \Yc} \qquad \mathcal{O}(\nc \mc^2 \dc)\text{ time}

Analysis: Main result

  • Well-specified case: there is some f_0 \in \h so that p_0 = p_{f_0}
  • Some technical assumptions on kernel, density smoothness
    • \theta is a parameter depending on problem smoothness:
      • \frac12 in the worst case, \frac13 in the best
  • Choose \color{#ccb974}{\h_\text{nys}^Y} basis, size \mc = \Omega(\nc^\theta \log \nc) uniformly, \lambda = \nc^{-\theta} :
    • Fit takes \mathcal{O}(\nc^{1 + 2 \theta} \dc^3 \log^2\!\nc) time instead of \mathcal{O}(\nc^3 \dc^3)
    • Fisher distance J :
    • \lVert\cdot\rVert_\h , KL, Hellinger, L_r ( 1 \le r \le \infty ):
      • Same rate, consistency for hard problems
      • Rate saturates slightly sooner for easy problems

Proof ideas

  • Error can be bounded relative to \lVert f - f_0 \rVert_\h
  • Define best estimator in basis f_\lambda^{\Yc} = \argmin_{f \in \h_{\Yc}} J_\lambda(f) : \lVert f_{\lambda,\Xc}^{\Yc} - f_0 \rVert_\h \le \underbrace{\lVert f_{\lambda,\Xc}^{\Yc} - f_\lambda^{\Yc} \rVert_\h}_\text{estimation} + \underbrace{\lVert f_{\lambda}^{\Yc} - f_0 \rVert_\h}_\text{approximation}
  • New Bernstein inequality for sums of correlated operators
  • Approximation key part: \h_{\Yc} covers “most of” f_0

Synthetic experiment: Grid

Normal distributions at d vertices of a hypercube in \R^d

Grid results

Normal distributions at d vertices of a hypercube in \R^d

Grid runtimes

Normal distributions at d vertices of a hypercube in \R^d

Synthetic experiment: Rings

  • Three circles, plus small Gaussian noise in radial direction
  • Other dimensions are uniform Gaussian noise

Rings results

Gradient-free Hamiltonian Monte Carlo

  • Want to marginalize out hyperparameter choice for a GP classifier
    • Can estimate likelihood of hyperparameters with EP
    • Can't get gradients this way: no HMC
  • Kernel Adaptive HMC [Strathmann et al. NIPS-15]:
    • Fit model to estimate scores based on current samples
    • Propose HMC trajectories based on score estimate
    • Metropolis rejection step accounts for errors in score estimate

Gradient-free HMC

Learning the kernel

  • Nice surface to optimize with gradient descent

Learning deep kernels

Use kernel k(x, y) = k_\text{SE}(\varphi(x), \varphi(y)) , with \varphi a deep network

0 layers
1 layer
2 layers
3 layers

Kernel Conditional Exponential Family

Michael Arbel, Arthur Gretton; AISTATS 2018, arXiv:1711.05363

  • Add kernel for conditioning variable too
  • P(X_1, \dots, X_d) = \prod_i P(X_i \mid X_{\pi(i)})


  • Kernel exponential families are rich density models
  • Hard to normalize, so maximum likelihood intractible
  • Score matching works, but it's slow
  • Nyström gives speedups with guarantees on performance
  • Lite variant even faster, but no guarantees (yet)
  • More complex densities: learn kernel and/or conditional structure

Efficient and principled score estimation
with Nyström kernel exponential families

Dougal J. Sutherland*, Heiko Strathmann*, Michael Arbel, Arthur Gretton
AISTATS 2018; arXiv:1705.08360