Efficiently estimating densities and scores
with Nyström kernel exponential families
Gatsby unit, University College London
Gatsby Tri-Centre Meeting, June 2018
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\bigO}{\mathcal O}
\DeclareMathOperator*{\E}{\mathbb E}
\newcommand{\h}{\mathcal H}
\newcommand{\g}{\mathcal G}
\newcommand{\R}{\mathbb R}
\DeclareMathOperator{\spn}{span}
\newcommand{\tp}{^\mathsf{T}}
\newcommand{\ud}{\mathrm d}
\newcommand{\nc}{\color{#d62728}{n}}
\newcommand{\Xc}{\color{#d62728}{X}}
\newcommand{\Mc}{\color{#1f77b4}{M}}
\newcommand{\Yc}{\color{#1f77b4}{Y}}
\newcommand{\mc}{\color{#17becf}{m}}
\newcommand{\dc}{\color{#2ca02c}{d}}
Our goal A multivariate density model that: Is flexible enough to model complex data (nonparametric) Is computationally efficient to estimate and evaluate Works in high(ish) dimensions Has statistical guarantees Parametric model Nonparametric model Multivariate Gaussian Gaussian process Finite mixture model Dirichlet process mixture model Exponential family Kernel exponential family
Exponential families Can write many classic densities on
\R^d
as:
p(x) =
\exp\big(
\big\langle
\underbrace{\eta}_{\substack{\text{natural}\\\text{parameter}}},
\underbrace{T(x)}_{\substack{\text{sufficient}\\\text{statistic}}}
\big\rangle_{\R^s}
- \underbrace{A(\eta)}_\text{log-normalizer}
\big) \, \underbrace{q_0(x)}_{\substack{\text{base}\\\text{measure}}}
Gaussian:
T(x) = \begin{bmatrix} x & x^2 \end{bmatrix}
Gamma:
T(x) = \begin{bmatrix} x & \log x \end{bmatrix}
Density only sees data
x
through
\E_X T(X)
(and
q_0
) Infinitely many features with kernels! Kernel : “similarity” function
k(x, y) \in \R
e.g. exponentiated quadratic
k(x, y) = \exp\left( - \tfrac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)
Reproducing kernel Hilbert space
\h
of functions on
\mathcal X
with
f(x) = \langle f, k_x \rangle_\h
k_x \in \h
is the infinite-dimensional feature map of
x
Similarity to all possible inputs
y
:
k_x(y) = k(x, y)
Kernel exponential families Lift parameters from
\R^s
to the RKHS
\h
, with kernel
k
: Use parameter
\eta = f \in \h
and sufficient statistic
T(x) = k_x
:
\begin{align}
p_f(x)
% &= \exp\big(
% \big\langle \eta, T(x) \big\rangle_\h
% - A(\eta)
% \big) \, q_0(x)
&= \exp\big(
\big\langle f, k_x \big\rangle_\h
- A(f)
\big) \, q_0(x)
\\& {}= \exp\big(
f(x)
- A(f)
\big) \, q_0(x)
\end{align}
Covers standard exponential families:
k(x, y) = T(x) \cdot T(y)
e.g. normal distribution has
k(x, y) = x y + x^2 y^2
Rich class of densities
p_f(x) = \exp\big(f(x) - A(f) \big) \, q_0(x)
For “strong”
k
that decay to 0,
\{ p_f \}_{f \in \h}
is dense in the set of continuous densities with tails like
q_0
Examples of
\exp(f(x))
with Gaussian
k
: Density estimation
p_f(x) = \exp\big(f(x) - A(f) \big) \, q_0(x)
Given sample
\{ X_b \}_{b \in [n]} \sim p_0
, want
f
so that
p_f \approx p_0
Usual approach: maximum likelihood
A(f)
,
\nabla A(f)
are hard to computeLikelihood equations ill-posed for characteristic kernels Need a way to fit an unnormalized probabilistic modelCould then estimate
A(f)
once after fitting Unnormalized density / score estimation Don't necessarily need to compute
A(f)
afterwards
f + \log q_0 = \log p_f + A(f)
(“energy”) lets us:Find modes (global or local) Sample (with MCMC) … The score,
\nabla_x f(x) = \nabla_x \log p_f(x)
, lets us:Run HMC when we can't evaluate gradients Construct Monte Carlo control functionals … But again, need a way to find
f
Score matching in KEFs [Sriperumbudur, Fukumizu, Gretton, Hyvärinen, and Kumar, JMLR 2017 ]
Idea: minimize Fisher divergence
J(p_0 \| p_f)
J(f)
= \frac{1}{2} \int p_0(x) \, \left\lVert \nabla_x \log p_f(x) - \nabla_x \log p_0(x) \right\rVert_2^2 \, \ud x
Under mild assumptions, integration by parts gives
J(f) =
\int p_0(x) \sum_{i=1}^d \left[
\partial_i^2 f(x)
+ \frac12 \left( \partial_i f(x) \right)^2
\right]
\ud x
+ C(p_0, q_0)
Can estimate with Monte Carlo Score matching in KEFs Minimize regularized loss function:
\hat J_\lambda(f)
=
\frac1n \sum_{a=1}^n \sum_{i=1}^d \left[
\partial_i^2 f(X_a)
+ \frac12 \left( \partial_i f(X_a) \right)^2
\right]
+ \frac12 \lambda \lVert f \rVert_\h^2
Representer theorem tells us minimizer of
\hat J_\lambda
over
\h
is
f_{\lambda,\Xc} \in
\fragment[2][highlight-current-red]{\spn\left\{ \partial_i k_{X_a} \right\}_{a \in [n]}^{i \in [d]}}
\cup
\fragment[3][highlight-current-red]{\spn\left\{ \partial_i^2 k_{X_a} \right\}_{a \in [n]}^{i \in [d]}}
Best score matching fit in subspace Best
f \in \h
is in
\color{#8172b2}{\h_\text{full}}
=
\spn\left\{ \partial_i k_{X_a} \right\}_{a \in [\nc]}^{i \in [\dc]}
\cup
\spn\left\{ \partial_i^2 k_{X_a} \right\}_{a \in [\nc]}^{i \in [\dc]}
Can find best
f
in subspace of dim
\Mc
in
\bigO(\nc\dc \Mc^2 + \Mc^3)
time
\beta = - (\tfrac1n \underbrace{B_{\Xc \Yc}\tp}_{\Mc \times \nc\dc} \; \underbrace{B^\phantom{\mathsf T}_{\Xc \Yc}}_{\nc\dc \times \Mc} + \lambda \underbrace{G_{\Yc \Yc}}_{\Mc \times \Mc})^\dagger \underbrace{h_\Yc}_{\Mc \times 1}
\color{#8172b2}{\h_\text{full}}
:
\Mc = \nc \dc
,
\mathcal{O}(\nc^3 \dc^3)
time! Nyström approximation
\color{#8172b2}{\h_\text{full}}
=
\spn\left\{ \partial_i k_{X_a} \right\}_{a \in [\nc]}^{i \in [\dc]}
\cup
\spn\left\{ \partial_i^2 k_{X_a} \right\}_{a \in [\nc]}^{i \in [\dc]}
Nyström approximation: find fit in different (smaller)
\h_{\Yc}
One choice:
pick
\Yc \subset [\nc]
,
\lvert \Yc \rvert = \mc
at random, then
\color{#ccb974}{\h_\text{nys}^Y} = \spn\left\{ \partial_i k_{X_a} \right\}_{a \in \Yc}^{i \in [\dc]}
\qquad
\mathcal{O}(\nc \mc^2 \dc^3)\text{ time}
“lite” :
pick
\Yc
at random, then
\color{#64b5c2}{\h_\text{lite}^Y} = \spn\left\{ k_{X_a} \right\}_{a \in \Yc}
\qquad
\mathcal{O}(\nc \mc^2 \dc)\text{ time}
Analysis: Main result Well-specified case: there is some
f_0 \in \h
so that
p_0 = p_{f_0}
Some technical assumptions on kernel, density smoothness
\theta
is a parameter depending on problem smoothness:
\frac12
in the worst case,
\frac13
in the best Choose
\color{#ccb974}{\h_\text{nys}^Y}
basis, size
\mc = \Omega(\nc^\theta \log \nc)
uniformly,
\lambda = \nc^{-\theta}
:Fit takes
\mathcal{O}(\nc^{1 + 2 \theta} \dc^3 \log^2\!\nc)
time
instead of
\mathcal{O}(\nc^3 \dc^3)
Fisher distance
J
:
\lVert\cdot\rVert_\h
, KL, Hellinger,
L_r
(
1 \le r \le \infty
):Same rate, consistency for hard problems Rate saturates slightly sooner for easy problems Proof ideas Error can be bounded relative to
\lVert f - f_0 \rVert_\h
Define best estimator in basis
f_\lambda^{\Yc} = \argmin_{f \in \h_{\Yc}} J_\lambda(f)
:
\lVert f_{\lambda,\Xc}^{\Yc} - f_0 \rVert_\h
\le \underbrace{\lVert f_{\lambda,\Xc}^{\Yc} - f_\lambda^{\Yc} \rVert_\h}_\text{estimation}
+ \underbrace{\lVert f_{\lambda}^{\Yc} - f_0 \rVert_\h}_\text{approximation}
New Bernstein inequality for sums of correlated operators Approximation key part:
\h_{\Yc}
covers “most of”
f_0
Synthetic experiment: Grid Normal distributions at
d
vertices of a hypercube in
\R^d
Grid results Normal distributions at
d
vertices of a hypercube in
\R^d
Grid runtimes Normal distributions at
d
vertices of a hypercube in
\R^d
Synthetic experiment: Rings Three circles, plus small Gaussian noise in radial direction Other dimensions are uniform Gaussian noise Rings results Gradient-free Hamiltonian Monte Carlo Want to marginalize out hyperparameter choice for a GP classifierCan estimate likelihood of hyperparameters with EP Can't get gradients this way: no HMC Kernel Adaptive HMC [Strathmann et al. NIPS-15 ] :Fit model to estimate scores based on current samples Propose HMC trajectories based on score estimate Metropolis rejection step accounts for errors in score estimate Gradient-free HMC Learning the kernel Nice surface to optimize with gradient descent Learning deep kernels Use kernel
k(x, y) = k_\text{SE}(\varphi(x), \varphi(y))
, with
\varphi
a deep network
Truth
0 layers
Kernel Conditional Exponential Family Michael Arbel, Arthur Gretton;
AISTATS 2018,
arXiv:1711.05363
Add kernel for conditioning variable too
P(X_1, \dots, X_d) = \prod_i P(X_i \mid X_{\pi(i)})
Recap Kernel exponential families are rich density models Hard to normalize, so maximum likelihood intractible Score matching works, but it's slow Nyström gives speedups with guarantees on performance Lite variant even faster, but no guarantees (yet) More complex densities: learn kernel and/or conditional structure Efficient and principled score estimation with Nyström kernel exponential families Dougal J. Sutherland * , Heiko Strathmann* , Michael Arbel, Arthur Gretton AISTATS 2018; arXiv:1705.08360