Efficient and principled score estimation
with kernel exponential families
Dougal Sutherland Gatsby unit, UCL
Based on:
Approximating high dimensional functions , 19 December 2017
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator{\bigO}{\mathcal O}
\DeclareMathOperator*{\E}{\mathbb E}
\newcommand{\h}{\mathcal H}
\newcommand{\g}{\mathcal G}
\newcommand{\R}{\mathbb R}
\DeclareMathOperator{\spn}{span}
\newcommand{\tp}{^\mathsf{T}}
\newcommand{\ud}{\mathrm d}
\newcommand{\nc}{\color{red}{n}}
\newcommand{\Mc}{\color{#55b}{M}}
\newcommand{\mc}{\color{#00f}{m}}
\newcommand{\dc}{\color{green}{d}}
Our goal A multivariate density model that: Is flexible enough to model complex data (nonparametric) Is computationally efficient to estimate and evaluate Works in high(ish) dimensions Has statistical guarantees Parametric model Nonparametric model Multivariate Gaussian Gaussian process Finite mixture model Dirichlet process mixture model Exponential family ???
Exponential families Can write many densities on
\R^d
as:
p(x) =
\exp\big(
\big\langle
\underbrace{\eta}_{\substack{\text{natural}\\\text{parameter}}},
\underbrace{T(x)}_{\substack{\text{sufficient}\\\text{statistic}}}
\big\rangle_{\R^s}
- \underbrace{A(\eta)}_\text{log-normalizer}
\big) \, \underbrace{q_0(x)}_{\substack{\text{base}\\\text{measure}}}
Gaussian:
T(x) = \begin{bmatrix} x & x^2 \end{bmatrix}
Gamma:
T(x) = \begin{bmatrix} x & \log x \end{bmatrix}
Density only sees data
x
through
\E_X T(X)
(and
q_0
) Infinitely many features with kernels! Kernel : “similarity” function
k(x, y) \in \R
e.g. exponentiated quadratic
k(x, y) = \exp\left( - \tfrac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right)
For any positive definite
k
on
\mathcal X
,
there's a reproducing kernel Hilbert space
\h
of functions on
\mathcal X
with
k(x, y) = \langle k(x, \cdot), k(y, \cdot) \rangle_\h
k(x, \cdot) \in \h
is the infinite-dimensional feature map of
x
Reproducing property Functions in
\h
are (closure of) linear combinations of
\varphi_k
:
f(x)
= \sum_a \beta_a k(X_a, x)
Reproducing property:
\begin{align}
\langle f, k(x, \cdot) \rangle_\h
&{}= \left\langle \sum_a \beta_a k(X_a, \cdot), k(x, \cdot) \right\rangle_\h
\\&\fragment[3]{{}= \sum_a \beta_a \left\langle k(X_a, \cdot), k(x, \cdot) \right\rangle_\h}
\\&\fragment[4]{{}= \sum_a \beta_a k(x_a, x)}
\\&\fragment[5]{{}= f(x)}
\end{align}
Kernel exponential families Lift parameters from
\R^s
to the RKHS
\h
, with kernel
k
: Use
\eta = f \in \h
and
T(x) = k(x, \cdot)
, so:
\begin{align}
p_f(x)
&= \exp\big(
\big\langle \eta, T(x) \big\rangle_\h
- A(\eta)
\big) \, q_0(x)
\\&\fragment{{}= \exp\big(
\big\langle f, k(x, \cdot) \big\rangle_\h
- A(f)
\big) \, q_0(x)}
\\&\fragment{{}= \exp\big(
f(x)
- A(f)
\big) \, q_0(x)}
\end{align}
Covers standard exponential families:
k(x, y) = T(x) \cdot T(y)
e.g. normal distribution has
k(x, y) = x y + x^2 y^2
Rich class of densities
p_f(x) = \exp\big(f(x) - A(f) \big) \, q_0(x)
For integrally strictly positive definite
k
that decay to 0,
\{ p_f \}_{f \in \h}
is dense in the set of distributions that decay like
q_0
Examples of
\exp(f(x))
: Density estimation
p_f(x) = \exp\big(f(x) - A(f) \big) \, q_0(x)
Given sample
\{ X_b \}_{b \in [n]} \sim p_0
, want
f
so that
p_f \approx p_0
Default approach: maximum likelihood
A(f)
,
\nabla A(f)
are hard to computeIll-posed for characteristic kernels Need a way to fit an unnormalized probabilistic model Could then estimate
A(f)
once after fitting Unnormalized density / score estimation Don't necessarily need to compute
A(f)
afterwards
f + \log q_0 = \log p_f + A(f)
(“energy”) lets us:Find modes (global or local) Sample (with MCMC) … The score,
\nabla_x f(x) = \nabla_x \log p_f(x)
, lets us: Run HMC when we can't evaluate gradients Construct Monte Carlo control functionals … But again, need a way to find
f
Score matching in KEFs [Sriperumbudur, Fukumizu, Gretton, Hyvärinen, and Kumar, JMLR 2017 ]
Idea: minimize Fisher divergence
J(p_0 \| p_f)
J(f)
= \frac{1}{2} \int p_0(x) \, \left\lVert \nabla_x \log p_f(x) - \nabla_x \log p_0(x) \right\rVert_2^2 \, \ud x
Under mild assumptions, integration by parts gives
J(f) =
\int p_0(x) \sum_{i=1}^d \left[
\partial_i^2 f(x)
+ \frac12 \left( \partial_i f(x) \right)^2
\right]
\ud x
+ C(p_0, q_0)
Estimate with
\hat J(f)
=
\frac1n \sum_{a=1}^n \sum_{i=1}^d \left[
\partial_i^2 f(X_a)
+ \frac12 \left( \partial_i f(X_a) \right)^2
\right]
Score matching in KEFs Minimize regularized loss function:
\hat J_\lambda(f)
=
\frac1n \sum_{a=1}^n \sum_{i=1}^d \left[
\partial_i^2 f(X_a)
+ \frac12 \left( \partial_i f(X_a) \right)^2
\right]
+ \frac12 \lambda \lVert f \rVert_\h^2
Thanks to representer theorem, we know
f_{\lambda,n} \in
\fragment[2][highlight-current-red]{\spn\left\{ \partial_i k(X_a, \cdot) \right\}_{a \in [n]}^{i \in [d]}}
\cup
\fragment[3][highlight-current-red]{\spn\left\{ \partial_i^2 k(X_a, \cdot) \right\}_{a \in [n]}^{i \in [d]}}
Score matching fit Can get an analytic solution to weights in that span:
f_{\lambda,\nc}(x) = \sum_{a=1}^{\nc} \sum_{i=1}^{\dc} \beta_{(a, i)} \partial_i k(X_a, x) - \frac{\hat\xi}{\lambda}
\hat\xi
has simple closed form
\beta = -\frac1\lambda \bigl( \underbrace{G_{XX}}_{\nc\dc \times \nc\dc} + n \lambda I \bigr)^{-1} \underbrace{h_X}_{\nc\dc \times 1}
Solving an
\nc\dc \times \nc\dc
linear system takes
\bigO(\nc^3 \dc^3)
time! The Nyström approximation Representer theorem: know the best
f \in \h
is in
\spn\left\{ \partial_i k(X_a, \cdot) \right\}_{a \in [\nc]}^{i \in [\dc]}
\cup
\spn\left\{ \partial_i^2 k(X_a, \cdot) \right\}_{a \in [\nc]}^{i \in [\dc]}
Nyström approximation: find best
f
in subspace
\h_Y \subset \h
If
\h_Y
has size-
\Mc
basis, can find coefficients in
\bigO(\nc\dc \Mc^2)
time
\beta = - (\tfrac1n \underbrace{B_{XY}\tp}_{\Mc \times \nc\dc} \; \underbrace{B^\phantom{\mathsf T}_{XY}}_{\nc\dc \times \Mc} + \lambda \underbrace{G_{YY}}_{\Mc \times \Mc})^\dagger \underbrace{h_Y}_{\Mc \times 1}
Worst case:
\Mc = \nc \dc
, no improvement If
\Mc = \mc \dc
,
\bigO(\nc \mc^2 \dc^3)
If
\Mc = \mc
,
\bigO(\nc \mc^2 \dc)
Nyström basis choice Representer theorem: know the best
f \in \h
is in
\spn\left\{ \partial_i k(X_a, \cdot) \right\}_{a \in [\nc]}^{i \in [\dc]}
\cup
\spn\left\{ \partial_i^2 k(X_a, \cdot) \right\}_{a \in [\nc]}^{i \in [\dc]}
Reasonable
\h_Y
:
pick
Y = \{ (a, i) \} \subset [\nc] \times [\dc]
at random, then
\h_Y = \spn\left\{ \partial_i k(X_a, \cdot) \right\}_{(a, i) \in Y}
“lite” :
pick
Y = \{ a \} \subset [\nc]
at random, then
\h_Y = \spn\left\{ k(X_a, \cdot) \right\}_{a \in Y}
Analysis: Main result Well-specified case: there is some
f_0 \in \h
so that
p_0 = p_{f_0}
Some technical assumptions on kernel, density smoothness
\theta
is a parameter depending on problem smoothness:
\frac12
in the worst case,
\frac13
in the best Choose basis of size
M = \Omega(n^\theta d \log n)
uniformly,
\lambda = n^{-\theta}
:Fisher distance
J
:Consistent; same rate as full solution
\lVert\cdot\rVert_\h
, KL, Hellinger,
L_r
(
1 \le r \le \infty
):Same rate, consistency for hard problems Rate saturates slightly sooner for easy problems Proof ideas Error can be bounded relative to
\lVert f - f_0 \rVert_\h
Define best estimator in basis
f_\lambda^m = \argmin_{f \in \h_Y} J_\lambda(f)
:
\lVert f_{\lambda,n}^m - f_0 \rVert_\h
\le \underbrace{\lVert f_{\lambda,n}^m - f_\lambda^m \rVert_\h}_\text{estimation}
+ \underbrace{\lVert f_{\lambda}^m - f_0 \rVert_\h}_\text{approximation}
New Bernstein inequality for sums of correlated operators Approximation key part:
\h_Y
covers “most of”
f_0
Other main score estimator Train a denoising autoencoder with normal noise
\sigma
Normalized reconstruction error estimates the score:
\frac{r_\sigma(x) - x}{\sigma} \to \nabla_x p_0(x)
…as long as autoencoder has infinite capacity …and is at global optimum
\sigma
is like a bandwidth, have to tune itExperiments Two categories: Synthetic experimentsWe know true score, and so can compute
J
Gradient-free HMCWe can evaluate quality of proposed gradient steps Synthetic experiment: Grid Normal distributions at
d
vertices of a hypercube in
\R^d
Synthetic experiment: Grid Normal distributions at
d
vertices of a hypercube in
\R^d
Grid results Normal distributions at
d
vertices of a hypercube in
\R^d
Grid runtimes Normal distributions at
d
vertices of a hypercube in
\R^d
Synthetic experiment: Rings Three circles, plus small Gaussian noise in radial direction Other dimensions are uniform Gaussian noise Synthetic experiment: Rings Three circles, plus small Gaussian noise in radial direction Other dimensions are uniform Gaussian noise Rings results Gradient-free HMC Want to marginalize out hyperparameter choice for a GP classifierCan estimate likelihood of hyperparameters with EP Can't get gradients this way: no HMC Approximate HMC method [Strathmann et al. NIPS-15 ] :Fit model to estimate scores based on current samples Propose HMC trajectories based on score estimate Metropolis rejection step accounts for errors in score estimate Our experiment (UCI Glass dataset):Fit model on "ground truth" samples2000 samples from 40 very-thinned random walks Track HMC acceptance rate for many trajectorieshigher acceptance rate
\approx
better model Gradient-free HMC Hyperparameter surface Nice surface to optimize with gradient descentPotentially allows for complex (deep?) kernels Recap so far Kernel exponential families are rich density models Hard to normalize, so maximum likelihood intractible Can do score matching, but it's slow Nyström gives speedups with guarantees on performance Lite version even faster, but no guarantees (yet) Efficient and principled score estimation with kernel exponential families Dougal J. Sutherland * , Heiko Strathmann* , Michael Arbel, Arthur GrettonarXiv:1705.08360
Kernel Conditional Exponential Family (arXiv:1711.05363 ) Michael Arbel and Arthur Gretton
(X_1, \dots, X_d)
often has a graphical structure:
P(X_1, \dots, X_d) = \prod_i P(X_i \mid \underbrace{X_{\pi(i)}}_{X_i\text{'s parents}})
Estimate each factor independently Kernel conditional exponential family General definition: use joint feature map
\psi(x, y)
:
p_f(y \mid x) = \exp\big( \langle f, \psi(x, y) \rangle_\h - A(f, x) \big) \, q_0(y)
Proposed particular case:
p_f(y \mid x) = \exp\big( \langle f_x, k(y, \cdot) \rangle_\g - A(f, x) \big) \, q_0(y)
Feature map
\psi
given by:
\begin{align}
\langle f_x, k(y, \cdot) \rangle_\g
&= \langle \Gamma_x^* f, k(y, \cdot) \rangle_\g
\\&\fragment{{}= \langle f, \Gamma_x k(y, \cdot) \rangle_\h}
\end{align}
What loss function to use? Obvious approach: minimize
J\left( p_0(x) p_0(y \mid x) \| p_f(x) p_f(y \mid x) \right)
Problem: expression still depends on
\int p_0(y \mid x) \ud y
Instead: minimize
\E_{X \sim \pi}\left[ J\left( p_0(y \mid X) \| p_f(y \mid X) \right) \right]
Zero iff conditional densities are equal for
\pi
-almost all
x
Leads to analytical minimizer of same form as before Failure mode of expected conditional score
\color{red}{P(Y \mid X = 1)}
\color{blue}{P(Y \mid X = -1)}
\color{teal}{P(Y) = \frac12 P(Y \mid X = 1) + \frac12 P(Y \mid X = -1)}
\E_{X}\bigl[ J\bigl(
\underbrace{p(y \mid X)}_{\text{target}},
\underbrace{\color{teal}{p(y)}}_{\text{model}}
\bigr) \bigr] = 0
Failure mode of expected conditional score Why does it fail?
Loss is
\E_{X}\bigl[ J\bigl(
\underbrace{p(y \mid X)}_{\text{target}},
\underbrace{\color{teal}{p(y)}}_{\text{model}}
\bigr) \bigr]
\E\left[
\lVert
\nabla_x \log \color{red}{p(y \mid 1)}
- \nabla_x \log \color{teal}{p(y)}
\rVert^2
%\ud y
\mid X = 1
\right]
= 0
since supports are disjoint,
and we drop the normalizer…
Violates assumptions of model,
but need to be careful as we approach this situation!
Unconditional vs conditional model Parkinsons: biomedical voice measurements from patients with early-stage Parkinson's disease Red Wine: physiochemical measurements on wine samples Parkinsons Red Wine Dimension 15 11 Samples 5,875 1,599
Unconditional vs conditional model Parkinsons:
d = 15, n = 5,875
Red Wine:
d = 11, n = 1,599
Compare to:
Negative log likelihoods: (smaller is better; 5 splits)
Parkinsons Red Wine KCEF 2.86 ± 0.77 11.8 ± 0.93 LSCDE 15.89 ± 1.48 14.43 ± 1.5 NADE 3.63 ± 0.0 9.98 ± 0.0
Sampling: Red Wine Sampling: Parkinsons Recap Kernel exponential families are rich density models Hard to normalize, so need score matching Nyström gives speedups with guarantees on performance Conditional version: more complex, higher-dimensional models Efficient and principled score estimation with kernel exponential families Dougal J. Sutherland * , Heiko Strathmann* , Michael Arbel, Arthur GrettonarXiv:1705.08360
Kernel Conditional Exponential Family (arXiv:1711.05363 ) Michael Arbel and Arthur Gretton