Evaluating and Training
Implicit Generative Models
with Two-Sample Tests
Dougal J. Sutherland
Gatsby unit, University College London
Implicit Generative Models workshop, ICML 2017
Implicit generative models
- Given some samples from a distribution on
- Goal: generate more samples from
- Don't have an explicit likelihood model
Evaluating implicit generative models
- Can't evaluate standard test-set likelihood
- Early GAN papers: estimate this with KDE
- KDE doesn't work in high , theoretically or empirically
- Models with high likelihoods can have terrible samples; those with good samples can have awful likelihoods [Theis+ ICLR-16]
Other evaluation methods
- Birthday paradox test [Arora/Zhang 2017]
- Needs a human
- Only measures diversity
- Inception score [Salimans+ NIPS-16]
- Domain-specific
- Only measures label-level diversity
- …
- Look at a bunch of pictures and see if they're pretty or not
- Easy to find bad samples
- Hard to see if modes missing, wrong probabilities
- Hard to compare models
Two-sample tests
- Given samples from two unknown distributions
- Question: is ?
- Hypothesis testing approach:
Applications of two-sample testing
- Does my generative model match ?
- Do smokers/non-smokers have different cancer rates?
- Do these neurons fire differently when the subject is reading?
- Do these columns from different databases mean the same?
- Independence: is ?
General scheme for two-sample tests
- Choose some notion of distance
- Ideally, iff
- Estimate the distribution distance from data:
- Say when
- Want (at least approximately) test of level :
- Test power is probability of true rejection: when ,
Mean difference
Variance difference
Higher-order differences
Need higher-order features still
Could keep stacking up moments, but get hard to estimate
Instead: use features , for an RKHS
Refresher: Reproducing Kernel Hilbert Spaces
- Using mean embedding
- corresponds to kernel
- For any positive semidefinite , a matching and exist
- e.g.
- Reproducing property: ,
Maximum Mean Discrepancy
MMD two-sample test
- Distance :
- Need to choose a kernel
- For characteristic , iff
- Estimate the distance from data:
- Choose a rejection threshold
- Use permutation testing to set
Functional form of MMD
Form called integral probability metric
Maximizing function called witness function (or critic)
Witness function
![]()
Witness function
![]()
Witness function
![]()
Witness function
![]()
Witness function
![]()
Witness function
![]()
The kernel matters!
![]()
The kernel matters!
![]()
The kernel matters!
![]()
Choosing a kernel
![]()
Choosing a kernel
![]()
Choosing a kernel
![]()
But how do we actually choose the kernel?
We want the most powerful test
![]()
Optimizing test power
- Turns out a good proxy for asymptotic power is:
- Can estimate this in quadratic time
- …in an autodiff-friendly way
github.com/dougalsutherland/opt-mmd
Generative model criticism
Take a really good GAN on MNIST: [Salimans+ NIPS-16]
![]()
![]()
![]()
Samples are distinguishable
ARD kernel on pixels:
p-values almost exactly zero
![]()
Investigating indicative points
![]()
Testing to generative modeling
- Natural idea: train a generator to minimize power of test
- Consistent test, powerful generator class, infinite samples:
- Tradeoffs for unrealizable case depend on test
Integral probability metrics (IPMs)
Get different distances for different choices of :
- with : MMD
- with : total variation
- that are 1-Lipschitz: Wasserstein
- …
Classifier two sample tests (C2STs)
[Lopez-Paz/Oquab ICLR-17]
- Let be set of functions :
- Estimator :
![]()
- Asymptotic power is monotonic function of
Generative model based on C2STs
- Train to minimize C2ST power = accuracy of classifier
- Accuracy hard to optimize, so use logistic surrogate
- Envelope theorem:
- Waste to retrain classifier each time: keep one discriminator
- …and now we have a GAN
Generative Moment Matching Networks
[Li+ ICML-15], [Dziugate+ UAI-15]
- Minimize
- Samples are okay on MNIST
- -GMMN using test power instead: basically the same
- Hard to choose a good kernel
Optimizing the kernel
- Alternate updating the generator and updating the test kernel
- As-is, runs into serious stability problems. Various fixes:
- MMD GAN [Li+ 2017]
- Cramér GAN [Bellemare+ 2017]
- Distributional Adversarial Networks [Li+ 2017]
- dfGMMN [Liu 2017]
- TextGAN [Zhang+ ICML-17]
Evaluating implicit generative models
- One useful way is via two-sample-testing framework
- MMD is a nice two-sample test, when you learn the kernel
- Can help diagnose problems
- More things to try for use on practical image problems
Training implicit generative models
- Can define models based on power of two-sample tests
- Might help with stability of training, etc
dougal@gmail.com
github.com/dougalsutherland/opt-mmd
Thanks!