Implicit Generative Models

with Two-Sample Tests

Dougal J. Sutherland

Gatsby unit, University College London

Implicit Generative Models workshop, ICML 2017

- Given some samples from a distribution on
- Goal: generate more samples from
- Don't have an explicit likelihood model

- Can't evaluate standard test-set likelihood
- Early GAN papers: estimate this with KDE
- KDE doesn't work in high , theoretically or empirically
- Models with high likelihoods can have terrible samples; those with good samples can have awful likelihoods [Theis+ ICLR-16]

Max-likelihood objective vs WGAN objective [Danihelka+ 2017]

- Birthday paradox test [Arora/Zhang 2017]
- Needs a human
- Only measures diversity
- Inception score [Salimans+ NIPS-16]
- Domain-specific
- Only measures label-level diversity
- …
- Look at a bunch of pictures and see if they're pretty or not
- Easy to find bad samples
- Hard to see if modes missing, wrong probabilities
- Hard to compare models

- Given samples from two unknown distributions
- Question: is ?
- Hypothesis testing approach:

- Does my generative model match ?
- Do smokers/non-smokers have different cancer rates?
- Do these neurons fire differently when the subject is reading?
- Do these columns from different databases mean the same?
- Independence: is ?

- Choose some notion of distance
- Ideally, iff
- Estimate the distribution distance from data:
- Say when
- Want (at least approximately) test of
*level*: - Test
*power*is probability of true rejection: when ,

Need higher-order features still

Could keep stacking up moments, but get hard to estimate

Instead: use features , for an RKHS

- Using mean embedding
- corresponds to kernel
- For any positive semidefinite , a matching and exist
- e.g.
- Reproducing property: ,

1.0 | 0.6 | 0.5 | 0.2 | 0.4 | 0.2 | |

0.6 | 1.0 | 0.7 | 0.4 | 0.1 | 0.1 | |

0.5 | 0.7 | 1.0 | 0.3 | 0.1 | 0.2 | |

0.2 | 0.4 | 0.3 | 1.0 | 0.7 | 0.8 | |

0.4 | 0.1 | 0.1 | 0.7 | 1.0 | 0.6 | |

0.2 | 0.1 | 0.2 | 0.8 | 0.6 | 1.0 |

- Distance :
- Need to choose a kernel
- For
*characteristic*, iff - Estimate the distance from data:
- Choose a rejection threshold
- Use permutation testing to set

Form called *integral probability metric*

Maximizing function called *witness function* (or *critic*)

We want the *most powerful* test

- Turns out a good proxy for asymptotic power is:
- Can estimate this in quadratic time
- …in an autodiff-friendly way

Take a really good GAN on MNIST: [Salimans+ NIPS-16]

ARD kernel on pixels:

*p*-values almost exactly zero

- Natural idea: train a generator to minimize power of test
- Consistent test, powerful generator class, infinite samples:
- Tradeoffs for unrealizable case depend on test

Get different distances for different choices of :

- with : MMD
- with : total variation
- that are 1-Lipschitz: Wasserstein
- …

- Let be set of functions :
- Estimator :
- Asymptotic power is monotonic function of

- Train to minimize C2ST power = accuracy of classifier
- Accuracy hard to optimize, so use logistic surrogate
- Envelope theorem:
- Waste to retrain classifier each time: keep one discriminator
- …and now we have a GAN

[Li+ ICML-15], [Dziugate+ UAI-15]

- Minimize
- Samples are okay on MNIST
- -GMMN using test power instead: basically the same
- Hard to choose a good kernel

- Alternate updating the generator and updating the test kernel
- As-is, runs into serious stability problems. Various fixes:
- MMD GAN [Li+ 2017]
- RBF kernel, WGAN weight clipping [Arjovsky+ ICML-17]
- Cramér GAN [Bellemare+ 2017]
- Distance kernel, WGAN-GP gradient penalty [Gulrajani+ 2017]
- Distributional Adversarial Networks [Li+ 2017]
- dfGMMN [Liu 2017]
- TextGAN [Zhang+ ICML-17]

- One useful way is via two-sample-testing framework
- MMD is a nice two-sample test, when you learn the kernel
- Can help diagnose problems
- More things to try for use on practical image problems

- Can define models based on power of two-sample tests
- Might help with stability of training, etc