Advances in GANs based on the MMD
Based on
“Demystifying MMD GANs”
“On gradient regularizers for MMD GANs”
Mikołaj Bińkowski
Sheffield Machine Learning seminar,
30 May 2018
(Swipe or arrow keys to move in slides; for a menu to jump; to show more.)
Implicit generative models
Given samples from distribution over
Want model that can produce new samples from
Don't necessarily care about likelihoods, interpretability, …
Model: Generator network
Deep network (params ) mapping from noise to images
DCGAN generator [Radford+ ICLR-16]
is e.g. uniform on
Choose by minimizing…something
Loss function
- Can't evaluate likelihood of samples under model
- Likelihood maybe not the best choice anyway [Theis+ ICLR-16]
- Doesn't hurt likelihoods much to take 99% white noise
- Instead, we'll minimize some
Maximum Mean Discrepancy [Gretton+ 2012]
( is RKHS with kernel )
Can do optimization in closed form:
Estimate generator based on SGD
with minibatches
, ,
Hard to pick a good kernel for images
MMD GANs: Deep kernels [Li+ NIPS-17]
- Use a class of deep kernels:
- Choose most-discriminative out of those kernels:
- Initialize random generator and representation
- Repeat:
SGD step in to minimize
- times:
- Take SGD step in to maximize
- Take SGD step in to minimize
Wasserstein and WGANs
- WGANs [Arjovsky+ ICML-17],
WGAN-GPs [Gulrajani+ NIPS-17]:
- Train a neural network for critic
- Enforce Lipschitz constraint on (more on this later)
- Run SGD on minibatches , with that critic
- Consider linear-kernel MMD GAN, :
- WGAN has:
- Linear-kernel MMD GAN and WGAN almost the same
- MMD GAN “offloads” some of the critic's work to closed-form optimization in the RKHS
Estimator bias
- Bellemare+ [2017] say that:
- WGANs have biased gradients
- which can lead SGD to wrong minimum even in expectation
- but Cramér GANs have unbiased gradients
- Cramér GAN MMD GAN with particular kernel
- We show:
- Gradients of fixed critic, , are unbiased
- Gradients of optimized critic, , are biased
- Exact same situation for WGAN and MMD GAN
Unbiasedness theorem for fixed critic
- For almost all feedforward architectures ,
- Works for ReLU, max-pooling, …
- For any distributions , with ,
- For most kernels used in practice,
- Includes linear kernel, RBF, RQ, distance kernel, …
- For Lebesgue-almost all parameters :
- , so:
- , unbiased (WGANs)
Proof of unbiasedness theorem
- By Fubini theorem, only need to show:
- For fixed input , differentiable for almost all
- Recall: , with piecewise smooth :
- case 1: inside domain of analyticity
- case 2: on boundary
- case 3: crosses the boundary:
- in a union of manifolds of 0 measure
Gradient bias for “full” loss
- Recall
- Estimator splits data:
- Pick on train set, estimate on test set
- GAN test set: current minibatch
- Showed: unbiased for any fixed
- Now: is biased
- Estimators have non-constant bias iff gradients are biased
- Will show is biased
Bias of
- Eval on test set is unbiased:
- But training introduces bias:
- If , estimator must be biased down
- Probably not a big deal in practice
- No (direct) bias due to minibatch size
- Can decrease bias by training critic longer
- But informs theory
- Convergence based on SGD of made difficult
Non-existence of an unbiased IPM estimator
- Beautiful argument of Bickel & Lehman [Ann. Math. Stat 1969]:
- Let ,
- Suppose ;
- But
isn't a polynomial
- So no unbiased can exist, though could
Smooth critics in MMD GANs
- Toy problem in [Mescheder+ ICML-18]:
- Point mass target , model
- Representation ,
- Gaussian kernel
Smooth critics in MMD GANs
- Toy problem in [Mescheder+ ICML-18]:
- Point mass target , model
- Representation ,
- Gaussian kernel
- Taking gives
- But if we restrain ourselves to where optimal critic is bounded Lipschitz,
implies ,
and is continuous and a.e. differentiable
Enforcing Lipschitz constraint
- First attempt: only optimize over that are -Lipschitz
- Hard to specify Lipschitz constant of deep nets
- WGAN [Arjovsky+ ICML-17] tried with simple box constraint
- Also original MMD GAN paper [Li+ NIPS-17]
- WGAN-GP [Gulrajani+ NIPS-17]:
penalize non-Lipschitzness
(with drawn in between the and minibatches)
- Looser constraint than Lipschitz
- Tends to work better in practice
- But barely changes anything on the toy problem!
Built-in gradient constraints
- Remember
- New Gradient-Constrained MMD:
- Variance constraint makes it like a Sobolev norm
- Doesn't quite constrain the Lipschitz constant
Estimating Gradient-Constrained MMD
- Say we have samples
- Let
- Then
has ,
has ,
- Dropping kernel matrix gives
- Solving this linear system takes time!
Gradient-Constrained MMD on MNIST
- It's a reasonable distance to optimize:
- …but this took days to run
- even with a low-rank incomplete-Cholesky approximation
The Scaled MMD
- Using a bit of RKHS theory, can write
- Define lower bound on of
Scaled MMD vs MMD with Gradient Penalty
- When ,
Rank collapse
- Optimization failure we sometimes see on SMMD, GC-MMD:
- Generator doing reasonably well
- Critic filters become low-rank
- Generator corrects it by breaking everything else
- Generator gets stuck
- ; learn and freely
- Encourages diversity without limiting representation
What if we just did spectral normalization?
- , so that ,
- Works well for original GANs [Miyato+ ICLR-18]
- …but doesn't work at all as only constraint in a WGAN
- Limits representation too much
- In the toy problem, constrains to
Implicit generative model evaluation
- No likelihoods, so…how to compare models?
- Main approach:
look at a bunch of pictures and see if they're pretty or not- Easy to find (really) bad samples
- Hard to see if modes are missing / have wrong probabilities
- Hard to compare models beyond certain threshold
- Need better, quantitative methods
- Current standard quantitative method
- Based on ImageNet classifier label predictions
- Classifier should be confident on individual images
- Predicted labels should be diverse across sample
- No notion of target distribution
- Scores completely meaningless on LSUN, Celeb-A, SVHN, …
- Not great on CIFAR-10 either
Fréchet Inception Distance (FID) [Heusel+ NIPS-17]
- Fit normals to Inception hidden layer activations of and
- Compute Fréchet (Wasserstein-2) distance between fits
- Meaningful on not-ImageNet datasets
- Estimator extremely biased, tiny variance
- ,
New method: Kernel Inception Distance (KID)
- between Inception hidden layer activations
- Use default polynomial kernel:
- Unbiased estimator: more able to compare estimates
- Reasonable estimates with fewer samples
Automatic learning rate adaptation with KID
- Models need appropriate learning rate schedule to work well
- Automate with three-sample MMD test [Bounliphone+ ICLR-16]:
CIFAR-10 | Small | Big |
WGAN-GP | .116 ± .002 | .026 ± .001 |
MMD GAN-GP | .032 ± .001 | .027 ± .001 |
LSUN Bedrooms | Small | Big |
WGAN-GP | .370 ± .003 | .039 ± .002 |
MMD GAN-GP | .091 ± .002 | .028 ± .002 |
Training on CelebA
CelebA Samples
KID: 0.006
KID: 0.022
ImageNet Samples
KID: 0.035
KID: 0.044
KID: 0.047
- MMD GANs do some optimization in closed form
- Can handle smaller critic networks
- Bias situation is the same between WGAN and MMD GAN
- Evaluation and learning rate adaptation with KID
- Gradient control is important
- Scaled MMD does it in closed form, seems to help a lot
- Spectral normalization plays nice with SMMD