Generative Adversarial Networks

Dougal J. Sutherland
Gatsby, UCL→TTIC→UBC

(from thispersondoesnotexist.com)

MLCC 2019

(Swipe or arrow keys to move through slides; for a menu to jump; to show more.)

Generative models

Start with a bunch of examples:
Want a model for the data:
Might want to do different things with the model:
- Find most representative data points / modes
- Find outliers, anomalies, …
- Discover underlying structure of the data
- Impute missing values
- Use as prior (semi-supervised, machine translation, …)
- Produce “more samples”
- …

Why produce samples?

Generative models: a traditional way

Maximum likelihood:
Equivalent:

Traditional models for images

1987-style generative model of faces (Eigenface via Alex Egg)
Can do fancier versions, of course…
Usually based on Gaussian noise loss

A hard case for traditional approaches

One use case of generative models is inpainting [Harry Yang]:
loss / Gaussians will pick the mean of possibilities

Next-frame video prediction

[Lotter+ 2016]

Trick a discriminator [Goodfellow+ NeurIPS-14]

Generator ()

Discriminator

Target ()

Is this real?

No way!

:( I'll try harder…

⋮

Is this real?

Umm…

Aside: deep learning in one slide

MLCC so far: models ,
is an activation function:
Classification usually uses log loss (cross-entropy):
Optimize with gradient descent

Generator networks

How to specify ?
[Radford+ ICLR-16]
,

GANs in equations

Tricking the discriminator:
Using the generator network for :
Can do alternating gradient descent!

Original paper's results [Goodfellow+ NeurIPS-14]

DCGAN results [Radford+ ICLR-16]

Training instability

Running code from [Salimans+ NeurIPS-16]:

Run 1, epoch 1

Run 1, epoch 2

Run 1, epoch 3

Run 1, epoch 4

Run 1, epoch 5

Run 1, epoch 6

Run 1, epoch 11

Run 1, epoch 501

Run 1, epoch 900

Run 2, epoch 1

Run 2, epoch 2

Run 2, epoch 3

Run 2, epoch 4

Run 2, epoch 5

One view: distances between distributions

What happens when is at its optimum?
If distributions have densities,
If stays optimal throughout, tries to minimize which is

Jensen-Shannon divergence

JS with disjoint support [Arjovsky/Bottou ICLR-17]

If and have (almost) disjoint support so

Discriminator point of view

Generator ()

Discriminator

Target ()

Is this real?

No way!

:( I don't know how to do any better…

How likely is disjoint support?

At initialization, pretty reasonable:
:
:
Remember we might have
For usual , is supported on a countable union of
manifolds with dim
“Natural image manifold” usually considered low-dim
No chance that they'd align at init, so

A heuristic partial workaround

Original GANs almost never use the minimax game
If is near-perfect, near instead of
When is near-perfect, makes it unstable instead of stuck

Solution 1: the Wasserstein distance

is a -Lipschitz critic function

Turns out is continuous: if , then

Wasserstein GANs

Idea: turn discriminator into a critic
Need to enforce
Easy ways to do this are way too stringent
Instead, control on average, near the data
Specifically: ,

Solution 2: add noise

Make the problem harder so there's no perfect discriminator
Use , for some independent, full-dim noise
But…how much noise to add? Also need more samples.
If and we take , get
Same kind of gradient penalty!
Can also simplify to e.g.

Solution 3: Spectral normalization [Miyato+ ICLR-18]

Regular deep nets:
Spectral normalization:
is the spectral norm
Guarantees
Faster to evaluate than gradient penalties
Not as well understood yet

New samples [Mescheder+ ICML-18]

How to evaluate?

FID [Heusel+ NeurIPS-17] and KID [Bińkowski+ ICLR-18]

Consider distance between distributions of image features
Features from a pretrained ImageNet classifier
FID:
- Estimator very biased, small variance
KID: use Maximum Mean Discrepancy instead
- Similar distance with unbiased, ~normal estimator!

Comparing approaches [Kurach+ ICML-19]

Maximum Mean Discrepancy

is smoothness induced by kernel

Optimal analytically:

Estimating MMD


1.0	0.2	0.6
0.2	1.0	0.5
0.6	0.5	1.0


1.0	0.8	0.7
0.8	1.0	0.6
0.7	0.6	1.0


0.3	0.1	0.2
0.2	0.3	0.3
0.2	0.1	0.4

MMD models [Li+ ICML-15, Dziugaite+ UAI-15]

No need for a discriminator – just minimize !
Continuous loss

Generator ()

Critic

Target ()

How are these?

Not great!

:( I'll try harder…

⋮

MMD models [Li+ ICML-15, Dziugaite+ UAI-15]

MNIST, mix of Gaussian kernels

Celeb-A, mix of rational quadratic + linear kernels

MMD loss with a smarter kernel

from pretrained Inception net
simple: exponentiated quadratic or polynomial

We just got adversarial examples!

[anishathalye/obfuscated-gradients]

Optimized MMD: MMD GANs [Li+ NeurIPS-17]

Don't just use one kernel, use a class parameterized by :
New distance based on all these kernels:
Turns out that isn't continuous: have but
Scaled MMD GANs [Arbel+ NeurIPS-18] correct with a gradient penalty to make it continuous