VAEs and GANs

There are three most popular types of generative models out there: FVBN (Fully Visible Bayes Net), VAE (Variational Auto-Encoders), and GAN (Generative Adversal Networks). This blog talks about how VAE and GAN are set up, and briefly discusses their similarity.

One of the important tasks in AI and machine learning is inference, where we need to figure out how the distribution of a variable $z$ (model class) changes as influenced by the other variable, $x$ (observation). Such inference is based on the Bayes’ Rule

If we treat $p(x|z)$ as the likelihood of data given class, then the Bayes formula shows how $p(z|x)$, the “a posteriori” distribution, changes as new observation data accumulates. In other words, the Bayes formula shows how inference is passed from data variable $x$ to the latent class variable $z$. If the density is tractable, we can directly integrate it up, or, “pass a message from x to z”. Fully Visible Bayes Net works in this way.

In reality, the integration term on the right hand side is hard to compute, so we want to find methods to reduce its complexity.
One of the ways to make the probability tractable is approximation. Let $q(z|x)$, a probability distribution function parameterized by $\phi$, approximate $p(z|x)$. By minimizing their difference we can use a easier-to-calculate PDF in place of an intractable one. Two metrics to measure the difference of probability distributions are Kullback-Leibler divergence and Jensen-Shannon divergence. Minimizing the former leads to VAE, and the latter leads to GAN.

Let’s first look at how to minimize . It is not easily computable, since is already intractable. Notice that we have the following equivalency:

The right hand side of above equations are called Evidence Lower BOund (ELBO). For a given prior $p(x)$, maximizing the ELBO minimizes .

In order to optimize ELBO, we need the numerical values of $\frac{\partial \mathcal L}{\partial \theta}$ and $\frac{\partial \mathcal L}{\partial \phi}$. With tools like Tensorflow, Theano and Autograd, we hope to be able to automatically do that if ELBO consists of either continuous or discrete functions. However, we need to make sure such numerical optimization is doable theoretically.

Problems still exist. First, only in some conditions (e.g., p(z|x) belongs to exponential family) that the KL term in ELBO has a closed analytical form.

Second, the remaining term, , is an integration over a probability distribution function: , where z obeys the distribution $p(z|x)$. To numerically calculate the integral, we need to sample z from $p(z|x)$, which is intractible. Deadlock! How to acquire $z$ then?

(Kingma and Wellings, 2014)1 puts forward the “vanilla” variational encoder, which involves setting up a SGVB estimator to acquire $z$ by acquiring , where and

This is a reparameterization (approximation) from a probability distribution, , to a continuous, differentiable function, , which makes the ELBO optimizable. In this model, is similar to an encoder that “encodes” data into latent variable, and works like a decoder that “decodes” latent information. Its name Variational Autoencoder reflects this feature.

How to select the function $g(.)$ and latent variable prior though? The VAE1 paper suggested a table of corresponding reparameterization function $g(.)$ and the assumed distribution .

There have been other efforts trying to make ELBO optimizable. SVI2 uses Markov Chain Monte Carlo sampling from $q_\phi (z|x)$ to approximate values. ADVI3 first transforms the model into one with unconstrained real-valued latent variables, and then perform elliptical standardization, after which the ELBO is differentiable.

Above is the story of variational autoencoder. In the training of a VAE model, an encoder and a decoder (both implemented as neural networks). Let and denote the probabilities output by the encoder and decoder neural networks, then the training of VAE minimizes . However, it seems like there is no specific “purpose” of and : what does it mean to map the data $x$ to the latent space $z$?

One way to think about their purpose is to let the network be a discriminator function D, who tries to make and as close as possible, where x comes from real data and G(z) comes from the counterfeit. Let the network be a counterfeiter , which strives to make it as hard as possible for the discriminator to correctly tell the difference. In addition, modify the training goal, so that instead of ELBO, optimization minimizes the “cross-entropy loss training a standard binary classifier with a sigmoid output”:

If we rename to , and to , the model in the GAN paper4 is established.

This is not a small modification to the model. First, the GAN model minimizes JS divergence instead of KL. The f-GAN5 paper generalized GAN model to any f-divergence families ( where is a convex generator function). Second, the decoder / discriminator input come from different places. VAE adds a noise $p(\epsilon)$ to an approximated encoder $q_\phi (z|x)$, and feed the decoder with a reparameterized sample $\tilde{z}$. GAN directly samples from the generator labeled as counterfeit data, samples from some prior labeled as ground-truth, and feed both into the discriminator.

An important idea that GAN brings in is adversarial training. After the GAN paper came out, there have been a bunch of works incorporating the adversarial training framework, including:

  • DCGAN6 applying the GAN to deep convolutional nets.
  • WGAN7 minimizing Wasserstein distance, a weaker one in comparison to KL or JS.
  • Many others.

References: