Neural Variational Inference: Variational Autoencoders and Helmholtz machines

htxu91 2016-07-30

展开全文

So far we had a little of “neural” in our VI methods. Now it’s time to fix it, as we’re going to consider Variational Autoencoders (VAE), a paper by D. Kingma and M. Welling, which made a lot of buzz in ML community. It has 2 main contributions: a new approach (AEVB) to large-scale inference in non-conjugate models with continuous latent variables, and a probabilistic model of autoencoders as an example of this approach. We then discuss connections to Helmholtz machines — a predecessor of VAEs.

Auto-Encoding Variational Bayes

As noted in the introduction of the post, this approach, called Auto-Encoding Variational Bayes (AEVB) works only for some models with continuous latent variables. Recall from our discussion of Blackbox VI and Stochastic VI, we’re interested in maximizing the ELBO $L (Θ, Λ)$ :

L (Θ, Λ) = E_{q (z ∣ x, Λ)} \log \frac{p (x, z ∣ Θ)}{q (z ∣ x, Λ)}

It’s not a problem to compute an estimate of the gradient of the ELBO w.r.t. model parameters $Θ$ , but estimating the gradient w.r.t. approximation parameters $Λ$ is tricky as these parameters influence the distribution the expectation is taken over, and as we know from the post on Blackbox VI, naive gradient estimator based on score function exhibits high variance.

Turns out, for some distributions we can make change of variables, that is, for some distributions $z \sim q (z ∣ x, Λ)$ can be represented as a (differentiable) transformation $g (ε; Λ, x)$ of some auxiliary random variable $ε$ whose distribution does not depend on $Λ$ . A well-known example of such reparametrization is Gaussian distribution: if $z \sim N (μ, Σ)$ then $z$ can be represented as $z = μ + Σ^{1 / 2} ε$ for $ε \sim N (0, I)$ . This transformation is called the reparametrization trick. After the reparametrization the ELBO becomes

\begin{aligned} L (Θ, Λ) & = E_{ε \sim N (0, I)} \log \frac{p (x, g (ε; Λ, x) ∣ Θ)}{q (g (ε; Λ, x) ∣ Λ, x)} \\ \approx \frac{1}{L} \sum_{l = 1}^{L} \log \frac{p (x, g (ε^{(l)}; Λ, x) ∣ Θ)}{q (g (ε^{(l)}; Λ, x) ∣ Λ, x)} where ε^{(l)} \sim N (0, I) \end{aligned}

This objective is much better as we don’t need to differentiate w.r.t. expectation’s distribution, essentially putting variational parameters $Λ$ to the same regime as model parameters $Θ$ . It’s sufficient now to just take gradients of the ELBO’s estimate, and run any optimization algorithm like Adam.

Oh, and if you wonder what Auto-Encoding in Auto-Encoding Variational Bayes means, there’s an interesting interpretation of the ELBO in terms of autoencoding:

\begin{aligned} L (Θ, Λ) & = E_{q (z ∣ x, Λ)} \log \frac{p (x, z ∣ Θ)}{q (z ∣ x, Λ)} = E_{q (z ∣ x, Λ)} \log \frac{p (x ∣ z, Θ) p (z ∣ Θ)}{q (z ∣ x, Λ)} \\ = E_{q (z ∣ x, Λ)} \log p (x ∣ z, Θ) - D_{K L} (q (z ∣ Λ, x) ∣∣ p (z ∣ Θ)) \end{aligned}

Here the first term can be treated as expected reconstruction ( $x$ from the code $z$ ) loss, while the second one is just a regularization term.

Variational Autoencoder

One particular application of AEVB framework comes from using neural networks as the model $p (x ∣ z, Θ)$ and the approximation $q (z ∣ x, Λ)$ . The model has no requirements, and $x$ can be discrete or continuous (or mixed). $z$ , however, has to be continuous. Moreover, we need to be able to apply the reparametrization trick. Therefore in many practical applications $q (z ∣ x, Λ)$ is set to be Gaussian distribution $q (z ∣ Λ, x) = N (z ∣ μ (x; Λ), Σ (x; Λ))$ where $μ$ and $Σ$ are outputs of a neural network taking $x$ as input, and $Λ$ denotes a set of neural network’s weights — the parameters you optimize the ELBO with respect to (and also $Θ$ ). In order to make reparametrization trick practical, you’d like to be able to compute $Σ^{1 / 2}$ quick. You don’t want to actually compute this quantity as it’d be too computationally expensive. Instead you might want to predict $Σ^{1 / 2}$ by a neural network in the first place, or consider only diagonal covariance matrices (as it’s done in the paper).

In case of Gaussian approximation $q (z ∣ x, Λ)$ and Gaussian prior $p (z ∣ Θ)$ we can compute KL-divergence $D_{K L} (q (z ∣ Λ, x) ∣∣ p (z ∣ Θ))$ analytically, see the formula at stats.stackexchange. This reduces variance of gradient estimator, though one can still train a VAE estimating KL-divergence using Monte Carlo, just like the other part of the ELBO.

We optimize both the model and the approximation by gradient ascent. This joint optimization pushed both approximation towards the model, and the model towards approximation. This leads not only to efficient inference using the approximation, but also the model is encouraged to learn latent representations $z$ such that the true posterior $p (z ∣ x, Θ)$ is approximately factorial.

This model has generated a lot of buzz because it can be used as a generative model, essentially VAE is an autoencoder with natural sampling procedure: suppose you’ve trained the model, and now want to sample new samples similar to those you used in the training set. To do so you first sample $z$ from the prior $p (z)$ , and then generate $x$ using the model $p (x ∣ z, Θ)$ . Both operations are easy: the first one is a sampling from some standard distribution (like Gaussian, for example), and the second one is just one feed-forward pass followed by another sampling from another standard distribution (Bernoulli, for example, in case $x$ is a binary image).

If you want to read more on Variational Auto-Encoders, I refer you to a great tutorial by Carl Doersch.

Helmholtz machines

In the end I’d like to add some historical perspective. The idea of two networks, one “encoding” an observation $x$ to some latent representation (code) $z$ , and another “decoding” it back is definitely not new. In fact, the whole idea is a special case of the Helmholtz Machines introduced by Geoffrey Hinton 20 years ago.

Helmholtz machine can be thought of as a neural network of stochastic hidden layers. Namely, we now have $M$ stochastic hidden layers (latent variables) $h_{1}, \dots, h_{M}$ (with deterministic $h_{0} = x$ ) where the layer $h_{k - 1}$ is stochastically produced by the layer $h_{k}$ , that is, it is samples from some distribution $p (h_{k - 1} ∣ h_{k})$ , which as you might have guessed already is parametrized in the same way as in usual VAEs. Actually, VAEs is a special case of a Helmholtz machine with just one stochastic layer (but each stochastic layer contains a neural network of arbitrarily many deterministic layers inside of it).

This image shows an instance of a Helmholtz machine with 2 stochastic layers (blue cloudy nodes), and each stochastic layer having 2 deterministic hidden layers (white rectangles).

The joint model distribution is

p (x, h_{1}, \dots, h_{M} ∣ Θ) = p (h_{M} ∣ Θ) \prod_{m = 0}^{M - 1} p (h_{m} ∣ h_{m + 1}, Θ)

And the approximate posterior is the same, but in inverse order:

q (h_{1}, \dots, h_{M} ∣ x, Λ) = \prod_{m = 1}^{M} p (h_{m} ∣ h_{m - 1}, Θ)

The $p (x, h_{1}, \dots, h_{M - 1} ∣ h_{M})$ distribution is usually called a generative network (or model) as it allows one to generate samples from latent representation(s). The approximate posterior $q (h_{1}, \dots, h_{M} ∣ x, Λ)$ in this framework is called a recognition network (or model). Presumably, the name reflects the purpose of the network to recognize the hidden structure of observations.

So, if the VAE is a special case of Helmholtz machines, what’s new then? The standard algorithm for learning Helmholtz machines, the Wake-Sleep algorithm, turns out to be optimizing a different objective. Thus, one of significant contributions of Kingma and Welling is application of the reparametrization trick to make optimization of the ELBO w.r.t. $Λ$ tractable.