分享

Neural Variational Inference: Blackbox Mode

 htxu91 2016-07-30

In the previous post we covered Stochastic VI: an efficient and scalable variational inference method for exponential family models. However, there’re many more distributions than those belonging to the exponential family. Inference in these cases requires significant amount of model analysis. In this post we consider Black Box Variational Inference by Ranganath et al. This work just as the previous one comes from David Blei lab — one of the leading researchers in VI. And, just for the dessert, we’ll touch upon another paper, which will finally introduce some neural networks in VI.

Blackbox Variational Inference

As we have learned so far, the goal of VI is to maximize the ELBO L(Θ,Λ). When we maximize it by Λ, we decrease the gap between the marginal likelihood of the model considered logp(xΘ), and when we maximize it by Θ we acltually fit the model. So let’s concentrate on optimizing this objective:

L(Θ,Λ)=Eq(zx,Λ)[logp(x,zΘ)logq(zx,Λ)]

Let’s find gradients of this objective:

ΛL(Θ,Λ)=Λq(zx,Λ)[logp(x,zΘ)logq(zx,Λ)]dz=Λq(zx,Λ)[logp(x,zΘ)logq(zx,Λ)]dzq(zx,Λ)Λlogq(zx,Λ)dz=Eq[Λq(zx,Λ)q(zx,Λ)logp(x,zΘ)q(zx,Λ)]q(zx,Λ)Λq(zx,Λ)q(zx,Λ)dz=Eq[Λlogq(zx,Λ)logp(x,zΘ)q(zx,Λ)]Λq(zx,Λ)dz=Eq[Λlogq(zx,Λ)logp(x,zΘ)q(zx,Λ)]Λq(zx,Λ)dz=1=Eq[Λlogq(zx,Λ)logp(x,zΘ)q(zx,Λ)]

In statistics Λlogq(zx,Λ) is known as score function. For more on this “trick” see a blogpost by Shakir Mohamed. In many cases of practical interest logp(x,z,Θ) is too complicated to compute this expectation in closed form. Recall that we already used stochastic optimization successfully, so we can settle with just an estimate of true gradient. We get one by approximating the expectation using Monte-Carlo estimates using L samples z(l)q(zx,Λ) (in practice we sometimes use just L=1 sample. We expect correct averaging to happen automagically due to use of minibatches):

ΛL(Θ,Λ)1Ll=1LΛlogq(z(l)x,Λ)logp(x,z(l)Θ)q(z(l)x,Λ)

For model parameters Θ gradients look even simpler, as we don’t need to differentiate w.r.t. expectation distribution’s parameters:

ΘL(Θ,Λ)=EqΘlogp(x,zΘ)1Ll=1LΘlogp(x,z(l)Θ)

We can even “naturalize” these gradients by premultiplying by the inverse Fisher Information Matrix I(Λ)1. And that’s it! Much simpler than before, right? Of course, there’s no free lunch, so there must be a catch… And there is: performance of stochastic optimization methods crucially depends on the variance of gradient estimators. It makes perfect sense: the higher the variance — the less information about the step direction we get. And unfortunately, in practice the aforementioned estimator based on the score function has impractically high variance. Luckily, in Monte Carlo community there are many variance reductions techniques known, we now describe some of them.

The first technique we’ll describe is Rao-Blackwellization. The idea is simple: if it’s possible to compute the expectation w.r.t. some of random variables, you should do it. If you think of it, it’s an obvious advice as you essentially reduce amount of randomness in your Monte Carlo estimates. But let’s put it more formally: we use chain rule to rewrite joint expectation as marginal expectation of conditional one:

EX,Yf(X,Y)=EX[EYXf(X,Y)]

Let’s see what happens with variance (in scalar case) when we estimate expectation of EYXf(X,Y) instead of expectation of f(X,Y):

VarX(EYXf(X,Y))=E(EYXf(X,Y))2(EX,Yf(X,Y))2=VarX,Y(f(X,Y))EX(EYXf(X,Y)2(EYXf(X,Y))2)=VarX,Y(f(X,Y))EXVarYX(f(X,Y))

This formula says that Rao-Blackwellizing an estimator reduces its variance by EXVarYX(f(X,Y)). Indeed, you can think of this term as of a measure of how much information Y contains about X that’s relevant to computing f(X,Y). Suppose Y=X: then you have EXf(X,X), and taking expectation w.r.t. Y does not reduce amount of randomness in the estimator. And this is what the formula tells us as VarYXf(X,Y) would be 0 in this case. Here’s another example: suppose f does not use X at all: then only randomness in Y affects the estimate, and after Rao-Blackwellization we expect the variance to drop to 0. And the formula agrees with out expectations as EXVarYXf(X,Y)=VarYf(X,Y) for any X since f(X,Y) does not depend on X.

Next technique is Control Variates, which is slightly less intuitive. The idea is that we can add zero-mean function h(X) that’ll preserve the expectation, but reduce the variance. Again, for a scalar case

Var(f(X)αh(X))=Var(f(X))2αCov(f(X),h(X))+α2Var(f(X))

Optimal α=Cov(f(X),h(X))Var(f(X)). This formula reflects an obvious fact: if we want to reduce the variance, h(X) must be correlated with f(X). Sign of correlation does not matter, as α will adjust. BTW, in reinforcement learning α is called baseline.

As we already have learned, Eq(zx,Λ)Λlogq(zx,Λ)=0, so the score function is a good candidate for h(x). Therefore our estimates become

ΛL(Θ,Λ)1Ll=1LΛlogq(z(l)x,Λ)°(logp(x,z(l)Θ)q(z(l)x,Λ)α)

Where ° is pointwise multiplication and α is a vector of |Λ| components with αi being a baseline for variational parameter Λi:

αi=Cov(Λilogq(zx,Λ)(logp(x,zΘ)logq(zx,Λ)),Λilogq(zx,Λ))Var(Λilogq(zx,Λ)(logp(x,zΘ)logq(zx,Λ)))

Neural Variational Inference and Learning

Hoooray, neural networks! In this section I’ll briefly describe a variance reduction technique coined by A. Mnih and K. Gregor in Neural Variational Inference and Learning in Belief Networks. The idea is surprisingly simple: why not learn a baseline α using a neural network?

ΛL(Θ,Λ)1Ll=1LΛlogq(z(l)x,Λ)°(logp(x,z(l)Θ)q(z(l)x,Λ)αα(x))

Where α(x) is a neural network trained to minimize

Eq(zx,Λ)(logp(x,z(l)Θ)q(z(l)x,Λ)αα(x))2

What’s the motivation of this objective? The gradient step of ΛL(Θ,Λ) can be seen as pushing q(zx,Λ) towards p(x,zΘ). Since q has to be normalized like any other proper distribution, it’s actually pushed towards the true posterior p(zx,Θ). We can rewrite the gradient ΛL(Θ,Λ) as

ΛL(Θ,Λ)=Eq[Λlogq(zx,Λ)(logp(x,zΘ)logq(zx,Λ))]=Eq[Λlogq(zx,Λ)(logp(zx,Θ)logq(zx,Λ)+logp(xΘ))]

While this additional logp(xΘ) term does not contribute to the expectation, it affects the variance on the estimator. Therefore, α(x) is supposed to estimate the marginal log-likelihood logp(xΘ).

The paper also lists several other variance reduction techniques that can be used in combination with the neural network-based baseline:

  • Constant baseline — analogue of Control Variates, uses running average of logp(x,zΘ)logq(zx,Λ) as a baseline
  • Variance normalization — normalizes the learning signal to unit variance, equivalent to adaptive learning rate
  • Local learning signals — falls out of the scope of this post as requires it model-specific analysis and alternations, and can’t be used in Blackbox regime 

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多