用正态分布理解贝叶斯线性回归和高斯过程

家有仙妻宝宝 2022-04-03

展开全文

在上一篇文章中，我们学习了机器学习 (ML) 的核心正态分布属性。在这里，我们将运用我们的知识来操纵正态分布来解决棘手的贝叶斯推理。我们将用贝叶斯线性回归和高斯过程来证明它。我们还将通过一些证明来应用我们所学的知识。但是我们假设您已经阅读了上一篇文章，如果您还没有阅读，请阅读。

贝叶斯线性回归

贝叶斯线性回归利用了正态分布运算的“便利性”，解析地解决了回归问题。

对于样本大小为n且每个数据点具有m个特征的数据集，贝叶斯线性回归定义为

并且先验参数θ和误差ε都被假设为正态分布。

后部

在处理正态分布时，如果结果是概率分布，我们可以专注于合并这些正态分布而忽略任何缩放因子。我们可以专注于寻找得到的正态分布的参数。这里，后验等于

即后验是

后验预测分布

为了计算贝叶斯推理中的后验预测分布，我们将后验整合到模型的所有可能值θ上。

这种整合通常是棘手的。在贝叶斯线性回归中，似然是高斯函数的形式。

因此，我们可以利用正态分布特性更容易地计算后验预测分布。回想一下，贝叶斯回归模型定义为

并且正态分布的线性变换规则是。

让我们将A替换为x*并将x替换为θ，我们得到

应用求和规则来解释贝叶斯回归方程中的 ε

后验预测分布变为

把所有东西放在一起，后验预测分布是

此分布的平均值是给定x * 的y * 的点估计。

接下来，我们将讨论探索数据点与正态分布之间关系的高斯过程。

高斯过程 (GP)

让我们快速了解一下 GP 可以做什么。高斯过程 (GP) 的分布是函数上的分布。这是什么意思？给定训练数据集D = {( x ₁, y ₁), ( x ₂, y ₂), ( x ₃, y ₃), ( x ₄, y ₄)}，下面的函数f完全适合D。

但是贝叶斯永远不会安定于点估计！事实上，有无数的函数可以精确地拟合这些数据点。但是某些功能比其他功能更有可能。例如，下面的函数f ²似乎比其他函数更有可能，因为它为适应数据所做的曲线变化较少。对于常客，他们将使用最似然估计 (MLE) 为我们提供最终回归模型。但是对于贝叶斯主义者，他们模拟了所有的可能性。为了演示，我们可以从 GP 中反复采样，绘制出贝叶斯预测的函数。

GP是一个生成模型。它生成适合的函数（样本）

观察结果，以及
我们对数据如何相关以及它们的期望值（平均值）是多少的信念。

在 GP 回归中，它预测给定x * 的y *的正态分布，即f ( x* ) 的概率密度函数。

懒惰的会计师

让我们快速浏览一个插图。在每个季度的开始，会计师向 CEO 报告应收账款 (AR) 和应付账款 (AP) 之间的余额（余额 = AR 减去 AP）。在任何时间点，AR 都可能高于 AP，反之亦然。不幸的是，这家公司只能实现收支平衡，即平均而言，余额为零。

会计师退休了，一个聪明但懒惰的会计师保罗被聘用了。他没有做他的工作，而是在下面创建了一个多元正态分布，并自动生成接下来 20 个月的余额。

这种分布很有意义，因为预期值为 0。这是一个带有未来 20 个月余额的单个样本的图。

协方差矩阵 Σ 控制数据点的相关性。当数据点i与数据点j不相关时， Σᵢⱼ 等于 0 。如果它大于零，它们就是。Paul 选择的 Σ 是数据点之间距离的指数函数。因此，相邻月份之间的余额是相似的。s是复制天平范围的比例因子。l是一个可调的超参数（又名内核宽度）。较大的l，数据点会影响更远的邻居。从另一个角度来看，每个数据点都会受到更多邻居的影响。因此，相应的曲线会更平滑。

资源

该方案非常成功，以至于 Paul 创建了一个新模型，并在接下来的 20 个月中每天对一个数据点进行采样。

The plotted graph is smooth and just looks like a function. So he calls his system a sampler of functions. Every sample is a sample of the function f.

The CFO is impressed with the plot and asks him to redo the monthly balance for the last financial year. He cannot reuse the last function sampler anymore. It will not match the previous quarterly reports.

As discussed before, if we know a multivariate normal distribution, we can re-establish the probability distribution of the missing data given the observed one.

As in this example, the probability distribution for x₁ given x₂=2 is

Paul realizes that too. If he knows f (the last 4 quarter balances), he can recreate the distribution of f* for the last 12 months. The first four entries in his new model are the reported balances at months 1, 4, 7, and 10 (every financial quarter). The next 12 entries (f*) represent the new curve for the last financial year.

The diagram below plots one of the functions sampled.

The original prior defines the mean and the covariance of the functions. It is our prior belief on the expectation value for f and how data are correlated. Given the observations (the four balances), the posterior enforces the constraints that any sampled functions must cross the path at the red dots below.

但有时，CEO 只对几个数据点感兴趣，例如第 60 天和第 280 天的余额。这可以通过以下新的多元正态分布来建模。它对预测特定输入的输出的函数f * 进行采样。例如，f* 的一个可能样本是[ f* ¹(125) = -0.2, f* ¹(300) = 1.1])。

还有一个重要的观察。随着我们进一步远离红点，f的方差增加。预测将有更广泛的猜测和更低的确定性。简而言之，随着我们远离已知数据点，预测的不确定性会增加。

全科医生定义

高斯过程是函数的正态分布。每个 GP 由平均函数m (x) 和协方差函数κ (x, x') 定义。κ模拟f (x) 和f (x')之间的协方差（相关性）。

Paul将m (x) 设计为 0 以反映预期的平衡。如果我们继续采样函数，f ¹(x), f ²(x), f ³(x), ... 的平均值接近m (x)。

Paul 使用高斯核对协方差函数进行建模。

For any pair of x and x’, the sampled data points f(x) and f(x’) form a bivariate normal distribution. As a simple demonstration, let’s have x’ equals x. As shown below, κ(x, x), the red curve, is normally distributed.

How do we sample data from a GP? In particular, A function can be viewed as a collection of random variables f(x₁), f(x₂), f(x₃), … For continuous functions, the number of random variables is unlimited. Therefore, it has infinite random variables and GP has an infinite dimension. In practice, we can focus on the k variables in interest, even k can be very large.

Definition: A GP is a collection of random variables in which the joint distribution of every finite subset of random variables is multivariate normally distributed.

By taking advantage of this definition, we model a k-variate normal distribution from the joint distribution. For example, Paul creates the finite subset X = {x₁, x₂, x₃, x₄, …, x₁₂} for the coming year report. It contains 12 random variables holding a balance for each month. Once the 12-variate normal distribution is defined, we sample values from this multivariate normal distribution to create sample functions.