【原】R笔记：多重线性回归（二）_适用条件考察

Memo_Cleon 2021-01-28

展开全文

转自个人微信公众号【Memo_Cleon】的统计学习笔记：R笔记：多重线性回归（二）_适用条件考察。

书接上回……

我们在<<多重线性回归（一）_模型拟合>>一文中已经建立了回归模型lmfit并对模型参数进行了估计，但建立回归方程实际上是整个回归分析里面最简单的一部分，数据适不适合采用线性回归，回归模型对数据的拟合性如何还需要更进一步的考察验证。

线性回归基本适用条件：独立、线性、正态和方差齐同，本次笔记在上次笔记建立多重回归模型lmfit后考察这几个适用条件。

【4】适用条件的考察

线性回归适用条件和模型的诊断多涉及残差分析。R的基础包几个常用的残差计算分析函数如下：

residuals()：模型残差；

rstandard()：标准化残差，或者直接计算sqrt(deviance(model)/df.residual(model)),…)；

rstudent()：学生化残差；

predict()：预测值；

plot()：绘制模型诊断图或构建相应的残差图。plot函数可利用模型残差绘制6幅图（默认4幅，相当于which=1:4，Cook距离和Cook距离对杠杆值图默认不显示），实现模型正态性、方差齐性及异常点等的分析。

plot.lm {stats}

Six plots (selectable by which) are currently available: a plot of residuals against fitted values, a Scale-Location plot of sqrt(|residuals|) against fitted values, a Normal Q-Q plot, a plot of Cook's distances versus row labels, a plot of residuals against leverages, and a plot of Cook's distances against leverage/(1-leverage). By default, the first three and 5 are provided. 1:Residuals vs Fitted; 2: Normal Q-Q; 3: Scale-Location; 4: Cook's distance; 5: Residuals vs Leverage; 6:Cook's dist vs Leverage h_ii/(1-h_ii).

par(mfrow=c(2,3)) #par {graphics}将plot()函数生成的四幅图绘制在一个大图中，大图按2×3排列四幅小图

plot(lmfit, which=1:6, col ="blue")

结果如下，基本满足线性、正态性和方差齐性，但可能存在异常点。

残差对拟合值散点图：可用于判断残差中是否还存在非线性趋势成分，红色的是lowess(局部加权回归散点修匀法)曲线；There could be a non-linear relationship between predictor variables and an outcome variable and the pattern could show up in this plot if the model doesn’t capture the non-linear relationship. If you find equally spread residuals around a horizontal line without distinct patterns, that is a good indication you don’t have non-linear relationships。该散点图也常用来判断判断方差齐性。

残差的正态QQ图：用于判断残差的正态性，It’s good if residuals are lined well on the straight dashed line.

Scale-Location图：残差绝对值平方根对拟合值散点图，常用于判断方差齐性。It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points.

Cook距离图：各观察值的Cook距离，判断异常值Cook's distance is a measure of how much the residuals of all records would change if a particular record were excluded from the calculation of the model coefficients. A large Cook's distance indicates that excluding a record from changes the coefficients substantially, and should therefore be considered influential.[SPSS]

标准化残差对杠杆值散点图：寻找强影响点。This plot helps us to find influential cases.We watch out for outlying values at the upper right corner or at the lower right corner. Those spots are the places where cases can be influential against a regression line. Look for cases outside of a dashed line, Cook’s distance. When cases are outside of the Cook’s distance (meaning they have high Cook’s distance scores), the cases are influential to the regression results. The regression results will be altered if we exclude those cases.

In the Cook's distance vs leverage/(1-leverage) plot, contours of standardized residuals that are equal in magnitude are lines through the origin. The contour lines are labelled with the magnitudes.通过原点的线是标准化残差等高线。

参考：

A Modern Approach to Regression with R.

Understanding Diagnostic Plots for Linear Regression Analysis：https://data.library./diagnostic-plots/

How does plot.lm() determine outliers for residual vs fitted plot?：https:///questions/39259252/how-does-plot-lm-determine-outliers-for-residual-vs-fitted-plot

除了基础包，R中有大量的包（比如car包、olsrr包、plotmo包、HH包、gvlma包）都可以实现对线性回归适用条件的考察。

（1）独立性

独立性的判断还是以事实经验为主，没有理由认为一个婴儿的出生体重会对其他婴儿的出生体重有影响，可以认为因变量值（或者残差）相互独立。Durbin-Watson检验可用于检测数据的是否具有随时间变化的自相关（相隔时间越近相关性越大），但该检验统计不显著仅代表没有自相关关系，并不代表数据就不相关，本例非聚集性数据并不适用。R中的Durbin-Watson检验函数有很多：DurbinWatsonTest {DescTools}、dwtest {lmtest}、test.DW {dcv}、durbinWatsonTest {car}、score_dw {auditor}、reg.dw {YRmisc}、reg.dw {PMmisc}……，使用很简单需要时可参考，比如

library(DescTools)

DurbinWatsonTest(lmfit)

（2）结局变量与各个解释变量呈线性关系

自变量与结局变量间的散点图、残差与自变量的散点图、残差对拟合值散点图、偏残差图（成分残差图）是常用的考察办法。当解释变量和结局变量不满足线性关系时，解释变量是分类变量应考虑将其设为哑变量，解释变量是连续变量进行Box-Tidwell变换。

各个自变量与结局变量间的散点图矩阵可以直观的大体了解各个自变量与响应变量间的关系。在数据初步考察里面我们用到了各个变量间的散点图来初步观察这种关系。相比这种散点图，残差与自变量的散点图效率更高一些，该图在<<线性回归中的线性考察>>一文中做过介绍。以绘制残差与自变量age的散点图为例：

lmres<-residuals(lmfit)

plot(lmres~lmdata$age,col="blue")

abline(h=0,lwd=2,col="red") #添水平直线y=0，线宽为2，颜色为绿色。添加水平直线用v，添加垂直直线h。添加残差与变量的的拟合直线：abline(lm(lmres~lmdata$age),lwd=1, col="blue")

年龄对应的残差点基本上是平均分布在y=0的直线两侧，无明显偏负或偏正，年龄与新生儿体重基本呈线性。

我们也可以直接利用现成的函数来绘制，比如residual.plots {HH}，但函数residual.plots()结果显示不友好，可以改用函数residual.plots.lattice {HH}。

residual.plots {HH}

Residual plots for a linear model. Four sets of plots are produced: (1) response against each of the predictor variables, (2) residuals against each of the predictor variables, (3) partial residuals for each predictor against that predictor ("partial residuals plots", and (4) partial residuals against the residuals of each predictor regressed on the other predictors ("added variable plots").

residual.plots.lattice {HH}

Construct four sets of regression plots. Response variable $Y$ against each $X_j$, residuals $e$ against each $X_j$, partial residuals plots of $e^j$ against each $X_j$, added variable plots of $e^j$ against the residuals of each $X_j$ adjusted for the other $X$ columns. The slopes shown in the panels of both bottom rows are equal to the regression coefficients.

library(HH);library(lattice);library(grid);library(latticeExtra);library(multcomp);library(survival);library(TH.data);library(MASS);library(gridExtra)

residual.plots.lattice(lmfit)

结果中第一行是各个自变量与因变量的关系；第二行是各个自变量与残差的关系，从各个自变量与残差的关系看，基本满足线性关系；第三行是各自变量的偏残差图；第四行是添加变量图（偏回归图）。

除了普通残差图，偏残差图（Partial Residual Plots）是更为专业的线性判断方法，它可以排除混杂因素的影响，能够更准确地判定自变量和因变量是否为线性。偏残差图也称成分+残差图（Component plus Residual Plots），偏残差图就是偏残差（用除自变量Xi外其他所有自变量校正过的因变量）与预测变量Xi构建的散点图，详细介绍可参见<<偏回归图与偏残差图>>。上述residual.plots.lattice {HH}函数给出的结果中，第三行结果就是各个自变量的偏残差图，结果显示校正了其他变量的影响后，年龄和随访次数对新生儿体重的影响不大，孕前体重越重新生儿体重越重，母亲吸烟、有早产经历、患高血压、有应激性新生儿体重越轻，相比白人，黑人和其他种族的孕妇的新生儿体重更轻一些。当然这些只是图片上显示出来的信息，并不能判定有没有统计学意义。

变量较多时residual.plots.lattice {HH}的结果细节显示不足，可以使用ols_plot_comp_plus_resid{olsrr}和crPlots{car}来绘制出成分残差图。crPlots{car}结果的成分残差图不仅给出了偏残差图的散点（Xi,ε+βXi）及其平滑拟合曲线（粉红色，局部加权回归散点平滑法(locally weighted scatterplot smoothing，LOWESS或LOESS)），同时给出了Xi与βXi的拟合直线，如果直线与拟合曲线差别较大则表明自变量与因变量之间可能存在非线性关系，则需要添加一些曲线成分如二次项、三次项等；如果拟合直线与曲线接近重合则表明自变量Xi与因变量存在很好的线性关系，但非水平直线才表明对影响变量有影响（水平线表示该自变量系数为0）。

library(car);library(carData)

crPlots(lmfit)

结果显示年龄、随访次数与新生儿体重偏残差拟合线与平滑拟合线重合性还可以，在年龄中可能存在一个异常点需要注意，但直线拟合线基本呈水平表明其对新生儿体重影响不大。孕前体重与新生儿体重的线性假设是合适的，而且对新生儿体重也有影响。

library(olsrr)

ols_plot_comp_plus_resid(lmfit)

我们仅展示age、lwt及race的结果。由于种族是3个水平被设置为因子后结果以2个虚拟变量来展示。各变量线性关系良好，但age对bwt影响不大，孕前体重与新生儿体重呈正相关，与白人相比，黑人和其他种族孕母的新生儿体重更轻一些。

olsrr包功能强大，Tools for Building OLS Regression Models.Tools designed to make it easier for users, particularly beginner/intermediate R users to build ordinary least squares regression models. Includes comprehensive regression output, heteroskedasticity tests, collinearity diagnostics, residual diagnostics, measures of influence, model fit assessment and variable selection procedures.

https://cran./web/packages/olsrr/

（3）正态性

线性回归模型的正态性指的并不是指因变量呈正态分布，而是要求模型的残差服从正态分布，详细可参见<<线性回归中的正态分布>>。你可以可对模型残差进行统计学检验，也可以采用相关的专业图进行判定。如果数据不满足正态分布，可以考虑进行数据变换，比如Box-Cox变换，或者换用其他的统计学方法。

第一类方法是先生成模型的残差，然后对残差进行正态分布的检验。比较常见的方法是Shapiro-Wilk W检验和Kolmogorov-Smirnov检验，这些我们在<<R笔记：正态分布的检验>>一文中都有介绍，以Shapiro-Wilk W检验为例，

lmres<-residuals(lmfit)

shapiro.test(lmres)

结果如下，表明残差满足正态分布（P=0.612>0.05）。

Shapiro-Wilk normality test

data: lmres

W = 0.99378, p-value = 0.612

除了直接对残差进行检验，我们还可以进行图示法，比如在前面plot函数结果中有模型残差的Q-Q图，

qqnorm(lmres)

结果就是Q-Q图我们就不单独展示了。我们还可以使用专用函数，比如qqPlot{car}：

library(car);library(carData)

qqPlot(lmfit) #Quantile-Comparison Plot.these functions return the labels of identified points, unless a grouping factor is employed, in which case NULL is returned invisibly.这是对函数返回值的解释，我其实没怎么看懂，但结合前面的图示返回值应该是两个异常点。

[1] 130 132

所有的点都在Q-Q线95%CI内（如在95%CI的之外的点可初步判定为离群点），表明满足残差满足正态分布的要求。但130和132可能是异常点。

（4）方差齐性

线性回归中的等方差指因变量残差不随所有自变量取值水平的变化而变化，在自变量与残差或预测值与残差的散点图上，标准化残差随机、均匀的散布在0横线上下两侧，即不论自变量或因变量预测值如何变化，标准化残差的波动基本保持稳定，可认为方差基本相等。同正态性检验一样，方差齐性检验除了图示法，也可以进行统计检验。这些我们在<<线性回归中的方差齐性探察>>一文中都有介绍。如果残差不满足方差齐同的要求，因变量变换、加权最小二乘法、稳健回归是常用的处理方法。

图示法我们在前面用plot()绘制模型诊断图时已有过说明，此处演示plotres{plotmo}绘制Sqrt abs residuals vs fitted图，结果同前，不再赘述。该函数同样可以输出残差的其他诊断图，比如Q-Q图，残差对预测值等。

library(plotmo);library(Formula);library(plotrix);library(TeachingDemos)

plotres(lmfit,which=c(1,3,4,6),col="blue") #Which plots do draw. Default is 1:4. 1:Model plot. What gets plotted here depends on the model class. For example, for earth models this is a model selection plot. Nothing will be displayed for some models. For details, please see the plotres vignette; 2:Cumulative distribution of abs residuals; 3:Residuals vs fitted; 4:QQ plot; 5:Abs residuals vs fitted; 6:Sqrt abs residuals vs fitted; 7:Abs residuals vs log fitted; 8:Cube root of the squared residuals vs log fitted; 9:Log abs residuals vs log fitted.

R中很多函数都可以实现线性回归的方差齐性检验：bptest{lmtest}、ncvTest{car}、spreadLevelPlot{car}。统计检验法我们以bptest()和ncvTest()为例，两种方法的检测结果均显示残差满足方差等同（P均＞0.05）。

library(lmtest);library(zoo)

bptest(lmfit)

studentized Breusch-Pagan test

data: lmfit

BP = 12.718, df = 9, p-value = 0.1758

library(car);library(carData)

ncvTest(lmfit)

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare = 0.0005196456, Df = 1, p = 0.98181

olsrr 包也提供了4种异方差检验Bartlett Test、Breusch Pagan Test、Score Test、F Test，感兴趣可以自己去练习一下。

最后介绍一个集线性（Linearity）、同方差性（Homoscedasticity）、不相关（即独立，Uncorrelatedness）、正态性（Normality）的检验为一体的综合假设检验（Global Validation of Linear Models Assumptions）：gvlma{gvlma}，该函数还单独给出了偏度、峰度和异方差检验结果。

library(gvlma)

gvlma(lmfit)

结果显示前提条件并没有完全满足，比如异方差检验未通过。这与前面单独检测结果有差异。