R语言分层线性模型

计量经济圈 2023-06-04 发布于浙江

展开全文

在之前的一篇文章中我们介绍了如何在R语言中实现基础的线性回归（文章链接：R语言线性回归模型基础），今天我们来介绍如何在R中实现分层线性模型。(≧∇≦)/

加载包

library(tidyverse)

library(lme4)

library(lmerTest)

library(modelr)
library(broom)

library(broom.mixed)

library(ggsci)
library(gt)

让我们从一个有意思的案例开始

不同院系教职员的收入:一般情况下，不同的院系，制定教师收入的依据和标准可能是不同的。我们假定有一份大学教职员的收入清单，这个学校包括信息学院、外国语学院、社会政治学、生物学院、统计学院共五个机构，我们通过数据建模，探索这个学校的薪酬制定规则。

读入数据

（关注公众号并回复关键字：分层线性模型，即可获得数据(・ω< )★）

df <- read_csv('df.csv')

df$department <- factor(df$department)
head(df)%>% gt() %>% fmt_auto()

ids	department	bases	experience	raises	salary
1	sociology	40,952.274	3	2,064.726	47,146.452
2	biology	54,169.215	4	463.968	56,025.086
3	english	61,000.425	5	544.438	63,722.617
4	informatics	71,476.101	8	1,610.774	84,362.291
5	statistics	85,205.298	0	535.088	85,205.298
6	sociology	40,879.63	9	1,865.304	57,667.368

线性模型

薪酬制定规则一：假定教师收入主要取决于他从事工作的时间，也就说说工作时间越长收入越高。意味着，每个院系的起始薪酬（起薪）是一样的，并有相同的年度增长率。那么，这个收入问题就是一个简单线性模型：y=α+β1x1+…+βnxn具体到我们的案例中，薪酬模型可以写为salary=α+β∗experience通过这个等式，可以计算出各个系数，即截距α就是起薪，斜率β就是年度增长率。确定了斜率和截距，也就确定了每个教职员工的收入曲线。

m1 <- lm(salary ~ experience, data = df)
broom::tidy(m1)%>% gt() %>% fmt_auto()

term	estimate	std.error	statistic	p.value
(Intercept)	59,886.79	2,587.765	23.142	1.672 × 10^-41
experience	1,193.294	513.159	2.325	0.022

使用线性模型进行预测

df %>% add_predictions(m1)%>% head() %>% gt() %>% fmt_auto()

ids	department	bases	experience	raises	salary	pred
1	sociology	40,952.274	3	2,064.726	47,146.452	63,466.672
2	biology	54,169.215	4	463.968	56,025.086	64,659.966
3	english	61,000.425	5	544.438	63,722.617	65,853.26
4	informatics	71,476.101	8	1,610.774	84,362.291	69,433.141
5	statistics	85,205.298	0	535.088	85,205.298	59,886.79
6	sociology	40,879.63	9	1,865.304	57,667.368	70,626.435

将预测值和真实值可视化后进行比较

df %>%
add_predictions(m1) %>%
ggplot(aes(x = experience, y = salary)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = 'Experience', y = 'Predicted Salary') +
ggtitle('linear model Salary Prediction') +
scale_colour_discrete('Deparment')

我们可以看出对每个教师来说，不管来自哪个学院的，系数α和β是一样的，是固定的，因此这种简单线性模型也称之为固定效应模型。但事实上从图中可以看出，这种线性模型的拟合方法太过于粗犷，构建的线性直线不能直接反映收入随院系的变化。

简单截距模型

既然线性模型行不通，那我们换一种模型进行数据的拟合，首先这一数据和普通用来进行线性回归的数据最大不同在于它包含了更高层次的变量department，即不属于个体教师层面的高层次变量学院，我们可以先试着不考虑薪酬指定的规则用每个学院的平均工资来预测个体的薪酬。

fit1 <- lmer(salary ~ 1 + (1 | department), data = df)

可以使用ranef()函数来获得模型中的第二层残差

ranef(fit1)

## $department
##             (Intercept)
## biology     -12923.9670
## english       -903.6319
## informatics 13248.3654
## sociology   -16011.9560
## statistics   16591.1895
##
## with conditional variances for 'department'

残差在一定程度上说明了第二层信息的变异性，这就引出了分层线性模型中的一个重要指标：Intra-class Correlation Coefficient，组内相关系数法，ICC，它测量了因变量的方差中由二层变量所解释的部分，能够为是否有必要采用分层线性模型提供正式的参考依据。

ICC可以从上面这个模型的汇总结果中计算出来

summary(fit1)

ICC为

220212154/(220212154 + 29362674)

## [1] 0.8823492

ICC的值在[0.01, 0.059]之间是低关联强度，不必使用多层次模型；在[0.059,0.138]之间是中关联程度，应该使用多层次模型；而大于等于0.138就强烈推荐多层次模型，我们计算出的ICC值为约为0.882非常适合使用多层次模型。

变化的截距

薪酬制定规则二，假定不同的院系起薪不同，但年度增长率是相同的。这种统计模型，相比于之前的固定效应模型（简单线性模型）而言，加入了截距会随所在学院不同而变化的思想，统计模型写为（其中教师i，他所在的学院j，记为j[i]，那么教师i所在学院j对应的α，很自然的记为）

这个等式中，斜率β代表着年度增长率，是一个固定值，也就前面说的固定效应项，而截距α代表着起薪，随学院变化，是五个值，因为一个学院对应一个α，所以称α为变化效应项（也叫随机效应项）。这里模型中既有固定效应项又有变化效应项，因此称之为混合效应模型。

# 有随机效应项的模型
m2 <- lmer(formula = salary ~ experience + (1 | department),
           #括号里所表示的内容是对截距项根据department变量进行分层
           #括号前的是固定效应的部分，括号内的是变化效应的部分
           data = df)
summary(m2)

在结果中Fixed Effects类似于之前的线性模型，截距项和系数项的值都被列了出来，Random effects随机效应，在这一部分中，首先看到Groups意思是怎么分层，而Name则表示对回归中的哪一部分进行分层，然后结果中也给出了标准差和残差。

提取模型中不同层所对应的不同截距

coef(m2)

## $department
##             (Intercept) experience
## biology        47098.82     1204.4
## english        58430.10     1204.4
## informatics    72918.02     1204.4
## sociology      43883.02     1204.4
## statistics     76869.10     1204.4
##
## attr(,'class')
## [1] 'coef.mer'

使用更好地方式查看模型中的固定效应和随机效应

broom.mixed::tidy(m2, effects = 'fixed')%>% gt() %>% fmt_auto()#固定效应

effect	term	estimate	std.error	statistic	df	p.value
fixed	(Intercept)	59,839.813	6,693.018	8.941	4.079	7.909 × 10^-4
fixed	experience	1,204.4	155.918	7.725	94.007	1.202 × 10^-11

broom.mixed::tidy(m2, effect = 'ran_vals')%>% gt() %>% fmt_auto()#随机效应

effect	group	level	term	estimate	std.error
ran_vals	department	biology	(Intercept)	-12,740.989	950.732
ran_vals	department	english	(Intercept)	-1,409.712	950.732
ran_vals	department	informatics	(Intercept)	13,078.209	950.732
ran_vals	department	sociology	(Intercept)	-15,956.793	950.732
ran_vals	department	statistics	(Intercept)	17,029.286	950.732

结果中estimate表示分层产生的截距上的差异或调整，比如-12740.989表示要在原来固定效应的基础上（59839.81）减去12740.989，得到分层模型的截距47098.82。

进行可视化

df %>%
add_predictions(m2) %>%
ggplot(aes(
x = experience, y = salary, group = department,
color = department)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = 'Experience', y = 'Predicted Salary') +
ggtitle('Varying Intercept Salary Prediction') +
scale_colour_discrete('Deparment') +
scale_color_npg()+
theme_bw()

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

变化的斜率

同样我们还可以假设薪酬制定规则三，不同的院系起始薪酬是相同的，但年度增长率不同。与薪酬模型规则二的统计模型比较，我们只需要把变化的截距变成变化的斜率，那么统计模型可写为

这里，截距α对所有教师而言是固定不变的，而斜率β会随学院不同而变化，5个学院对应着5个斜率。这里，截距α对所有教师而言是固定不变的，而斜率β会随学院不同而变化，5个学院对应着5个斜率。

m3 <- lmer(salary ~ experience + (0 + experience | department),
#这里的0是一定要加的，表示不要截距只留下变量，如果不加程序会默认有截距
data = df)
summary(m3)

从结果中我们可以看到只有的变化的斜率，固定效应模型的斜率并不显著，所以这样的拟合结果并不理想

接下来我们根据上面的步骤把系数、固定效应和随机效应提取出来，并进行可视化

coef(m3)

## $department
##             (Intercept) experience
## biology        61067.39 -1456.150
## english        61067.39    628.862
## informatics    61067.39   3275.784
## sociology      61067.39 -1324.683
## statistics     61067.39   3433.939
##
## attr(,'class')
## [1] 'coef.mer'

broom.mixed::tidy(m3, effects = 'fixed')%>% gt() %>% fmt_auto()

effect	term	estimate	std.error	statistic	df	p.value
fixed	(Intercept)	61,067.388	1,693.218	36.066	94.067	7.063 × 10^-57
fixed	experience	911.55	1,130.785	0.806	4.543	0.46

broom.mixed::tidy(m3, effects = 'ran_vals')%>% gt() %>% fmt_auto()

effect	group	level	term	estimate	std.error
ran_vals	department	biology	experience	-2,367.7	438.874
ran_vals	department	english	experience	-282.688	367.959
ran_vals	department	informatics	experience	2,364.234	387.536
ran_vals	department	sociology	experience	-2,236.234	405.747
ran_vals	department	statistics	experience	2,522.389	412.692

df %>%
add_predictions(m3) %>%
ggplot(aes(
x = experience, y = salary, group = department,
color = department)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = 'Experience', y = 'Predicted Salary') +
ggtitle('Varying slope Salary Prediction') +
scale_colour_discrete('Deparment') +
scale_color_nejm() +
theme_minimal()

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

注意到在可视化图形中，分层的回归线是具有同一个起点的，这就代表了固定的截距，但其实正如一开始模型显示的那样，固定效应的斜率系数并不显著，有一些学院随着工作时间的增加工资居然越来越低，可见这样的模型拟合的效果并不理想。

变化的斜率 + 变化的截距

我们还可以设想薪酬制定规则四：不同的学院起始薪酬和年度增长率也不同。这可能是最现实的一种情形了，它实际上是规则二和规则三的一种组合，要求截距和斜率都会随学院的不同变化，数学上记为

具体来说就是教师i，所在的学院j，他的入职的起始收入表示为αji，年度增长率表示为βji.

m4 <- lmer(salary ~ experience + (1 + experience | department),
data = df)
summary(m4)

我们注意到在结果中Random effects中处理方差标准差之外多了一个相关系数，这是截距和斜率之间的相关系数。我们可以自己验算一下这一相关系数值

ran_vals <- broom.mixed::tidy(m4, effects = 'ran_vals')
intercept <- ran_vals %>%
filter(term == '(Intercept)') %>%
pull(estimate)
beta <- ran_vals %>%
filter(term == 'experience') %>%
pull(estimate)
cor(intercept, beta)

## [1] -0.5607434

相关系数有所差距是因为算法上有差异，但大体上并没有差异

同样，我们根据上面的步骤把系数、固定效应和随机效应提取出来，并进行可视化

coef(m4)

## $department
##             (Intercept) experience
## biology        47420.04 1125.2275
## english        60844.84   688.3418
## informatics    70410.61 1771.2658
## sociology      40243.54 2074.4345
## statistics     79664.87   494.3011
##
## attr(,'class')
## [1] 'coef.mer'

broom.mixed::tidy(m4, effects = 'fixed')%>% gt() %>% fmt_auto()

effect	term	estimate	std.error	statistic	df	p.value
fixed	(Intercept)	59,716.782	7,289.686	8.192	4.006	0.001
fixed	experience	1,230.714	359.762	3.421	4.115	0.026

broom.mixed::tidy(m4, effects = 'ran_vals')%>% gt() %>% fmt_auto()

effect	group	level	term	estimate	std.error
ran_vals	department	biology	(Intercept)	-12,296.737	1,631.275
ran_vals	department	english	(Intercept)	1,128.056	1,510.276
ran_vals	department	informatics	(Intercept)	10,693.833	1,494.81
ran_vals	department	sociology	(Intercept)	-19,473.241	1,453.691
ran_vals	department	statistics	(Intercept)	19,948.09	1,345.757
ran_vals	department	biology	experience	-105.487	341.316
ran_vals	department	english	experience	-542.372	266.954
ran_vals	department	informatics	experience	540.552	277.73
ran_vals	department	sociology	experience	843.72	282.241
ran_vals	department	statistics	experience	-736.413	265.557

df %>%
add_predictions(m4) %>%
ggplot(aes(
x = experience, y = salary, group = department,
colour = department
)) +
geom_point() +
geom_line(aes(x = experience, y = pred)) +
labs(x = 'Experience', y = 'Predicted Salary') +
ggtitle('Varying Intercept and Slopes Salary Prediction') +
scale_colour_discrete('Department') +
scale_color_startrek() +
theme_light()

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

从图中可以看到，在这种情况下，我们计算出了五种不同斜率和不同截距的直线，那么这就引发出了一个问题：在薪酬制定规则四中，不同的院系起薪不同，年度增长率也不同，我们得出了5组不同的截距和斜率，那么是不是可以等价为，先按照院系分5组，然后各算各的截距和斜率？比如：

df %>%
group_by(department) %>%
group_modify(
~ broom::tidy(lm(salary ~ 1 + experience, data = .))
) %>%
gt() %>%
fmt_auto()

term	estimate	std.error	statistic	p.value
biology
(Intercept)	47,852.456	1,272.709	37.599	1.464 × 10^-18
experience	1,005.392	277.398	3.624	0.002
english
(Intercept)	61,293.411	1,666.562	36.778	2.168 × 10^-18
experience	587.388	303.011	1.939	0.068
informatics
(Intercept)	69,792.443	1,711.172	40.786	3.440 × 10^-19
experience	1,926.971	328.102	5.873	1.463 × 10^-5
sociology
(Intercept)	39,747.643	1,114.122	35.676	3.722 × 10^-18
experience	2,185.078	223.947	9.757	1.302 × 10^-8
statistics
(Intercept)	79,989.567	1,854.726	43.127	1.272 × 10^-19
experience	422.22	379.386	1.113	0.28

从数据中可以看出分组各自回归，与这里的（变化的截距+变化的斜率）模型，不是一回事。

信息共享

分层线性模型各层之间的信息是部分共享的，而分成5组各自回归那么各组之间的信息是完全不共享的，而向本文开始但建立一个简单的线性回归模型那么各层之间的信息就是完全共享的。其实这很好理解，在本例中，一个学院的薪资水平不可能不受到其他学院的影响，但也不可能完全和别的学院一模一样，这也是不同层次变量信息所引起的差异，所以分层线性模型存在的必要就是解决这一问题。

完全共享

complete_pooling <-
broom::tidy(m1) %>%
dplyr::select(term, estimate) %>%
tidyr::pivot_wider(
names_from = term,
values_from = estimate
) %>%
dplyr::rename(Intercept = `(Intercept)`, slope = experience) %>%
dplyr::mutate(pooled = 'complete_pool') %>%
dplyr::select(pooled, Intercept, slope)

complete_pooling %>% gt() %>% fmt_auto()

pooled	Intercept	slope
complete_pool	59,886.79	1,193.294

部分共享

fix_effect <- broom.mixed::tidy(m4, effects = 'fixed')
var_effect <- broom.mixed::tidy(m4, effects = 'ran_vals')
#随机效应加上固定效应的参数
partial_pooling <- var_effect %>%
dplyr::select(level, term, estimate) %>%
tidyr::pivot_wider(
    names_from = term,
    values_from = estimate
) %>%
dplyr::rename(Intercept = `(Intercept)`, estimate = experience) %>%
dplyr::mutate(
    Intercept = Intercept + fix_effect$estimate[1],
    estimate = estimate + fix_effect$estimate[2]
) %>%
dplyr::mutate(pool = 'partial_pool') %>%
dplyr::select(pool, level, Intercept, estimate)
partial_pooling <-
coef(m4)$department %>%
tibble::rownames_to_column() %>%
dplyr::rename(level = rowname, Intercept = `(Intercept)`, slope = experience) %>%
dplyr::mutate(pooled = 'partial_pool') %>%
dplyr::select(pooled, level, Intercept, slope)

partial_pooling %>% gt() %>% fmt_auto()

pooled	level	Intercept	slope
partial_pool	biology	47,420.045	1,125.227
partial_pool	english	60,844.837	688.342
partial_pool	informatics	70,410.614	1,771.266
partial_pool	sociology	40,243.541	2,074.434
partial_pool	statistics	79,664.871	494.301

不共享

no_pool <- df %>%
dplyr::group_by(department) %>%
dplyr::group_modify(
    ~ broom::tidy(lm(salary ~ 1 + experience, data = .))
)
un_pooling <- no_pool %>%
dplyr::select(department, term, estimate) %>%
tidyr::pivot_wider(
    names_from = term,
    values_from = estimate
) %>%
dplyr::rename(Intercept = `(Intercept)`, slope = experience) %>%
dplyr::mutate(pooled = 'no_pool') %>%
dplyr::select(pooled, level = department, Intercept, slope)

un_pooling %>% gt() %>% fmt_auto()

pooled	Intercept	slope
biology
no_pool	47,852.456	1,005.392
english
no_pool	61,293.411	587.388
informatics
no_pool	69,792.443	1,926.971
sociology
no_pool	39,747.643	2,185.078
statistics
no_pool	79,989.567	422.22

可视化

library(ggrepel)

un_pooling %>%
dplyr::bind_rows(partial_pooling) %>%
ggplot(aes(x = Intercept, y = slope)) +
purrr::map(
    c(seq(from = 0.1, to = 0.9, by = 0.1)),
    .f = function(level) {
      stat_ellipse(
        geom = 'polygon', type = 'norm',
        size = 0, alpha = 1 / 10, fill = 'gray10',
        level = level
      )
    }
) +
geom_point(aes(group = pooled, color = pooled)) +
geom_line(aes(group = level), size = 1 / 4) +
# geom_point(data = complete_pooling, size = 4, color = 'red') +
geom_text_repel(
    data = . %>% filter(pooled == 'no_pool'),
    aes(label = level)
) +
scale_color_manual(
    name = 'information pool',
    values = c(
      'no_pool' = 'black',
      'partial_pool' = 'red' # ,
      # 'complete_pool' = '#A65141'
    ),
    labels = c(
      'no_pool' = 'no share',
      'partial_pool' = 'partial share' # ,
      # 'complete_pool' = 'complete share'
    )
) +
theme_classic()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.