在之前的一篇文章中我们介绍了如何在R语言中实现基础的线性回归(文章链接:R语言线性回归模型基础),今天我们来介绍如何在R中实现分层线性模型。(≧∇≦)/
加载包 library(tidyverse) library(lme4) library(lmerTest) library(modelr) library(broom) library(broom.mixed) library(ggsci) library(gt) 让我们从一个有意思的案例开始 不同院系教职员的收入:一般情况下,不同的院系,制定教师收入的依据和标准可能是不同的。我们假定有一份大学教职员的收入清单,这个学校包括信息学院、外国语学院、社会政治学、生物学院、统计学院共五个机构,我们通过数据建模,探索这个学校的薪酬制定规则。 读入数据 (关注公众号并回复关键字:分层线性模型,即可获得数据(・ω< )★) df <- read_csv('df.csv') df$department <- factor(df$department) head(df)%>% gt() %>% fmt_auto() ids | department | bases | experience | raises | salary | 1 | sociology | 40,952.274 | 3 | 2,064.726 | 47,146.452 | 2 | biology | 54,169.215 | 4 | 463.968 | 56,025.086 | 3 | english | 61,000.425 | 5 | 544.438 | 63,722.617 | 4 | informatics | 71,476.101 | 8 | 1,610.774 | 84,362.291 | 5 | statistics | 85,205.298 | 0 | 535.088 | 85,205.298 | 6 | sociology | 40,879.63 | 9 | 1,865.304 | 57,667.368 |
线性模型 薪酬制定规则一:假定教师收入主要取决于他从事工作的时间,也就说说工作时间越长收入越高。意味着,每个院系的起始薪酬(起薪)是一样的,并有相同的年度增长率。那么,这个收入问题就是一个简单线性模型:y=α+β1x1+…+βnxn具体到我们的案例中,薪酬模型可以写为salary=α+β∗experience通过这个等式,可以计算出各个系数,即截距α就是起薪,斜率β就是年度增长率。确定了斜率和截距,也就确定了每个教职员工的收入曲线。 m1 <- lm(salary ~ experience, data = df) broom::tidy(m1)%>% gt() %>% fmt_auto() term | estimate | std.error | statistic | p.value | (Intercept) | 59,886.79 | 2,587.765 | 23.142 | 1.672 × 10^-41 | experience | 1,193.294 | 513.159 | 2.325 | 0.022 |
使用线性模型进行预测 df %>% add_predictions(m1)%>% head() %>% gt() %>% fmt_auto() ids | department | bases | experience | raises | salary | pred | 1 | sociology | 40,952.274 | 3 | 2,064.726 | 47,146.452 | 63,466.672 | 2 | biology | 54,169.215 | 4 | 463.968 | 56,025.086 | 64,659.966 | 3 | english | 61,000.425 | 5 | 544.438 | 63,722.617 | 65,853.26 | 4 | informatics | 71,476.101 | 8 | 1,610.774 | 84,362.291 | 69,433.141 | 5 | statistics | 85,205.298 | 0 | 535.088 | 85,205.298 | 59,886.79 | 6 | sociology | 40,879.63 | 9 | 1,865.304 | 57,667.368 | 70,626.435 |
将预测值和真实值可视化后进行比较 df %>% add_predictions(m1) %>% ggplot(aes(x = experience, y = salary)) + geom_point() + geom_line(aes(x = experience, y = pred)) + labs(x = 'Experience', y = 'Predicted Salary') + ggtitle('linear model Salary Prediction') + scale_colour_discrete('Deparment') 我们可以看出对每个教师来说,不管来自哪个学院的,系数α和β是一样的,是固定的,因此这种简单线性模型也称之为固定效应模型。但事实上从图中可以看出,这种线性模型的拟合方法太过于粗犷,构建的线性直线不能直接反映收入随院系的变化。 简单截距模型 既然线性模型行不通,那我们换一种模型进行数据的拟合,首先这一数据和普通用来进行线性回归的数据最大不同在于它包含了更高层次的变量department,即不属于个体教师层面的高层次变量学院,我们可以先试着不考虑薪酬指定的规则用每个学院的平均工资来预测个体的薪酬。 fit1 <- lmer(salary ~ 1 + (1 | department), data = df) 可以使用ranef()函数来获得模型中的第二层残差 ranef(fit1) ## $department ## (Intercept) ## biology -12923.9670 ## english -903.6319 ## informatics 13248.3654 ## sociology -16011.9560 ## statistics 16591.1895 ## ## with conditional variances for 'department' 残差在一定程度上说明了第二层信息的变异性,这就引出了分层线性模型中的一个重要指标:Intra-class Correlation Coefficient,组内相关系数法,ICC,它测量了因变量的方差中由二层变量所解释的部分,能够为是否有必要采用分层线性模型提供正式的参考依据。 ICC可以从上面这个模型的汇总结果中计算出来 summary(fit1) ICC为 220212154/(220212154 + 29362674) ## [1] 0.8823492 ICC的值在[0.01, 0.059]之间是低关联强度,不必使用多层次模型;在[0.059,0.138]之间是中关联程度,应该使用多层次模型;而大于等于0.138就强烈推荐多层次模型,我们计算出的ICC值为约为0.882非常适合使用多层次模型。 变化的截距 薪酬制定规则二,假定不同的院系起薪不同,但年度增长率是相同的。这种统计模型,相比于之前的固定效应模型(简单线性模型)而言,加入了截距会随所在学院不同而变化的思想,统计模型写为(其中教师i,他所在的学院j,记为j[i],那么教师i所在学院j对应的α,很自然的记为) 这个等式中,斜率β代表着年度增长率,是一个固定值,也就前面说的固定效应项,而截距α代表着起薪,随学院变化,是五个值,因为一个学院对应一个α,所以称α为变化效应项(也叫随机效应项)。这里模型中既有固定效应项又有变化效应项,因此称之为混合效应模型。 # 有随机效应项的模型 m2 <- lmer(formula = salary ~ experience + (1 | department), #括号里所表示的内容是对截距项根据department变量进行分层 #括号前的是固定效应的部分,括号内的是变化效应的部分 data = df) summary(m2) 在结果中Fixed Effects类似于之前的线性模型,截距项和系数项的值都被列了出来,Random effects随机效应,在这一部分中,首先看到Groups意思是怎么分层,而Name则表示对回归中的哪一部分进行分层,然后结果中也给出了标准差和残差。 提取模型中不同层所对应的不同截距 coef(m2) ## $department ## (Intercept) experience ## biology 47098.82 1204.4 ## english 58430.10 1204.4 ## informatics 72918.02 1204.4 ## sociology 43883.02 1204.4 ## statistics 76869.10 1204.4 ## ## attr(,'class') ## [1] 'coef.mer' 使用更好地方式查看模型中的固定效应和随机效应 broom.mixed::tidy(m2, effects = 'fixed')%>% gt() %>% fmt_auto()#固定效应 effect | term | estimate | std.error | statistic | df | p.value | fixed | (Intercept) | 59,839.813 | 6,693.018 | 8.941 | 4.079 | 7.909 × 10^-4 | fixed | experience | 1,204.4 | 155.918 | 7.725 | 94.007 | 1.202 × 10^-11 |
broom.mixed::tidy(m2, effect = 'ran_vals')%>% gt() %>% fmt_auto()#随机效应 effect | group | level | term | estimate | std.error | ran_vals | department | biology | (Intercept) | -12,740.989 | 950.732 | ran_vals | department | english | (Intercept) | -1,409.712 | 950.732 | ran_vals | department | informatics | (Intercept) | 13,078.209 | 950.732 | ran_vals | department | sociology | (Intercept) | -15,956.793 | 950.732 | ran_vals | department | statistics | (Intercept) | 17,029.286 | 950.732 |
结果中estimate表示分层产生的截距上的差异或调整,比如-12740.989表示要在原来固定效应的基础上(59839.81)减去12740.989,得到分层模型的截距47098.82。 进行可视化 df %>% add_predictions(m2) %>% ggplot(aes( x = experience, y = salary, group = department, color = department)) + geom_point() + geom_line(aes(x = experience, y = pred)) + labs(x = 'Experience', y = 'Predicted Salary') + ggtitle('Varying Intercept Salary Prediction') + scale_colour_discrete('Deparment') + scale_color_npg()+ theme_bw() ## Scale for colour is already present. ## Adding another scale for colour, which will replace the existing scale. 变化的斜率 同样我们还可以假设薪酬制定规则三,不同的院系起始薪酬是相同的,但年度增长率不同。与薪酬模型规则二的统计模型比较,我们只需要把变化的截距变成变化的斜率,那么统计模型可写为 这里,截距α对所有教师而言是固定不变的,而斜率β会随学院不同而变化,5个学院对应着5个斜率。这里,截距α对所有教师而言是固定不变的,而斜率β会随学院不同而变化,5个学院对应着5个斜率。 m3 <- lmer(salary ~ experience + (0 + experience | department), #这里的0是一定要加的,表示不要截距只留下变量,如果不加程序会默认有截距 data = df) summary(m3) 从结果中我们可以看到只有的变化的斜率,固定效应模型的斜率并不显著,所以这样的拟合结果并不理想 接下来我们根据上面的步骤把系数、固定效应和随机效应提取出来,并进行可视化 coef(m3) ## $department ## (Intercept) experience ## biology 61067.39 -1456.150 ## english 61067.39 628.862 ## informatics 61067.39 3275.784 ## sociology 61067.39 -1324.683 ## statistics 61067.39 3433.939 ## ## attr(,'class') ## [1] 'coef.mer' broom.mixed::tidy(m3, effects = 'fixed')%>% gt() %>% fmt_auto() effect | term | estimate | std.error | statistic | df | p.value | fixed | (Intercept) | 61,067.388 | 1,693.218 | 36.066 | 94.067 | 7.063 × 10^-57 | fixed | experience | 911.55 | 1,130.785 | 0.806 | 4.543 | 0.46 |
broom.mixed::tidy(m3, effects = 'ran_vals')%>% gt() %>% fmt_auto() effect | group | level | term | estimate | std.error | ran_vals | department | biology | experience | -2,367.7 | 438.874 | ran_vals | department | english | experience | -282.688 | 367.959 | ran_vals | department | informatics | experience | 2,364.234 | 387.536 | ran_vals | department | sociology | experience | -2,236.234 | 405.747 | ran_vals | department | statistics | experience | 2,522.389 | 412.692 |
df %>% add_predictions(m3) %>% ggplot(aes( x = experience, y = salary, group = department, color = department)) + geom_point() + geom_line(aes(x = experience, y = pred)) + labs(x = 'Experience', y = 'Predicted Salary') + ggtitle('Varying slope Salary Prediction') + scale_colour_discrete('Deparment') + scale_color_nejm() + theme_minimal() ## Scale for colour is already present. ## Adding another scale for colour, which will replace the existing scale. 注意到在可视化图形中,分层的回归线是具有同一个起点的,这就代表了固定的截距,但其实正如一开始模型显示的那样,固定效应的斜率系数并不显著,有一些学院随着工作时间的增加工资居然越来越低,可见这样的模型拟合的效果并不理想。 变化的斜率 + 变化的截距 我们还可以设想薪酬制定规则四:不同的学院起始薪酬和年度增长率也不同。这可能是最现实的一种情形了,它实际上是规则二和规则三的一种组合,要求截距和斜率都会随学院的不同变化,数学上记为 具体来说就是教师i,所在的学院j,他的入职的起始收入表示为αji,年度增长率表示为βji. m4 <- lmer(salary ~ experience + (1 + experience | department), data = df) summary(m4) 我们注意到在结果中Random effects中处理方差标准差之外多了一个相关系数,这是截距和斜率之间的相关系数。我们可以自己验算一下这一相关系数值 ran_vals <- broom.mixed::tidy(m4, effects = 'ran_vals') intercept <- ran_vals %>% filter(term == '(Intercept)') %>% pull(estimate) beta <- ran_vals %>% filter(term == 'experience') %>% pull(estimate) cor(intercept, beta) ## [1] -0.5607434 相关系数有所差距是因为算法上有差异,但大体上并没有差异 同样,我们根据上面的步骤把系数、固定效应和随机效应提取出来,并进行可视化 coef(m4) ## $department ## (Intercept) experience ## biology 47420.04 1125.2275 ## english 60844.84 688.3418 ## informatics 70410.61 1771.2658 ## sociology 40243.54 2074.4345 ## statistics 79664.87 494.3011 ## ## attr(,'class') ## [1] 'coef.mer' broom.mixed::tidy(m4, effects = 'fixed')%>% gt() %>% fmt_auto() effect | term | estimate | std.error | statistic | df | p.value | fixed | (Intercept) | 59,716.782 | 7,289.686 | 8.192 | 4.006 | 0.001 | fixed | experience | 1,230.714 | 359.762 | 3.421 | 4.115 | 0.026 |
broom.mixed::tidy(m4, effects = 'ran_vals')%>% gt() %>% fmt_auto() effect | group | level | term | estimate | std.error | ran_vals | department | biology | (Intercept) | -12,296.737 | 1,631.275 | ran_vals | department | english | (Intercept) | 1,128.056 | 1,510.276 | ran_vals | department | informatics | (Intercept) | 10,693.833 | 1,494.81 | ran_vals | department | sociology | (Intercept) | -19,473.241 | 1,453.691 | ran_vals | department | statistics | (Intercept) | 19,948.09 | 1,345.757 | ran_vals | department | biology | experience | -105.487 | 341.316 | ran_vals | department | english | experience | -542.372 | 266.954 | ran_vals | department | informatics | experience | 540.552 | 277.73 | ran_vals | department | sociology | experience | 843.72 | 282.241 | ran_vals | department | statistics | experience | -736.413 | 265.557 |
df %>% add_predictions(m4) %>% ggplot(aes( x = experience, y = salary, group = department, colour = department )) + geom_point() + geom_line(aes(x = experience, y = pred)) + labs(x = 'Experience', y = 'Predicted Salary') + ggtitle('Varying Intercept and Slopes Salary Prediction') + scale_colour_discrete('Department') + scale_color_startrek() + theme_light() ## Scale for colour is already present. ## Adding another scale for colour, which will replace the existing scale. 从图中可以看到,在这种情况下,我们计算出了五种不同斜率和不同截距的直线,那么这就引发出了一个问题:在薪酬制定规则四中,不同的院系起薪不同,年度增长率也不同,我们得出了5组不同的截距和斜率,那么是不是可以等价为,先按照院系分5组,然后各算各的截距和斜率?比如: df %>% group_by(department) %>% group_modify( ~ broom::tidy(lm(salary ~ 1 + experience, data = .)) ) %>% gt() %>% fmt_auto() term | estimate | std.error | statistic | p.value | biology | (Intercept) | 47,852.456 | 1,272.709 | 37.599 | 1.464 × 10^-18 | experience | 1,005.392 | 277.398 | 3.624 | 0.002 | english | (Intercept) | 61,293.411 | 1,666.562 | 36.778 | 2.168 × 10^-18 | experience | 587.388 | 303.011 | 1.939 | 0.068 | informatics | (Intercept) | 69,792.443 | 1,711.172 | 40.786 | 3.440 × 10^-19 | experience | 1,926.971 | 328.102 | 5.873 | 1.463 × 10^-5 | sociology | (Intercept) | 39,747.643 | 1,114.122 | 35.676 | 3.722 × 10^-18 | experience | 2,185.078 | 223.947 | 9.757 | 1.302 × 10^-8 | statistics | (Intercept) | 79,989.567 | 1,854.726 | 43.127 | 1.272 × 10^-19 | experience | 422.22 | 379.386 | 1.113 | 0.28 |
从数据中可以看出分组各自回归,与这里的(变化的截距+变化的斜率)模型,不是一回事。 信息共享 分层线性模型各层之间的信息是部分共享的,而分成5组各自回归那么各组之间的信息是完全不共享的,而向本文开始但建立一个简单的线性回归模型那么各层之间的信息就是完全共享的。其实这很好理解,在本例中,一个学院的薪资水平不可能不受到其他学院的影响,但也不可能完全和别的学院一模一样,这也是不同层次变量信息所引起的差异,所以分层线性模型存在的必要就是解决这一问题。 完全共享 complete_pooling <- broom::tidy(m1) %>% dplyr::select(term, estimate) %>% tidyr::pivot_wider( names_from = term, values_from = estimate ) %>% dplyr::rename(Intercept = `(Intercept)`, slope = experience) %>% dplyr::mutate(pooled = 'complete_pool') %>% dplyr::select(pooled, Intercept, slope)
complete_pooling %>% gt() %>% fmt_auto() pooled | Intercept | slope | complete_pool | 59,886.79 | 1,193.294 |
部分共享 fix_effect <- broom.mixed::tidy(m4, effects = 'fixed') var_effect <- broom.mixed::tidy(m4, effects = 'ran_vals') #随机效应加上固定效应的参数 partial_pooling <- var_effect %>% dplyr::select(level, term, estimate) %>% tidyr::pivot_wider( names_from = term, values_from = estimate ) %>% dplyr::rename(Intercept = `(Intercept)`, estimate = experience) %>% dplyr::mutate( Intercept = Intercept + fix_effect$estimate[1], estimate = estimate + fix_effect$estimate[2] ) %>% dplyr::mutate(pool = 'partial_pool') %>% dplyr::select(pool, level, Intercept, estimate) partial_pooling <- coef(m4)$department %>% tibble::rownames_to_column() %>% dplyr::rename(level = rowname, Intercept = `(Intercept)`, slope = experience) %>% dplyr::mutate(pooled = 'partial_pool') %>% dplyr::select(pooled, level, Intercept, slope)
partial_pooling %>% gt() %>% fmt_auto() pooled | level | Intercept | slope | partial_pool | biology | 47,420.045 | 1,125.227 | partial_pool | english | 60,844.837 | 688.342 | partial_pool | informatics | 70,410.614 | 1,771.266 | partial_pool | sociology | 40,243.541 | 2,074.434 | partial_pool | statistics | 79,664.871 | 494.301 |
不共享 no_pool <- df %>% dplyr::group_by(department) %>% dplyr::group_modify( ~ broom::tidy(lm(salary ~ 1 + experience, data = .)) ) un_pooling <- no_pool %>% dplyr::select(department, term, estimate) %>% tidyr::pivot_wider( names_from = term, values_from = estimate ) %>% dplyr::rename(Intercept = `(Intercept)`, slope = experience) %>% dplyr::mutate(pooled = 'no_pool') %>% dplyr::select(pooled, level = department, Intercept, slope)
un_pooling %>% gt() %>% fmt_auto() pooled | Intercept | slope | biology | no_pool | 47,852.456 | 1,005.392 | english | no_pool | 61,293.411 | 587.388 | informatics | no_pool | 69,792.443 | 1,926.971 | sociology | no_pool | 39,747.643 | 2,185.078 | statistics | no_pool | 79,989.567 | 422.22 |
可视化 library(ggrepel)
un_pooling %>% dplyr::bind_rows(partial_pooling) %>% ggplot(aes(x = Intercept, y = slope)) + purrr::map( c(seq(from = 0.1, to = 0.9, by = 0.1)), .f = function(level) { stat_ellipse( geom = 'polygon', type = 'norm', size = 0, alpha = 1 / 10, fill = 'gray10', level = level ) } ) + geom_point(aes(group = pooled, color = pooled)) + geom_line(aes(group = level), size = 1 / 4) + # geom_point(data = complete_pooling, size = 4, color = 'red') + geom_text_repel( data = . %>% filter(pooled == 'no_pool'), aes(label = level) ) + scale_color_manual( name = 'information pool', values = c( 'no_pool' = 'black', 'partial_pool' = 'red' # , # 'complete_pool' = '#A65141' ), labels = c( 'no_pool' = 'no share', 'partial_pool' = 'partial share' # , # 'complete_pool' = 'complete share' ) ) + theme_classic() ## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0. ## ℹ Please use `linewidth` instead. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was ## generated. 从图中可以看出,处于中间位置的灰色深的区域是信息完全共享,黑色的点是信息完全不共享,而红色的点是信息部分共享,可见信息部分共享向内收缩,聚合了信息但又不完全聚合成一个点,保持独立性的同时考虑到了特殊性。
|