1. 二分类logistic回归分析 概念解释「logistic回归介绍:」
Logistic回归适用于二分类变量(0和1)。模型假设Y服从二项分布,线性模型的拟合形式为:
「什么是优势比(odds radio)」
其中
为Y=1时的优势比。
「什么是对数优势比」
其中
为对数优势比,也称logit。
「什么是连接函数」
这里,
为连接函数,概率分布为二项分布。
2 R语言如何拟合logistic模型 「R语言中如何拟合二分类logistic回归?」
mod = glm(Y ~ X1 + X2 + X3, family = binomial(link = "logit" ), data = mydata)
上面公式中:
family = binomial(link = "logit")
为Y分类和连接函数3 二分类logistic何时使用 「什么时候应用二分类logistic回归分析?」
❝ 当通过一系列连续型或类别变量来预测二分类变量时,Logistic回归是一个非常有用的工具。
❞ 「说人话解释:」
比如根据年龄,性别,学历,收入,工作类型,判断批给你信用卡是否违约,构建一个模型。另外有人申请信用卡时,根据他的年龄,性别,学历,收入,工作类型,预测他违约的可能性,然后判断是否批他信用卡。 生存或者死亡(to be or not to be) 4 概率和log(odds)
的关系 「概率和对数优势比的关系」
进而可以推断出(后面有用到这个公式):
# 作图 p = seq(0,1,0.01) odds = p/(1-p) plot(log (odds), p, type = "l" , col = "blue" ,ylab = "Probability" ,las = 1) abline(h=.5,lty="dashed" ) abline(v=0,lty="dashed" )
由图可知,概率P的最小值为0,最大值为1,中间值为0.5, 它对应的对数优势比分别是无穷小,无穷大和0.即:
P=0, log(odds) = -Inf(负无穷大) P = 1, log(odds) = Inf(正无穷大) 因此,Logistic回归常常用于估计给定暴露水平时结果事件发生的概率。例如,我们可以用它来预测在给定年龄、性别和行为方式等情形下某人患病的概率。
5 数据和代码演练 5.1 数据介绍数据来源于MASS
包中的birthwt
数据,
The birthwt data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Mass during 1986. Usage birthwt Format This data frame contains the following columns: low indicator of birth weight less than 2.5 kg. age mother's age in years. lwt mother' s weight in pounds at last menstrual period. race mother's race (1 = white, 2 = black, 3 = other). smoke smoking status during pregnancy. ptl number of previous premature labours. ht history of hypertension. ui presence of uterine irritability. ftv number of physician visits during the first trimester. bwt birth weight in grams.
5.2 将数据转化为分析所用的数据我们将数据相关的列,转化为因子:
library(MASS) data("birthwt" ) library(dplyr) birthweight = birthwt %>% mutate(race = factor(race,labels = c("white" ,"black" ,"other" )), smoke = factor(smoke,labels = c("no" ,"yes" )), ptl = ifelse(ptl >0,"1+" ,ptl), ptl = factor(ptl), ht = factor(ht,labels = c("no" ,"yes" )), ui = factor(ui,labels = c("no" ,"yes" )), ftv = ifelse(ftv >1,"2+" ,ftv), ftv = factor(ftv)) head(birthweight) str(birthweight)
「数据结构:」
> str(birthweight)'data.frame' : 189 obs. of 10 variables: $ low : int 0 0 0 0 0 0 0 0 0 0 ... $ age : int 19 33 20 21 18 21 22 17 29 26 ... $ lwt : int 182 155 105 108 107 124 118 103 123 113 ... $ race : Factor w/ 3 levels "white" ,"black" ,..: 2 3 1 1 1 3 1 3 1 1 ... $ smoke: Factor w/ 2 levels "no" ,"yes" : 1 1 2 2 2 1 1 1 2 2 ... $ ptl : Factor w/ 2 levels "0" ,"1+" : 1 1 1 1 1 1 1 1 1 1 ... $ ht : Factor w/ 2 levels "no" ,"yes" : 1 1 1 1 1 1 1 1 1 1 ... $ ui : Factor w/ 2 levels "no" ,"yes" : 2 1 1 2 2 1 1 1 1 1 ... $ ftv : Factor w/ 3 levels "0" ,"1" ,"2+" : 1 3 2 3 1 1 2 2 2 1 ... $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
5.3 明确分析目的「数据分析目的:」
❝ low为二分类变量,表示婴儿是否体重偏轻,我们考虑几个因素:年龄,母亲体重,肤色,是否有早产等因素。其中年龄,母亲体重为连续变量,肤色,是否早产为二类变量。数据分析的目的是,用年龄,体重,肤色,是否早产等因素,去预测胎儿的体重是否偏轻。
❞ 5.4 用R语言glm
建模「使用R语言glm
模型拟合模型」
m1 = glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, # 1 family = binomial("logit" ), # 2 data = birthweight) #3 summary(m1) #4
上面代码中,共有4行:
第二行,定义y变量的分布(二分类)和连接函数(logit) 结果如下:
> summary(m1) Call: glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, family = binomial("logit" ), data = birthweight) Deviance Residuals: Min 1Q Median 3Q Max -1.7038 -0.8068 -0.5008 0.8835 2.2152 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.82302 1.24471 0.661 0.50848 age -0.03723 0.03870 -0.962 0.33602 lwt -0.01565 0.00708 -2.211 0.02705 * raceblack 1.19241 0.53597 2.225 0.02609 * raceother 0.74069 0.46174 1.604 0.10869 smokeyes 0.75553 0.42502 1.778 0.07546 . ptl1+ 1.34376 0.48062 2.796 0.00518 ** htyes 1.91317 0.72074 2.654 0.00794 ** uiyes 0.68019 0.46434 1.465 0.14296 ftv1 -0.43638 0.47939 -0.910 0.36268 ftv2+ 0.17901 0.45638 0.392 0.69488 --- Signif. codes: 0 '***’ 0.001 '**’ 0.01 '*’ 0.05 '.’ 0.1 ' ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 on 188 degrees of freedom Residual deviance: 195.48 on 178 degrees of freedom AIC: 217.48 Number of Fisher Scoring iterations: 4
5.5 使用step
进行逐步回归「可以用drop1
一个个查看,先删掉哪个变量」
> drop1(m1) Single term deletions Model: low ~ age + lwt + race + smoke + ptl + ht + ui + ftv Df Deviance AIC <none> 195.48 217.48 age 1 196.42 216.42 lwt 1 200.95 220.95 race 2 201.23 219.23 smoke 1 198.67 218.67 ptl 1 203.58 223.58 ht 1 202.93 222.93 ui 1 197.59 217.59 ftv 2 196.83 214.83
可以看到,不去掉时,AIC为195.48,去掉flv
只有AIC为196.83
。
「更简单的方法,使用step
进行逐步回归」
> glm2 = step(m1) Start: AIC=217.48 low ~ age + lwt + race + smoke + ptl + ht + ui + ftv Df Deviance AIC - ftv 2 196.83 214.83 - age 1 196.42 216.42 <none> 195.48 217.48 - ui 1 197.59 217.59 - smoke 1 198.67 218.67 - race 2 201.23 219.23 - lwt 1 200.95 220.95 - ht 1 202.93 222.93 - ptl 1 203.58 223.58 Step: AIC=214.83 low ~ age + lwt + race + smoke + ptl + ht + ui Df Deviance AIC - age 1 197.85 213.85 <none> 196.83 214.83 - ui 1 199.15 215.15 - race 2 203.24 217.24 - smoke 1 201.25 217.25 - lwt 1 201.83 217.83 - ptl 1 203.95 219.95 - ht 1 204.01 220.01 Step: AIC=213.85 low ~ lwt + race + smoke + ptl + ht + ui Df Deviance AIC <none> 197.85 213.85 - ui 1 200.48 214.48 - smoke 1 202.57 216.57 - race 2 205.47 217.47 - lwt 1 203.82 217.82 - ptl 1 204.22 218.22 - ht 1 205.16 219.16
可以看到,模型low ~ lwt + race + smoke + ptl + ht + ui
的AIC最小,为最终模型。
「模型结果:」
> summary(glm2) Call: glm(formula = low ~ lwt + race + smoke + ptl + ht + ui, family = binomial("logit" ), data = birthweight) Deviance Residuals: Min 1Q Median 3Q Max -1.7308 -0.7841 -0.5144 0.9539 2.1980 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.125326 0.967561 -0.130 0.89694 lwt -0.015918 0.006954 -2.289 0.02207 * raceblack 1.300856 0.528484 2.461 0.01384 * raceother 0.854414 0.440907 1.938 0.05264 . smokeyes 0.866582 0.404469 2.143 0.03215 * ptl1+ 1.128857 0.450388 2.506 0.01220 * htyes 1.866895 0.707373 2.639 0.00831 ** uiyes 0.750649 0.458815 1.636 0.10183 --- Signif. codes: 0 '***’ 0.001 '**’ 0.01 '*’ 0.05 '.’ 0.1 ' ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.67 on 188 degrees of freedom Residual deviance: 197.85 on 181 degrees of freedom AIC: 213.85 Number of Fisher Scoring iterations: 4
「更友好的结果显示」
> library(epiDisplay) > logistic.display(glm2) Logistic regression predicting low crude OR(95%CI) adj. OR(95%CI) P(Wald's test) P(LR-test) lwt (cont. var.) 0.99 (0.97,1) 0.98 (0.97,1) 0.022 0.015 race: ref.=white 0.022 black 2.33 (0.94,5.77) 3.67 (1.3,10.35) 0.014 other 1.89 (0.96,3.74) 2.35 (0.99,5.58) 0.053 smoke: yes vs no 2.02 (1.08,3.78) 2.38 (1.08,5.26) 0.032 0.03 ptl: 1+ vs 0 4.32 (1.92,9.73) 3.09 (1.28,7.48) 0.012 0.012 ht: yes vs no 3.37 (1.02,11.09) 6.47 (1.62,25.88) 0.008 0.007 ui: yes vs no 2.58 (1.14,5.83) 2.12 (0.86,5.21) 0.102 0.105 Log-likelihood = -98.9258 No. of observations = 189 AIC value = 213.8516
可以看出,结果没有直接给出回归系数,而是给出了优势比(odds radio)和P值(wald检验和LRT检验)
「手动计算优势比:」
> exp(coef(glm2)) (Intercept) lwt raceblack raceother smokeyes ptl1+ htyes uiyes 0.8822092 0.9842076 3.6724379 2.3499973 2.3787659 3.0921190 6.4681832 2.1183740
可以看出,结果和上面的adj.OR(95%)
列的结果一致。
6 通过模型预测 ❝ 哲学家不是为了解释世界,重点是要改变世界……马克思
❞ 预测才是最重要的。
上面的模型构建好了,现在又来了一个人,我们收集了她的体重,肤色等信息,是否能够预测胎儿出生体重偏轻的概率是多少?
「数据如下:」
数据结构要和构建模型的数据结构一致。
newdata = data.frame(lwt = 120, race = "black" ,smoke = "yes" ,ptl = "0" ,ht = "yes" ,ui = "no" ) newdata birthweight %>% head
> newdata lwt race smoke ptl ht ui 1 120 black yes 0 yes no > birthweight %>% head low age lwt race smoke ptl ht ui ftv bwt 1 0 19 182 black no 0 no yes 0 2523 2 0 33 155 other no 0 no no 2+ 2551 3 0 20 105 white yes 0 no no 1 2557 4 0 21 108 white yes 0 no yes 2+ 2594 5 0 18 107 white yes 0 no yes 0 2600 6 0 21 124 other no 0 no no 0 2622
「手动根据公式计算P值」
logit = predict(glm2,newdata = newdata) logit exp(logit)/(1+exp(logit)) # 计算概率值
公式:
> exp(logit)/(1+exp(logit)) # 计算概率值 1 0.88067
可以看到,上面的数据预测的概率为0.88
也可以定义predict
中的type = response
参数,直接计算P值:
> predict(glm2,newdata = newdata,type = "response" ) # 直接得到概率值 1 0.88067
7. asreml的解决方法 asreml中也可以分析广义线性模型,虽然它的优势在于混合线性模型,但是分析起广义线性混合模型
也非常方便。下面使用优化后的模型,比较一下结果:
as2 = asreml(low ~ lwt + race + smoke + ptl + ht + ui, family = asr_binomial(link="logit" ), data = birthweight) summary(as2)$varcomp wald(as2) # pvalue summary(as2,coef=T)$coef .fixed # effect
结果:
> wald(as2) # pvalue Wald tests for fixed effects. Response: low Terms added sequentially; adjusted for those above. Note: These Wald statistics are based on the working variable and are not equivalent to an Analysis of Deviance. Df Sum of Sq Wald statistic Pr(Chisq) (Intercept) 1 14.9106 14.9106 0.0001127 *** lwt 1 3.0123 3.0123 0.0826351 . race 2 3.2332 3.2332 0.1985701 smoke 1 6.4746 6.4746 0.0109429 * ptl 1 6.7581 6.7581 0.0093323 ** ht 1 6.1986 6.1986 0.0127854 * ui 1 2.6767 2.6767 0.1018287 residual (MS) 1.0000 --- Signif. codes: 0 '***’ 0.001 '**’ 0.01 '*’ 0.05 '.’ 0.1 ' ’ 1 > summary(as2,coef=T)$coef .fixed # effect solution std error z.ratio ui_no 0.00000000 NA NA ui_yes 0.75064880 0.458817140 1.6360522 ht_no 0.00000000 NA NA ht_yes 1.86689527 0.707378249 2.6391754 ptl_0 0.00000000 NA NA ptl_1+ 1.12885661 0.450389557 2.5064005 smoke_no 0.00000000 NA NA smoke_yes 0.86658183 0.404473690 2.1424924 race_white 0.00000000 NA NA race_black 1.30085571 0.528489036 2.4614621 race_other 0.85441418 0.440911907 1.9378342 lwt -0.01591847 0.006953881 -2.2891495 (Intercept) -0.12532604 0.967572468 -0.1295263
可以看出,Effect和Pvalue和R中的glm
结果是一致的。
8. 后面计划写