DID大法：如何在Stata中实现队列DID操作？

我是张金康呀 2021-12-15

展开全文

Cohort DID（队列DID）是一种巧妙的计量识别策略，常用于评估特殊历史事件对个体和家庭的长期影响。目前已经给大家推送了《The long-term health and economic consequences of the 1959-1961 famine in China》（陈玉宇和周黎安，2007JHE）、《早年的饥荒经历影响了人们的储蓄行为吗?——对我国居民高储蓄率的一个新解释》（程令国和张晔，2011经济研究）和《Arrival of Young Talent: The Send- Down Movement and Rural Education in China》（陈祎、范子英、顾晓敏和周黎安，2020AER）三篇经典的队列DID论文，相信大家对队列DID已经有了一定的了解。计量经济学是一门需要与软件结合起来学习的课程，本期我想给大家推送的是队列DID的Stata操作。

数据来源

陈祎、范子英、顾晓敏和周黎安四位老师在AER官网上公布了《Arrival of Young Talent: The Send- Down Movement and Rural Education in China》一文使用的数据和代码，我就主要使用作者提供的部分数据和代码给大家讲解一下队列DID的Stata操作。

Yi Chen, Ziying Fan, Xiaomin Gu, and Li-An Zhou. 2020. “Replication Data for: Arrival of Young Talent: The Send-Down Movement and Rural Education in China.” American Economic Association[publisher], Inter-university Consortium for Political and Social Research [distributor]. https:///10.3886/E119690V1

Standard cohort DID估计

出生队列维度变异：个体的出生年份。如果个体出生在1956-1969年之间，则归为处理组，出生在1946-1955年之间，归为控制组。（变量）

地区维度变异：各县接受的下乡知青的人口比例，计算方式为各县接受的下乡知青总数除以1964年该县人口总数。（变量）

通常情况下，我们都习惯将固定效应引入DID模型，因为固定效应能够更为精确地反映两个维度上的变异性，并且可以在一定程度上帮助我们缓解遗漏变量偏误问题。对于Cohort DID模型，我们就需要加入地区固定效应和出生队列固定效应，如陈玉宇和周黎安（2007, JHE）。在陈范顾周（2020，AER）这篇文章中，作者考虑的更为细致，除了县级固定效应region1990和省份-出生队列固定效应prov#year_birth（就是省份与出生队列的交互项，相当于控制了每个省每个出生队列的异质性，这比出生队列固定效应控制的更为细致）外，还在模型中加入了各县基础教育状况与出生队列的交互项c.primary_base#year_birth和c.junior_base#year_birth。

. reghdfe yedu c.sdy_density#c.treat male han_ethn if rural==1, absorb(region1990 prov#year_birth c.primary_base#year_birth c.junior_base#year_birth) cluster(region1990)
(MWFE estimator converged in 14 iterations)

HDFE Linear regression                            Number of obs   =  2,775,858
Absorbing 4 HDFE groups                           F(   3,   1767) =    1462.40
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.2934
                                                  Adj R-squared   =     0.2928
                                                  Within R-sq.    =     0.1018
Number of clusters (region1990) =      1,768      Root MSE        =     2.7831

                                  (Std. Err. adjusted for 1,768 clusters in region1990)
---------------------------------------------------------------------------------------
                      |               Robust
                 yedu |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
c.sdy_density#c.treat |   3.237091   .7011438     4.62   0.000     1.861933     4.61225
                      |
                 male |   1.874483    .028378    66.05   0.000     1.818825     1.93014
             han_ethn |   .1500992   .0565311     2.66   0.008     .0392243    .2609742
                _cons |   5.436428   .0551559    98.56   0.000      5.32825    5.544606
---------------------------------------------------------------------------------------

Absorbed degrees of freedom:
-------------------------------------------------------------------+
               Absorbed FE | Categories  - Redundant  = Num. Coefs |
---------------------------+---------------------------------------|
                region1990 |      1768        1768           0    *|
           prov#year_birth |       624           0         624     |
 year_birth#c.primary_base |        24           0          24    ?|
  year_birth#c.junior_base |        24           0          24    ?|
-------------------------------------------------------------------+
? = number of redundant parameters may be higher
* = FE nested within cluster; treated as redundant for DoF computation

因为是高维固定效应，所以我们最好使用reghdfe命令，可以看出交互项的系数是3.237091，这与原文报告的估计结果是一致的。其他列的估计方式都是类似的，只需更换一下数据或者是被解释变量即可。

Reduced-Form cohort DID估计

出生队列维度变异：个体的出生年份，一个出生年份生成一个队列虚拟变量。从1946到1969共有24年，我们要生成24个队列虚拟变量（以1941-1945年为基准组）。

地区维度变异：同上。

forvalues y = 1946/1969 {
 gen I`y' = sdy_density*[year_birth==`y']
}

Reduced-Form cohort DID实际上就是一种动态DID模型，这个名称我是follow了Duflo（2001，AER）那篇队列DID的“鼻祖”论文。我们首先需要生成队列虚拟变量和各县接受的下乡知青的人口比例的交互项（24个），将交互项作为解释变量进行回归。交互项的系数识别的就是知识青年上山下乡对特定年份出生的人群的受教育年限的平均因果效应。这些交互项的系数能够直观展示知青上山下乡对农村居民受教育年限的动态影响，这有助于我们检验DID模型的平行趋势假定。

. reghdfe yedu I1946-I1969 male han_ethn if rural==1, absorb(region1990 prov#year_birth c.primary_base_older#year_birth c.junior_base_older#year_birth) cluster(region1990)
(MWFE estimator converged in 14 iterations)

HDFE Linear regression                            Number of obs   =  3,082,370
Absorbing 4 HDFE groups                           F(  26,   1761) =     197.11
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.3101
                                                  Adj R-squared   =     0.3095
                                                  Within R-sq.    =     0.1062
Number of clusters (region1990) =      1,762      Root MSE        =     2.8137

                         (Std. Err. adjusted for 1,762 clusters in region1990)
------------------------------------------------------------------------------
             |               Robust
        yedu |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       I1946 |    1.66256   .7684583     2.16   0.031     .1553734    3.169747
       I1947 |  -.0350028   .7205034    -0.05   0.961    -1.448135    1.378129
       I1948 |   .3873289   .8108702     0.48   0.633    -1.203041    1.977698
       I1949 |  -.0806064   .7207926    -0.11   0.911    -1.494306    1.333093
       I1950 |   .0444851   .7051097     0.06   0.950    -1.338455    1.427425
       I1951 |   .0476566   .9115992     0.05   0.958    -1.740274    1.835587
       I1952 |  -.5168174   .8303635    -0.62   0.534    -2.145419    1.111784
       I1953 |  -1.399235   .8577277    -1.63   0.103    -3.081506     .283037
       I1954 |   .2770551   .9313479     0.30   0.766    -1.549609    2.103719
       I1955 |   .8836312   .8897173     0.99   0.321     -.861382    2.628644
       I1956 |   1.517669   1.018799     1.49   0.136    -.4805141    3.515853
       I1957 |   2.300266   1.016131     2.26   0.024     .3073154    4.293216
       I1958 |   2.858782   1.074015     2.66   0.008     .7523023    4.965261
       I1959 |   4.254199   1.127633     3.77   0.000     2.042558     6.46584
       I1960 |    4.22015   1.221957     3.45   0.001      1.82351    6.616789
       I1961 |    3.86647   1.477515     2.62   0.009     .9686016    6.764339
       I1962 |   4.990398    1.16422     4.29   0.000     2.706999    7.273797
       I1963 |   4.390182   1.066651     4.12   0.000     2.298147    6.482217
       I1964 |   3.146871    .984423     3.20   0.001     1.216111    5.077632
       I1965 |   3.551206   .9706773     3.66   0.000     1.647405    5.455007
       I1966 |   3.329448   .9370527     3.55   0.000     1.491595      5.1673
       I1967 |   3.273766   1.006823     3.25   0.001     1.299072    5.248459
       I1968 |    4.09358   .9098564     4.50   0.000     2.309068    5.878092
       I1969 |   3.144828   1.013811     3.10   0.002     1.156428    5.133227
        male |   1.939975   .0275946    70.30   0.000     1.885853    1.994096
    han_ethn |   .1394096    .055681     2.50   0.012     .0302018    .2486174
       _cons |   5.209187   .0551155    94.51   0.000     5.101089    5.317286
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-------------------------------------------------------------------------+
                     Absorbed FE | Categories  - Redundant  = Num. Coefs |
---------------------------------+---------------------------------------|
                      region1990 |      1762        1762           0    *|
                 prov#year_birth |       754           0         754     |
 year_birth#c.primary_base_older |        29           0          29    ?|
  year_birth#c.junior_base_older |        29           0          29    ?|
-------------------------------------------------------------------------+
? = number of redundant parameters may be higher
* = FE nested within cluster; treated as redundant for DoF computation

接下来，我们可以进一步通过直观的图形，呈现交互项的系数的变化（知青上山下乡对农村居民受教育年限的动态影响）。作者使用的是一种较为复杂的绘图方法，有兴趣的朋友可以去看看，在这里我使用的是更加方便快捷的绘图命令coefplot。coefplot命令可以便捷地根据回归结果帮助我们绘制回归系数的取值和置信区间，常用于DID平行趋势检验制图。

*-绘图
coefplot, baselevels ///
keep(I19*) ///
vertical ///转置图形
coeflabels(I1946=1946 I1947=1947 I1948=1948 I1949=1949 I1950=1950 ///
I1951=1951 I1952=1952 I1953=1953 I1954=1954 I1955=1955 I1956=1956 ///
I1957=1957 I1958=1958 I1959=1959 I1960=1960 I1961=1961 I1962=1962 ///
I1963=1963 I1964=1964 I1965=1965 I1966=1966 I1967=1967 I1968=1968 ///
I1969=1969) /// 
yline(0,lwidth(vthin) lpattern(solid) lcolor(teal)) ///
xline(10,lwidth(vthin) lpattern(solid) lcolor(teal)) ///
ylabel(-4(2)8,labsize(*0.85) angle(0)) xlabel(,labsize(*0.75) angle(45)) ///
ytitle('Coefficients') ///
xtitle('Birth cohort') ///
msymbol(O) msize(small) mcolor(gs1) ///plot样式
addplot(line @b @at,lcolor(gs1) lwidth(medthick)) ///增加点之间的连线
ciopts(recast(rline) lwidth(thin) lpattern(dash) lcolor(gs2)) ///置信区间样式
graphregion(color(white)) //白底

注：图中虚线表示95%置信区间

除此之外，我们可以通过调整recast()括号里的选项，绘制出不同样式的图。

从图中可以看出，在1956年出生队列之前，交互项系数系数基本在0左右（95%的置信区间包含了0值），这表明在上山下乡运动之前，受下乡知青影响程度不同的县并没有出现异质性队列趋势，这一点支持了我们的平行趋势假定。在1956年出生队列之后，交互项的系数从开始逐渐增加，这表明从知青到达的那一刻起，随着越来越多的适龄儿童接触到下乡知青，上山下乡运动对农村教育的积极影响本质上是在不断累积的。对比一下原文报告的系数图，可以发现结果是一致的。