目录 1. 引言在实证研究中,我们经常会遇到数据缺失的问题。在样本较大的情况下,我们可以删除缺失值,而在样本较小的情况下,缺失值的影响会变得很大,此时,填补数据这项工作就变得相对重要。 本文主要介绍了 Stata 中较为常用的缺失数据处理命令 2. 问题与方法2.1 缺失数据会带来的问题通常,在缺失部分数据的情况下,只要数据量足够大,我们可以将含有缺失值的样本删掉,并且不会对结果产生太大影响。但是,在「样本量小」和「缺失数据多」的情况下,我们简单的删掉数据,会损失较多的信息,并且使得结果有偏。 例如,下表中有 8 个样本,左边为含有缺失值的数据,右边为完整数据,并且左边部分 age 有 4 个缺失值。当我们删除缺失值时,左边 age 均值为 39,而右边 age 均值为 29.75,可以看出二者还是有很大差别的。 +--------------------------------------+ | 含有缺失值的数据 | 完整数据 | +--------------------------------------+ |Case Age Gender | Case Age Gender | +--------------------------------------+ |1 . Female | 1 21 Female | |2 . Male | 2 22 Male | |3 39 Male | 3 39 Male | |4 . Female | 4 20 Female | |5 42 Male | 5 42 Male | |6 . Female | 6 18 Female | |7 37 Male | 7 37 Male | |8 38 Male | 8 39 Male | +--------------------------------------+ (缺失数据和完整数据的基本格式) 2.2 缺失数据的常用方法求均值
删除缺失值
插补法
3. tsfill 和 ipolate 命令简介在时间序列数据中,
+-------------+ | t y x | |-------------| | 0 65 8 | | 1 . 15 | | 2 80 20 | +-------------+
4. Stata 实操:tsfill 和 ipolate 命令下面对 4.1 时间序列数据以 . use https://www./data/r16/tsfillxmpl, clear . tsset time variable: mdate, 1995m7 to 1996m3, but with gaps delta: 1 month . list mdate income, sep(0) +------------------+ | mdate income | |------------------| 1. | 1995m7 1153 | 2. | 1995m8 1181 | 3. | 1995m11 1236 | 4. | 1995m12 1297 | 5. | 1996m1 1265 | 6. | 1996m3 1282 | +------------------+
. tsfill . list mdate income, sep(0) +------------------+ | mdate income | |------------------| 1. | 1995m7 1153 | 2. | 1995m8 1181 | 3. | 1995m9 . | 4. | 1995m10 . | 5. | 1995m11 1236 | 6. | 1995m12 1297 | 7. | 1996m1 1265 | 8. | 1996m2 . | 9. | 1996m3 1282 | +------------------+ 在得到了这几个新增的观测值后,我们就可以使用 . ipolate income mdate, gen(ipinc) //income 是 mdate 函数 . list mdate income ipinc, sep(0) +------------------------------+ | mdate income ipinc | |------------------------------| 1. | 1995m7 1153 1153 | 2. | 1995m8 1181 1181 | 3. | 1995m9 . 1199.3333 | 4. | 1995m10 . 1217.6667 | 5. | 1995m11 1236 1236 | 6. | 1995m12 1297 1297 | 7. | 1996m1 1265 1265 | 8. | 1996m2 . 1273.5 | 9. | 1996m3 1282 1282 | +------------------------------+ 实际上, 具体来看,我们先生成一份数据,然后剔除一部分数据,最后再分别用「插值」和「插值+外推」的方法生成变量 y1 和 y2。观察下表,我们可以发现,在 *-产生一份数据 clear all set obs 20 set seed 10101 gen id =_n gen year = _n+1999 gen x = rnormal(8,1) gen e = rnormal(2,1) gen y=1+2*x+e tsset year *-将 y > 20 定义为缺失值 gen ymissing = y replace ymissing = . if ymissing > 20 *-进行插值和外推 ipolate ymissing x, gen(y1) ipolate ymissing x, gen(y2) epolate *-列示数据 sort x list year y x ymissing y1 y2, sep(0) +---------------------------------------------------------------+ | year y x ymissing y1 y2 | |---------------------------------------------------------------| 1. | 2016 14.22344 6.224262 14.22344 14.223439 14.223439 | 2. | 2003 16.01714 6.299623 16.01714 16.017143 16.017143 | 3. | 2012 15.6483 6.631131 15.6483 15.648301 15.648301 | 4. | 2018 15.23791 6.776969 15.23791 15.237909 15.237909 | 5. | 2017 19.02291 7.256191 19.02291 19.022915 19.022915 | 6. | 2014 18.78441 7.443309 18.78441 18.78441 18.78441 | 7. | 2004 17.76631 7.66682 17.76631 17.76631 17.76631 | 8. | 2019 19.20397 7.694068 19.20397 19.203974 19.203974 | 9. | 2009 19.79814 7.845325 19.79814 19.798141 19.798141 | 10. | 2011 18.87022 8.242313 18.87022 18.870222 18.870222 | 11. | 2002 20.07537 8.258301 . 19.029664 19.029664 | 12. | 2008 19.78762 8.334302 19.78762 19.787622 19.787622 | 13. | 2010 19.64958 8.389261 19.64958 19.649576 19.649576 | 14. | 2000 20.51375 8.392216 . 19.649501 19.649501 | 15. | 2013 20.21895 8.769518 . 19.639873 19.639873 | 16. | 2001 19.63783 8.849615 19.63783 19.637829 19.637829 | 17. | 2005 18.35985 9.265242 18.35985 18.359848 18.359848 | 18. | 2007 22.23178 9.515868 . . 17.589214 | 19. | 2015 25.5771 10.40832 . . 14.845067 | 20. | 2006 24.7197 10.81053 . . 13.608347 | +---------------------------------------------------------------+ 4.2 面板数据以 tsfillxmpl2.dta 为例,我们可以看到个体 2 缺失了 1991 年数据。 . webuse tsfillxmpl2, clear . tsset panel variable: edlevel (unbalanced) time variable: year, 1988 to 1992, but with a gap delta: 1 unit . list edlevel year income, sep(0) +-------------------------+ | edlevel year income | |-------------------------| 1. | 1 1988 14500 | 2. | 1 1989 14750 | 3. | 1 1990 14950 | 4. | 1 1991 15100 | 5. | 2 1989 22100 | 6. | 2 1990 22200 | 7. | 2 1992 22800 | +-------------------------+ 与时间序列数据类似,我们可以通过 . tsfill . list edlevel year income, sep(0) +-------------------------+ | edlevel year income | |-------------------------| 1. | 1 1988 14500 | 2. | 1 1989 14750 | 3. | 1 1990 14950 | 4. | 1 1991 15100 | 5. | 2 1989 22100 | 6. | 2 1990 22200 | 7. | 2 1991 . | 8. | 2 1992 22800 | +-------------------------+ 当然,我们也可以按照「平衡面板」数据结构来进行观察值填补,只需在 . webuse tsfillxmpl2, clear . xtset edlevel year panel variable: edlevel (unbalanced) time variable: year, 1988 to 1992, but with a gap delta: 1 unit . tsfill, full . list edlevel year income +-------------------------+ | edlevel year income | |-------------------------| 1. | 1 1988 14500 | 2. | 1 1989 14750 | 3. | 1 1990 14950 | 4. | 1 1991 15100 | 5. | 1 1992 . | |-------------------------| 6. | 2 1988 . | 7. | 2 1989 22100 | 8. | 2 1990 22200 | 9. | 2 1991 . | 10. | 2 1992 22800 | +-------------------------+ 可以看到在扩充了 3 个观察值后,数据变成了「平衡面板」数据。接下来,我们要对缺失值填充。 . ipolate income year, gen(ipinc1) . list edlevel year income ipinc1 +----------------------------------+ | edlevel year income ipinc1 | |----------------------------------| 1. | 1 1988 14500 14500 | 2. | 1 1989 14750 18425 | 3. | 1 1990 14950 18575 | 4. | 1 1991 15100 15100 | 5. | 1 1992 . 22800 | |----------------------------------| 6. | 2 1988 . 14500 | 7. | 2 1989 22100 18425 | 8. | 2 1990 22200 18575 | 9. | 2 1991 . 15100 | 10. | 2 1992 22800 22800 | +----------------------------------+ 5. iploate 方法评价优点
缺点
注意
参考资料
|
|
来自: unceasinghe > 《STATA》