stata学习笔记

lyricLee7v7c1q 2017-09-16

展开全文

本文来源网络，由计量经济学服务中心综合整理

转载请注明来源

推荐阅读Stata学习资源汇总

准备好开始学习了吗

STATA的基本操作

setmem 500m, perm

显示输入内容

Display 1

Display “clive”

显示数据集结构describe

Describe /d

编辑 edit

Edit

重命名变量

Rename var1 var2

显示数据集内容list/browse

List in 1

List in 2/10

数据导入:数据文件是文本类型（.csv）

insheet: . insheet using “C:\Documentsand Settings\Administrator\桌面\ST9007\dataset\Fees1.csv”,clear

内存为空时才可以导入数据集，否则会出现（you must start with an empty dataset）

清空内存中的所有变量：.drop _all

导入语句后加入“clear”命令

打开及退出已存文件use

Use 文件路径及文件名, clear

记录命令和输出结果（log）

1、开始建立记录文件：log using 'J:\phd\output.log', replace

2、暂停记录文件：log off

3、重新打开记录文件：log on

4、关闭记录文件：log close

创建和保存程序文件：（doedit, do）

1、打开程序编辑窗口：doedit

2、写入命令

3、保存文件，.do.

4、运行命令：.do 程序文件路径及文件名

多个数据集合并为一个数据集（变量和结构相同）纵向合并append

insheet using'J:\phd\Fees1.csv', clear

save'J:\phd\Fees1.dta', replace

insheet using'J:\phd\Fees2.csv', clear

append using'J:\phd\Fees1.dta'

save'J:\phd\Fees1.dta', replace

横向合并，在原数据集基础上加上另外的变量merge

1、insheet using'J:\phd\Fees1.csv', clear

sort companyid yearend

save 'J:\phd\Fees1.dta', replace

describe

insheet using 'J:\phd\Fees6.csv', clear

sort companyid yearend

merge companyid yearend using 'J:\phd\Fees1.dta'

save 'J:\phd\Fees1.dta', replace

describe

2、_merge==1 obs. From master data

_merge==2 obs. From using data

_merge==3 obs. From both master and using data

帮助文件：help

1、. Help describe

描述性统计量

summarize incorporationyear 单个

summarize incorporationyear-big6 连续多个

summarize _all or simply summarize 所有

更详细的统计量

summarize incorporationyear, detail

centile

centile auditfees, centile(0(10)100)

centile auditfees, centile(0(5)100)

tabulate不同类型变量的频数和比例

tabulate companytype

tabulate companytype big6, column 按列计算百分比

tabulate companytype big6, row 按行计算百分比

tab companytype big6 ifcompanytype<=3, row="" col="">

计算满足条件观测的个数

count if big6==1

count if big6==0| big6==1

按离散变量排序，对连续变量计算描述性统计量：

by companytype, sort:summarize auditfees, detail

sort companytype

By companytype:summarizeauditees

转换变量

按公司类型将公开发行股票公司赋值为1，其他为0

gen listed=0

replace listed=1if companytype==2

replace listed=1if companytype==3

replace listed=1if companytype==5

replace listed=.if companytype==.

产生新变量gen

Generate newvar=表达式

模型

format x1 %10.3f ——将x1的列宽固定为10，小数点后取三位

基本一元回归

regress y x

回归结果的保存

回归结果的系数保存在_b[varname]内存变量中，常数项的系数保存在(_cons)内存变量中。

预测值及残差

predict yhat

predict yres, resid

yres即为真实值得与预测值之差

残差与X的散点图

twoway (scatter y_res x) (lfit y_res x)

衡量估计系数准确程度：标准误差。

用样本的标准偏差与系数之间的关系来衡量即T值（用系数除以标准差），同时P值是根据T值的分布计算出来的，表示系数落入标准对应上下限的可能性。前提是残差符合以下假设：

同方差：Homoscedasticity (i.e., the residuals have a constant variance)

独立不相关：Non-correlation (i.e., the residuals are not correlated with eachother)

正态分布：Normality (i.e., the residuals are normally distributed)

回归结果包含的一些内容的意思

l 各变差的自由度：

For the ESS, df = k-1 where k = number of regression coefficients(df = 2 – 1)

For the RSS, df = n – k where n =number of observations (= 11 - 2)

For the TSS, df = n-1 ( = 11 – 1)

MS：变差除以自由度：The last column(MS) reports the ESS, RSS and TSS divided by their respective degrees offreedom

R平方：The R-squared = ESS / TSS

调整的R平方：Adj R-squared =1-(1-R2)(n-1)/(n-k) ，消除了加入相关度不高解释变量后R平方增加的不足。

Root MSE = square root of RSS/n-k：模型的平均解释能力

The F-statistic = (ESS/k-1)/(RSS/n-k)：模型的总解释能力

Heteroscedasticity(hettest)异方差性

检验方差齐性的方法：

回归后使用hettest命令：

· reg auditfees nonauditfees totalassets big6 listed

· hettest

方差齐性不会使系数有偏，但会使使系数的标准差有偏。产生的原因有可能是数据本身有界限，产生高的偏度。一些方差不齐可以通过取对数消除。当发现不齐性时使用Huber/White/sandwich estimator对标准差进行调整。STATA可以在回归时加上robust来实现。

reg auditfees nonauditfees totalassets big6 listed, robust

加robust后的回归系数相同，但标准差不同，T值变小，P值变大，F值变小，R2不变。

Correlated errors(自变量相关)

The residuals of a given firm are correlated across years (“timeseries dependence”)，面板数据（In paneldata）, 同一公司不可观测的特性对不同年度都会产生一定的影响，这时就会使数据不独立。there are likely to be unobserved company-specific characteristicsthat are relatively constant over time

标准差会下偏，This problem canbe avoided by adjusting the standard errors for the clustering of yearlyobservations across a given company

消除变量相关问题：

在回归中加入robust cluster()

reg lnaf lntabig6 listed, robust cluster (companyid)

如何验证同一公司不同年度数据的残差的相关性

reg lnaf lnta

predict res, resid

keep companyid year res

sort companyid year

drop if companyid==companyid[_n-1] & year==year[_n-1]

reshape wide res, i(companyid) j(year)

browse

pwcorr res1998- res2002

在使用面板数据时应注意：

只用robust控制heteroscedasticity，而未用cluster( )控制time-series dependence，T统计量也会上偏

如果 heteroscedasticity也未控制，T统计量会上偏更严重。

因此在使用面板数据时应加入robust cluster() option, otherwise your “significant” results frompooled regressions may be spurious.

什么情况下会产生多重共线性

l We have seen that when there isperfect collinearity between independent variables, STATA will have to excludeone of them. For example, year_1 + year_2 + year_3 + year_4 + year_5 = 1

reg lnaf year_1 year_2 year_3year_4 year_5, nocons

STATA automatically throws awayone of the year dummies so that the model can be estimated

l Even if the independentvariables are not perfectly collinear, there can still be a problem if they arehighly correlated

后果：

the standard errors of the coefficients to be large (i.e., thecoefficients are not estimated precisely)

the coefficient estimates can be highly unstable

衡量方法：

Variance-inflation factors (VIF) 可用来衡量是否存在多重共线性。

reg lnaf lnta big6 lnta1

vif

reg lnaf lnta big6

vif