数据挖掘专题 | 用R怎么做T检验

萌小芊 2018-01-22

展开全文

### T检验统计基础

http://www./english/wiki/t-test-formula

Probably one of the most popular research questions is whether two independent samples differ from each other. Student’s t test is one of the common statistical test used for comparing the means of two independent or paired samples.

### 测试数据

data =

read.csv('https://stats.idre./stat/data/hsb2.csv')

### 测试需求1：ses=1与ses=2时，write变量是否具有显著差异

http://www./Statistical_analysis/t-test/

关于var.equal参数：【设置数据是否是齐方差】

By default, t.test does not assume equal variances, it uses the Welch t-test by default. Note that in the Welch t-test, df=88.466, because of the adjustment for unequal variances. To use Student’s t-test, set var.equal=TRUE.

### 测试需求2：不同ses水平（取值1，2，3）之间，write变量是否具有显著差异

ANOVA分析结果如下：

如图所示，不同ses组之间，write变量具有显著差异，但是该差异显著性主要是由谁贡献的，两两ses组之间的write是否具有显著差异呢？

事后多重比较：post-hoc analysis

https://stats.idre./r/faq/how-can-i-do-post-hoc-pairwise-comparisons-in-r/

After an ANOVA, you may know that the means of your response variable differ significantly across your factor, but you do not know which pairs of the factor levels are significantly different from each other. At this point, you can conduct pairwise comparisons.

### pairwise分析

http://psych./moore/Rpdf/610-R3_post-hoc_one-way_betw.pdf

https://rtutorialseries./2011/03/r-tutorial-series-anova-pairwise.html

即两两比较分析，需要做三组（ses=1 vs. ses=2；ses=1 vs. ses=3；ses=2 vs. ses=3）

# 无矫正：

pairwise.t.test 默认参数pool.sd = T，与t.test中设置var.equal=T效果比较类似，但是很容易发现的一个问题是，对于ses为1和2之间的p值，在pairwise的计算中是0.4306，而在t.test中的计算结果是0.4279...

在网上检索到相同问题：

https://stat./pipermail/r-help/2010-September/252267.html

The pool.sd switch calculates a common SD for all groups, so the denominator is not the same as when testing each pair separately.

解决办法如下：

pairwise.t.test(data$write, data$ses,

p.adj='none',pool.sd=F,var.eq=T)

此时，我们再看，pairwise的结果即与每个pair单独分析的结果一致！

而如果只设置pool.sd=F时，结果则与t.test默认参数（Welch’s T-test）结果一致！

To get separate standard deviation estimates instead of a pooled standard deviation, set the pool.sd=FALSE.This is unnecessary, since default is to assume homogeneity of variance. If that assumption weren’t true for your data, you could use pool.sd=F.

# FDR矫正：

因为涉及到多比较，所以就要考虑多重假设检验矫正，当然对于p值的矫正方法是多种多样的，如目前最常用的BH矫正、即FDR，矫正之后的p值普遍会增加：

矫正结果与单独对p值使用p.adjust矫正结果一致：

矫正方法在pairwise.t.test函数中使用p.adj参数即可进行限定，p.adj参数的可选值与p.adjust函数中内置的矫正方法一致：'holm', 'hochberg', 'hommel', 'bonferroni', 'BH', 'BY', 'fdr', 'none'；

# TukeyHSD：

Tukey’s test is a single-step multiple comparison procedure and statistical test. It is a post-hoc analysis, what means that it is used in conjunction with an ANOVA.The Tukey Honest Significant Difference (HSD) method controls for the Type I error rate across multiple comparisons and is generally considered an acceptable technique.

diff is the difference in means between the two groups, lwr is the lower estimate of the 95% confidence interval of the difference in means, upr is the upper estimate of the same 95% confidence interval, and p adj is the significance of the test after correcting for family-wise error rate. The adjustment of the p-value is necessary in controlling for Type 1 error inflation.

# 针对多组中的两组比较，T检验的结果可信还是TukeyHSD中的结果可信？

https://www./post/What_is_the_difference_between_Tukeys_Post_Hoc_Test_and_Students_t-test

https://stats./questions/61732/two-sample-t-test-vs-tukeys-method

建议是：TukeyHSD【考虑全部数据，矫正】

# 各种方法比较

www.ijsmi.com/Journal/index.php/IJSMI/article/download/1/pdf

https://www./meeting/chicago16/slides/chicago16_ender.pdf

### 总结一下

对于两组数据的Welch’s T-test，使用t.test函数，默认参数；

对于两组数据的Student’s T-test，使用t.test函数，加参数var.equal=T；

对于两两组间的差异分析，使用TukeyHSD比较方便。

### 延伸

http://www./english/wiki/print.php?id=94

多数情况下（就像上面说的一大堆一样），我们在做的分析的时候都是默认数据服从正态分布、齐方差，所以不管三七二十一上来就用Student’s T-test，由此带来的问题是，使用不适合的统计方法得到的结论可能是有偏倚甚至是错误的。

Statistical errors are common in scientific literature, and about 50% of the published articles have at least one error. Many of the statistical procedures including correlation, regression, t test, and analysis of variance assume that the data are normally distributed.