【原】当代流行病学大神Rothman：关于P值的18个误读与真相

妙趣横生统计学 2019-12-08

展开全文

2016年，Kenneth J. Rothman与Sander Greenland于《欧洲流行病学》杂志发表了一篇总结性的文章《统计检验、P值、置信区间、检验效能的误导》，这一篇文献，解读了Fisher以P值来，不拒绝或者否定H0假设以来的种种误区，并进行解读。现在本论文来分三篇论文来陈述。本篇先来讨论P值的误读与真相。

一、KennethJ. Rothman是谁

学公共卫生的人不能不知道Kenneth J. Rothman。他是当代流行病学第一人！他出版的<Modern Epidemiology>一书是当代流行病学的圣经。有兴趣可以网上搜索下。Sander Greenland也是<Modern Epidemiology>的作者。

所以这篇论文是21世纪以来差不多医学领域最权威的P值解读了！

二、P值的误区

误区1：

P值是在关于H0假设正确的概率，比如说，如果假设假设检验P值为0.01，那么说明H0成立的可能性不到1%，反之，如果P值为0.4，那则说明H0成立的概率是40%。不！

The P value is theprobability that the test hypothesis is true; for example, if a test of thenull hypothesis gave P = 0.01, the null hypothesis has only a 1 % chance ofbeing true; if instead it gave P = 0.40, the null hypothesis has a 40 % chanceof being true. No!

解读1：

P值是在H0假设正确的基础上进行，但不是讨论H0成立的概率。P值只不过在意味着讨论样本数据到底和H0成立时候，有多大程度的接近。所以P=0.01说明数据不是那么很接近H0假设的总体现状，P=0.4则意味着，我们得到的数据和H0假设下的统计模型接近了很多。

The P valueassumes the test hypothesis is true—it is not a hypothesis probability and maybe far from any reasonable probability for the test hypothesis. The P valuesimply indicates the degree to which the data conform to the pattern predictedby the test hypothesis and all the other assumptions used in the test (theunderlying statistical model). Thus P = 0.01 would indicate that the data arenot very close to what the statistical model (including the test hypothesis)predicted they should be, while P = 0.40 would indicate that the data are muchcloser to the model prediction, allowing for chance variation.

误区2：

(惭愧，我也不甚理解，还是贴原文吧)The P value for the null hypothesis is theprobability that chance alone produced the observed association; for example,if the P value for the null hypothesis is 0.08, there is an 8 % probabilitythat chance alone produced the association.

解读2：

No! This is acommon variation of the first fallacy and it is just as false. To say thatchance alone produced the observed association is logically equivalent toasserting that every assumption used to compute the P value is correct,including the null hypothesis. Thus to claim that the null P value is theprobability that chance alone produced the observed association is completelybackwards: The P value is a probability computed assuming chance was operatingalone. The absurdity of the common backwards interpretation might beappreciated by pondering how the P value, which is a probability deduced from aset of assumptions (the statistical model), can possibly refer to theprobability of those assumptions. Note: One often sees ‘‘alone’’ dropped fromthis description (becoming ‘‘the P value for the null hypothesis is theprobability that chance produced the observed association’’), so that thestatement is more ambiguous, but just as wrong.

误区3：

一个具有统计学意义（P<0.05）意味着零假设（H0）是错的，应该被拒绝。不！

A significant testresult (P £ 0.05) means that the test hypothesis is false or should berejected.

解读3：

一个小的P自意味着，如果零假设是对的话，这样的样本比较罕见。P值比较小,因为抽样误差比较小，或者和其他的假设有冲突。P值比较小也许，和更多的假设会P值也会更小。； P值小于0.05意味着和零假设的距离比较大（比如两组没有统计学差异），这样的差距，如果是偶然发生的话，发生概率比较小的。

No! A small Pvalue simply flags the data as being unusual if all the assumptions used tocompute it (including the test hypothesis) were correct; it may be smallbecause there was a large random error or because some assumption other thanthe test hypothesis was violated (for example, the assumption that this P valuewas not selected for presentation because it was below 0.05). P B 0.05 onlymeans that a discrepancy from the hypothesis prediction (e.g., no differencebetween treatment groups) would be as large or larger than that observed nomore than 5 % of the time if only chance were creating the discrepancy (as opposedto a violation of the test hypothesis or a mistaken assumption).

误区4：

一个没有统计学意义的P值意味着零假设是对的或者应该被接受。不！

A nonsignificanttest result (P > 0.05) means that the test hypothesis is true or should beaccepted.

解读4：

大的P值意味着，如果H0成立的话，样本不罕见。此外，即便是H0是错的，P值也可能很大因为，抽样误差比较大，或者因为其他的假设问题（比如假设认为P值没有被选中呈现，因为P值大于0.05）。P值大于0.05，只能以为这在零假设成立（比如两组没有统计学差异）的情况下，这样的差距偶然发生的可能性大于0.05。

No! A large Pvalue only suggests that the data are not unusual if all the assumptions usedto compute the P value (including the test hypothesis) were correct. The samedata would also not be unusual under many other hypotheses. Furthermore, evenif the test hypothesis is wrong, the P value may be large because it wasinflated by a large random error or because of some other erroneous assumption(for example, the assumption that this P value was not selected forpresentation because it was above 0.05). P[0.05 only means that a discrepancyfrom the hypothesis prediction (e.g., no difference between treatment groups)would be as large or larger than that observed more than 5 % of the time ifonly chance were creating the discrepancy

误区5：

大的P值说接受H0的证据比较充分。不！

A large P value isevidence in favor of the test hypothesis. No!

解读5：

实际上，任何一个P值不等于1意味着这个假设不是我们样本最佳的假设，可能其他的假设和我们的样本更加契合。P值不能说明H0成立的证据是否充分，除非和P值较小的比较。此外，大的P值往往意味着这个数据无法足够的能力去挑选合适的属于它的假设。比如，很多作者在P = 0.70人认定处理因素没有效应，实际上P=0.7，不意味着零假设看和数据最契合，但是，其实还有更好的未知假设与我们的样本是一路的，也就是我们的样本属于其他总体的，比如会有P=1的请。即便是P=1，也有可能其他的假设比现有H0假设更契合。因此，有没有统计学关联，无法根据P值来下结论，无论P值有多大。

In fact, any Pvalue less than 1 implies that the test hypothesis is not the hypothesis mostcompatible with the data, because any other hypothesis with a larger P valuewould be even more compatible with the data. A P value cannot be said to favorthe test hypothesis except in relation to those hypotheses with smaller Pvalues. Furthermore, a large P value often indicates only that the data areincapable of discriminating among many competing hypotheses (as would be seenimmediately by examining the range of the confidence interval). For example,many authors will misinterpret P = 0.70 from a test of the null hypothesis asevidence for no effect, when in fact it indicates that, even though the null hypothesisis compatible with the data under the assumptions used to compute the P value,it is not the hypothesis most compatible with the data—that honor would belongto a hypothesis with P = 1. But even if P = 1, there will be many otherhypotheses that are highly consistent with the data, so that a definitiveconclusion of ‘‘no association’’ cannot be deduced from a P value, no matter how large。

误区6：

如果针对H0的假设，P值大于0.05，则说明处理因素没有效应，或者没有被观察到。不！

A null-hypothesisP value greater than 0.05 means that no effect was observed, or that absence of an effect was shown or demonstrated.

解读6：

如果P值大于0.05，则说明H0假设是一个诸多假设中一个大于0.05的假设，除非点估计完全等于H0假设，那么我们不能说明研究因素和结局没有关联或者效应没有证据。如果P值小于1，则说明肯定有一定的关联，研究者必须关注点估计值去看看效应值到底有多大。

No! ObservingP[0.05 for the null hypothesis only means that the null is one among the many hypotheses that have P[0.05. Thus, unless the point estimate (observed association) equals the null value exactly, it is a mistake to conclude fromP>0.05 that a study found ‘‘no association’’ or ‘‘no evidence’’ of an effect.If the null P value is less than 1 some association must be present in thedata, and one must look at the point estimate to determine the effect size mostcompatible with the data under the assumed model.

误区7：

统计学意义意味着一个科学或者十分重要的关联性结果被我们观察到了。不！

Statisticalsignificance indicates a scientifically or substantively important relation hasbeen detected. No!

解读7：

特别是一个大样本，但是小效应的的时候，经常会检测到统计学意义的结果。所以小的P值仅仅意味着H0对的情况下，发生概率比较小，但不是意味着具有临床意义，每个人必须关注置信区间去看看效应值有没有科学或者实际意义。

Especially when astudy is large, very minor effects or small assumption violations can lead tostatistically significant tests of the null hypothesis. Again, a small null Pvalue simply flags the data as being unusual if all the assumptions used tocompute it (including the null hypothesis) were correct; but the way the dataare unusual might be of no clinical interest. One must look at the confidenceinterval to determine which effect sizes of scientific or other substantive(e.g., clinical) importance are relatively compatible with the data, given the model

误区8：

如果没有统计学意义，意味着效应值比较小。不！

Lack ofstatistical significance indicates that the effect size is small. No!

解读8：

特别是样本量比较小的时候，如果有很多噪声，往往会发现不了统计学差异，大的P值只能说明发生可能性比较大，也有可能对于其他的假设检验也是如此，不是说它肯定与H0一样，也有可能与其他假设也一样。我们必须管制置信区间

Especially when astudy is small, even large effects may be ‘‘drowned in noise’’ and thus fail tobe detected as statistically significant by a statistical test. A large null Pvalue simply flags the data as not being unusual if all the assumptions used tocompute it (including the test hypothesis) were correct; but the same data willalso not be unusual under many other models and hypotheses besides the null.Again, one must look at the confidence interval to determine whether itincludes effect sizes of importance.

误区9：

P值是说明H0成立请将下我们样本的发生概率。比如P = 0.05意味着H0成立的时候，我们观察到的统计量发生概率是5%。不！

The P value is thechance of our data occurring if the test hypothesis is true; for example, P =0.05 means that the observed association would occur only 5 % of the time underthe test hypothesis

解读9：

P值不是说我们的观察到的统计量发生概率为0.05，而是观察到的比我们统计量还大的还极端的值所有加起来的概率为5%。

No! The P valuerefers not only to what we observed, but also observations more extreme thanwhat we observed (where ‘‘extremity’’ is measured in a particular way). Andagain, the P value refers to a data frequency when all the assumptions used tocompute it are correct. In addition to the test hypothesis, these assumptionsinclude randomness in sampling, treatment assignment, loss, and missingness, aswell as an assumption that the P value was not selected for presentation basedon its size or some other aspect of the results.

误区10：

如果P值等于0.05，你把它拒绝了，那么你发生一类错的概率是5%。不！

If you reject the test hypothesis because P £0.05, the chance you are in error (the chance your ‘‘significant finding’’ is afalse positive) is 5 %.

解读`10：

为什么呢，假如h0是对的，现在拒绝它了，那么你发生错误的概率就100%，不是5%。5%知识意味着你有多大的频率拒绝它，比如你做了100次假设检验，你会拒绝他5次，不代表着概率的问题。

NO!To see why this description is false, suppose the test hypothesis is in fact true. Then, if you reject it, the chance you are in error is 100 %, not 5 %. The 5 % refers onlyto how often you would reject it, and therefore be in error, over very manyuses of the test across different studies when the test hypothesis and allother assumptions used for the test are true. It does not refer to your singleuse of the test, which may have been thrown off by assumption violations aswell as random errors. This is yet another version of misinterpretation #1

误区11:

P值等于0.05和P值<0.05是同一回事。不！

P = 0.05 and P <0.05mean the same thing. No!

解读11：

就像说这个高度等于2m和高度小于2一样不是回事。高度=2，意味着很少人，意味着他们被认为很高，但是高度<=2m则说明几乎所有人都满足于条件，因此P = 0.05意味着是一个统计学研究的界值，P<=0.05意味着结果和H0不太兼容。

This is likesaying reported height = 2 m and reported height B2 m are the same thing:‘‘height = 2 m’’ would include few people and those people would be consideredtall, whereas ‘‘height =2 m’’ would include most people including small children. Similarly, P = 0.05 would be considered a borderline result in terms of statistical significance, whereas P < 0.05 lumps borderline results together with results very incompatible with the model (e.g., P = 0.0001) thus rendering its meaning vague, for no good purpose.

误区12：

P值可以报告为>0.05或者<0.05。不！

P values areproperly reported as inequalities (e.g., report ‘‘P < 0.02’’ when P = 0.015 or report ‘‘P > 0.05’’ when P = 0.06 or P = 0.70). No!

解读12：

这是一个非常不好的习惯，因为这个会非常不容易让读者去理解统计学结果，除非P值太小了比如小于under 0.001，比较太小的P值区分去来也没有太大的意思。

This is badpractice because it makes it difficult or impossible for the readertoaccuratelyinterpretthe statistical result. Only when the P value is very small(e.g., under 0.001) does an inequality become justifiable: There is littlepractical difference among very small P values when the assumptions used tocompute P values are not known with enough certainty to justify such precision,and most methods for computing P values are not numerically accurate below acertain point.

误区13：

统计学检验意义是对研究现象的判断，因此假设检验可以用来发现有统计学差异。不！

Statistical significance is a property of the phenomenon being studied, and thusstatistical tests detect significance. No!

解读13：

这个误区，是因为很多人通过P值把现象一分为二，变成有差异或者没有差异。它知识统计学检验的二分类，不是客观线性的二分类，很多客观现象没有明确的界限

This misinterpretation is promoted when researchers state that they have or have notfound ‘‘evidence of’’ a statistically signifi- cant effect. The effect beingtested either exists or does not exist. ‘‘Statistical significance’’ is adichotomous description of a P value (that it is below the chosen cut-off) andthus is a property of a result of a statistical test; it is not a property ofthe effect or population being studied.

误区14：

研究者应该都是用双侧检验的P值。不！

One should alwaysuse two-sided P values. No!

解读14：

双侧检验P值是用来监测效应值是否等于H0,或者大于或者小于，但是更有意义的科学性和现实意义的是单侧检验，比如，考虑一个新药是否真比标准药物更有效果，最好是单侧检验。

Two-sided P valuesare designed to test hypotheses that the targeted effect measure equals aspecific value (e.g., zero), and is neither above nor below this value. When,however, the test hypothesis of scientific or practical interest is a one-sided(dividing) hypothesis, a onesided P value is appropriate. For example, considerthe practical question of whether a new drug is at least as good as thestandard drug for increasing survival time. This question is one-sided, sotesting this hypothesis calls for a one-sided P value. Nonetheless, becausetwo-sided P values are the usual default, it will be important to note when andwhy a one-sided P value is being used instead.

误区15：

如果我们在不同的研究中H0假设是一样的，而P值也是大于0.05，则可以说总体上证据是支持H0的。不！

When the samehypothesis is tested in different studies and none or a minority of the testsare statistically significant (all P > 0.05), the overall evidence supportsthe hypothesis.

解读15：

这个一般经常用在文献分析方面，这个反应研究者往往过高估计检验效能。实际上，很多研究单个没有统计学意义，但是多个就不好说了。比如假如5个研究P均等于0.1，那么热如果按照Fisher formula方法合并来探讨差异性，那么总的P值就会小于0.01.因此没有统计学意义并不意味着总体也没有统计学意义

No! This belief isoften used to claim that a literature supports no effect when the opposite iscase. It reflects a tendency of researchers to ‘‘overestimate the power of mostresearch’’ [89]. In reality, every study could fail to reach statisticalsignificance and yet when combined show a statistically significant associationand persuasive evidence of an effect. For example, if there were five studieseach with P = 0.10, none would be significant at 0.05 level; but when these Pvalues are combined using the Fisher formula [9], the overall P value would be0.01. There are many real examples of persuasive evidence for important effectswhen few studies or even no study reported ‘‘statistically significant’’ associations[90, 91]. Thus, lack of statistical significance of individual studies shouldnot be taken as implying that the totality of evidence supports no effect.

误区16：

如果两项研究结论是相反的，一个大于0.05，另外一个小于0.05，那么结论是矛盾的。不！

When the samehypothesis is tested in two different populations and the resulting P valuesare on opposite sides of 0.05, the results are conflicting. No!

解读16：

统计学检验对不同人群的结果是比较敏感的，比如样本量。因此两个研究提供了截然不同的P值也有可能说明情况是一致的。比如两个随机对照试验，A 有标准误为2，B为1，但是他们的效益指标都是3，但的P值为0.013，B为0.003，这个不能说明两项研究结论相反，这个时候还是要看看他们的结果差异性，特别是置信区间来显示，P值用来反映研究项目的异质性交互或者修饰。

Statistical tests are sensitive to many differences between study populations that are irrelevantto whether their results are in agreement, such as the sizes of compared groupsin each population. As a consequence, two studies may provide very different Pvalues for the same test hypothesis and yet be in perfect agreement (e.g., mayshow identical observed associations). For example, suppose we had tworandomized trials A and B of a treatment, identical except that trial A had aknown standard error of 2 for the mean difference between treatment groupswhereas trial B had a known standard error of 1 for the difference. If bothtrials observed a difference between treatment groups of exactly 3, the usualnormal test would produce P = 0.13 in A but P = 0.003 in B. Despite their difference in P values, the test of the hypothesis of no difference in effectacross studies would have P = 1, reflecting the perfect agreement of the observed mean differences from the studies. Differences between results must beevaluated by directly, for example by estimating and testing those differencesto produce a confidence interval and a P value comparing the results (often called analysis of heterogeneity, interaction, or modification).

误区17：

如果假设检验发现两组都P值小于0.05，说明结果一致。不！

When the same hypothesis is tested in two different populations and the same P values areobtained, the results are in agreement. No!

解读17：

这个跟上面的误区一样，不同的研究特征不同，比如样本量也不一样，那么标准误是不同，这个时候即便是P值小于0.05，也不能说明两者一致性，往往可能是效应值不同。2个随机对照试验，A 有标准误为1，差距为3，B标准为4，差距为12，两个P值3 AP值为0.003,B为0.03,其实结论完全不同。

Again, tests are sensitive to many differencesbetween populations that are irrelevant to whether their results are inagreement. Two different studies may even exhibit identical P values fortesting the same hypothesis yet also exhibit clearly different observedassociations. For example, suppose randomized experiment A observed a meandifference between treatment groups of 3.00 with standard error 1.00, while Bobserved a mean difference of 12.00 with standard error 4.00. Then the standardnormal test would produce P = 0.003 in both; yet the test of the hypothesis ofno difference in effect across studies gives P = 0.03, reflecting the largedifference (12.00 - 3.00 = 9.00) between the mean differences.

误区18：

如果一个研究P值比较小，那么下一次研究P值应该会比较小。不！

If one observes asmall P value, there is a good chance that the next study will produce a Pvalue at least as small for the same hypothesis. No!

解读18：

情况很相似，两个研究的场景不同。即便是两个研究非常相似，所有假设都一样，也很能重现之前的结果。为什么呢，即便两者有差异，那么抽样出去出来的样本，还是有相当高的概率出现没有统计学意义（这个跟检验效能有关系）。比如样本量的问题，比如样本量抽样的运气问题。

This is false evenunder the ideal condition that both studies are independent and all assumptionsincluding the test hypothesis are correct in both studies. In that case, if(say) one observes P = 0.03, the chance that the new study will show P B 0.03is only 3 %; thus the chance the new study will show a P value as small orsmaller (the ‘‘replication probability’’) is exactly the observed P value! Ifon the other hand the small P value arose solely because the true effectexactly equaled its observed estimate, there would be a 50 % chance that arepeat experiment of identical design would have a larger P value [37]. Ingeneral, the size of the new P value will be extremely sensitive to the studysize and the extent to which the test hypothesis or other assumptions areviolated in the new study [86]; in particular, P may be very small or verylarge depending on whether the study and the violations are large or small.

三、郑老师译后感言

1.其实我也一知半解，有兴趣可以慢慢消化原文。

2.大神的境界估计是大家一时难以消耗，但别觉得自己不行。

3.我认为现在虽然抛弃P值的言论甚嚣尘上，但是P值仍然是终结者，谁也离不开了它。4.明天我将摘选我国生物统计学的泰斗，中国统计大神方积乾教授对P值的理解！敬请期待！