Some decades ago correcting for multiple testing became popular. And don’t we all indulge in looking at lots of tests to compare groups at baseline or examine correlations of various measures or evaluate outcomes at different end points. Nowadays almost no research meeting passes without the big question being raised by a PhD-candidate or senior-researcher pressed for an intelligent remark. There are two rivalling schemes: the classical view and rational analysis (Feise, 2002).

To Bonferroni

Classicists argue that correction for multiple testing is mandatory. If the null hypothesis (H0=nil) is true, then we expect that no more than 1 in 20 tests will show a statistically significant difference by chance. This is the type I error, or α=5% (P<0.05). When 20 independent tests are performed, the chance of at least one test being significant (a false positive) is calculated as 1−(1−α)n, with 1 in 20 tests that chance is 64%. Correcting for multiples testing, e.g. the Bonferroni adjustment, ensures that the study-wide error rate remains at 0.05. The adjusted significance level is 1−(1−α)1/n or approximately α/n (0.0025 with 20 tests). Because the Bonferroni corrections is rather conservative, alternative procedure have been suggested. A well-known alternative is the Holm–Bonferroni method which is a sequential procedure: testing 10 hypotheses first reject the one with the smallest P-value if it’s smaller than alpha/10, secondly look at the hypothesis with the next smallest P-value and reject if the value is smaller than alpha/9, etcetera. Stop at the first hypothesis that is not rejected. However, these more sophisticated approaches do not eliminate the basic objections of the rivalling scheme.

Not to Bonferroni

Epidemiologists or rationalists argue that the Bonferroni adjustment defies common sense and increases type II errors (the chance of false negatives). Rothman (1990) states that "no adjustments are needed". Common sense tells us that it doesn’t make sense to evaluate a statistic differently depending on the number of tests in a particular study. The same difference in means (for instance a 10 points difference between men and women on a scale measuring psychiatric symptoms) would be considered non-significant in a study with many comparisons and statistically significant in a similar study focussing on few hypothesis. One practical issue when correcting for multiple comparisons is which statistical tests are included in the ‘study-wide’ of ‘familywise’ error rate. Is the correction based simply on all the tests in a given study? Or only on the number of tests that were reported? Or should the number of tests include the ‘family of studies’ on the same hypotheses or by the same investigator or published in the same journal? Apart from common sense, the main objection is that Bonferroni-type methods inflate type II errors (decreasing the type I error increases the probability not rejecting the null hypothesis when an alternative hypothesis is true). Because many null-hypotheses are unlikely nil-hypotheses (for instance: no gender differences), the chance of getting one false positives is nil to begin with and any correction for multiple testing unnecessary increases the type II error rate (Cohen, 1994).

The third road to multiple testing

An alternative scheme is to differentiate between study objectives. Mayo & Cox (2006) propose to consider the context of the study (inferential goal versus behavioristic goal or decision making) and a good understanding of the possible difficulties in the problem at hand or the inference to be drawn from the facts. To simplify this third approach I use a typology based on two study attributes: correlated versus independent variables and confirmative analyses versus exploratory research.

First, when there are no explicit hypotheses and the tests are correlated (for instance different items on type of referrals), then the chance of a false positive is high and a correction for multiple testing seems appropriate. The practical issue of which tests to include is solved because the tests are correlated for some reason. The null-hypothesis "stretches out" over all the relevant items.

Secondly, in case of independent tests of different domains or aspects, for instance types of psychological and social functioning, there is no need for correcting for multiple testing. Just like we would not apply any corrections if these aspects were tested in different studies.

Thirdly, if there are many explicit hypotheses explored (as in some genetic studies), then it could be efficient to use a smaller p-value as ‘crude surprise index’ (the expression is used by Senn, 2001).

Finally, in case there are several explicit hypotheses on different topics, then it could be wise not to correct for multiple testing, but to combine outcomes or to prioritize one primary outcome (and hold on to the power of the test) and explore the results of secondary research questions (Feise, 2002; Streiner & Norman, 2011).

So ‘to Bonferroni or not to Bonferroni’ that is a matter of study design. In practice, however, the combinations in this typology are endpoints on the continua of the study’s objective and the difficulties in the dataset. In short: should we correct for multiple testing? Most of the time we should not apply a correction, but it depends.