In many papers Table 1 summarizes sample characteristics. In reports of randomized clinical trials (RCTs), or some other comparison scenario, it is common practice to test for differences between treatment groups at baseline before analysing the effects on outcome variables. However, this is common practice against expert advice.

CONsolidated Standards of Reporting Trials (CONSORT), endorsed by over 50% of the PubMed journals, state that baseline information is most efficiently presented as descriptive statistics.Likewise, in "Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): Explanation and Elaboration" Vandenbroucke et al (2007) state that "“significance tests should be avoided in descriptive tables. Also, P values are not an appropriate criterion for selecting which confounders to adjust for in analysis”. In “CONSORT 2010 Explanation and Elaboration” Moher et al (2010) sigh that “unfortunately significance tests of baseline differences are still common”.

Referring to Senn (1995), the CONSORT-group considers that tests of baseline homogeneity are not necessarily wrong, just illogical, superfluous, and potentially misleading. So in three U-words: Unsound, Useless, and Unclarifying. In my view, such practice is necessarily wrong in the sense that we should not be teaching regular students, or PhD-students, this kind of “Mindless Statistics” (Gigerenzer, 2004).


Why are tests of baseline differences unsound and useless?

The idea of tests of baseline homogeneity in RCT-reports comparing group means and calculating standard errors is unsound because we have not been sampling from a population or different populations. Most clinical trials are convenience samples: people are invited per advertisements, mail or posters to participate or patients referred by the general practitioner are screened. Inclusion in such samples is a complex process and can change over time. Therefore, the relationship between the sample and the underlying population of interest is not straightforward.

Tests of baseline differences are not directed at specific hypotheses about population characteristics, but are aimed at evaluating the quality of the allocation procedure. The null hypothesis is that we expect no difference because of randomisation, the alternative hypothesis is that the researcher applied inappropriate or fraudulent allocation procedures.

  • Non-significant results cannot prove that patients were allocated randomly. The P-value is calculated assuming homogeneity is true: pr(D|H) which is different from pr(H|D). In other words: the probability of Death-by-Hanging is not the probability of Hanging-as-cause-of-Death.
  • Statistically significant baseline differences will occur even if the null hypothesis is true (randomisation was completed). Moher et al (2010) state that “significance tests assess the probability that observed baseline differences could have occurred by chance; however, we already know that any differences are caused by chance.” This formulation is somewhat awkward because significance tests do not calculate error probabilities. The P-value of a baseline homogeneity test gives us the probability of the observed differences, or larger differences, if the hypothesis is true. This is not an error rate. The proportion of significant results is larger than the common 5% Type I error rate. If we take a thousand samples of 150 cases, divided over two groups, of ten items consisting of random numbers, in about 40% one or more of the 10 t-tests of "pure noise" can be statistically significant (see Figure 1).

So how useful are baseline tests that cannot prove randomness and will wrongly show significant differences at high rates?


Why are tests of baseline differences useless and ‘unclarifying’?

Again Moher et al (2010) seem to be off balanced when they argue that “comparisons at baseline should be based on consideration of the prognostic strength of the variables measured and the size of any chance imbalances that have occurred.” This could give the wrong impression that we need balance, which is one of the many myths of randomisation (Senn, 2013).

  • Balance contributes to the efficiency of a trial, not to the validity of the conclusions. When comparing an intervention with treatment as usual, controlling for gender, it helps when outcome differences are not complicated by different male and female proportions in the intervention and treatment group. Imbalances will increase the variance of, and the confidence intervals around, the estimated intervention effect. However, an analysis of covariance with gender as covariate will cover these baseline differences.
  • A common misunderstanding is that balanced variables can be discarded in multiple variable analysis. But, as an example, when gender is balanced, males in the treatment condition could benefit more from the intervention than women and both men and women in the control condition. Or even more complicated interaction effects could be relevant for the research question.

So important prognostic variables should be identified in the trial-plan and fitted in an analysis of covariance (Senn, 1995).


What next?

P-values as a “surprise index” (Senn, 2001) can be used by a trial monitor to explore the bumps and ridges of the process of randomisation. Forget multiple testing – use logistic regression analysis as discriminant analysis: treatment versus control group as “dependent” and relevant covariates as “predictors”. This cannot demonstrate that patients were allocated randomly, but could signal problems in the research protocol or clumsy attempts at cheating (Senn, 1995). However, we will often get a false alarm: when we sample ten items consisting of random numbers and create a dichotomous dependent variable, backward stepwise logistic regression analyses often will show one or more statistically significant predictors (see figure). Anyway, there is no place for these analyses in table 1 of the RCT-report. 

Comparisons at baseline summarize the sample characteristics regardless of their baseline distribution (statistical significance). The CONSORT-statement advocates a table of descriptive statistics of baseline differences. Continuous variables can be summarised for each group by the mean and standard deviation (not standard errors and confidence intervals which concerns the population, not the sample). For not normally distributed variables report the median and a centile range (such as the 25th and 75th centiles). Ordered categories should be compared using numbers and proportions for each category. That’s all folks.