Van den Bosch and colleagues conducted a pilot study to prepare a randomised clinical trial comparing a short intensive course of dialectical behaviour therapy with the standard outpatient. For 39 female patients with borderline problems information was collected on (para)suicidal behaviour, drop-out, severity of borderline problems and quality of life. The English abstract suggests that the authors used participating observation as a data collection method: “We participated in 3-month-long inpatient dbt programme” (something was added in translation). Results showed that severity of borderline problems was significantly reduced, but there was no significant reduction in (para) suicidal behaviours. However, what happened to dropouts?
What about good intentions?
Van den Bosch and colleagues report they carried out “an intention-to-treat analysis” – this approach compares subjects in the groups to which they were originally randomly assigned and thus requires that complete outcome data be collected for all randomised patients (else there would be nothing to compare). However, the pilot study did not randomize patients and outcome data for dropouts were not available. For 13 patients begin scores were used as estimates of the final outcome (“last observation carried forward”).
In this case, no time trend – possibly a relapse – after dropout is to be expected which makes a baseline-observation-carried-forward approach somewhat reasonable. However, the authors used BOCF imputation for relatively many patients. In a pre-post analysis, change was artificially zero for one third of patients (33%), thus underestimating the standard deviation of the differences and inflating the t-statistic. The Guideline on Missing Data adopted in 2011 by the Committee for Medicinal Products for Human Use (CHMP) states that single imputation can give an artificial impression of precision, and “this possibility should be addressed when results from these analyses are presented.”
Here is a simple simulation to illustrate this point. The left pie chart shows the p-values of paired t-tests for 1000 random samples (n=39) from normal distributions with pre-post means and standard deviations as reported by Van den Bosch et al. The pie chart on the right illustrates the effect of substituting begin scores for final outcome. The percentage of statistically significant tests increases, the percentage of p<.01 results about doubled.
t1 mean= 30.45 (SD 9.61), t2 mean= 35.81 (SD 8.88)
Sampled (for 13 cases t2=sampled from t1)
Carried forward (for 13 cases t2=t1)
Thus p-values will be overly optimistic. However, the authors used a Bonferroni-correction to account for multiple testing, lowering the conventional 0,05% alpha to 0,003%. This is not a very helpful approach in general (read here why) and certainly should be avoided in a pilot study. A pilot is not intended to formally test the primary hypotheses but to explore the implementation problems involved in the upcoming trial. But in this case maybe the p-value correction counterbalanced the inflated t-statistics. Maybe, who knows? But "two wrongs don't make a right".
Post hoc power-analysis is of no use
The authors repeatedly state that the statistical tests were under-powered (not much chance to reject the no-difference hypothesis even when a real difference would exist, aka type II error). However, they also claim that a post hoc power-analysis showed that the chance of type II errors stayed within “acceptable limits” – not further specified. But Bonferroni-correction further lowers the power of statistical tests, which is in fact one of the reasons not to correct for multiple testing (Cohen, 1994). More importantly, a post hoc power-analysis implies that the no-difference hypothesis is false (power is the likelihood that a difference will be found when there is an effect there to be detected). However, that assumption is not supported by the data.
Van den Bosch and colleagues hope that the randomised trial that began in 2012 will reveal whether short-term inpatient therapy results in better outcomes compared to standard outpatient treatment. Whatever the results may be, the pilot study may have highlighted some study design problems but the reported statistical analyses cannot have been very helpful in preparing the clinical trial.