I don’t really want to write again and again about the New England Journal of Medicine and p-values for baseline characteristics, but just to continue the story from previous posts (here and here)…
I got in touch with the journal to raise the issue. To recap: in a recent submission of a clinical trial report, we were asked to add an indication showing which comparisons of baseline characteristics had p < 0.05. In my view that is substantially worse than including the exact p value because it looks exactly like a significance test, and is extremely likely to be interpreted as one. Just to be clear – if the randomisation has worked, some comparisons of baseline characteristics will have p < 0.05, completely by chance, and they don’t mean there is anything special or different about that variable. It doesn’t make any sense to talk about “significant” differences because there is no null hypothesis test; the outcome of finding p < 0.05 is not (or shouldn’t be) to “reject the null hypothesis” because we already know that the null hypothesis is true. Anyway, we complied (not my choice!) so that’s now in the paper.
The response from the journal made three points:
- They recognise that p-values for baseline characteristics do not have the same interpretation as for comparisons of outcomes. That’s good, but I suspect the majority of readers don’t understand the distinction .
- If the randomisation has not worked properly, due to error or fraud, significant differences in the tale of baseline characteristics might be a sign of this.
- They will include a sentence in the guidance reiterating that differences in baseline characteristics occur at random if the randomisation works properly.
So, the main justification for including p-values for baseline characteristics is that they may show up problems in randomisation. I’m not sure whether this is reasonable. You might get more or fewer low p-values if something has gone wrong, or somebody is committing fraud, but that’s by no means guaranteed. I’d have thought that problems in randomisation would generally show up in other ways – imbalances in numbers between groups or subgroups, or obvious differences in variables between the trial arms. You wouldn’t rely on significance tests to show up those sorts of problems.
What do others think? Leave a comment below!