Elephant in the Room (by Banksy)
This violent bias of classical procedures [against the null hypothesis] is not an unmitigated disaster. Many null hypotheses tested by classical procedures are scientifically preposterous, not worthy of a moment's credence even as approximations. If a hypothesis is preposterous to start with, no amount of bias against it can be too great. On the other hand, if it is preposterous to start with, why test it? Ward Edwards, Psychological Bulletin (1965)
Several interesting blog posts, discussions on Twitter and regular media articles (e.g. Times Higher Education) have recently focused on the role of negative (so-called null) findings and the publish or perish culture.
In his blog, Pete gives the following description of how we generally think that null findings influence the scientific process
...if you run a perfectly good, well-designed experiment, but your analysis comes up with a null result, you're much less likely to get it published, or even actually submit it for publication. This is bad, because it means that the total body of research that does get published on a particular topic might be completely unrepresentative of what's actually going on. It can be a particular issue for medical science - say, for example, I run a trial for a new behavioural therapy that's supposed to completely cure anxiety. My design is perfectly robust, but my results suggest that the therapy doesn't work. That's a bit boring, and I don't think it will get published anywhere that's considered prestigious, so I don't bother writing it up; the results just get stashed away in my lab, and maybe I'll come back to it in a few years. But what if labs in other institutions run the same experiment? They don't know I've already done it, so they just carry on with it. Most of them find what I found, and again don't bother to publish their results - it's a waste of time. Except a couple of labs did find that the therapy works. They report their experiments, and now it looks like we have good evidence for a new and effective anxiety therapy, despite the large body of (unpublished) evidence to the contrary.The hand-wringing about negative or null findings is not new...and worryingly, psychology fares worse than most other disciplines, has done for a long time and (aside from hand-wringing) does little to change this situation. For example, see Greenwald's 'Consequences of the Prejudice against the Null Hypothesis' published in Psychological Bulletin in 1975. The table below comes from Sterling et al (1995) showing that <0.2% of papers in this sample accepted the null hypothesis (compare to a sample of medical journals below)
Table 1. From Sterling et al 1995
More recently Fanelli (2010) confirmed the earlier reports about psychology/psychiatry being especially prone to the bias of publishing positive findings. Table 2 below outlines the probability for a paper to report positive results in various disciplines. It is evident that, compared to every other discipline, Psychology fares the worst - being five times more likely as the baseline (space science) to publish positive results! We might ask 'why' psychology? and what effect does it have?
Table 2 Psychology/Psychiatry bottom of the league
Issues from Meta-Analysis
It is certainly my experience that negative findings are more commonly published in more medically oriented journals. In this context, the use of meta analysis becomes very interesting.The File Drawer effect and Fail-Safes
Obviously meta-analysis is based on quantitatively summarising the findings that are accessible (tending to be those published of course). This raises the so-called file-drawer effect, whereby negative studies may be tucked away in a file drawer because they are viewed as less publishable. It is possible in meta analysis to statistically estimate the file-drawer effect - the original and still widely used method is the Fail-Safe statistic devised by Orwin, which essentially estimates how many unpublished negative studies would be need to overturn a significant effect size in a meta analysis. A marginal effect size may require just one or two unpublished negative studies to overturn it, while a strong effect may require thousands of unpublished negative studies to eliminate the effect.
So, at least we have a method for estimating the potential influence of negative unpublished studies.
Where wild psychologists roam - Negativland by Neu!
Funnel Plots: imputing missing negative findings
Related to the first point, we may also estimate the number of missing effect sizes and even how large they might be. Crucially, we can then impute the missing values to see how it changes the overall effect size in a meta-analysis. This was recently spotlighted by Cuijpers et al (2010) in their timely meta-analysis of psychological treatments for depression, which highlighted a strong bias toward positive reporting.A standard way to look for bias in a meta analysis is to examine funnel plots of the individual study effect sizes plotted against their sample sizes or the standard errors. When no bias exists, studies with smaller error and larger sample sizes cluster around the mean effect size. By contrast, smaller samples and greater error variance produce far more variable effect sizes (in the tails). Ideally, we should observe a nicely symmetrical inverted funnel shape.
Turning to Cuijpers paper, the first figure below is clearly asymmetrical, showing a lack of negative findings (left side). Statistical techniques now exist for imputing or filling in these assumed missing values (see figure below where this has been done). The lower funnel plot gives a more realistic picture and adjusts the overall effect size downwards as a consequence (Cuijpers imputed 51 missing negative findings - the dark circles) - which reduced the effect size considerably from 0.67 to 0.42.
Figure 1. Before and After Science: Funnel Plots from Cuipjers et al (2010)
No One Receiving (Brian Eno - from Before & After Science)
Question "What do you call a group of negative (null) findings?" Answer: "A positive effect"
As noted already, some more medically-oriented disciplines seem happier to publish null findings - but what precisely may be some of the implications of this - especially in meta analyses? Just publishing negative findings is not the end of the questioning!Although some may disagree, one area that I think I know a fair bit about is the use of CBT for psychosis. The following forest plot in Figure 2 is taken from a meta analysis of CBT for psychosis from Wykes et al (2008).
Figure 2. Forest plot displaying Effect sizes for CBT as treatment for Psychosis
(from Wykes et al 2008)
In meta analysis, forest plots are often one of the most informative sources of information - because they reveal much about the individuals studies. This example shows 24 published studies. The crucial information here, however, concerns not the magnitude of individual effect sizes (where the rectangle sits on the X axis), but
...the confidence intervals - these tell us everything!
When the confidence intervals pass through zero, we know the effect size was nonsigificant. So, looking at this forest plot only 6/24 (25%) studies show clear evidence of being significant trials(Trower 2004; Kuipers 1997; Drury 1997; Milton 1978; Guadiano 2006; Pinto 1999). Although only one quarter of all trials were clearly significant, the overall effect is significant (around 0.4 as indicated by the diamond at the foot of the figure)
In other words, it is quite possible for a vast majority of negative (null) findings to produce an overall significant effect size - surprising? Other examples exist (e.g. streptokinase: Lau et al 1992; for a recent example, see Rerkasem & Rothwell (2010) and indeed, I referred to one in my recent blog "What's your Poison?" on the meta analysis assessing LSD as a treatment for alcoholism (where no individual study was significant!).
Some argue that the negative studies are only negative because they are underpowered - however, this only seems likely with a moderate-large effect size that produces a nonsignificant statistical result. And further speculate that a large trial will prove the effectiveness of the treatment' however, when treatments have subsequently been evaluated in definitive large trials, they have often failed to reach significance. Egger and colleagues have written extensively on the unreliability of conclusions in meta-analyses where small numbers of nonsignificant trials are pooled to produce significant effects (Egger & Davey Smith. 1995)
So, negative or null findings are perhaps more and less worrisome than we may think. Its not just an issue of not publishing negative results in psychology, its also an issue of what to do with them when we have them
The Journal of Null Results?
ReplyDeleteHi Jim
ReplyDeleteI am inclined to argue against 'Journals for Null Results' on the basis that they make negative findings into a 'special case' - when they are a part of 'normal science'.
I would prefer to see null results appear more often in regular journals. Indeed, psychology is massively out of step on this issue - as other disciplines publish far more null results in their regular journals.
I agree entirely. Negative findings are sometimes described as "failures", but if the studies were run properly, then they are just information. I think we should just drop hypothesis testing for most situations and focus on effect size estimation. Once someone convincing argues that a phenomenon is worth measuring, then the field should focus on getting a precise estimate of the magnitude (how precise depends on the importance of the topic and use of the value). This approach largely removes the motivation for withholding null findings. A null finding with 500 subjects will reduce the range of a confidence interval (or credibility interval for the Bayesians) about as much as a significant finding with 500 subjects. Our investigative efforts should determine the precision of our estimates, not the magnitude.
DeleteNull results must always be nterpretd in the context of statistical power: Given my experimental size what would have been the statistical power to detect an effect of magnitude x?
ReplyDeleteHi Dirk,
ReplyDeleteyes you are absolutely correct - as I remarked at the end of the blog, power is one key to understanding what null findings mean. If one finds a nonsignificant p-value, but the effect size is moderate-large, then the study may well be underpowered (and it may be worth pursuing using a larger sample). If the null finding is accompanied by a small-zero effect size, then there probably is nothing to see!
I work in Genetics myself and we have had quite a bit of debate about negative findings there. Especially in terms of candidate genes for human disease etc. A not so recent review actually showed that 'we' were not doing too badly in terms of also publishing NS results. However, given the genetic heterogeneity of different populations and the very complex genetic architecture of most traits a negative result can mean several things.
DeleteIn the same way that 'they have found a gene for 'insert favourite disease' ' really means: we have found a genetic variant that increases susceptibility to this disease in this particular population or family, the interpretation of a ngative result can go the other direction. Again meta analyses have been the order of the day in terms of integrating over many separate genome studies.
...You could also move on to Bayesian methods :-)
Delete(even better: do both things!)
It seems to me that the core issue relates to the philosophy of science and not so much about the negative results themselves or what to do about them. After all, there is an implicit message here that social and psychological (and perhaps economic!) factors are playing a key part in publication across disciplines. I have argued in the past that there is a need in many areas (especially in areas in psychology) to focus away from simple pseudo or proxy dependent measures and simple 'models' of them. To my mind there needs to be a greater consideration of what to use as reasonable dependent measures in studies. This is a key issue I think - for example is reaction time, or a BDI score (or whatever), really the right thing to measure? Not as often as they are used I would guess. And possibly generally not! One partial solution is to use change measures and to focus on the conditions that induce greatest change, without the current focus on control and simple models.
ReplyDeleteHi Ben, I agree its a Philosophy of Science issue, but would argue that one role of Philosophy of Science is precisely to inform us how to deal with negative results - indeed, this formed the bedrock of Popper's Philosophy of Science.
ReplyDeleteWhether we agree with Popper's view or not, all well-known Philosophies of Science are essentially rooted in how to deal with negative results (whether its Popper, Kuhn, Lakatos - possibly Feyerabend is exception)
Hi Keith,
ReplyDeleteAgree with that, but my point really is about the nature of the dependent measures we should have to deal with......
Hi Ben
DeleteI agree that chosen dependent variables often seem inappropriate...certainly for methodological reasons e.g. the use of the BDI as main DV in depression intervention studies.
I am not sure I follow the 'change' measures idea and its advantage over more standard DV choices - could you give an example? Would a 'change' measure not still be a proxy?
The mandatory registration of trials in international registries was the solution to the problem of non-publication of drug trials with negative results. It became impossible to "bury bad results" from clinical trials because there is now a trail that people can follow to uncover the existence of the trial. Decent medical journals won't publish unregistered trials and it is high time that psychology journals adopted the same policy.
DeletePhil Wilson
I agree Phil - it might also encourage mainstream journals to more frequently publish negative results (rather than authors having to consider alternative low impact outlets specialising in negation!) As I said above, negative results are as much a part of science as positive findings...and deserve the same 'respect'
Delete