Now a
tenured professor at the University of
California, Schooler
still ruminates over his
experiments. “I know I should
just move on
already. I should really stop talking about
this.
But I can’t.”
Schooler is convinced that he’s stumbled
on
a serious problem. A problem that afflicts
most new ideas in many fields of research.
And he’s not
alone. Other scientists have
also noticed the phenomena,
looked into
it and come up blank as well.
The decline effect
Why so many published research findings are false As the
medical technology industry knows very well, before the
effectiveness of a drug or device can be confirmed, it
must be tested and tested again. Different scientists in
different labs need to repeat the protocols and publish
their results. The test of replicability, as it’s known,
is the foundation of modern research. Replicability is
how the scientific community polices itself. It’s a
safeguard against subjectivity. But what if
replicability itself is in question?
It’s as if facts are losing their truth The problem first reared its ugly head in Brussels
in September 2007. This occurred when a few dozen
neuroscientists, psychiatrists, and drug company
executives had gathered in a hotel conference room to
hear about a class of drugs know as a atypical or
second-generation antipsychotics. According to the data
released at this conference it appeared that the
therapeutic power of many drugs appeared was steadily
waning. A recent study showed than an effect that was
less than half of that documented in trials carried out
in the early nineteen-nineties. According to extensive
replication studies, expensive pharmaceuticals appeared
to be no better than previous generations of far less
sophisticated, and less profitable, medications.
As recent findings show, all sorts of well-established,
multiply confirmed findings have started to look
increasingly uncertain, It’s as if our facts are losing
their truth: claims that have been enshrined in medical
tomes are suddenly unprovable. While this phenomenon
does not yet have an official name, it is occurring
across a wide range of fields, from ecology to
psychology. In the field of medicine, the phenomenon
seems extremely widespread, affecting therapies ranging
from cardiac stents to vitamine E and antidepressants.
For instance, a forthcoming analysis demonstrates that
the efficacy of antidepressants has gone down as much as
threefold in recent decades.
The academic super
star with a secret This
“problem” was first noticed by Jonathan Schooler back
in the 1990s. As a young graduate student at the
University of Washington in 1980, Schooler had appeared
to discover an interesting fact about language and
memory. At that time, it was widely believed that the
act of describing our memories improved them. But, in a
series of clever experiments, Schooler demonstrated this
was not true. He showed that subjects shown a face and
asked to describe it were much less likely to remember
that face when shown it later than those subjects who
had simply looked at it. Schooler dubed the phenomenon
“verbal overshadowing”.
This “discovery” turned Schooler into an academic super
star and his study was cited more than 400 times over
the years that followed. Buoyed by the success of his
first experiment, Schooler went on to extend the model
to a variety of other similar tasks which also seemed to
demonstrate the same results. But while Schooler was
publishing these results in highly reputable journals, a
secret worry was gnawing at him: he was finding it
impossible to replicate his original findings. He would
often see an effect of some kind, but the “verbal
overshadowing” was just not as strong. It was as if it
was getting weaker. At first, he assumed that he’d made
an error in the design of his experiment or has made a
mistake in his statistical calculations.
Cosmic habituation Schooler
tried to put the pesky problem out of his mind. His
colleagues assured him that such things happen all the
time. Years passed and Schooler found new things to
research, he got married and raised a family. But his
replication problems kept on getting worse. His first
attempt at replicating his 1990 study was in 1995. His
results that time turned out to be 30% smaller. The next
year, the size of the effect shrank by yet another 30%.
When other labs repeated his experiments, a similar
spread of data occurred. It seemed as if nature had
changed its mind and was now taking back the original
results. Schooler started referring to the problem as
“cosmic habituation”, by analogy to the response that
occurs when individual subjects habituate to particular
stimuli. Although he started joking that the cosmos was
habituating to his ideas, he took it very personally.
Now a tenured professor at the University of California,
Schooler still ruminates over those experiments. “I know
I should just move on already. I should really stop
talking about this. But I can’t.” Schooler is convinced
that he’s stumbled on a serious problem. A problem that
afflicts many of the most new ideas in many fields of
research. And he’s not alone. Other scientists have also
noticed the phenomena, looked into it and come up blank
as well.
Extraordinarily lucky
researchers It has been
argued that the decline effect is largely be a product
of publication bias, or the tendency of scientists and
scientific journals to prefer positive data over null
results, which is what happens when no effect is found.
The bias was first identified by the statistitian
Theodore Sterling back in 1959. Sterling noticed that
97% of all published psychological studies with
statistically significant data found the effect they
were looking for. A significant result is defined as any
data point that would be reproduced by chance less than
5% of the time.
This test was invented in 1922 by the English
mathematician Ronald Fisher, who picked 5% as the
boundary line, somewhat arbitrarily, because it made
pencil and slide-rule calculations easier. Sterling saw
that if 97% of psychology studies were proving their
hypotheses, either psychologist were extraordinarily
lucky or they published only the outcomes of successful
experiments. In recent years, publication bias has
mostly been seen as a problem for clinical trials, since
pharmaceutical companies are less interested in
publishing results that are not favourable. But it is
becoming increasingly clear that publication bias also
produces major distortions in fields without large
corporate incentives, such as psychology and ecology.
While publication bias almost certainly plays a role in
the decline effect, it remains and incomplete
explanation. For one thing, it fails to account for the
initial prevalence of positive results among studies
that never even get submitted to journals. It also fails
to explain the experience of people who have been unable
to replicate their initial data despite their best
efforts.
Our beliefs as a form
of blindness One of the
classic examples of selective reporting concerns the
testing of accupuncture in different countries. Widely
accepted as a medical treatment in various Asian
countries, acupuncture is much more contested in the
West. These cultural differences have profoundly
influenced the results of clinical trials. Between 1966
and 1195, there were 47 studies of acupuncture in China,
Taiwan and Japan. Every single trial concluded that
accupunture was an effective treatment. During the same
period, there were 94 clinical trials of acupuncture in
the United States, Sweden and the UK. Only 56% of these
studies found any therapeutic benefits. This wide
discrepancy suggests that scientists find ways to
confirm their preferred hypothesis, disregarding what
they don’t want to see. Our beliefs are a form of
blindness.
Reporting only
positive results John
Ioannidis, an epidomologist at Stanford University,
argues that such distortions are a serious issue in
biomedical research. “these exagerations are why the
decline has become so common,” he says. “It’d be really
great if the initial studies gave us an accurate summary
of things. But they don’t. And so what happens is we
waste a lot of money treating millions of patients and
doing lost of follow-up studies on other themes based on
results that are misleading.”
In 2005, Ioannidis published an article in the Journal
of the American Association that looked at the
forty-nine most cited clinical-research studies in three
major medical journals. Forty-five of these studies
reported positive results, suggesting that the
intervention being tested was effective. Because most of
these studies were randomized controlled trials–the gold
standard of medical evidence–they tended to have a
significant impact on clinical practice, and led to the
spread of treatments such as hormone replacement therapy
for menopausal women and daily low-dose aspirin to
prevent heart attacks and strokes. Nevertheless, the
data Ioannidis found were disturbing: of the thirty-four
claims that had been subject to replication, 41% had
either been directly contradicted or had their effect
size downgraded significantly.
Fashionable subjects
are worse The
situation is even worse when a subject is fashionable.
In recent years, for instance, there have been hundreds
or studies on the various genes that control the
differences between in disease risks between men and
women. These findings have included everything from the
mutations responsible for the increased risk of
schizophrenia to the genes underlying hypertension.
Ioannidis and his colleagues looked at four hundred and
thirty-tow of these claims. They quickly discovered that
the vast majority had serious flaws. But the most
troubling fact emerged when the looked at the test of
replication: out of four hundred and thirty-tow claims,
only a single one was consistently replicable. “This
doesn’t mean that none of these claims will turn out to
be true” he says. But, given the fact that most of them
were done badly, I wouldn’t hold my breath.”
According to Ioannidis, the main problem is that too
many researchers engage in what he calls “significance
chasing” or finding ways to interpret the data so that
it passes the statistical test of significance–the 95%
boundary invented by Ronald Fisher. “The scientists are
so eager to pass this magical test that they start
playing around with the numbers, trying to find anything
that seems worthy.” Ioannidis says. In recent years,
Ioannidis has become increasingly blunt about the
pervasiveness of the problem. One of his most often
cited papers has a deliberately provocative title: “Why
most published research findings are false.”
A fundamental
cognitive flaw
The problem of selective reporting is rooted in a
fundamental cognitive flaw: we like proving ourselves
right and hate being wrong. Or, as Ioannidis puts it:
“It feels good to validate an hypothesis. It feels even
better when you got a financial interest in the idea or
your career depends on it. And that’s why, even after a
claim has been systematically disproven you still see
some researchers citing the first few studies that show
a strong effect. They really want to believe that it’s
true.”
That why scientists need to become more rigorous about
data collection before they publish. We’re wasting too
much time chasing after bad studies and underpowered
experiments. The current “obsession” with replicability
distracts from the real problem, which is faulty design.
Nobody even tried to replicate most science papers –
there are simply too many. According to Nature, a third
of all studies never even get cited, let alone repeated.
“I’ve learned the hard way to be exceedingly careful”
Schooler says. “Every researcher,, should have to spell
out, in advance, how many subjects they’re going to use,
and what exactly they’re testing, and what constitutes a
sufficient level of proof. We have the tools to be much
more transparent about our experiments.”
In a forthcoming paper, Schooler recommends the
establishment of an open-source database, in which
researchers would be required to outline their planned
investigations and document all of their results. “I
think this would provide a huge increase in access to
scientific work and give us a much better way to judge
the quality of an experiment,” Schooler says.”It would
help us finally deal with all these issues that the
decline effect is exposing.”
We still have to
choose what to believe Ultimately,
anomalies in test results demonstrate the slipperiness
of empiricism. Although many scientific ideas generate
conflicting results and suffer from falling effects
sizes, they often continue to get cited in textbooks and
drive standard medical practice. Why? Because these
ideas seem true. Because they make sense. Because we
can’t bear to let them go. And this why the decline
effect is so troubling. Not because it reveals the human
fallibility of science. Not because data can be tweaked
and beliefs shape perception. (Such shortcomings aren’t
surprising, at least for scientists.) And not because it
reveals that many of our most exciting theories are
fleeting fads and will soon be rejected. The decline
effect is troubling because it reminds us of how
difficult it is to prove anything. We like to pretend
that our experiments define the truth for us. But that’s
often not the case. Just because an idea is true doesn’t
mean it can be proven. And just because an idea can be
proven doesn’t mean it’s true. Once all the experiments
are done, we still have to choose what to believe.
This text
was edited from an article written by Jonah Lehrer and
published in the December 13, 2010 edition of the New
Yorker magazine.