What is the Problem With Reproducing Behavioral Results?
The original paper was published in Procceedings of the National Academy of Sciences USA
In June 1999 a disturbing and influential report was published in Science magazin. John Crabbe, Douglas Wahlsten and Bruce Dudek (CWD) reported that they encountered considerable difficulties replicating behavioral results in three different laboratories. CWD made an exceptional effort equalizing the conditions of their tests. They used several inbred strains of laboratory mice (within each inbred strain all the animals are more than 99% identical genetically – more similar than identical twins). They used several standard behavioral tests that are common in the biomedical community for phenotyping and assessing the behavioral effects of drugs. They didn’t merely equalize the conditions of housing and testing, but practically conducted each test in the same day and hour over all three labs.

The good news in the CWD report were that they found large and highly significant strain differences in all the measures they use (most of the tests use more than one measure to evaluate the results. These measures are termed “endpoints” in the phenotyping community). Apparently, the genotype does have a considerable effect on the behavior of mice. The not-so-bad news were that they found highly significant laboratory effect in many “endpoints”. A lab effect is everything that affects all the genotypes the laboratory by the same amount. That is, we have a lab-specific factor that equally increases or decreases the results of all the genotypes. This is not a very serious problem because you can run some common genotype in your own lab and use it as a reference to calculate your lab’s specific factor. The differences between the genotypes will remain the same although the absolute values will change from lab to lab.
The real problem that CWD reported was that in many of the endpoints they also found a significant labx genotype interaction (LxG). This means that even the strain differences were not the same in each laboratory. Such an interaction cannot be solved by running a reference genotype, because it means that each lab affects each genotype in a different way. CWD thus concluded: “experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory”.
Of course, most investigators in the behavioral neurosciences have known from personal experience that behavioral results from one laboratory might prove very difficult to replicate in another lab, even after investing considerable effort in standardizing housing and testing conditions, but the CWD report made it official because they used genetically-identical animals, because they used some of the most common standard behavioral tests, and mainly because for the first time they actually conducted the experiment in a controlled way across several labs. It is difficult to judge to say how much the CWD report changed the way people in the field treat behavioral results. I personally heard from many investigators that this report was used as an argumant to allocate less fund to behavioral experiments, and some grant applications were explicitly turned down because the referees thought they did not address this issue. Perhaps the most absurd thing about this situation is that the studies that are likely to suffer from this critic are those that, like the original CWD report, use mouse inbred strains and relatively simple, standardized setups, although there's absolutly no reason to assume that it's safer to trust results from behavioral studies that use outbred (genetically variable) animals and complex, idiosyncratic setups. The confounding factors that prevent replication of results in a different lab most probably exist in the second type of studies even more than in the first type. It's only that in the first type they are easier to discover.
To be continued...