Test Fatigue – Why it Happens

First of all super thanks to all of the great comments on my previous post about Test Fatigue. If you didn’t read my previous post or you don’t know what I mean by Test Fatigue, then please go ahead and read it now. I’ll wait.

Now, to the point – why do we often see the lift from a challenger in a split test decrease after it seems to be going strong and steady?

Statistical significance is for the winner, not the lift.
First and foremost, most split testing tools (I’ve only used Test&Target and Google Website Optimizer extensively) will provide a confidence level for your results. If the control has a conversion rate of 4% and the challenger a conversion rate of 6% (a 50% lift) with a 97% confidence level, the tool is NOT telling you that there is a 97% chance that there will be a 50% lift. The confidence level is referring to the confidence that the the challenger will outperform the control.

You don’t have enough data and there are many variables outside of your control.
We tend to think that in a split test all variables other than the visitor being presented with the control vs. the challenger are identical. In reality there are many external variables outside of our control, some of which we aren’t even aware of. All things being equal, we often see fluctuations in conversion rates even when we don’t make any changes in our site. Meta Brown provided some excellent points in her comments in my previous post.

Results aren’t always reproducible. Learn to live with it.
Lisa Seaman pointed out an excellent article from the New Yorker magazine about this very same phenomenon in other sciences. This is a must read for anyone doing any type of testing in any field. Read it. Now: The Truth Wears Off

What was especially eye opening for me was this part of the article (on page 5). Here is a shortened version of it:

In the late nineteen-nineties, John Crabbe, a neuroscientist at the Oregon Health and Science University, conducted an experiment that showed how unknowable chance events can skew tests of replicability. He performed a series of experiments on mouse behavior in three different science labs: in Albany, New York; Edmonton, Alberta; and Portland, Oregon. Before he conducted the experiments, he tried to standardize every variable he could think of.

The premise of this test of replicability, of course, is that each of the labs should have generated the same pattern of results. “If any set of experiments should have passed the test, it should have been ours,” Crabbe says. “But that’s not the way it turned out.” In one experiment, Crabbe injected a particular strain of mouse with cocaine. In Portland the mice given the drug moved, on average, six hundred centimetres more than they normally did; in Albany they moved seven hundred and one additional centimetres. But in the Edmonton lab they moved more than five thousand additional centimetres. Similar deviations were observed in a test of anxiety. Furthermore, these inconsistencies didn’t follow any detectable pattern. In Portland one strain of mouse proved most anxious, while in Albany another strain won that distinction.

The disturbing implication of the Crabbe study is that a lot of extraordinary scientific data are nothing but noise.

So there you have it. While I know you really want a silver bullet that will make your positive results always stay the same, reality isn’t so simple.

They say that conversion optimization is part art and part science, but I think we have to accept that it’s also part noise :)

Ophir