Test Fatigue – Why it Happens

First of all super thanks to all of the great comments on my previous post about Test Fatigue. If you didn’t read my previous post or you don’t know what I mean by Test Fatigue, then please go ahead and read it now. I’ll wait.

Now, to the point – why do we often see the lift from a challenger in a split test decrease after it seems to be going strong and steady?

Statistical significance is for the winner, not the lift.
First and foremost, most split testing tools (I’ve only used Test&Target and Google Website Optimizer extensively) will provide a confidence level for your results. If the control has a conversion rate of 4% and the challenger a conversion rate of 6% (a 50% lift) with a 97% confidence level, the tool is NOT telling you that there is a 97% chance that there will be a 50% lift. The confidence level is referring to the confidence that the the challenger will outperform the control.

You don’t have enough data and there are many variables outside of your control.
We tend to think that in a split test all variables other than the visitor being presented with the control vs. the challenger are identical. In reality there are many external variables outside of our control, some of which we aren’t even aware of. All things being equal, we often see fluctuations in conversion rates even when we don’t make any changes in our site. Meta Brown provided some excellent points in her comments in my previous post.

Results aren’t always reproducible. Learn to live with it.
Lisa Seaman pointed out an excellent article from the New Yorker magazine about this very same phenomenon in other sciences. This is a must read for anyone doing any type of testing in any field. Read it. Now: The Truth Wears Off

What was especially eye opening for me was this part of the article (on page 5). Here is a shortened version of it:

In the late nineteen-nineties, John Crabbe, a neuroscientist at the Oregon Health and Science University, conducted an experiment that showed how unknowable chance events can skew tests of replicability. He performed a series of experiments on mouse behavior in three different science labs: in Albany, New York; Edmonton, Alberta; and Portland, Oregon. Before he conducted the experiments, he tried to standardize every variable he could think of.

The premise of this test of replicability, of course, is that each of the labs should have generated the same pattern of results. “If any set of experiments should have passed the test, it should have been ours,” Crabbe says. “But that’s not the way it turned out.” In one experiment, Crabbe injected a particular strain of mouse with cocaine. In Portland the mice given the drug moved, on average, six hundred centimetres more than they normally did; in Albany they moved seven hundred and one additional centimetres. But in the Edmonton lab they moved more than five thousand additional centimetres. Similar deviations were observed in a test of anxiety. Furthermore, these inconsistencies didn’t follow any detectable pattern. In Portland one strain of mouse proved most anxious, while in Albany another strain won that distinction.

The disturbing implication of the Crabbe study is that a lot of extraordinary scientific data are nothing but noise.

So there you have it. While I know you really want a silver bullet that will make your positive results always stay the same, reality isn’t so simple.

They say that conversion optimization is part art and part science, but I think we have to accept that it’s also part noise :)

Ophir

Test Fatigue – Conversion Optimization’s Dirty Little Secret

I’m going to expose to you a phenomenon that’s fairly common when split testing, but no one seems to be talking about it (other than veteran split testers) and I don’t think it’s ever been blogged about (please add a comment if I’m wrong).

It has to do with the question:
“Will the lift I see during my split test continue over time”?

Let’s start by looking at a scenario commonly used by practically everyone in the business of split testing.

Your web site currently is currently generating \$400k a month is sales which has been steady for the past few months. You hire a conversion optimization company, which does a split test on your checkout page.

After running the test for 3-4 weeks, the challenger version provides a 10% lift in conversion and RPV at a 99% statistical confidence level. The conversion rate company turns off the test and you hard code the winning challenger.

First of all – Wooohoo!!! (Seriously, that’s an excellent win.)

A 10% lift from \$400k a month is an extra \$40k a month. Annualized that amounts to an extra \$480k a year. So your potential increased yearly revenue from using the winning checkout page is almost half a million dollars. Sounds pretty good to me.

Here’s the problem.

All things being equal, by using the winning version of the checkout page and not your old checkout page, there is a good chance you won’t be making an extra \$480k in the next 12 months.

Don’t get me wrong. You will indeed be making more money with the winning checkout page than with the old one, but in all likelihood, it will be less than simply annualizing the lift from during the test itself.

The culprit is what I like to call “Test Fatigue” (a term I think I just coined).

Here’s what often happens if instead of stopping your split test after 3-4 weeks you could let it run for an entire year. There is a phenomenon that I’ve often, but not always seen with very long running split tests; after a while (this might be 3 weeks or 3 months) the performance of the winning version and the control (original) version start to converge.

They usually won’t totally converge, but that 10% lift which was going strong for a while with full statistical confidence is now a 9% lift or an 8% lift or a 5% lift or maybe even less.

As I mentioned before this doesn’t always happen and the time frame can change, but this is a very real phenomenon.

Why is does this happen?

Please read my next posting – Why Test Fatigue Happens where I provide some explanations on why this happens.

Also, I’d love to hear if you have also seen this phenomenon with your own tests and what your personal theories are as to why it happens.

Thanks
Ophir