First of all super thanks to all of the great comments on my previous post about Test Fatigue. If you didn’t read my previous post or you don’t know what I mean by Test Fatigue, then please go ahead and read it now. I’ll wait.
Now, to the point – why do we often see the lift from a challenger in a split test decrease after it seems to be going strong and steady?
Statistical significance is for the winner, not the lift.
First and foremost, most split testing tools (I’ve only used Test&Target and Google Website Optimizer extensively) will provide a confidence level for your results. If the control has a conversion rate of 4% and the challenger a conversion rate of 6% (a 50% lift) with a 97% confidence level, the tool is NOT telling you that there is a 97% chance that there will be a 50% lift. The confidence level is referring to the confidence that the the challenger will outperform the control.
You don’t have enough data and there are many variables outside of your control.
We tend to think that in a split test all variables other than the visitor being presented with the control vs. the challenger are identical. In reality there are many external variables outside of our control, some of which we aren’t even aware of. All things being equal, we often see fluctuations in conversion rates even when we don’t make any changes in our site. Meta Brown provided some excellent points in her comments in my previous post.
Results aren’t always reproducible. Learn to live with it.
Lisa Seaman pointed out an excellent article from the New Yorker magazine about this very same phenomenon in other sciences. This is a must read for anyone doing any type of testing in any field. Read it. Now: The Truth Wears Off
What was especially eye opening for me was this part of the article (on page 5). Here is a shortened version of it:
In the late nineteen-nineties, John Crabbe, a neuroscientist at the Oregon Health and Science University, conducted an experiment that showed how unknowable chance events can skew tests of replicability. He performed a series of experiments on mouse behavior in three different science labs: in Albany, New York; Edmonton, Alberta; and Portland, Oregon. Before he conducted the experiments, he tried to standardize every variable he could think of.
The premise of this test of replicability, of course, is that each of the labs should have generated the same pattern of results. “If any set of experiments should have passed the test, it should have been ours,” Crabbe says. “But that’s not the way it turned out.” In one experiment, Crabbe injected a particular strain of mouse with cocaine. In Portland the mice given the drug moved, on average, six hundred centimetres more than they normally did; in Albany they moved seven hundred and one additional centimetres. But in the Edmonton lab they moved more than five thousand additional centimetres. Similar deviations were observed in a test of anxiety. Furthermore, these inconsistencies didn’t follow any detectable pattern. In Portland one strain of mouse proved most anxious, while in Albany another strain won that distinction.
The disturbing implication of the Crabbe study is that a lot of extraordinary scientific data are nothing but noise.
So there you have it. While I know you really want a silver bullet that will make your positive results always stay the same, reality isn’t so simple.
They say that conversion optimization is part art and part science, but I think we have to accept that it’s also part noise :)
Ophir
Hi Ophir
The fourth reason is that the underlying mechanisms that are responsible for the observed change are complex (as they are in the case of a website with multiple page elements, people from different backgrounds with different intentions, genders, preferences, expectations, motivations and so different levels of susceptibility to the marketing messages who visit the website. Disambiguating these confounders is usually impossible in this sphere – as the information is not always available. We often do not know the gender or age of the visitors who did not convert, and people can use the same keyword phrase with very different intent (and as we do not have visibility to their prior searches, we cannot glean this from the data we have). Often it is not feasible to run a test long enough to gather sufficient data to slice and dice the demographics finely enough to spot the real gems. Such is life – we must always operate with incomplete information. On the other hand, we are incredibly lucky to have the wealth of data we can get our hands on, so quickly and relatively cheaply. Before the internet, this would have been an incredibly expensive exercise both financially and temporally.
Regards,
Salvatore
Hi Ophir, a big problem, that I am seeing right now is that users are becoming more accustomed to using several devices before making a final decision (e.g. start on laptop, continue on mobile, and end on tablet). This can obfuscate test results significantly. I wrote a small post about this under http://bit.ly/H6IM7T. If you have the time to read it, I’d love to hear from you, if you are experiencing similar problems.
Cheers from a former POP colleague!
David White
Hi David,
Thanks for the link. You make a very valid point, though the phenomenon you describe is more an issue of correct attribution than un-reproducible results.
Ultimately it a reminder that so much of what really happens is outside our ability to measure.
Thanks
Ophir
Pingback: Methode und Geduld beim Testen bringt sichere Ergebnisse
Pingback: Conversion-Tests interpretieren: Laufzeit, Kausalität und Konfidenz
Hi Ophir,
If you really want to be entertained I suggest you do an AA test of two identical ads on Google Adwords. Watch how the CTR, conversion rates and other metrics change, sometimes flip flop, converge and diverge over time.
You can probably do an AA test with landing page with similar results but I never tried that.
In some cases I have let these run for a long, long time so there is no question of not enough data. But you can still see divergent metrics in many cases. Very depressing to see. Actually, we did AAB tests as we wanted to wait for the AAs to be the same before we declared the B or the A a winner.
To over come your “test fatigue” problem we try to test pages that are very different from each other. We can then sometimes get very big differences in the metrics. This allows us to be more confident that the winner is really a winner.