I’m going to expose to you a phenomenon that’s fairly common when split testing, but no one seems to be talking about it (other than veteran split testers) and I don’t think it’s ever been blogged about (please add a comment if I’m wrong).
It has to do with the question:
“Will the lift I see during my split test continue over time”?
Let’s start by looking at a scenario commonly used by practically everyone in the business of split testing.
Your web site currently is currently generating $400k a month is sales which has been steady for the past few months. You hire a conversion optimization company, which does a split test on your checkout page.
After running the test for 3-4 weeks, the challenger version provides a 10% lift in conversion and RPV at a 99% statistical confidence level. The conversion rate company turns off the test and you hard code the winning challenger.
First of all – Wooohoo!!! (Seriously, that’s an excellent win.)
A 10% lift from $400k a month is an extra $40k a month. Annualized that amounts to an extra $480k a year. So your potential increased yearly revenue from using the winning checkout page is almost half a million dollars. Sounds pretty good to me.
Here’s the problem.
All things being equal, by using the winning version of the checkout page and not your old checkout page, there is a good chance you won’t be making an extra $480k in the next 12 months.
Don’t get me wrong. You will indeed be making more money with the winning checkout page than with the old one, but in all likelihood, it will be less than simply annualizing the lift from during the test itself.
The culprit is what I like to call “Test Fatigue” (a term I think I just coined).
Here’s what often happens if instead of stopping your split test after 3-4 weeks you could let it run for an entire year. There is a phenomenon that I’ve often, but not always seen with very long running split tests; after a while (this might be 3 weeks or 3 months) the performance of the winning version and the control (original) version start to converge.
They usually won’t totally converge, but that 10% lift which was going strong for a while with full statistical confidence is now a 9% lift or an 8% lift or a 5% lift or maybe even less.
As I mentioned before this doesn’t always happen and the time frame can change, but this is a very real phenomenon.
Why is does this happen?
Please read my next posting – Why Test Fatigue Happens where I provide some explanations on why this happens.
Also, I’d love to hear if you have also seen this phenomenon with your own tests and what your personal theories are as to why it happens.
22 thoughts on “Test Fatigue – Conversion Optimization’s Dirty Little Secret”
We have also seen “test fatigue” as you describe above. To overcome this, we start testing the individual variables (multivariate testing) on the page (text, field names, buttons, etc.) which continues to increase conversions of the page over time.
Great article Ophir! I hate this effect with all my heart and soul :-) You are all happiness with the statistically significant results and then, slowly, your happiness starts getting undermined…
Really looking forward to the next post, please do not be shy to add all the statistics that you know so well!
Hi Ophir, interesting topic. We have seen this in our clients test as well. Never actually stopped and researched the topic. Great cliffhanger here… curious what your findings are.
Ditto, convergence over time, often with .5%, which is more or less within the margin of error. Our conversion process at LegalMatch is multi-step – I’ve found it useful to test both on the next page and at the conversion step.
This emphasizes the importance of continuing improvement. The internet is constantly changing and your users are too. If you stand still, you’re falling behind. This same trend is why stale sites have slowly eroding conversion rates whether or not they do A/B testing. You’ve got to keep making improvements.
We’ve seen “test fatigue” in few of the tests we have left running for longer periods, great to hear someone comment on this phenomenon. Keen to hear your theory on why this might happen.
I think Robert has good point around continually optimising and not standing still – Your site traffic doesn’t stand still either and is more fluid – users that where on your site 2 months ago are either not on your site now or if you have a good retention program they are in a different mind set than previously.
Annualising makes the assumption your test sample is a constant.
Test tools, “black boxes” that they generally are, are built to reach statistical confidence as fast as possible. That means they are angled towards reaching confidence with the smallest sample size possible. I’m all for that approach, even though it seems to cause the unpleasant phenomena you posted about. I believe the effect is caused by increasing sample size and increasing accuracy, mixed with a ‘diminishing returns’ factor that happens naturally over time on the Web. Looking forward to your thoughts in part II, of course :-)
I find this phenomenon typically happens with the test variations are only testing superficial changes to the test page (CTA color, size, headline text change) and don’t provide enough substance for 1 variation to win over the other. A companies time and energy is better spend on running split tests that actually mean something to the consumer/customer. Clarify some point of confusion, provide a real value to the customer and through testing verify the value. If the value is only superficial then the results will not last.
Hi Ophir. Your “test fatigue” sounds similar to the topic of a December 2010 New Yorker article, which called it the “decline effect.” Here’s a snippet from the concluding paragraph.
“The decline effect is troubling because it reminds us how difficult it is to prove anything. We like to pretend that our experiments define the truth for us. But that’s often not the case. Just because an idea is true doesn’t mean it can be proved. And just because an idea can be proved doesn’t mean it’s true. When the experiments are done, we still have to choose what to believe.”
Read more of “THE TRUTH WEARS OFF” by John Lehrer:
Thanks for the comments and the link to the new yorker article. It’s a true gem and I never realized how prevalent the issue of “non reproducible results” is in other areas.
From a statistician’s point of view, i can think of a couple of obvious reasons why you might see this effect.
As Brendan pointed out, tests are conducted with relatively small samples. Small samples are by nature more variable than large ones. Sometimes you will just happen to draw a sample that where the effect of the change is larger than usual, and when the change is applied to the larger group the overall effect just isn’t as big as it was for that sample.
Another thing to consider is confounding – that is, factors you don’t control, and may not even know about, are coming into play. For example, when you conducted the test, it might be that people who were on the fence about buying your product saw the test page and it somehow pushed them over the edge to purchase. Once the test is over, there might be fewer of those people left in the population to buy. Or there might have been something going on when you did the test – such as a mention of your product in the media – and while it appeared to you that the lift was caused by the new page, some of it was actually caused by the publicity. Your test might have reached a sample that was not representative of your everyday site visitor for some other reason that you haven’t detected yet. These factors can alter the results of the test, but it’s not obvious, so it seems it was all due to the test page.
Great comments and I agree that while we often think we have a big enough sample rate, we really don’t.
I’ll be incorporating some of your comments in part two.
Meta, you make some good points.
With all tests, you need to ensure there is no channel bias as for example if PPC (paid traffic) is driving the conversion, if/when this activity is switched off this will obviously alter the Conversion %. Always look at channel split for all tests.
@Ophir, look forward to seeing some output in part 2.
I think this is an area where optimisers need collaboration.
These are really important considerations. Every effect has a cause but we have to be careful about linking specific effects to specific causes. There’s a danger of wishful thinking influencing our interpretation of changes in data.
I agree with Meta’s first point about small sample sizes. I would suggest that this is the core of the phenomena you are describing, Ophir. Especially since you mention that they’re long-running tests, I would guess that they’re also likely for lower-traffic target areas.
External factors should be less of a concern if your test is run using a concurrent control. What Meta describes is a “Pre & Post” or “Before & After” test, which will always produce questionable results. Your sample must be from a similar sample which, in the case of marketing testing, must include the same time period.
We certainly see this in some of our tests, but also see the opposite so it’s not a general rule. It’s perhaps just more painful to see so makes a more significant impression on us and seem more general.
Unfortunately small sample sizes is not at the core of this issue. Some of my tests involve hundreds of thousands of visitor a day with thousands of conversions.
By external factors I don’t mean always external to the site, but external to the test.
For example, with one ecommerce client, we were testing a product category page. Midway through the test they changed the promotional offer on the homepage and it greatly impacted test results.
If you haven’t yet, read the article that Lisa Seaman pointed out:
“THE TRUTH WEARS OFF” by John Lehrer:
Thanks for the comments!
Pingback: Test Fatigue – Why it Happens | Analytics Impact
Pingback: Gefahren bei der Interpretation von A/B-Tests — Online Marketing optimieren
Interesting observation. This is a common problem with (using and interpreting) statistics – and applies to every science where an underlying mechanism cannot be demonstrated by controlled experiments. Unless the changes to the winning page are due to better psychology (so they apply to the vast majority of normal humans – if such a beast exists – for example: providing a story that demonstrates product benefits that pander to the underlying human need for acceptance will win over a list of features, and will probably win forever, by a significant margin), then there is a a possibility that the raise is not due to anything more than a passing fashion preference (and this is why so many web pages are so easy to optimise – they look old fashioned, so changing them to look more modern fits with what people expect to see, so they are more likely to buy). Sorry for the parenthesis and convoluted sentence – too much caffeine and not enough time for edits! ;-)
Interesting point. But what can you really do about it to stop it from happenging though? If this were me, I would probably not even raise this with my business or clients, because it may de-rail attempts to actually do more testing in the future… And don’t forget, website traffic sources and quality changes over time, as does the rest of the website, so naturally the results won’t stay elevated for much more than 6 months. I suggest you do follow up tests a year later to see how you can improve your initial result further (test iteration).
Great points Rich!
I agree that pro-actively raising the point isn’t the best idea, but on the other hand I have personally been asked by clients why isn’t the lift from the test we did 3 months ago (and implemented the winner) showing up in their bottom line?
It’s not about doing something about it so much as being aware of the phenomenon.
I would not hesitate to discuss this with clients, although if they were complete novices to testing it would not be a high priority topic. My take is simple – testing gives you the best information available at any given moment.
Internetland is changing often. Those who are very serious about testing use it often, much more than every year. Some would consider a day to be a long time.