A/B and Multivariate Test Validity: Beware of Bad Data!
Clearly, validating your test results should be a key part of the conversion testing process, or you’re going to be acting on bad data (and losing cash).
But how can you tell when there might be problems with your numbers? Look out for these 4 types of validity threats:
Too small a sample size
To find a winner, test your layout and copy variations with enough test subjects to reach a high level of confidence in your results. But how many is enough? Several factors impact the sample size you’ll need including:
- The current conversion rate of the page you're testing (note: not the same as the conversion rate of your entire site)
- The average number of daily visits to the test page
- The number of versions you’re testing
- The percentage of visitors in the experiment (sometimes you want to test with just a segment of your traffic)
- The percentage improvement you expect over the control
- How confident you need to be in the results (usually 95% but could be higher if the risks of being wrong are high)
To estimate how long you need to run your test for your results to be statistically significant at the 95% confidence level (i.e. 5% chance you’ll think the variations are performing differently when really they aren’t), look to the Google Website Optimizer calculator. Amadesa also has an A/B experiment duration calculator that's little more flexible. It lets you choose the level of confidence you want to achieve. By playing with the calculators, what you’ll find is that if your site gets limited traffic, you won’t be able to run as many versions or segment your test traffic as much as a higher volume site.
Multiplying the potential duration of your experiment by your average daily visitors gives you an indication of your sample size (or you can use a complicated formula). It’s helpful to have a sample size in mind before you start testing because many testing tools can be a little misleading. They can turn “green” or “red” after only a few visits, falsely indicating a high level of confidence that you have a winner or loser, and then quickly revert back to "yellow" or inconclusive results. If you heed the first "green" bar, you will stop your test too early. By waiting until you’ve tested with your full pre-determined sample size, you stand a better chance of finding the real superior performer. But don't worry, peeking during a test is ok, and necessary as we'll see below.
An external event that changes visitor behavior
Events outside of an experiment, often called “history” threats, can affect response rates. Often, these are news events (e.g. holidays, major industry or company events, or news stories) that significantly but temporarily affect the attitudes and behaviors of visitors, and the amount of traffic. So much so that you can’t tell whether response differences are due to page changes or the historical event.
This is why we never recommend sequential testing, like trying one page version against your control in the first half of the month, and another version in the second half. An external event that happened only during the second half of the month can alter your results. But even A/B split testing is susceptible to external influencers. While an external event impacts all test versions equally, your overall results may vary if you had started the test earlier or later.
To minimize the risk of “history” impacts, here are a few tips:
- Regularly analyze your data for consistency during the test, especially if it’s particularly long running.
- Don’t run tests that extend into holidays (or across periods that differ significantly for your industry) unless holiday behavior is what you want to study in your test.
- During the test, look out for industry or news stories that may temporarily affect purchase behavior or traffic.
- Test over a longer duration, or repeat a test (to a point), until you are confident in your data.
A change in your technical environment or measurement tools
“Instrument change”, where something happens to the technical environment or the measurement tools used during a test, can invalidate the results of your A/B or multivariate experiment. This could be things like:
- Inconsistent placement of test control code (e.g. in the body on some pages but in the header on others)
- A code deploy happening during a test that disables or alters your control code
- Performance issues stemming from web server or network problems
- Testing software or reporting tool malfunction
- Response time slowdowns due to heavy page weights or page code, or server overload
And to reduce the risk of “instrument change”, follow these guidelines:
- Make sure your control code is placed correctly and consistently across all your versions.
- Browser compatibility check your versions before launch to make sure there are no compatibility issues.
- Be careful when deploying code while a test is running not to alter or delete your test control code.
- Monitor for odd data during the test. If you have multiple sources of the same data, cross check your numbers every once and awhile to make sure there are no major differences.
A change in incoming traffic sources or traffic mix
When different types of visitors are not distributed equally between page versions, the test outcome can be affected. This is called “selection bias”. For example, if your incoming traffic sources, or mix of traffic, change dramatically during the test (due to a big email send or other channel-specific marketing activity). Or if the profile of your testers doesn’t match the profile of your actual customers.
While your test is running, monitor your control to make sure it’s not deviating significantly from past performance. As with "instrument change" threats, look out for sudden changes in the performance of one page version over another, or in the distribution of traffic amongst your variations.
Here are a few ways to minimize “selection bias”:
- Use traffic sources that most closely match the target audience for the page being tested.
- Make sure visitors are being randomly distributed between your test versions. It should be impossible for you to predict which version a given visitor will see, and visitors shouldn’t be able to self-select the version they see either.
- Compare the performance of your control with its recent historical performance for consistency.
- Gather enough analytics data to allow deeper analysis post-test. For example, to compare weekend vs. weekday results, or new vs. returning visitors.
And finally, always take time to carefully analyze your data after each test has run its course. Sometimes you’ll uncover interesting learnings like one variation worked better for one audience, while another worked better for a different group. Running follow-up tests can confirm these findings.