Loading Form...
Thank you! The form was submitted successfully.
Jun 15, 2011 | 6 minute read
written by Amanda Dhalla
According to the 2011 Marketing Sherpa Landing Page Optimization Benchmark Report, 40% of the over 2,000 marketers surveyed did not calculate the statistical significance of A/B and multivariate test results in 2010. 40%! That’s a big chunk of marketers.
Clearly, validating your test results should be a key part of the conversion testing process, or you’re going to be acting on bad data (and losing cash).
But how can you tell when there might be problems with your numbers? Look out for these 4 types of validity threats:
To find a winner, test your layout and copy variations with enough test subjects to reach a high level of confidence in your results. But how many is enough? Several factors impact the sample size you’ll need including:
To estimate how long you need to run your test for your results to be statistically significant at the 95% confidence level (i.e. 5% chance you’ll think the variations are performing differently when really they aren’t), look to the Google Website Optimizer calculator. Amadesa also has an A/B experiment duration calculator that's little more flexible. It lets you choose the level of confidence you want to achieve. By playing with the calculators, what you’ll find is that if your site gets limited traffic, you won’t be able to run as many versions or segment your test traffic as much as a higher volume site.
Multiplying the potential duration of your experiment by your average daily visitors gives you an indication of your sample size (or you can use a complicated formula). It’s helpful to have a sample size in mind before you start testing because many testing tools can be a little misleading. They can turn “green” or “red” after only a few visits, falsely indicating a high level of confidence that you have a winner or loser, and then quickly revert back to "yellow" or inconclusive results. If you heed the first "green" bar, you will stop your test too early. By waiting until you’ve tested with your full pre-determined sample size, you stand a better chance of finding the real superior performer. But don't worry, peeking during a test is ok, and necessary as we'll see below.
Events outside of an experiment, often called “history” threats, can affect response rates. Often, these are news events (e.g. holidays, major industry or company events, or news stories) that significantly but temporarily affect the attitudes and behaviors of visitors, and the amount of traffic. So much so that you can’t tell whether response differences are due to page changes or the historical event.
This is why we never recommend sequential testing, like trying one page version against your control in the first half of the month, and another version in the second half. An external event that happened only during the second half of the month can alter your results. But even A/B split testing is susceptible to external influencers. While an external event impacts all test versions equally, your overall results may vary if you had started the test earlier or later.
To minimize the risk of “history” impacts, here are a few tips:
“Instrument change”, where something happens to the technical environment or the measurement tools used during a test, can invalidate the results of your A/B or multivariate experiment. This could be things like:
While your test is running, if you spot sudden changes in performance or traffic distribution between variations, take a look under the covers to see if your technical environment or testing toolkit has altered in any way.
And to reduce the risk of “instrument change”, follow these guidelines:
When different types of visitors are not distributed equally between page versions, the test outcome can be affected. This is called “selection bias”. For example, if your incoming traffic sources, or mix of traffic, change dramatically during the test (due to a big email send or other channel-specific marketing activity). Or if the profile of your testers doesn’t match the profile of your actual customers.
While your test is running, monitor your control to make sure it’s not deviating significantly from past performance. As with "instrument change" threats, look out for sudden changes in the performance of one page version over another, or in the distribution of traffic amongst your variations.
Here are a few ways to minimize “selection bias”:
And finally, always take time to carefully analyze your data after each test has run its course. Sometimes you’ll uncover interesting learnings like one variation worked better for one audience, while another worked better for a different group. Running follow-up tests can confirm these findings.