A/B testing, sometimes called “two sample hypothesis testing,” is an essential tool for virtually all online businesses. Properly applied, an A/B testing regime allows businesses to optimize their marketing strategies by independently testing two alternatives (i.e., strategy A versus strategy B) against one another and checking to see which is more effective. The idea seems simple, but A/B testing can be deceptively complex. Like any tool, its usefulness depends on the skill of the operator. These five common errors in research design and data analysis can lead to incorrect–and sometimes catastrophic–conclusions.
Choosing the Wrong Dependent Variable
The central mission of any A/B testing program is to identify the most effective possible strategy, but defining “most effective” isn’t always intuitive. For some businesses, the goal is to maximize click-through rates, but others want to attract users interested in buying a product. The problem is that a strategy that generates lots of clicks doesn’t necessarily generate lots of buyers.
Suppose that version A of an advertisement is intriguing and controversial, but a bit misleading about the content on your website. Version B is a bit boring, but communicates what your site has to offer clearly and directly. Version A might generate 20% more clicks than version B, but the vast majority of users will navigate away almost immediately. Version B might get fewer eyeballs on the site, but those eyeballs might be significantly more valuable. If the goal is to maximize revenue, the results of an A/B test of click-through rates would lead to the wrong conclusion. Even if the goal is to maximize traffic rather than purchases, click-through rates can still be a poor metric. If A attracts many more clicks but version B attracts users that spend five times longer on the site, then version B could be preferable depending on your business model.
The solution is to define your objectives before implementing the test. Clearly identify the goal your business is seeking to achieve first, and then worry about how to measure it.
Testing Too Many Variables
Imagine you’re testing a new banner ad, and you’re considering two different slogans, two different colors, and two different images. It might be tempting to try to cram all these variations into a single A/B test, but it’s a big mistake.
Suppose that version A features the first slogan in blue text with a female model and version B features the second slogan with red text and a male model. A quick look at the results shows that A is almost twice as effective as B. At first, this seems like a fantastic result, but this design leaves valuable information on the table. Because the ads differ on three levels, it’s impossible to determine which factor drives the difference between the two samples. Does A work better because people are attracted to the model, or because they prefer the slogan? Is B’s problem the image, or the use of red text instead of blue?
The better approach is to resist temptation and test only one variable at a time. Use ads that are identical in every way except the slogan. Choose the winner, and then retest using ads that are identical in every way except the image, and repeat as needed. This approach is more time-consuming, but it also yields richer information, and ultimately, greater profits.
Overlooking Interaction Effects
In A/B testing, it’s common to use basic statistical tests like the T-test and one-way analysis of variance (ANOVA) to compare the mean of one sample against the other. Suppose you want to figure out which format for an email newsletter will attract more committed readers to your site. If version A yields a significantly larger mean time on site or revenue spent than version B, then a test comparing the means for the two groups will tell you whether there is a significant difference between them. If format A and format B yield almost identical results, then it’s easy to conclude neither strategy is better than the other.
This sounds simple enough, but there’s a problem. What if version A is 25% more effective than version B for men, but the opposite is true for women? The ideal marketing strategy would be to send version A to all the men on your list and version B to all the women, but a simple comparison between groups won’t give you that crucial information. Assuming the ratio of men to women was roughly equivalent across the two conditions, a T-test or an ANOVA will find no difference between them. Email format interacts with gender to produce a powerful effect for both version A and version B, but the effects cancel each other out.
Interaction effects are quite common in advertising, although they usually aren’t as extreme as in the example above. Some messages work better for older audiences, others for younger. Some messages work well for high-income readers, but others perform well with lower-income groups. Device type can produce interaction effects too because an ad format that functions well on a desktop might have problems on a cell phone.
The quick and dirty method to identify interaction effects is to split the sample for both versions on the variable of interest and compare again. For example, you could isolate all the men who received version A and compare them against all the men who received version B, excluding the women (and vice versa). More statistically rigorous solutions can be obtained through multiple regression analyses.
Too Little Statistical Power
Statistical power is the ability of a given statistical test to identify an effect (i.e., a difference between A and B) if one exists. If a research design is “underpowered,” then even if one version is clearly better than the other in reality, test statistics aren’t likely to show it. Three major factors determine how much statistical power you have.
First, and most intuitively, statistical power increases with sample size. The larger the number of people you test the two versions on, the greater the chance that you’ll be able to find a difference between them. If you test your email newsletter on 10 people you probably won’t find anything interesting, but test it on 10,000 people and your chance of discovering something is much improved.
Second, large effects are easier to detect than small effects. If A is only slightly better than B, you’ll need a very large sample size to detect the difference. But if A is vastly superior to B, you should be able to tell even with a relatively small sample. In other words, you have greater statistical power to detect large differences than small differences.
Finally, statistical power depends on the degree of certainty you want to achieve. This degree of certainty, called the alpha level, is the cutoff point for determining statistical significance. The standard alpha level in most advertising contexts is .05, meaning that researchers conclude that version A is preferable to version B only if there is less than a 5% chance that the difference between the two versions observed in the data is due to random chance.
If your A/B test is underpowered, you could wrongly conclude that two strategies are essentially equal when they are not. The best way to guard against this type of error is to increase the sample size. If that’s not possible, another option is to try to increase the effect size by accentuating the differences between the two versions. That technique might yield a large enough difference to detect, but it also introduces the problem of testing too many variables at the same time.
Too Much Statistical Power
An underpowered design presents major problems for data analysis, but a design with too much statistical power can cause problems as well. At first glance, it might seem like being able to detect even minute differences between two possible strategies would be ideal. In some circumstances, it can be. But under many conditions, an excess of statistical power can lead to spurious conclusions.
Suppose you’re testing a geo-located social media advertisement in a large city with a sample size of 50,000 impressions. Version A has a blue background, and version B has a red background. With a sample that large and alpha level of .05, your statistical test will be virtually certain to find a difference between the two versions. Great news, right?
The problem is that because of the excess of statistical power, you will be very likely to find differences that are functionally meaningless. Imagine, for example, that on the morning of the test, a popular local radio station plays the irritatingly catchy Eiffel 65 hit “I’m Blue.” As a result, a small fraction of the city’s population spends half the morning with that song stuck in their heads, and are thus slightly more likely to click on the ad the with blue background.
From a marketing perspective, there is no real difference between A and B. The precise set of circumstances that led to the blue ad winning is exceedingly unlikely to occur again, and there’s no way to predict when every popular radio station in the target region will play that song in the future. The information if useless. But the overpowered statistical test, deaf to the influence of the hit song, will suggest otherwise.
Fortunately, there’s an easy solution to the problem of excessive power: adopt a stricter alpha level. Instead of adopting .05 as the threshold for statistical significance, set the threshold to .001, or even lower. The stricter statistical test will help to weed out the effects of random coincidences and improve your odds of identifying a true signal in the noise.
Whether you’re seeking to increase traffic, boost sales, or improve customer satisfaction, A/B testing is an invaluable resource. But the helpfulness of a research program depends on the quality of its design. Avoid these five common pitfalls, and your experiments will guide you toward the optimal marketing strategy for your business.