Thoughts on A/B testing

A/B testing is part of a push towards software engineering as an experimental science, which I support, but there are plenty of open problems.

I've been mulling over these points for a long while, but, after running into this excellent and amusing post by John Moult, about the pains and perils of doing analytics, I was led by example to write up my thoughts about the more specific subject of A/B testing.

There is undoubtedly a strong trend in the software industry toward the adoption of A/B testing as a powerful antidote to arbitrary and opinion-based decision making. I think this is part of a larger push to inject powerful methods of the natural sciences into software engineering and make it more "scientific" and I wholeheartedly support that, but now there is an extreme programming like attitude to do too much of a good thing, a very visible advocate of which has been Eric Ries — note though his own caveats in the linked entry. I agree with him: there are many problems with A/B testing. Why shouldn't we A/B test everything, from the almost proverbial 41 shades of blue to bug fixes?

Experiments do not replace theory

When we perform a test, we need to pick a probability of rejecting the null when the null is true, or, in plain english, of promoting a change that is actually detrimental for some important metric. People set this probability fairly low, say 5% and think their job is done. They should read "Why most published research findings are false". The jargon of that paper might be a little unfamiliar to a software developer, so let me try and explain some ideas using the following example. If we test a very carefully selected change, where multiple people have used their knowledge and experience to reach a consensus that that change is promising and should be given a run, the pre-test probability that it's going to be successful is, based on historical data, 90%. At the opposite end, we hire monkeys to throw random changes at the A/B testing system and their pre-test probability of success is 1%. Now that 5% chance of erroneously giving the green light to a harmful change has very different consequences in the two cases. In the first, we first need to hit the 10% when the expert committee makes a mistake, then the 5% where the A/B test doesn't catch it. These two events are most likely independent, so we end up with a 0.5% chance of a bad change pushed to the site. In the monkeys case, the two probabilities are 99% and and 5%, with a 4.95% chance of a bad change making its way to the users.

Statistics and computational molecular biology have been dealing with monkeys for a long time: it's actually called multiple testing. Think of those genome-wide scans for genes of interests in some disease: for each gene, before seeing the data, the chance of being involved in the disease in negligible. I've been involved in some of them myself. The solutions involve carefully raising the significance bar, which means that some potentially successful changes will be discarded, victim to the huge number of tests being performed. Another consequence is that monkeys raise the bar for everyone, including the team of experts, by submitting large numbers of random changes. There you have it: a statistical view of why teams are often bogged down by the weakest members. My point of view is that instead of succumbing to the Lake Wobegon effect, according to which even the most mediocre startup is hiring only the best and brightest, one should be able to model pre-test likelihood of success based on repeated observation of the same developer or team. There are always going to be performance differences in any team and it's better to try and make the best of them instead of trying to eliminate them.

It seems to me the quarrel between designers, who seem to be opposed to A/B testing, and engineers, who seem more inclined to support it, derives from a misunderstanding of the role of A/B testing: it doesn't formulate interesting and important hypotheses to test, it doesn't direct our efforts where it matters. Experiments do not replace theory.

All observable metrics can be misleading

In theory, we all agree that we are working towards the long term success of our employer or client. In practice, nobody can observe that in the present. What is the closest thing? Maximizing revenue? Adoption or user satisfaction? Content generation or consumption? A/B tests rigorously asses the impact of a change on one or more of these proxies for success. The test is rigorous, but the proxy relation is an educated guess at best. For example more page views are good, but using AJAX or adding video can reduce the number of page views while increasing user satisfaction. More impressions are good, right? Sure, but one can replace all content with ads based on this metric alone and short term measurements. In fact, such a thing happens: it's called a homepage takeover. If you A/B tested the takeover, you would likely find that revenue has increased, but nobody would be so ill-advised as to deploy the takeover for good. Aware of the limitations and uncertainties of each of these metrics, management often uses the "let's do both (all)" heuristic and metrics inflation ensues. Let's measure everything: page views, page by page, content creation, by type of content, time on site, impressions, ad clicks, adoption, virality. Each metrics requires a separate test, compounding the multiple testing issue described above. Even if we can take advantage of very large samples, it is very often the case that some metrics are significantly up and some are down, just because people's time and patience is finite and using any feature competes with using every other. Add video and people have less time for pictures. Add chat and messages decline. If you have, say, 100 metrics, every decision becomes a judgement call because some will be up and some down, and this is the opposite of what we were trying to achieve with A/B testing in the first place. The quest for the practical measure that best correlates with long term goals is still ongoing. The issue of long term objectives is also covered in a recent and highly recommended paper on A/B testing where it is somewhat downplayed. They suggest to incorporate them into the analysis, but how can you A/B test ROI over 5 years? Another recommendation they make is to use complex metrics that combine revenue with user satisfaction, such as a function of revenue a visit frequency. I think they are somewhat in denial of the key issue. At one level, we can use A/B testing to verify an assertion like: "feature X increases visit frequency". It's a statement about the present or a very near future we can wait for so that we can measure it and we'd better not be wrong about it. At another level, we want A/B testing to replace human decision making in all its fallibility, and then we are entering the prediction business, which is a lot more difficult.

Speed is important, but is the enemy of accuracy -- unless your name is Sundance Kid. As Google CEO Eric Schmidt recently observed, when launching a new product what counts is the speed of adoption after the first burst of interest has declined. I believe the same to be true for all but the least visible features, albeit on different time scales: some need just a look to get used to, like a different background, and some cause protests and user churn before taking off. The great chess master Capablanca, asked about how many moves he considered before making one, famously answered: "Only one, but it's always the right one". For everyone else it is necessary to look several moves ahead. Almost any reasonable heuristic becomes effective if the position space is searched thoroughly enough. The converse seems to be true with web related metrics: in the short term all of them can be misleading, but from what I hear in the industry, most A/B tests are run over fixed time periods, ignoring temporal effects. It is wise to experiment with the duration of A/B tests and model time dependent effects. The way to deploy more changes faster is not to perform shorter A/B tests, but to run many in parallel. Check out this report of two experimental studies from Microsoft and Google on the impact of web search speed on user behavior. From the graphs you can tell the experiments lasted at least 6 and 11 weeks respectively. The aforementioned paper recommends a full week at least to steer clear of day of the week effects. I think those are very consistent and there is a chance to model them successfully. It's the rare events that worry me, like a sudden concentration of soccer matches on TV and the variable nature of users' reaction and learning curve. If you combine 11 week long tests with the 50 releases per day that are possible with modern build-test-deploy systems that means 50 * 5 * 11 = 2750 ongoing experiments at any given time! That's an extreme I am not sure can be managed effectively without an equally extreme level of modularity, and it might not be necessary. Moreover lengthy tests affect time to market and delay useful feedback, no matter how many are performed at the same time. One way to work around that is to run a "B/A test" for some time after a release, that is keep a small sample of users behind to observe long term effects without slowing down the development of the product. Only when the supplemental, long term analysis disagrees with the initial, fast assessment a change will have to be reconsidered and the analyst has the time to really understand what is happening.

Testability bias

Some things are easier to test than others. If we mandate that any product change undergo an A/B test, we might end up introducing a bias not only for changes that are successful but also for changes that are testable. This is a non exhaustive list of testability biases:

User attitude: users of many web sites have shown resistance to change, the most famous example of which might be Facebook's introduction of the newsfeed, now a mandatory feature of any social network. Visible changes also create user confusion and envy, because users might become aware of different features available to other users. Broadcast communication channels to announce new features, like company blogs and PR channels, can not be used because they don't reach selectively the B population. This means that users are surprised by the changes, exacerbating a negative attitude. This creates a bias toward small incremental changes that go unnoticed.
Learning curve: related to the previous, any feature that requires learning will undergo a ramp up period during which it could be rejected by an A/B test. So the bias is toward intuitive features vs. powerful but difficult to use ones. Or it could be a novelty curve, a feature gets an initial spike of interest followed by oblivion, for the opposite bias.
Networked features such as messaging or chat or multiplayer games or anything social: it's not clear how to sample communities in a statistically unbiased way. Facebook revealed they sample whole countries (can't locate the reference, could any reader help?), with all the limits of that approach. People at the "social boundaries" of a sampled community will find it harder to engage in the new feature. On the other hand some simple sampling techniques are biased toward more connected users. The bias here could be positive or negative.
Non user vs users: users are more readily available for an A/B test then non users, but if user base growth is a priority one needs to take non users' needs into account: what is it that prevents a potential user from joining in? It is unlikely that deploying a new feature to a small subset of users will make a sizeable difference to non-user that would join if aware of that feature. Therefore user acquisition efforts focus on marketing and virality, which can be A/B tested, not on catering to non-user needs — that is creating a more broadly appealing product. One can A/B test casual visitors of a site and maximize their conversion into regular users, but this focuses mostly on the subscription process and the perception of the product more than its reality and potential users need to be visiting the site in the first place. The bias here is in favor of current users.
Premium features: since adoption is generally lower than for free features, sample sizes are smaller, sometimes by orders of magnitude. Gains need to be bigger to be detectable, all else being equal. The bias is toward free features.
Resource intensive features: depending on deployment and load balancing techniques, it's very possible that resource intensive features might be at an unfair advantage or disadvantage during testing. Let's say you have a fixed pool of machines to use for testing and that pool is not resized when testing a resource intensive feature: the test might indicate to drop that feature, but the only reason is that we needed to allocate more resources and the additional cost could have been justified. The opposite scenario is when the load is balanced across A and B pool users, using the same set of resources. Then any speed problem with a new feature is going to be hidden.
Interaction: a change that affects several sections of a web site is going to interact with other changes we are trying to A/B test in parallel, making testing more difficult. Even when features appear to be unrelated, they are competing for user time and attention. The same influential paper on A/B testing states that interactions are not very important and common, but my experience does not support that. It could be that different types of sites are more prone to interaction: for instance on a purely recreational site people are going to spend 10 minutes on any given day, and if offered activity X they won't engage in activity Y, which was the hot new thing just the week before. On an e-commerce web site or search engine things might work in a different way.

Contrasting hypothesis is at the heart of the scientific method, but science has the luxury of picking which problems are within its domain. When creating a product, one doesn't have that luxury. If these biases go unnoticed and are not addressed, we might end up with a product that is conservative in its development, dumbed down, catering to a niche of early adopters, free and so forth. The point of using statistics was to help us go where we want to go, not to introduce biases in decision making. Are there ways to mitigate these effects?

Educate users creating channels specific for test users when the new feature or improvement being tested is important enough. In doubt, split the B population into two random samples and educate B1, let B2 figure it out, study the differences.
Look at temporal effects. Let the test run long enough until a steady state is reached or incorporate those effects into the model.
This is an open research problem as far as I know. I would start from this paper.
This is a hard one. I will risk two suggestions. One is to advertise or announce non-existing or prototype level features and see what the response is. Some call it "fake it till you make it". The other one is that newcomers are likely to be more similar to potential users than early adopters. You could even try to model the shifts in usage patterns and extrapolate. I don't have experience in either, so take them as brainstorming.
Put more data gathering effort into measuring premium features. If you are sampling, be sure to use adjustable sampling rates so that you don't end up with 99% of your data being about page views of the home page. Run longer tests involving a larger user base when dealing with premium features. Use better statistics: when the sample sizes are small, accurate modeling is more important and errors and changes of directions are less costly and public.
In general, testing should always be done in a real-life setting and therefore resource constraints should be part of it. Developers and managers that propose new features should also provide estimates of the resources necessary to make them successful. This way, a cost benefit analysis is possible. But for bolder, complicated features, it might be helpful to know first what the user reaction is, assuming there is no resource issue, then try to get the feature out within a realistic budget. It's a form of prototyping: leave optimizations for later. Of course one can't assume P = NP during prototyping and leave that as an implementation detail.
Modeling interaction is best — you need to know if two independently useful changes will be a disaster when combined. One can also try and avoid them using multiple test pools, but subsequent combination testing is necessary.

Statistical testing might not be the right way

I am not going to say that we should give up on statistics in software engineering. I just think hypothesis testing in particular has been overemphasized in its application to software development and we need to look at a richer set of statistical tools. Imagine this situation: after an A/B test, the data is not enough to reject the hypothesis that A and B perform equally well. Of course the numbers are not identical, but the test doesn't reach the predetermined significance level. What is a manager to do?

Coin toss
More experiments and analysis
Go with the currently deployed version
Go with the experimental version
Go with the version that had slightly better numbers
All of the above

Imagine the converse. The data shows that B is better than A. Unfortunately, B requires 10% more servers to meet response time requirements. The test doesn't tell you by how much B is ahead. In both cases and in my experience, an estimate of the performance gap and a confidence interval are much more useful or, as they say, actionable than a test. In the first case, if the potential difference is big, albeit uncertain, let's say the 95% confidence interval is -3–11%, one might go with 2, more experiments, to try and narrow it down. If it is 0–1%, it could be better to move on — my pick is option 5 all else being equal. In the second case, if the lower end of the confidence interval is related to business goals that are deemed worth the cost of additional servers, the manager might go ahead, otherwise try and get more data. Of course there is a cost to gathering more data and delaying decisions as well.

Another way in which statistical testing is not appropriate is illustrated again by the "41 shades of blue" example. In this case regression seems a more natural approach and multivariate regression would allow us to study border colors and background colors in a coordinated fashion, which makes a lot of sense to me. The expertise of both the domain expert, like a UI designer, and the data analyst here becomes important to formulate appropriate models. For instance, we have a strong expectation that similar shades of color will have very close performance, based on color perception theory: therefore the model should use a continuous or Lipschitz function to relate color and user behavior. In summary, there is a range of statistical tools beyond testing that might be more appropriate for different problems that arise in experimental software development: use them!

Check your assumptions

In the aforementioned post, John Moult finds that fastidious attention to every single factor that can compromise the validity of a statistical study is a killer for the work of the analytics engineer or scientist, and I agree with that. But in the case of A/B testing there is something practical we can do to allay some concerns. Perform periodic A/A tests and look for differences, and also look at the distribution of p-values, which should be uniform in this case. An A/A test is one in which everything is done according to the protocol for an A/B test, but the two products being compared are actually identical. In my experience I had a few surprises from this kind of check, which triggered somewhat painful investigations, but eventually generated more confidence in the method and better understanding of how our system worked. On the other hand, it's a little harder to check for power, or false negatives. If a generative model of the data is available, it is possible to run the test on synthetic data and see how it goes. One can also subsample the data and see what size effects are detected at what sample size. Under normality assumptions, there is an analytic solution that connects effect size, variance and sample size — see yet again this paper.

Conclusions

I believe that the experimental approach to software engineering is here to stay and will develop further, becoming easier to apply, more predictive of long term success and more generally accepted. On the contrary, overemphasizing statistical tests as the only and infallible tool of this trade or as a shrink-wrapped solution for all sort of development decisions is not helpful and not supported by evidence or theory and is generating a backlash among less quantitatively oriented people. The answer is more, better statistics but also an acceptance of its role and limits.

Credits

John Moult suggested an important reference for this post

Comments

Mike

I'm sorry but you took an already complex concept and managed to make it even more complicated. Nice topic anyway ;)