In this series of articles, let’s have a quick look at the basics of A/B testing and also focus on revealing best practice tips. There is a plethora of books and articles dealing with statistical hypothesis testing, but very few of them discuss about what can go wrong.
Although A/B testing is a widely prevalent practice these days, a lot can go wrong during the setting up process and while interpreting the results. Let’s delve into the important questions to be considered during the A/B testing, and later follow it up with an overview of a promising alternative called ‘multi-armed bandits’.
Offline and online – These are roughly the two regimes for machine learning evaluation
. It’s during the prototyping phase where one tries out different features, models, and hyper-parameters that the offline evaluation occurs. In the offline evaluation, multiple rounds of evaluation against a chosen baseline on a set of chosen evaluation metrics are carried out in an iterative process. Online testing is done once we have a model that performs reasonably well, and it is ripe to deploy the model to production and evaluate its performance online, with live data.
What is A/B Testing?
In the industry, A/B testing is now the predominant method of online testing. This technique is often used to answer questions like-
“Is my new model superior to the old one?”
“Would it be the red or green colour which would suit this button better?”
The A/B testing setup typically has a new model (or design) and an incumbent model (or design).
For the A/B testing, there is a notion of live traffic, which is split into two groups: Group A and Group B, or control group and experiment group. Group A is fed to the old model, and Group B is fed to the new model. Thereafter the performance of both is compared, and based on the observations, an analysis; (leading to a decision) is made about whether the new model performs substantially better than the old model. In a nutshell, this is what the A/B testing is all about, although there exists a complete machinery, known as statistical hypothesis testing, that makes this statement much more precise.
Statistical hypothesis testing decides between a null hypothesis and an alternate hypothesis. The null hypothesis is usually of the form “the new model doesn’t change the average value of the key metric, while the alternative hypothesis takes on the assumption “the new model changes the average value of the key metric.” The test for the average value (the population mean, in statistical language) is the most common, but there are tests for other population parameters as well. More often than not, A/B tests are designed to answer the question, “Does this new model lead to a statistically significant change in the key metric?”
One can find umpteen books and online resources that describe statistical hypothesis testing in great detail. However, we would not be attempting to expound on them here.
In short, A/B testing involves the following steps:
- Splitting the live traffic into randomized control group/experiment group.
- Observing the behavior of both groups on the proposed methods.
- Computing test statistics.
- Computing p-value.
- Output decision.
Appears to be a cakewalk, doesn’t it? What then, could go wrong?
Quite a bit! While understanding A/B tests is easy; doing them right is fairly tricky.
In the forthcoming posts of this series we would explore common pitfalls of A/B testing. We will include a list of things to watch out for, ranging from pedantic to pragmatic. Although a few of the pitfalls are straightforward and well known, others are foxier than they sound.