In continuation with my previous post
in this series on A/B Testing, let us dig a bit deeper into the process by looking at the aspects in the succeeding paragraphs.
- Separate the Experiences Completely
The very first thing to do is to have a hard look at the user randomization and group splitting module that you have created. Is a portion of your users for the experimentation group clearly demarcated? Are these users experiencing only the new design/model?
It is of utmost importance to clearly and totally separate the experiences between the two groups. Let us assume that a new button for a website is being tested. If this button is appearing on every page, then it needs to be ensured that the same user sees the same button everywhere on the website. It will make sense to split by user ID, if such an option exists or user sessions rather than individual page visits.
There is also a chance that a few of your users have been permanently conditioned/tuned or “trained” by the old model/design and prefer things the way they were prior to the new model/design. This “Carryover” effect has been discussed by Kohavi et al. in their KDD 2012 paper. These types of users lug the “baggage of the old” and are more likely to give biased answers for any new model. Should this be the case, you should seriously consider acquiring a brand new set of users or randomizing the test buckets.
It is always advisable to do some A/A testing to ensure the soundness of your testing framework. Simply put, this translates to performing the randomization and the split, but testing both groups on the same model or design. Check for any observable differences. Switch over to the A/B testing only once the system passes the A/A test.
- Which Metric Should be Used?
A pertinent question at this juncture would be – which metric should be utilized to evaluate the model? In the end, the right metric is, in all probability, a business metric. However the system may not be able to measure this easily. For example, search engines take into account the number of users, the duration they spend on the site, and their overall market share. The live system does not have readily available comparison statistics. So, the need will arise for them to approximate the ultimate business metric of market share with the more measurable ones like the number of unique visitors per day and the average session length.
It is indeed quite crafty to design the right metric. In practice, short-term, measurable live metrics may not always align with long-term business metrics.
There exist four classes of metrics which need to be considered: measurable live metrics, business metrics, training metrics and offline evaluation metrics. The difference between business metrics and live metrics that can be measured, has just been touched upon.
Offline evaluation metrics are things like the classification, regression, and ranking metrics which were discussed previously in this series. The loss function which is optimized during the training process is the training metric for instance; a support vector machine optimizes a combination of the norm of the weight vector and misclassification penalties.
The ideal situation is when all four of these metrics are either exactly the same or are linearly aligned with each other. The first is impossible and the second condition is highly unlikely. Therefore, pay attention to the fact that these metrics always increase or decrease with each other. But one may yet come across situations where a linear decrease in RMSE – a regression metric does not translate to a linear increase in click-through rates.(Some interesting examples have been described by Kohavi et al. in their KDD 2012 paper
.) This input may be stored at the back of your mind to save your efforts to optimize where it counts the most. Constantly track all of these metrics, so that you are alerted immediately when things go out of control : usually a sign of software and instrumentation bugs or distribution drift.
- How to Quantify the Change to Consider it to be a Real Change?
Having decided upon the metric, the next step is figuring out how much of a change in this metric matters? Answering this is essential to deciding the number of observations required for the experiment. Akin to our second query, in this post , this is, in all probability a business question and not entirely a data science question. You should select a reasonable value up front and thereafter avoid the temptation to change it later, as you start to see the results.
- Should we use the One-Sided or Two-Sided Test?
While the One sided (or one-tailed tests, as they are also referred to) can only reveal if the new model is better than the baseline, it does not inform you, if it is in fact worse. If you are absolutely confident that it can never be worse, or there are nil consequences for it being worse, then you can go in for only the one-sided test, else, you should always test both. On the other hand, a two-sided test (or two-tailed, as it is also called) reveals the new model to be either better or worse than the original model. However, it would still require a separate check for identifying which is the case.
- How Many False Positives Can be Tolerated?
A false positive in A/B testing implies that the null hypothesis has been rejected when the null hypothesis is true. Simply put, you’ve decided that your model is better than the baseline when it actually isn’t better than the baseline. What’s the cost implication of a false positive? This is dependent on the application.
For instance, in a drug effectiveness study, a false positive could lead to the patient consuming an ineffective drug. Alternatively, a false negative could imply the patient not consuming a drug that is effective at curing the disease. In both these instances, a very high cost implication to the patient’s health is involved.
In a machine learning A/B test, a false positive suggests switching to a model that should increase revenue while it in fact, doesn’t. A false negative would imply missing out on a more beneficial model and losing out on potential revenue increase.
The results of these tests are in terms of probability of a particular event occurring at a specified assurance level. A statistical hypothesis test permits us to control the probability of false positives by setting the significance level, and false negatives by the power of the test. For example, if we select a false positive rate of 0.05, then out of every 20 new models that don’t improve the baseline; on an average one of them will be falsely identified by the test as an improvement. The pertinent question is – would this be an acceptable outcome to the business?