This blog post is the continuation of my previous articles part 1
, part 2
and part 3
Caution: The Difference Between Training Metrics and Evaluation Metrics
Sometimes, the model training procedure uses a different metric (also known as a loss function) than the evaluation. This can happen in the instance when we are re-appropriating a model for a different task than it was designed for. For example, we might train a personalized recommender by minimizing the loss between its predictions and observed ratings, and then use this recommender to produce a ranked list of recommendations.
This is not an optimal scenario. It makes the life of the model difficult by asking it to do a task that it was not trained to do. This should be avoided whenever possible. It is always better to train the model to directly optimize for the metric that it will be evaluated on. However, for certain metrics, this may be very hard, impossible, even. (For instance, it’s very difficult to directly optimize the AUC.) Always think about what the right evaluation metric is, and check if the training procedure can optimize it directly.
Caution: Skewed Datasets—Imbalanced Classes, Outliers, and Rare Data
Though it’s easy to write down the formula of a metric, it isn’t as easy to interpret the actual metric measured on real data. Book knowledge is not a substitute for working experience.
Both are essential for successful applications of machine learning.
Always consider what the data looks like and how it affects the metric. To be specific, always keep an eye out for data skew.
By data skew, I mean the situations where one “kind” of data is much rarer than others, or when there are very large or very small outliers that could drastically change the metric.
Earlier, we talked about how imbalanced classes could be a caveat in measuring per-class accuracy. This is an example of data skew—one of the classes is much rarer compared to the other. It is problematic not just for per-class accuracy, but for all of the metrics that give equal weight to each data point. Suppose the positive class is only a tiny portion of the observed data, say 1%. This is a common situation for real-world datasets such as click-through rates for ads, user-item interaction data for recommenders, malware detection, etc.
This means that a “dumb” baseline classifier that always classifies incoming data as negative would achieve 99% accuracy. A good classifier should have an accuracy that is much higher than 99%. Similarly, if you are looking at the ROC curve, only the top left corner of the curve would be important, so the AUC would need to be very high in order to beat the baseline.
Any metric that gives equal weight to each instance of a class has a hard time handling imbalanced classes, because by definition, the metric will be heavily dependent on the class(es) with the most data.
Furthermore, they are problematic not only at the evaluation stage, but even more so when training the model. If class imbalance is not properly dealt with, the resulting model may not know how to predict the rare classes at all.
Data skew can also create problems for personalized recommenders.
Real-world user-item interaction data often contains items that are rated by very few users, as well as many users who rate very few items. Rare users and rare items are problematic for the recommender, both during training and evaluation. When not enough data is available in the training data, a recommender model won’t be able to learn the user’s preferences, or the items that are similar to a rare item. Rare users and items in the evaluation data would lead to a very low estimate of the recommender’s performance, which adds to the problem of having a badly trained recommender.
Another kind of data skew is outliers. Large outliers can result in problems for a regressor. For instance, in the Million Song Dataset
, a user’s score for a song is taken to be the number of times he or she has listened to the particular song. The highest score is greater than 16,000! This means that any error made by the regressor on this data point would overshadow all other errors. The effect of large outliers during evaluation can be diminished through robust metrics such as quantiles of errors. Yet, this would not solve the problem for the training phase.
Effective solutions for large outliers would probably involve careful data cleaning, and possibly reformulating the task so that it’s not sensitive to large outliers.