This blog post is the continuation of my previous articles part 1 and part 2.
The average per-class accuracy is a variation of accuracy. It is defined as the average of the accuracy for each individual class. Accuracy is an example of what is known as a micro-average, while average per-class accuracy is a macro-average.
In general, when there are different numbers of examples per class, the average per-class accuracy will be different from the accuracy. (Exercise for the curious reader: Try proving this mathematically!)
Why this is important is because when the classes are imbalanced, i.e., there are a lot more examples of one class than of the other, and then the accuracy will give an imprecise picture as the class with more examples will dominate the statistic. In that case, you should look at per class accuracy. Thus, you should be looking at two things, both, the average and, the per class accuracy.
Per-class accuracy is not without its own restrictions. For instance, if there are very few examples of one class, then test statistics for that class will have a large variance, which means that its accuracy estimate will not be as reliable as of other classes. Taking the average of all the classes blurs the confidence measurement of individual classes.
Log-loss, or logarithmic loss, gets into the finer details of a classifier. In particular, if the raw output of the classifier is a numeric probability instead of a class label of 0 or 1, then log-loss can be used. The probability can be understood as a measure of confidence. If the true label is 0 but the classifier thinks it belongs to class 1 with probability 0.51, then even though the classifier would be making a mistake, it’s a near miss because the probability is very close to the decision boundary of 0.5. Log-loss is a “soft” measurement of accuracy that incorporates this idea of probabilistic confidence.
Mathematically, log-loss for a binary classifier looks like this:
Formulas like the one above are incomprehensible without years of grueling, inhuman training. Let’s unpack it. pi is the probability that the data point belongs to class 1, as judged by the classifier. Yi is the true label and is either 0 or 1. Since yi is either 0 or 1, the formula essentially “selects” either left or the right summand. The minimum is 0, which happens when the prediction and the true label match up. (We follow the convention that defines 0 log 0 = 0.)
The beautiful thing about this definition is that it is closely tied to information theory: log-loss is the cross entropy between the distribution of the true labels and the predictions, and it is very intimately related to what’s known as the relative entropy, or Kullback–Leibler divergence. Entropy measures the unpredictability of something.
Cross entropy incorporates the entropy of the true distribution, plus the extra unpredictability when one assumes a different distribution than the true distribution. So, log-loss is an information-theoretic measure to determine the “extra noise” that comes from using a predictor as opposed to the true labels. By minimizing the cross entropy,we maximize the accuracy of the classifier.
Precision and recall treat all retrieved items equally; a relevant item in position k counts just as much as a relevant item in position 1.But this is not usually how people think. When we look at the results from a search engine, the top few answers matter much more than answers that are lower down on the list.
NDCG tries to take this behavior into account. NDCG stands for ‘normalized discounted cumulative gain’. There are three closely related metrics here: cumulative gain (CG), discounted cumulative gain (DCG), and finally, normalized discounted cumulative gain.
Cumulative gain sums up the relevance of the top k items. Discounted cumulative gain discounts items that are further down the list. Normalized discounted cumulative gain, true to its name, is a normalized version of discounted cumulative gain. It divides the DCG by the perfect DCG score, so that the normalized score always lies between 0.0 and 1.0.
DCG and NDCG are important metrics in information retrieval and in any application where the positioning of the returned items is important.
In a regression task, the model learns to predict numeric scores. For example, when we try to predict the price of a stock on future days based on past price history and other information about the company and the market, we can treat it as a regression task. Another example is personalized recommenders that try to explicitly predict a user’s rating for an item. (A recommender can alternatively optimize for ranking.)
For regression tasks, the most commonly used metric is RMSE (root-mean-square error), also known as RMSD (root-mean-square-deviation). This is defined as the square root of the average squared distance between the actual score and the predicted score:
Here, yi denotes the true score for the data point, and yi denotes the predicted value. One intuitive way to understand this formula is that it is the Euclidean distance between the vector of the true scores and the vector of the predicted scores, averaged by n, where n is the number of data points.
Quantiles of Errors
RMSE may be the most common metric, but it has some problems. Most crucially, because it is an average, it is sensitive to large outliers. If the regressor performs really badly on a single data point, then the average error could be huge. In statistical terms, we say that the mean is not robust (to large outliers).
Quantiles (or percentiles), on the other hand are much more robust. To see why this is the case, let’s take a look at the median (the 50thpercentile), which is the element of a set that is larger than half of the set, and smaller than the other half. If the largest element of a set changes from 1 to 100, the mean should shift, but the median would not be affected at all.
One thing that is certain with real data is that there will always be “outliers.” The model will probably not perform very well on them. So it’s important to look at robust estimators of performance that aren’t affected by large outliers. It is also useful to look at the median absolute percentage:
It gives us a relative measure of the typical error. We could also compute the 90th percentile of the absolute percent error, which would give an indication of an “almost worst case” behavior.
“Almost Correct” Predictions
Perhaps the easiest metric to interpret is the percent of estimates that differ from the true value by no more than X%. The choice of X depends on the nature of the problem. For example, the percent of estimates within 10% of the true values would be computed by percent of |(yi–ŷi)/yi| < 0.1. This gives us a notion of the precision of the regression estimate.
Learn R, Python, basics of statistics, machine learning and deep learning through this free course and set yourself up to emerge from these difficult times stronger, smarter and with more in-demand skills! In 15 days you will become better placed to move further towards a career in data science. Upgrade to the specialization programs at attractive discounts!
Don't Miss This Absolutely Free, No Conditions Attached Course