Not Hot Dog: Exploring the Accuracy Paradox

I recently had my examinations, and like any dutiful Computer Science student, I spent more time procrastinating rather than studying. Standardized vectors were swapped for Silicon Valley (the television show), and Edsger Djikstra for Elrich Bachman. One story arc particularly stood out – that of the ‘SeeFood’ app. For the more responsible amongst us: one of the characters creates an application that classifies whether a given image contains a hot dog or not. The app was touted to have “outperformed the image recognition software in the industry”.

The question beckons: what metric was used to calculate the performance? Intuitively, we assume it to be accuracy – that the application was able to label more images correctly than other algorithms. But Tim Anglade, a developer working on the show who actually created the application ‘Not Hot Dog’, mentions in his article, this is not always the case:

If you train your algorithm on 3 hotdog images and 97 non-hotdog images, and it recognizes 0% of the former but 100% of the latter, it will still score 97% accuracy by default!

Clearly, the model created was highly accurate, but it had no predictive power at all! This problem is known as the accuracy paradox, and we will explore ways to combat it in this blog post

What is the accuracy paradox?

The accuracy paradox states that “A predictive model with low accuracy can have higher predictive power than one with high accuracy”. Under the classification system devised by American logician W.V. Quine, the accuracy paradox falls under the branch of veridical paradoxes – statements that are absurd, but nevertheless have a true conclusion. In contrast, Zeno’s paradoxes are falsidical – their conclusions are false. (A third branch, antimonies, consists of paradoxes that are neither veridical nor falsidical. Here’s an excellent explanation for them.)

It’s easiest to understand this paradox using some data: suppose we run a lending agency, and swept by the recent fervour for machine learning, decided to train a classifier on our set of data to predict whether a person with a given set of attributes is likely to miss a payment and go into arrears, or pay on time.

Table 1 Sample debt payment data

In example, {0: made payment; 1: missed payment}. As would be expected in a
real scenario, relatively few people default on their payment in a given time period.

Table 2 shows the predictions of two classifiers – one that predicts 0 for all records (zero classifier) and one that predicts 1 for all records (one classifier). While the zero classifier is 40% more accurate than the one classifier, it is unable to correctly predict a single instance of a person defaulting on the loan. On the other hand, the one classifier satisfies our goal, but does not help us in the decision making process (“Stellar credit score – should we approve the loan? - No.”).

Table 2 Comparison of classifiers on the dataset

Resolving the Accuracy Paradox

In order to resolve any paradox, we need to identify the source of the paradox. In our situation, the problem is caused by the metric accuracy, so we need to change it. We could also look at changing the metric predictive power, but that would not help us in our situation. How do we go about choosing a metric that is useful?

Looking back at Table 2, we see the main problem with accuracy was that it only encapsulated information regarding the correctness of predictions – it contained no information pertaining to the source of the incorrect predictions. So, it would have been helpful to know the number of ones that the zero classifier tagged as zero – the false negatives – and similarly, the number of zeros that the one classifier tagged as one – the false positives. Perhaps ironically, we choose to define the event of a person going into arrears as ‘positive’, as our aim is to build a system that catches more of such events.

We introduce two new metrics:

Precision tells us what proportion of all the positive predictions was correct, and recall tells us what proportion of true instances have been correctly identified. Using these metrics, we see that our zero classifier has a precision and a recall of 0, while our one classifier has a precision of 0.3 and a recall of 1. We can see that using these metrics gives us a much better understanding of the workings of the classifier as compared to the metric accuracy.

Since using two values can be cumbersome, we can use a variety of metrics that unite them. The most common one in information retrieval and machine learning is the F1-score: the harmonic mean of precision and recall.

It is important to note that all applications may need to take harmonic mean of the two, and would rather prefer to weight one of precision and recall over the other. For example, we would prefer a classifier that has higher recall than precision, as it is more important that we identify most of the instances of people going overdue. In such scenarios, we can use a generalised version of the above formula:

However, I find there to be two problems with the F-score. Firstly, it doesn’t take into consideration the true negatives -- they don't form part of the definition of precision or recall, and so won't be reflected in the F-score. Secondly, it’s quite hard to think intuitively in F-scores, or visualize them. In such scenarios, it is often helpful to use a receiver operating characteristic (ROC) graph. The ROC graph plots the recall against the false positive rate (# False Positives / (# False Positives + #True Negatives).

The dotted diagonal signifies a random classifier. The space above the diagonal consists of classifiers that are better than random guesses, while the space below consists of classifiers that are worse than making random guesses. The perfect classifier would be at point (0,1) (as the number of false negative and false positives would be zero), and so classifiers that are closer to (0,1) are better than those farther away. But how can we differentiate between two points that are equidistant from (0,1) on the ROC? Suppose we had classifiers X and X' at points (0.2,0.6) and (0.6,0.8) respectively. Both classifiers lie above the line of random guessing, and are equidistant from (0,1). In this case, we can say that classifier $X$ is more conservative that $X'$, as it makes positive classifications only based on strong evidence. As a result, it may have lower true positive rates than a more liberal classifier such as X'. We can see this on a graph:

Now suppose we have run our classifiers, plotted them on the ROC graph, and found the one closest to (0,1). It is possible to go one step further. So far, we have worked in the developer environment, and it’s time to ship this model out to production. Earlier, we had made assumptions on whether false positives or false negatives were more harmful to the application. But now that we can get data such as the marginal costs of false positive and false negative classifications, we can plot a utility curve to find the classifier which maximises the profit. An in-depth explanation of this technique can be found here.

NStack Addendum

Have you built a model with a crazy high ROC score and are you wondering how to productionise it? NStack provides simple tools that let you deploy your models as APIs, and share them throughout your organisation as reusable modules. If you're interested in learning more, check us out at nstack.com or tweet @nstackcom with any questions!

Utsav Popat is studying Mathematics and Computer Science at the University of Oxford. He enjoys learning about machine learning as it falls in the intersection of these two subjects, and because saying so will help him get a job. He can be reached at utsav.popat@balliol.ox.ac.uk

Team Nstack

Read more posts by this author.

Register your interest with NStack

Schedule a 1:1 demo of NStack.