He or She? Or: The basics of (binary) classifier evaluation

Of all the amazing scientific discoveries of the 20th century, the most astonishing has to be that “men are from Mars, [and] women are from Venus”. (Not that the differences weren’t obvious pre-20th century, but it’s always good to have something in writing.)

If indeed the genders do originate from different planets, then surely the ways in which they use language must be very different as well. In fact, the differences should be so gleamingly obvious that even a computer should be able to tell them, right?

So we’re building an author gender classifier…

In natural language processing, there is a task called author profiling. One of its subtasks, author gender identification, deals with detecting the gender of a text’s author. Please note that for the sake of didactic simplicity (and not an old-fashioned view of gender identity), I’ll confine myself to the two traditional genders.

In supervised machine learning, a classifier is a function that takes some object or element and assigns it to a set of pre-defined classes. As it turns out, the task of author gender identification is a nice example of a classification problem. More specifically, we are dealing with binary classification since we assume only two possible classes.

By default, these classes are labelled as positive (aka “yes”) and negative (aka “no”). Needless to say, it is perfectly fine to adapt the naming of the two possible outcomes. In our case, female and male (aka “not female”) seem like plausible choices.

It all starts with the data

We are about to train supervised classifiers and so we first need to obtain a good amount of training data. Understandably, I wasn’t too excited about manually collecting thousands of training examples. Therefore, I went ahead and wrote a Scrapy spider to automatically collect articles from nytimes.com on a per-author basis.

If you are interested in the spider code, you’re welcome to check it out. Our industrious spider managed to collect the titles and summaries of more than 210000 articles as well as their authors’ genders. All in all, there were about 2.5 times more male articles than female ones. This is a great real-world example of a problem known as class imbalance or data imbalance.

Meet the stars of the show

With the data kindly collected by the NewYorkTimesSpider, we’ll train two supervised classifiers and compare their performance. To this purpose, we’ll make use of scikit-learn, one of the most popular Python frameworks for machine learning. We’ll be training two different classification models: Naive Bayes (NB) and Gradient Boosting (GB).

NB is a classic and historically quite successful model in all kinds of real-world domains including text analysis & classification. The GB model is a more recent development that has achieved considerable success on problems posed on kaggle.com.

This article will not delve into the algorithmic details of these two models. Rather, we’ll assume a black box view and focus on their evaluation. The same goes for the topic of feature extraction. For instructional purposes, we’ll go with a very basic feature set based on the tried-and-tested bag-of-words representation. scikit-learn comes with an efficient implementation which spares us having to reinvent the wheel.

Evaluation metrics 101

Unfortunately, no classifier is perfect and so each decision (positive vs. negative or female vs. male) can either be true (correct) or false (incorrect). This leaves us with a total of 2*2 = 4 boxes we can put each classifier decision (aka prediction) into:

Predicted \ ActualPositiveNegative
PositiveTrue positive (TP)False positive (FP)
NegativeFalse negative (FN)True negative (TN)

As presented in the table, true positives are positive examples correctly classified as positive. On the other hand, false negatives, are positive examples misclassified as negative. The same relationship goes for true negatives and false positives. In the area of machine learning, a 2-by-2 table structure such as the above is commonly referred to as a confusion matrix.

A confusion matrix can serve as the basis for calculating a number of metrics. A metric is a method of reducing the confusion matrix to a single (scalar) value. This reduction is very important because it gives us one value to focus on when improving our classifiers. If we didn’t have this one value, we could endlessly argue back and forth about whether this or that confusion matrix represents a better result.

The below table summarizes some of the the most fundamental & widely used metrics for classifier evaluation. Note that although all of them result in values between 0 and 1, I will describe them in terms of percentages for the sake of intuition. Also, some metrics have different names in different fields and contexts. I will highlight the names most commonly used in machine learning in bold.

MetricFormulaDescription / Intuition
Accuracy\frac{TP + TN}{TP + TN + FP + FN}What percentage of elements were predicted correctly? How good is the classifier at finding both positive & negative elements?
True positive rate (aka recall, sensitivity)\frac{TP}{TP + FN}What percentage of positive elements were predicted correctly? How good is the classifier at finding positive elements?
False positive rate\frac{FP}{TP + FN}What percentage of positive elements were predicted incorrectly? How bad is the classifier at finding positive elements?
True negative rate (aka specificity)\frac{TN}{TN + FP}What percentage of negative elements were predicted correctly? How good is the classifier at finding negative elements?
False negative rate\frac{FN}{TN + FP}What percentage of negative elements were predicted incorrectly? How bad is the classifier at finding negative elements?
Precision (aka positive predictive value)\frac{TP}{TP + FP}What percentage of elements predicted as positive were actually positive?
F1 score\frac{2*Precision*Recall}{Precision + Recall}
Weighted average of the precision and recall with precision and recall being weighted equally. How good is the classifier in terms of both precision & recall?

Now that we have basic understanding of the fundametal metrics for evaluating classifiers, it’s time to put the theory into practice (i.e. write some code). Luckily for us, scikit-learn comes with many pre-implemented metrics. In addition to the metrics, scikit-learn also provides us with a number of pre-implemented cross-validation schemes.

One of the primary motivations for cross-validating your classifiers is to reduce the variance between multiple runs of the same evaluation setup. This holds especially true for situations where only a limited amount of data is available in the first place. In such cases, splitting your data into multiple datasets (a training and a test dataset) will reduce the number of training samples even further.

Oftentimes, this reduction will lead to significant performance differences between two or more evaluation runs caused by particular random choices of training and test sets. After partioning the dataset and running the evaluation multiple times, we can average the results and thereby arrive at a more reliable overall evaluation result.

The importance of a baseline

The evaluation code is available as a Jupyter notebook. Besides a data loading function and the two classifiers to be tested, the notebook also contains the definition of a baseline for our evaluation (HeOrSheBaselineClassifier). A baseline is a simple classifier that gives us a basis for comparing our actual models to.

In many cases, choosing a baseline is a quite straightforward process. For example, in our domain of newspaper articles, about 71.5% of articles were written by men. Therefore, it makes sense to define a baseline classifier that unconditionally predicts an article to have a male author. If a classifier can’t deliver a better performance than this super simple baseline classifier, then obviously it can’t be any good.

To summarize, a baseline provides us with a performance minimum that we should be able to exceed in any case. scikit-learn accelerates the development of baseline classifiers by providing the DummyClassifier class that the HeOrSheBaselineClassifier inherits from.

Finally, results

If we take a look at the Jupyter evaluation notebook, we can see that both classifiers significantly outperform our baseline in every metric. Though overall the GB classifier offers better performance, the NB model features a better precision score.

Obviously, the classifiers presented in the course of this post are only the tip of the iceberg. But even though we haven’t performed any optimization, the results are already significantly better than the expected minimum performance (i.e. the baseline). What this means is that there is a statistical difference in how often each gender uses specific words since word counts were the only features employed by the presented models.

The results of the above evaluation might serve as the basis for another post on where to go from here. Further resources on how to improve upon the existing performance can be found in the academic literature (e.g. Author gender identification from text).