Evaluating ML classification models

Recently I have been learning about ways to evaluate machine learning models, especially classifiers. This blog is a note on some basic metrics, and some thoughts on whether these metrics indicate good models, and should be pursued.

Basic metrics

The four basic rates: true positive, true negative, false positive, and false negative.

Predict Positive Negative
Ground true TP FN
Ground false FP TN

Precision: Of those predicted positive, how many of them are true?
Recall: Of those true samples, how many of them did I pick up correctly?
F1: An harmonic mean of precision and recall. In sklearn, there can be three schemes for calculation:

  • average="binary": Only report results for the class labeled as True
  • average="micro": Calculate precision and recall globally: Then micro F1 is the harmonic mean of these two. Note that in binary-class classification, micro F1 is just equal to accuracy.
  • average="macro": Just calculate precision and recall on each class, then average to get “macro p” and “macro r”. Then take harmonic mean.
  • average="weighted": Macro F1, but weighted by the size of the classes in taking average.

Does high metric implicate good model?

At first thought, yes. However, some questions need to be answered.
(1) For those datasets where classes differ at subtle features, am I predicting based on some superficial criteria instead of the true things I want to look for? For example, in a binary classification question involving pictures of sky and sea, if you classify all figures with blue as the dominant color as sea, while all with white colors dominating as sky, you are not capturing the fine details. If you classify all with patterns of wave as sea, then you are ignoring the effects of cloud. To address this, feature engineering could be regarded as a manual step to force some fine-grained aspects to be considered by the classifiers. Also, model should have enough complexity to capture those amount of information. You definitely should not expect a single neuron to be able to model XOR gate in euclidean space.

(2) For those imbalanced dataset, is the classifier just predicting based on the frequency of occurrences of classes? For a dataset containing 80% positive and 20% negative samples, predicting blindly “T” gives a 80% accuracy. Here comes the importance of observing the confusion matrix. If you see something suspicious, the model is highly likely to be troublesome.

(3) The high metrics might be produced by a leak of evaluation data to training set.
I realized that when working on unsupervised learning algorithms. One of the “leaking” example is selecting the features that have the highest correlation with the class labels, across the whole dataset. For another example, if you use the labels of a dataset to validate effectiveness of an unsupervised model, and coin-toss to run the model multiple times and then save the best-performing unsupervised model, you are essentially using the labels of test sets. Leaking again.

Should we aim at high metrics?

In Kaggle, we make lots of modifications to the models, submit and keep the highest result evaluated on the testing datasets. If this process is done many times, we are essentially using test data to validate models so as to get high points. This of course makes the scores, etc. look much better, but does this bring about overfitting, and does it inhibit the model’s performance, when we want to use the trained model on some real-world data?

Is there a way out?

Here comes the importance of validation set. In a 80-10-10 split setting, 80% data are used for training, 10% data are used for tuning, and the remaining test data are not used until just before you write report into the paper.