3. Evaluation

When we began working with supervised machine learning methods, we evaluated a classifier's performance using its accuracy. Accuracy, as you might recall is the fraction of samples that were classified correctly, that is where the classifier's predicted label matched the correct or True label.

Learning Objectives

  • Understand why accuracy only gives a partial picture of a classifier's performance.
  • Understand the motivation and definition of important evalution metrics in machine learing
  • Learn how to use a variety of evaluation metrics to evaluate supervised machine learning models
  • Learn about choosing the right metric for selecting between models or for doing parameter tuning

Represent / Train / Evaluate / Refine Cycle

Representation : Extract and select object features ` ==>`Train models : Fit the estimator to the data ` ==>`Evaluation ` ==>`Feature and model refinement ` ==>`Representation

Evaluation

  • Different applications have very different goals
  • Accuracy is widely used, but many others are possible, e.g. :
    • User satisfaction (Web search)
    • Amount of revenue (e-commerce)
    • Increase in patient survival rates (medical)
  • It's very important to choose evaluation methods that match the goal of your application
  • Compute your selected evaluation metric for multiple different models
  • Then select the model with 'best' value of evaluation metric

Evaluation for Classification

Accuracy with Imbalanced Classes

  • Suppose you have two classes:
    • Relevant(R) : the positive class
    • Not_Relevant(N) : the negative class
  • Out of 1000 randomly selected items, on average
    • One item is relevant and has an "R" label
    • The rest of the items(999 of them) are not relevant and labelled "N".
  • Recall that: $$Accuracy = \frac{correct.predictions}{total.instances}$$

First, let's consider the case where we have a binary classification task where there are a lot of instances labeled with the negative class, but only a few instances that belong to the positive class. For example, we might see this scenario in online search or recommender systems where system has to predict whether or not to diasplay an advertisement or product suggestion or show a query suggestion or item on a page that's likely to be relevant given a user's query and what they clicked on in the past and so on, so those would be the positive examples, but of course there are many, many irrelevant items that are in the negative class that don't make sense to show a user. So this is called an imbalanced class scenario

  • You build a classifier to predict relevant items, and see that its accuracy on a test set is 99.9%
  • Wow! Amazingly good, right?
  • For comparison, suppose we had a "dummy" classifier that did't look at the features at all, and always just blindly predicted the most frequent class (i.e. the negetive "N" class)
  • Assuming a test set of 1000 instances, what would this dummy classifier's accuracy be?
  • Answer: $Accuracy_{DUMMY}=999/1000=99.9\%$

Preamble

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits #using the digits dataset, which has images of handwritten digits labeled with
                                         #ten classes, representing the digits 0 though 9

dataset = load_digits()
X, y = dataset.data, dataset.target

for class_name, class_count in zip(dataset.target_names, np.bincount(dataset.target)):
    print(class_name,class_count)
0 178
1 182
2 177
3 183
4 181
5 182
6 181
7 179
8 174
9 180
In [34]:
dataset
Out[34]:
{'DESCR': "Optical Recognition of Handwritten Digits Data Set\n===================================================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 5620\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttp://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\nReferences\n----------\n  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n    Graduate Studies in Science and Engineering, Bogazici University.\n  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n    Linear dimensionalityreduction using relevance weighted LDA. School of\n    Electrical and Electronic Engineering Nanyang Technological University.\n    2005.\n  - Claudio Gentile. A New Approximate Maximal Margin Classification\n    Algorithm. NIPS. 2000.\n",
 'data': array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
        [  0.,   0.,   0., ...,  10.,   0.,   0.],
        [  0.,   0.,   0., ...,  16.,   9.,   0.],
        ..., 
        [  0.,   0.,   1., ...,   6.,   0.,   0.],
        [  0.,   0.,   2., ...,  12.,   0.,   0.],
        [  0.,   0.,  10., ...,  12.,   1.,   0.]]),
 'images': array([[[  0.,   0.,   5., ...,   1.,   0.,   0.],
         [  0.,   0.,  13., ...,  15.,   5.,   0.],
         [  0.,   3.,  15., ...,  11.,   8.,   0.],
         ..., 
         [  0.,   4.,  11., ...,  12.,   7.,   0.],
         [  0.,   2.,  14., ...,  12.,   0.,   0.],
         [  0.,   0.,   6., ...,   0.,   0.,   0.]],
 
        [[  0.,   0.,   0., ...,   5.,   0.,   0.],
         [  0.,   0.,   0., ...,   9.,   0.,   0.],
         [  0.,   0.,   3., ...,   6.,   0.,   0.],
         ..., 
         [  0.,   0.,   1., ...,   6.,   0.,   0.],
         [  0.,   0.,   1., ...,   6.,   0.,   0.],
         [  0.,   0.,   0., ...,  10.,   0.,   0.]],
 
        [[  0.,   0.,   0., ...,  12.,   0.,   0.],
         [  0.,   0.,   3., ...,  14.,   0.,   0.],
         [  0.,   0.,   8., ...,  16.,   0.,   0.],
         ..., 
         [  0.,   9.,  16., ...,   0.,   0.,   0.],
         [  0.,   3.,  13., ...,  11.,   5.,   0.],
         [  0.,   0.,   0., ...,  16.,   9.,   0.]],
 
        ..., 
        [[  0.,   0.,   1., ...,   1.,   0.,   0.],
         [  0.,   0.,  13., ...,   2.,   1.,   0.],
         [  0.,   0.,  16., ...,  16.,   5.,   0.],
         ..., 
         [  0.,   0.,  16., ...,  15.,   0.,   0.],
         [  0.,   0.,  15., ...,  16.,   0.,   0.],
         [  0.,   0.,   2., ...,   6.,   0.,   0.]],
 
        [[  0.,   0.,   2., ...,   0.,   0.,   0.],
         [  0.,   0.,  14., ...,  15.,   1.,   0.],
         [  0.,   4.,  16., ...,  16.,   7.,   0.],
         ..., 
         [  0.,   0.,   0., ...,  16.,   2.,   0.],
         [  0.,   0.,   4., ...,  16.,   2.,   0.],
         [  0.,   0.,   5., ...,  12.,   0.,   0.]],
 
        [[  0.,   0.,  10., ...,   1.,   0.,   0.],
         [  0.,   2.,  16., ...,   1.,   0.,   0.],
         [  0.,   0.,  15., ...,  15.,   0.,   0.],
         ..., 
         [  0.,   4.,  16., ...,  16.,   6.,   0.],
         [  0.,   8.,  16., ...,  16.,   8.,   0.],
         [  0.,   1.,   8., ...,  12.,   1.,   0.]]]),
 'target': array([0, 1, 2, ..., 8, 9, 8]),
 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}
In [35]:
# Creating a dataset with imbalanced binary classes:  
# Negative class (0) is 'not digit 1' 
# Positive class (1) is 'digit 1'
y_binary_imbalanced = y.copy()
y_binary_imbalanced[y_binary_imbalanced != 1] = 0

print('Original labels:\t', y[1:30])
print('New binary labels:\t', y_binary_imbalanced[1:30])
Original labels:	 [1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]
New binary labels:	 [1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
In [36]:
np.bincount(y_binary_imbalanced)    # Negative class (0) is the most frequent class
Out[36]:
array([1615,  182])
In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

# Accuracy of Support Vector Machine classifier
from sklearn.svm import SVC

svm = SVC(kernel='rbf', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)
Out[37]:
0.90888888888888886

Dummy Classifiers

DummyClassifier is a classifier that makes predictions using simple rules, which can be useful as a baseline for comparison against actual classifiers, especially with imbalanced classes. DummyClassifier are called that because they don't even look at the data to make a prediction, they simply use the strategy or rule fo thumb that you instruct them to use, when creating them. In fact, when you create the claasifier, you set the strategy argument to tell it what rule of thumb to use to make its predictions. So here we set this to the most frequent strategy to predict the most frequent class.

In [38]:
from sklearn.dummy import DummyClassifier

# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
# Therefore the dummy 'most_frequent' classifier always predicts class 0
y_dummy_predictions = dummy_majority.predict(X_test)

y_dummy_predictions
Out[38]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [39]:
dummy_majority.score(X_test, y_test)
Out[39]:
0.9044444444444445

Dummy classifiers completely ignore the input data!

  • Dummy classifiers serve as a sanity check on your classifier's performance
  • They provide a null metric(e.g. null accuracy) baseline
  • Dummy classifiers should not be used for real problems
  • Some commonly-used settings for the strategy parameter for DummyClassifier in scikit-learn:
    • most_frequent : predicts the most frequent label in the training set.
    • stratified : random predictions based on training set class distribution.
    • uniform : generates predictions uniformly at random
    • constant : always predicts a constant label provided by the users
      • A major motivation if this method is $F_1$-scoring, when the positive class is in the minority.

When we have a binary classification task where the most frequent class is the negative class, turns out that using the most frequent strategy will never predict the positive class and will never be able to count the number of positive instances that are correctly predicted, and so the overall count of such positive correct predictions will be 0, and we'll cause some important metrics like $F_1$-score to always be 0. So using the constant strategy, we can force a dummy classifier to always predict the positie class even if it's the minority class in a set of classes, and this will lead to more meaningful computation of $F_1$-score

What if my classifier accuracy is close to the null accuracy baseline? This could be a sign of:

  • Ineffective, erroneous or missing features
  • Poor choice of kernel or hyperparameter
  • Large class imbalance

For example, if we change the support vector classifier's kernel parameter to linear from rbf,and recompute the accuracy on this retrain classifier, we can see that this leads to much better performance of almost 98% compared to the most_frequent class based line of 90%

In [40]:
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)
Out[40]:
0.97777777777777775

In general, for imbalanced classification problems, you should use metrics other than accuracy, look at one shortly called AUC, which is short for area under the curve.

Dummy Regressors strategy parameter options:

  • mean: predicts the mean of the training targets
  • median:predicts the median of the training targets
  • quantile:predicts a user-provided quantile of the training targets
  • constant:predicts a constant user-provided value

Confusion matrices

Binary (two-class) confusion matrix

In [41]:
from sklearn.metrics import confusion_matrix

# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
y_majority_predicted = dummy_majority.predict(X_test)
confusion = confusion_matrix(y_test, y_majority_predicted)

print('Most frequent class (dummy classifier)\n', confusion)
Most frequent class (dummy classifier)
 [[407   0]
 [ 43   0]]
In [42]:
# produces random predictions w/ same class proportion as training set
dummy_classprop = DummyClassifier(strategy='stratified').fit(X_train, y_train)
y_classprop_predicted = dummy_classprop.predict(X_test)
confusion = confusion_matrix(y_test, y_classprop_predicted)

print('Random class-proportional prediction (dummy classifier)\n', confusion)
Random class-proportional prediction (dummy classifier)
 [[373  34]
 [ 41   2]]
In [43]:
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm_predicted = svm.predict(X_test)
confusion = confusion_matrix(y_test, svm_predicted)

print('Support vector machine classifier (linear kernel, C=1)\n', confusion)
Support vector machine classifier (linear kernel, C=1)
 [[402   5]
 [  5  38]]
In [44]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression().fit(X_train, y_train)
lr_predicted = lr.predict(X_test)
confusion = confusion_matrix(y_test, lr_predicted)

print('Logistic regression classifier (default settings)\n', confusion)
Logistic regression classifier (default settings)
 [[401   6]
 [  6  37]]
In [45]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
tree_predicted = dt.predict(X_test)
confusion = confusion_matrix(y_test, tree_predicted)

print('Decision tree classifier (max_depth = 2)\n', confusion)
Decision tree classifier (max_depth = 2)
 [[400   7]
 [ 17  26]]

Evaluation metrics for binary classification

There is often a tradeoff between precision and recall

  • Recall-oriented machine learning tasks:
    • Search and information extraction in legal discovery
    • Tumor detection
    • Often paired with a human expert to filter out false positives
  • Precision-oriented machine learning tasks:
    • Search engine ranking, query suggestion
    • Document classification

$F_1$-score: combining precision & recall into a single number

When evaluating classifiers, it's often convenient to compute a quantity known as an $F_1$-score, that combines precision and recall into a single number. Mathematically this is based on harmonic mean of precision and recall using this formula: $$F_1=2\cdot \frac{Precision\cdot Recall}{Precion+Recall} = \frac {2\cdot TP}{2\cdot TP+FN+FP}$$

$$F_{\beta}=(1+\beta^2)\cdot \frac{Precision\cdot Recall}{(\beta^2\cdot Precion)+Recall} = \frac {(1+\beta^2)\cdot TP}{(1+\beta^2)\cdot TP+\beta \cdot FN+FP}$$

$\beta$ allows adjustment of the metric to control the emphasis on recall vs precision:

  • Precision-oriented users: $\beta = 0.5$ (false positives hurt performance more than false negative)
  • Recall-oriented users: $\beta = 2$ (false negative hurt performance more than false positive)
In [46]:
# Once a model is builted, we can get it's "accuracy", "precision","recall", "F1" and so on
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Accuracy = TP + TN / (TP + TN + FP + FN)
# Precision = TP / (TP + FP)
# Recall = TP / (TP + FN)  Also known as sensitivity, or True Positive Rate
# F1 = 2 * Precision * Recall / (Precision + Recall) 
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, tree_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, tree_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, tree_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, tree_predicted)))   
Accuracy: 0.95
Precision: 0.79
Recall: 0.60
F1: 0.68
In [47]:
# Combined report with all above metrics
from sklearn.metrics import classification_report

print(classification_report(y_test, tree_predicted, target_names=['not 1', '1']))
# the last column "support" shows the number of instances in the test set that have true label
             precision    recall  f1-score   support

      not 1       0.96      0.98      0.97       407
          1       0.79      0.60      0.68        43

avg / total       0.94      0.95      0.94       450

In [48]:
print('Random class-proportional (dummy)\n', 
      classification_report(y_test, y_classprop_predicted, target_names=['not 1', '1']))
print('SVM\n', 
      classification_report(y_test, svm_predicted, target_names = ['not 1', '1']))
print('Logistic regression\n', 
      classification_report(y_test, lr_predicted, target_names = ['not 1', '1']))
print('Decision tree\n', 
      classification_report(y_test, tree_predicted, target_names = ['not 1', '1']))
Random class-proportional (dummy)
              precision    recall  f1-score   support

      not 1       0.90      0.92      0.91       407
          1       0.06      0.05      0.05        43

avg / total       0.82      0.83      0.83       450

SVM
              precision    recall  f1-score   support

      not 1       0.99      0.99      0.99       407
          1       0.88      0.88      0.88        43

avg / total       0.98      0.98      0.98       450

Logistic regression
              precision    recall  f1-score   support

      not 1       0.99      0.99      0.99       407
          1       0.86      0.86      0.86        43

avg / total       0.97      0.97      0.97       450

Decision tree
              precision    recall  f1-score   support

      not 1       0.96      0.98      0.97       407
          1       0.79      0.60      0.68        43

avg / total       0.94      0.95      0.94       450

Decision functions

Many classifiers in Scikit-learn can provide information about the uncertainty associated with a particular prediction either by using the decision function method or predict probability method

  • Each classifier score value per test point indicates how confidently the classifier predicts the positive class (large-magnitude positive values) or the negative class (large-magnitude negative values).
  • Choosing a fixed decision threshold gives a classification rule.
  • By sweeping the decision threshold through the entire range of possible score values, we get a series of classification outcomes that form a curve.
In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))

# We can see the instances in the negative class often have large magnitude negative scores, and indeed the instances in 
# the positive class has positive scores from the logistic regression classifier.
# show the decision_function scores for first 20 instances
y_score_list
Out[49]:
[(0, -23.177112362903241),
 (0, -13.541470729541413),
 (0, -21.722900989694729),
 (0, -18.907438437430027),
 (0, -19.735821729002289),
 (0, -9.7498078195600613),
 (1, 5.2349604859009276),
 (0, -19.307551661127864),
 (0, -25.101182889530396),
 (0, -21.827362391350579),
 (0, -24.151343401889438),
 (0, -19.576969790071697),
 (0, -22.574689400560423),
 (0, -10.823324268750714),
 (0, -11.912123406737392),
 (0, -10.97922371337485),
 (1, 11.206006114721543),
 (0, -27.646002317931909),
 (0, -12.859381428186682),
 (0, -25.848764845244997)]

Predicted Probability of Class Membership (predict_proba)

  • Typical rule: choose most likely class
    • e.g. class I if threshold > 0.50
  • Adjusting threshold affects predictions of classifier
  • Higher threshold results in a more conservative classifier
    • e.g. only predict Class I if estimated probability of class I is above 70%
    • This increase precision. Doesn't predict class I as often, but when it does, it gets high proportion of class I instances correct
  • Not all models provide realistic probability estimates
In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)
y_proba_list = list(zip(y_test[0:20], y_proba_lr[0:20,1]))

# show the probability of positive class for first 20 instances
y_proba_list

# you can see that many entries with a positive label of one, have a high probability like 0.995. While many negative label
# instances have a very low prediction probability
Out[50]:
[(0, 8.59622935435435e-11),
 (0, 1.3152654562877047e-06),
 (0, 3.6801368532691122e-10),
 (0, 6.1461593657683724e-09),
 (0, 2.6843647432086779e-09),
 (0, 5.8302468525539246e-05),
 (1, 0.99470119196307671),
 (0, 4.119427394145904e-09),
 (0, 1.2551475474865303e-11),
 (0, 3.3151024835512065e-10),
 (0, 3.2449262447677433e-11),
 (0, 3.1465167386169614e-09),
 (0, 1.5701348842784633e-10),
 (0, 1.9928808739415592e-05),
 (0, 6.7085339693943354e-06),
 (0, 1.7052039108840334e-05),
 (1, 0.99998640787132853),
 (0, 9.8513045790707473e-13),
 (0, 2.6015997081086803e-06),
 (0, 5.9432508848392272e-12)]

Note that not all all models provide useful probability estimates of this type. For example, a model that was over-fit to a training set might provide overly optimistic high probabilities that were in fact not accurate.

Precision-recall curves

Varying the Decision Threshold

We can use these decision scores or prediction probabilities for getting more complete evaluation picture of a classifiers performance. For a particular application, we might pick a specific decision threshold depending on whether we want the classifier to be more or less conservative about making false-positive or false-negative errors. It might not be entirely clear when developing a new model what the right decision threshold would be, and how that choice will affect evaluation metrics like precision and recall. So instead, what we'll do is look at how classifier performs for all possible decision thresholds.

On the left there is a list of test instances with their true label and classifier score. If we set a decision threshold, then all the instances above that line, for example if we set the decision threshold to -20 here, then all the instances above the line are below the threshold of -20. So -20 or less and all the instances in this direction are above the threshold of -20. And so the ones below the threshold will be predicted to be in the negative class, and the ones above the threshold will be predicted to be in the positive class.

So if we pick the specific threshold, in this case, -20, and we partition the test points in this way, we can compute precision and recall for the points that are predicted to be in the positive class. So in this case, we have 12 instances here, 12 total instances. They are being predicted as positive and only four of them

In [67]:
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]

plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

The red circle indicates the precision and recall that's achieved when the decision threshold is zero.

In this particular case you see also that it's not exactly a smooth curve, there are some jaggy errors and in fact the jumps tend to get a little bigger as we approach maximum precision, this is a consequence of how the formulas for recall precision are computed, they use discrete counts that include the number of true positives.

So as the decision threshold increases, these are fewer and fewer points that remain as positive predictions, so the fractions that are computed for these smaller numbers can change pretty dramatically with small changes in the decision threshold. That's why these sort of trailing edges of precision-recall curve can appear a bit jagged when you plot them

ROC curves, Area-Under-Curve (AUC)

ROC curves or receiver operating characteristic curves are a very widely used visualziation method that illustrate the performance of a binary classifier.

X-axis: False Positive Rate Y-axis: True Positive Rate

Top left corner:

  • The "ideal" point
  • False positive rate of zero
  • True positive rate of one

"Steepness" of ROC curves is important:

  • Maximize the true positive rate
  • while minizing the false positive rate
In [68]:
from sklearn.metrics import roc_curve, auc

X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

y_score_lr = lr.fit(X_train, y_train).decision_function(X_test)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)

plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.axes().set_aspect('equal')
plt.show()
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

The dotted line here is the classifier curve that secretly results from a classifier that randomly guesses the label for a binary class, it's basically like flipping a coin, if you have two classes with equal numbers of positive and negative incidences, then flipping a coin will get you randomly equal numbers of false positives and true positives for a large virus data sets.

So the dotted line here is used as a base line, the bad classifier will have performance that is random or maybe even worse than random or be slightly better than random, reasonably good classifier will give an ROC curve that is consistently better than random across all decision threshold choices, and then an excellent classiferwould be one like shown here (LogRegr ROC curve), this particular example is an example of a logistic regression classifier.

So the shape of the curve can be important as well, the steepness of the curve, we want classifiers that maximize the true positive rate while minimizing the false positive rate. Now we can quantify the goodness of a classifier in some sense by looking at how much area there is underneath the curve. So the area underneath the random classifier is going to be 0.5. We use AUC, the single number that measures this total area underneath the ROC curve as a way to summarize a classifier's performance

In [69]:
from matplotlib import cm

X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
for g in [0.01, 0.1, 0.20, 1]:
    svm = SVC(gamma=g).fit(X_train, y_train)
    y_score_svm = svm.decision_function(X_test)
    fpr_svm, tpr_svm, _ = roc_curve(y_test, y_score_svm)
    roc_auc_svm = auc(fpr_svm, tpr_svm)
    accuracy_svm = svm.score(X_test, y_test)
    print("gamma = {:.2f}  accuracy = {:.2f}   AUC = {:.2f}".format(g, accuracy_svm, 
                                                                    roc_auc_svm))
    plt.plot(fpr_svm, tpr_svm, lw=3, alpha=0.7, 
             label='SVM (gamma = {:0.2f}, area = {:0.2f})'.format(g, roc_auc_svm))

plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate (Recall)', fontsize=16)
plt.plot([0, 1], [0, 1], color='k', lw=0.5, linestyle='--')
plt.legend(loc="lower right", fontsize=11)
plt.title('ROC curve: (1-of-10 digits classifier)', fontsize=16)
plt.axes().set_aspect('equal')

plt.show()
gamma = 0.01  accuracy = 0.91   AUC = 1.00
gamma = 0.10  accuracy = 0.90   AUC = 0.98
gamma = 0.20  accuracy = 0.90   AUC = 0.66
gamma = 1.00  accuracy = 0.90   AUC = 0.50
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

Evaluation measures for multi-class classification

  • Multi-class evalution is an extension of the binary case
    • A collection of true vs predicted binary outcomes, one per class
    • Confusion matrices are especially useful
    • Classification report
  • Overal evaluation metrics are averages across classes
    • But there are different ways to average multi-class results
    • The support (number of instances) for each class is important to consider, e.g. in case of imbalanced classes
  • Multi-label classification: each instance can have multiple labels (not covered here)

Multi-class confusion matrix

The multi-class confusion matrix is a straightforward extension of the binary classifier 2X2 confusion matrix.

As a general rule of thumb as part of model evaluation, I suggest always looking at the confusion matrix for your classifier, to get some insight into what kind of errors it is making for each class including whether some classes are much more prone to certain kinds of errors than others

In [70]:
dataset = load_digits()
X, y = dataset.data, dataset.target
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y, random_state=0)


svm = SVC(kernel = 'linear').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
df_cm = pd.DataFrame(confusion_mc, 
                     index = [i for i in range(0,10)], columns = [i for i in range(0,10)])

plt.figure(figsize=(5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM Linear Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc, 
                                                                       svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label')


svm = SVC(kernel = 'rbf').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
df_cm = pd.DataFrame(confusion_mc, index = [i for i in range(0,10)],
                  columns = [i for i in range(0,10)])

plt.figure(figsize = (5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM RBF Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc, 
                                                                    svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label');

Multi-class classification report

Get a classification report that summarizes multiple evaluation metrics for a multi-class classifier with an average metric computed for each class

In [55]:
print(classification_report(y_test_mc, svm_predicted_mc))
             precision    recall  f1-score   support

          0       1.00      0.65      0.79        37
          1       1.00      0.23      0.38        43
          2       1.00      0.39      0.56        44
          3       1.00      0.93      0.97        45
          4       0.14      1.00      0.25        38
          5       1.00      0.33      0.50        48
          6       1.00      0.54      0.70        52
          7       1.00      0.35      0.52        48
          8       1.00      0.02      0.04        48
          9       1.00      0.55      0.71        47

avg / total       0.93      0.49      0.54       450

Micro- vs. macro-averaged metrics

  • If the classes have about the same number of instances, marcro- and micro-average will be about the same
  • If some classes are much larger (more instances) than others, and you want to:
    • Weight your metric toward the largest ones, use micro average
    • Weight your metric toward the smallest ones, use macro average
  • If the micro-average is much lower than the macro-average then examine the larger classes for poor metric performance. The model is probably misclassifying the frequent labels more than the frequent labels.
  • If the macro-average is much lower than the micro-average then examine the smaller classes for poor metric performance. The model is probably misclassifying the infrequent labels more than the infrequent labels.

In [56]:
print('Micro-averaged precision = {:.2f} (treat instances equally)'
      .format(precision_score(y_test_mc, svm_predicted_mc, average = 'micro')))
print('Macro-averaged precision = {:.2f} (treat classes equally)'
      .format(precision_score(y_test_mc, svm_predicted_mc, average = 'macro')))
Micro-averaged precision = 0.49 (treat instances equally)
Macro-averaged precision = 0.91 (treat classes equally)
In [57]:
print('Micro-averaged f1 = {:.2f} (treat instances equally)'
      .format(f1_score(y_test_mc, svm_predicted_mc, average = 'micro')))
print('Macro-averaged f1 = {:.2f} (treat classes equally)'
      .format(f1_score(y_test_mc, svm_predicted_mc, average = 'macro')))
Micro-averaged f1 = 0.49 (treat instances equally)
Macro-averaged f1 = 0.54 (treat classes equally)

Regression evaluation metrics

  • Typically r2_score is enough, it's always positive does have the potential to go negative for bad model fits, such as when fitting non-linear functions to data.
    • Reminder: computes how well future instances will be predicted
    • Best possible score is 1.0
    • Constant prediction score is 0.0 without looking at the input - DummyRegressor
  • Alternative metrics include:
    • mean_absolute_error (absolute difference of target & predicted values), in machine learning terms this corresponds to the expected value of L1 norm loss. This is sometimes used for example to assess focused outcomes for regression in time series analysis.
    • mean_squared_error (squared difference of target & predicted values), this corresponds to the expected value of the L2 norm loss. This is widely used for many regression problems. Like mean_absolute_error, mean_squared_error doesn't distingush between over and under estimates.
    • median_absolute_error (robust to outliers), in those cases, when ignoring outlier is important, you can use the median_absolute_error score, which is robust with the presence of outliers because it uses the median of error distribution rather than the mean

Dummy Regressors The DummyRegressor class implements four simple baseline rules for regression, using the strategy parameter:

  • mean predicts the mean of the training target values
  • median predicts the median of the training target values
  • quantile predicts a user-provided quantile of the training values (e.g. value at the $75^{th}$ percentile)
  • constant predicts a custom constant value provided by the user

Although regression typically has simpler evaluation needs than classification, it does pay to double check to make sure the evaluation metric you choose for a regression problem does penalize errors in way that reflects the consequences of those errors for the business, organizational, or user needs of your application

In [63]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor

diabetes = datasets.load_diabetes()

X = diabetes.data[:, None, 6]
y = diabetes.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lm = LinearRegression().fit(X_train, y_train)
lm_dummy_mean = DummyRegressor(strategy = 'mean').fit(X_train, y_train)

y_predict = lm.predict(X_test)
y_predict_dummy_mean = lm_dummy_mean.predict(X_test)

print('Linear model, coefficients: ', lm.coef_)
print("Mean squared error (dummy): {:.2f}".format(mean_squared_error(y_test, 
                                                                     y_predict_dummy_mean)))
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, y_predict)))
print("r2_score (dummy): {:.2f}".format(r2_score(y_test, y_predict_dummy_mean)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, y_predict)))

# Plot outputs
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_predict, color='green', linewidth=2)
plt.plot(X_test, y_predict_dummy_mean, color='red', linestyle = 'dashed', 
         linewidth=2, label = 'dummy')

plt.show()
Linear model, coefficients:  [-698.80206267]
Mean squared error (dummy): 4965.13
Mean squared error (linear model): 4646.74
r2_score (dummy): -0.00
r2_score (linear model): 0.06

Beyond the essential evaluation metrics covered in this course, online controlled experiments, which involve A-B testing and other techniques, are perhaps the most important way that machine learning algorithms are evaluated for real-world use in Web and other online applications. This article by Ron Kohavi, Randal Henne, and Dan Sommerfield, reviews the key points of running controlled experiments, along with important engineering issues and limitations to keep in mind when using them.

Kohavi, R., Henne, R. M., & Sommerfield, D. (2007). Practical guide to controlled experiments on the web. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '07. doi:10.1145/1281192.1281295


Model selection using evaluation metrics

  • Train/test on same data
    • Single metric
    • Typically overfits and likely won't generalize well on new data
    • But can serve as sanity check: low accuracy on the training set may indicate an implementation problem
  • Single train/test split
    • Single metric
    • Speed and simplicity
    • Lack of variance information
  • K-fold cross-validation
    • K train-test splits
    • Average metric over all splits
    • Can be combined with parameter grid search: GridSearchCV (def. cv=3)

Cross-validation example

In [59]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

dataset = load_digits()
# again, making this a binary problem with 'digit 1' as positive class 
# and 'not 1' as negative class
X, y = dataset.data, dataset.target == 1
clf = SVC(kernel='linear', C=1)

# accuracy is the default scoring metric
print('Cross-validation (accuracy)', cross_val_score(clf, X, y, cv=5))
# use AUC as scoring metric
print('Cross-validation (AUC)', cross_val_score(clf, X, y, cv=5, scoring = 'roc_auc'))
# use recall as scoring metric
print('Cross-validation (recall)', cross_val_score(clf, X, y, cv=5, scoring = 'recall'))
Cross-validation (accuracy) [ 0.91944444  0.98611111  0.97214485  0.97493036  0.96935933]
Cross-validation (AUC) [ 0.9641871   0.9976571   0.99372205  0.99699002  0.98675611]
Cross-validation (recall) [ 0.81081081  0.89189189  0.83333333  0.83333333  0.83333333]

Grid search example

In [60]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = SVC(kernel='rbf')
grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}

# default metric to optimize over grid parameters: accuracy
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values)
grid_clf_acc.fit(X_train, y_train)
y_decision_fn_scores_acc = grid_clf_acc.decision_function(X_test) 

print('Grid best parameter (max. accuracy): ', grid_clf_acc.best_params_)
print('Grid best score (accuracy): ', grid_clf_acc.best_score_)

# alternative metric to optimize over grid parameters: AUC
grid_clf_auc = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf_auc.fit(X_train, y_train)
y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test) 

print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))
print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
print('Grid best score (AUC): ', grid_clf_auc.best_score_)
Grid best parameter (max. accuracy):  {'gamma': 0.001}
Grid best score (accuracy):  0.996288047513
Test set AUC:  0.999828581224
Grid best parameter (max. AUC):  {'gamma': 0.001}
Grid best score (AUC):  0.99987412783

Evaluation metrics supported for model selection

You can see the complete list of names for the evaluation metric supported by the scoring parameter by running he following code that uses the score's variable imported from sklearn metrics, you can see metrics for classification such as the string precision_micro that represents micro-averaged precision, as well as metrics for regression such as r2 metric for R-squared regression loss

In [61]:
from sklearn.metrics.scorer import SCORERS

print(sorted(list(SCORERS.keys())))
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']

Two-feature classification example using the digits dataset

Example: Optimizing a classifier using different evaluation metrics

Show a specific example how a classifier's decision boundary changes when it's optimized for different evaluation metrics, this classification problem is based on the same binary digit classifier training and test sets we've been using as an example throughout the notebook.

In this classification visualization examples, the positive examples, the digit one are shown as black points and the region of positive class prediction is shown in the light-colored or yellow region to the right of this decision boundary. The negative examples, all other digits, are shown as white points, and the region of negative class prediction here in these figures is to the left of the decision boundary. The data points have been plotted using two out of 64 future values in the digits dataset and have been jittered a little.

In [9]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from adspy_shared_utilities import plot_class_regions_for_classifier_subplot
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV


dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Create a two-feature input vector matching the example plot above
# We jitter the points (add a small amount of random noise) in case there are areas
# in feature space where many instances have the same features.
jitter_delta = 0.25
X_twovar_train = X_train[:,[20,59]]+ np.random.rand(X_train.shape[0], 2) - jitter_delta
X_twovar_test  = X_test[:,[20,59]] + np.random.rand(X_test.shape[0], 2) - jitter_delta

# We apply `GridSearchCV` here to explore different values of the optional class weight parameter that contols how much
# weight is given to each of the two classes during training 
clf = SVC(kernel = 'linear').fit(X_twovar_train, y_train)
grid_values = {'class_weight':['balanced', {1:2},{1:3},{1:4},{1:5},{1:10},{1:20},{1:50}]}
plt.figure(figsize=(9,6))
for i, eval_metric in enumerate(('precision','recall', 'f1','roc_auc')):
    grid_clf_custom = GridSearchCV(clf, param_grid=grid_values, scoring=eval_metric)
    grid_clf_custom.fit(X_twovar_train, y_train)
    print('Grid best parameter (max. {0}): {1}'
          .format(eval_metric, grid_clf_custom.best_params_))
    print('Grid best score ({0}): {1}'
          .format(eval_metric, grid_clf_custom.best_score_))
    plt.subplots_adjust(wspace=0.3, hspace=0.3)
    plot_class_regions_for_classifier_subplot(grid_clf_custom, X_twovar_test, y_test, None,
                                             None, None,  plt.subplot(2, 2, i+1))
    
    plt.title(eval_metric+'-oriented SVC')
plt.tight_layout()
plt.show()