3. Evaluation

When we began working with supervised machine learning methods, we evaluated a classifier's performance using its accuracy. Accuracy, as you might recall is the fraction of samples that were classified correctly, that is where the classifier's predicted label matched the correct or True label.

Learning Objectives

  • Understand why accuracy only gives a partial picture of a classifier's performance.
  • Understand the motivation and definition of important evalution metrics in machine learing
  • Learn how to use a variety of evaluation metrics to evaluate supervised machine learning models
  • Learn about choosing the right metric for selecting between models or for doing parameter tuning

Represent / Train / Evaluate / Refine Cycle

Representation : Extract and select object features ` ==>`Train models : Fit the estimator to the data ` ==>`Evaluation ` ==>`Feature and model refinement ` ==>`Representation

Evaluation

  • Different applications have very different goals
  • Accuracy is widely used, but many others are possible, e.g. :
    • User satisfaction (Web search)
    • Amount of revenue (e-commerce)
    • Increase in patient survival rates (medical)
  • It's very important to choose evaluation methods that match the goal of your application
  • Compute your selected evaluation metric for multiple different models
  • Then select the model with 'best' value of evaluation metric

Evaluation for Classification

Accuracy with Imbalanced Classes

  • Suppose you have two classes:
    • Relevant(R) : the positive class
    • Not_Relevant(N) : the negative class
  • Out of 1000 randomly selected items, on average
    • One item is relevant and has an "R" label
    • The rest of the items(999 of them) are not relevant and labelled "N".
  • Recall that: $$Accuracy = \frac{correct.predictions}{total.instances}$$

First, let's consider the case where we have a binary classification task where there are a lot of instances labeled with the negative class, but only a few instances that belong to the positive class. For example, we might see this scenario in online search or recommender systems where system has to predict whether or not to diasplay an advertisement or product suggestion or show a query suggestion or item on a page that's likely to be relevant given a user's query and what they clicked on in the past and so on, so those would be the positive examples, but of course there are many, many irrelevant items that are in the negative class that don't make sense to show a user. So this is called an imbalanced class scenario

  • You build a classifier to predict relevant items, and see that its accuracy on a test set is 99.9%
  • Wow! Amazingly good, right?
  • For comparison, suppose we had a "dummy" classifier that did't look at the features at all, and always just blindly predicted the most frequent class (i.e. the negetive "N" class)
  • Assuming a test set of 1000 instances, what would this dummy classifier's accuracy be?
  • Answer: $Accuracy_{DUMMY}=999/1000=99.9\%$

Preamble

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits #using the digits dataset, which has images of handwritten digits labeled with
                                         #ten classes, representing the digits 0 though 9

dataset = load_digits()
X, y = dataset.data, dataset.target

for class_name, class_count in zip(dataset.target_names, np.bincount(dataset.target)):
    print(class_name,class_count)
0 178
1 182
2 177
3 183
4 181
5 182
6 181
7 179
8 174
9 180
In [34]:
dataset
Out[34]:
{'DESCR': "Optical Recognition of Handwritten Digits Data Set\n===================================================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 5620\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttp://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\nReferences\n----------\n  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n    Graduate Studies in Science and Engineering, Bogazici University.\n  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n    Linear dimensionalityreduction using relevance weighted LDA. School of\n    Electrical and Electronic Engineering Nanyang Technological University.\n    2005.\n  - Claudio Gentile. A New Approximate Maximal Margin Classification\n    Algorithm. NIPS. 2000.\n",
 'data': array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
        [  0.,   0.,   0., ...,  10.,   0.,   0.],
        [  0.,   0.,   0., ...,  16.,   9.,   0.],
        ..., 
        [  0.,   0.,   1., ...,   6.,   0.,   0.],
        [  0.,   0.,   2., ...,  12.,   0.,   0.],
        [  0.,   0.,  10., ...,  12.,   1.,   0.]]),
 'images': array([[[  0.,   0.,   5., ...,   1.,   0.,   0.],
         [  0.,   0.,  13., ...,  15.,   5.,   0.],
         [  0.,   3.,  15., ...,  11.,   8.,   0.],
         ..., 
         [  0.,   4.,  11., ...,  12.,   7.,   0.],
         [  0.,   2.,  14., ...,  12.,   0.,   0.],
         [  0.,   0.,   6., ...,   0.,   0.,   0.]],
 
        [[  0.,   0.,   0., ...,   5.,   0.,   0.],
         [  0.,   0.,   0., ...,   9.,   0.,   0.],
         [  0.,   0.,   3., ...,   6.,   0.,   0.],
         ..., 
         [  0.,   0.,   1., ...,   6.,   0.,   0.],
         [  0.,   0.,   1., ...,   6.,   0.,   0.],
         [  0.,   0.,   0., ...,  10.,   0.,   0.]],
 
        [[  0.,   0.,   0., ...,  12.,   0.,   0.],
         [  0.,   0.,   3., ...,  14.,   0.,   0.],
         [  0.,   0.,   8., ...,  16.,   0.,   0.],
         ..., 
         [  0.,   9.,  16., ...,   0.,   0.,   0.],
         [  0.,   3.,  13., ...,  11.,   5.,   0.],
         [  0.,   0.,   0., ...,  16.,   9.,   0.]],
 
        ..., 
        [[  0.,   0.,   1., ...,   1.,   0.,   0.],
         [  0.,   0.,  13., ...,   2.,   1.,   0.],
         [  0.,   0.,  16., ...,  16.,   5.,   0.],
         ..., 
         [  0.,   0.,  16., ...,  15.,   0.,   0.],
         [  0.,   0.,  15., ...,  16.,   0.,   0.],
         [  0.,   0.,   2., ...,   6.,   0.,   0.]],
 
        [[  0.,   0.,   2., ...,   0.,   0.,   0.],
         [  0.,   0.,  14., ...,  15.,   1.,   0.],
         [  0.,   4.,  16., ...,  16.,   7.,   0.],
         ..., 
         [  0.,   0.,   0., ...,  16.,   2.,   0.],
         [  0.,   0.,   4., ...,  16.,   2.,   0.],
         [  0.,   0.,   5., ...,  12.,   0.,   0.]],
 
        [[  0.,   0.,  10., ...,   1.,   0.,   0.],
         [  0.,   2.,  16., ...,   1.,   0.,   0.],
         [  0.,   0.,  15., ...,  15.,   0.,   0.],
         ..., 
         [  0.,   4.,  16., ...,  16.,   6.,   0.],
         [  0.,   8.,  16., ...,  16.,   8.,   0.],
         [  0.,   1.,   8., ...,  12.,   1.,   0.]]]),
 'target': array([0, 1, 2, ..., 8, 9, 8]),
 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}
In [35]:
# Creating a dataset with imbalanced binary classes:  
# Negative class (0) is 'not digit 1' 
# Positive class (1) is 'digit 1'
y_binary_imbalanced = y.copy()
y_binary_imbalanced[y_binary_imbalanced != 1] = 0

print('Original labels:\t', y[1:30])
print('New binary labels:\t', y_binary_imbalanced[1:30])
Original labels:	 [1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]
New binary labels:	 [1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
In [36]:
np.bincount(y_binary_imbalanced)    # Negative class (0) is the most frequent class
Out[36]:
array([1615,  182])
In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

# Accuracy of Support Vector Machine classifier
from sklearn.svm import SVC

svm = SVC(kernel='rbf', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)
Out[37]:
0.90888888888888886

Dummy Classifiers

DummyClassifier is a classifier that makes predictions using simple rules, which can be useful as a baseline for comparison against actual classifiers, especially with imbalanced classes. DummyClassifier are called that because they don't even look at the data to make a prediction, they simply use the strategy or rule fo thumb that you instruct them to use, when creating them. In fact, when you create the claasifier, you set the strategy argument to tell it what rule of thumb to use to make its predictions. So here we set this to the most frequent strategy to predict the most frequent class.

In [38]:
from sklearn.dummy import DummyClassifier

# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
# Therefore the dummy 'most_frequent' classifier always predicts class 0
y_dummy_predictions = dummy_majority.predict(X_test)

y_dummy_predictions
Out[38]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [39]:
dummy_majority.score(X_test, y_test)
Out[39]:
0.9044444444444445

Dummy classifiers completely ignore the input data!

  • Dummy classifiers serve as a sanity check on your classifier's performance
  • They provide a null metric(e.g. null accuracy) baseline
  • Dummy classifiers should not be used for real problems
  • Some commonly-used settings for the strategy parameter for DummyClassifier in scikit-learn:
    • most_frequent : predicts the most frequent label in the training set.
    • stratified : random predictions based on training set class distribution.
    • uniform : generates predictions uniformly at random
    • constant : always predicts a constant label provided by the users
      • A major motivation if this method is $F_1$-scoring, when the positive class is in the minority.

When we have a binary classification task where the most frequent class is the negative class, turns out that using the most frequent strategy will never predict the positive class and will never be able to count the number of positive instances that are correctly predicted, and so the overall count of such positive correct predictions will be 0, and we'll cause some important metrics like $F_1$-score to always be 0. So using the constant strategy, we can force a dummy classifier to always predict the positie class even if it's the minority class in a set of classes, and this will lead to more meaningful computation of $F_1$-score

What if my classifier accuracy is close to the null accuracy baseline? This could be a sign of:

  • Ineffective, erroneous or missing features
  • Poor choice of kernel or hyperparameter
  • Large class imbalance

For example, if we change the support vector classifier's kernel parameter to linear from rbf,and recompute the accuracy on this retrain classifier, we can see that this leads to much better performance of almost 98% compared to the most_frequent class based line of 90%

In [40]:
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)
Out[40]:
0.97777777777777775

In general, for imbalanced classification problems, you should use metrics other than accuracy, look at one shortly called AUC, which is short for area under the curve.

Dummy Regressors strategy parameter options:

  • mean: predicts the mean of the training targets
  • median:predicts the median of the training targets
  • quantile:predicts a user-provided quantile of the training targets
  • constant:predicts a constant user-provided value

Confusion matrices

Binary (two-class) confusion matrix

In [41]:
from sklearn.metrics import confusion_matrix

# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
y_majority_predicted = dummy_majority.predict(X_test)
confusion = confusion_matrix(y_test, y_majority_predicted)

print('Most frequent class (dummy classifier)\n', confusion)
Most frequent class (dummy classifier)
 [[407   0]
 [ 43   0]]
In [42]:
# produces random predictions w/ same class proportion as training set
dummy_classprop = DummyClassifier(strategy='stratified').fit(X_train, y_train)
y_classprop_predicted = dummy_classprop.predict(X_test)
confusion = confusion_matrix(y_test, y_classprop_predicted)

print('Random class-proportional prediction (dummy classifier)\n', confusion)
Random class-proportional prediction (dummy classifier)
 [[373  34]
 [ 41   2]]
In [43]:
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm_predicted = svm.predict(X_test)
confusion = confusion_matrix(y_test, svm_predicted)

print('Support vector machine classifier (linear kernel, C=1)\n', confusion)
Support vector machine classifier (linear kernel, C=1)
 [[402   5]
 [  5  38]]
In [44]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression().fit(X_train, y_train)
lr_predicted = lr.predict(X_test)
confusion = confusion_matrix(y_test, lr_predicted)

print('Logistic regression classifier (default settings)\n', confusion)
Logistic regression classifier (default settings)
 [[401   6]
 [  6  37]]
In [45]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
tree_predicted = dt.predict(X_test)
confusion = confusion_matrix(y_test, tree_predicted)

print('Decision tree classifier (max_depth = 2)\n', confusion)
Decision tree classifier (max_depth = 2)
 [[400   7]
 [ 17  26]]

Evaluation metrics for binary classification

There is often a tradeoff between precision and recall

  • Recall-oriented machine learning tasks:
    • Search and information extraction in legal discovery
    • Tumor detection
    • Often paired with a human expert to filter out false positives
  • Precision-oriented machine learning tasks:
    • Search engine ranking, query suggestion
    • Document classification

$F_1$-score: combining precision & recall into a single number

When evaluating classifiers, it's often convenient to compute a quantity known as an $F_1$-score, that combines precision and recall into a single number. Mathematically this is based on harmonic mean of precision and recall using this formula: $$F_1=2\cdot \frac{Precision\cdot Recall}{Precion+Recall} = \frac {2\cdot TP}{2\cdot TP+FN+FP}$$

$$F_{\beta}=(1+\beta^2)\cdot \frac{Precision\cdot Recall}{(\beta^2\cdot Precion)+Recall} = \frac {(1+\beta^2)\cdot TP}{(1+\beta^2)\cdot TP+\beta \cdot FN+FP}$$

$\beta$ allows adjustment of the metric to control the emphasis on recall vs precision:

  • Precision-oriented users: $\beta = 0.5$ (false positives hurt performance more than false negative)
  • Recall-oriented users: $\beta = 2$ (false negative hurt performance more than false positive)
In [46]:
# Once a model is builted, we can get it's "accuracy", "precision","recall", "F1" and so on
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Accuracy = TP + TN / (TP + TN + FP + FN)
# Precision = TP / (TP + FP)
# Recall = TP / (TP + FN)  Also known as sensitivity, or True Positive Rate
# F1 = 2 * Precision * Recall / (Precision + Recall) 
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, tree_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, tree_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, tree_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, tree_predicted)))   
Accuracy: 0.95
Precision: 0.79
Recall: 0.60
F1: 0.68
In [47]:
# Combined report with all above metrics
from sklearn.metrics import classification_report

print(classification_report(y_test, tree_predicted, target_names=['not 1', '1']))
# the last column "support" shows the number of instances in the test set that have true label
             precision    recall  f1-score   support

      not 1       0.96      0.98      0.97       407
          1       0.79      0.60      0.68        43

avg / total       0.94      0.95      0.94       450

In [48]:
print('Random class-proportional (dummy)\n', 
      classification_report(y_test, y_classprop_predicted, target_names=['not 1', '1']))
print('SVM\n', 
      classification_report(y_test, svm_predicted, target_names = ['not 1', '1']))
print('Logistic regression\n', 
      classification_report(y_test, lr_predicted, target_names = ['not 1', '1']))
print('Decision tree\n', 
      classification_report(y_test, tree_predicted, target_names = ['not 1', '1']))
Random class-proportional (dummy)
              precision    recall  f1-score   support

      not 1       0.90      0.92      0.91       407
          1       0.06      0.05      0.05        43

avg / total       0.82      0.83      0.83       450

SVM
              precision    recall  f1-score   support

      not 1       0.99      0.99      0.99       407
          1       0.88      0.88      0.88        43

avg / total       0.98      0.98      0.98       450

Logistic regression
              precision    recall  f1-score   support

      not 1       0.99      0.99      0.99       407
          1       0.86      0.86      0.86        43

avg / total       0.97      0.97      0.97       450

Decision tree
              precision    recall  f1-score   support

      not 1       0.96      0.98      0.97       407
          1       0.79      0.60      0.68        43

avg / total       0.94      0.95      0.94       450

Decision functions

Many classifiers in Scikit-learn can provide information about the uncertainty associated with a particular prediction either by using the decision function method or predict probability method

  • Each classifier score value per test point indicates how confidently the classifier predicts the positive class (large-magnitude positive values) or the negative class (large-magnitude negative values).
  • Choosing a fixed decision threshold gives a classification rule.
  • By sweeping the decision threshold through the entire range of possible score values, we get a series of classification outcomes that form a curve.
In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))

# We can see the instances in the negative class often have large magnitude negative scores, and indeed the instances in 
# the positive class has positive scores from the logistic regression classifier.
# show the decision_function scores for first 20 instances
y_score_list
Out[49]:
[(0, -23.177112362903241),
 (0, -13.541470729541413),
 (0, -21.722900989694729),
 (0, -18.907438437430027),
 (0, -19.735821729002289),
 (0, -9.7498078195600613),
 (1, 5.2349604859009276),
 (0, -19.307551661127864),
 (0, -25.101182889530396),
 (0, -21.827362391350579),
 (0, -24.151343401889438),
 (0, -19.576969790071697),
 (0, -22.574689400560423),
 (0, -10.823324268750714),
 (0, -11.912123406737392),
 (0, -10.97922371337485),
 (1, 11.206006114721543),
 (0, -27.646002317931909),
 (0, -12.859381428186682),
 (0, -25.848764845244997)]

Predicted Probability of Class Membership (predict_proba)

  • Typical rule: choose most likely class
    • e.g. class I if threshold > 0.50
  • Adjusting threshold affects predictions of classifier
  • Higher threshold results in a more conservative classifier
    • e.g. only predict Class I if estimated probability of class I is above 70%
    • This increase precision. Doesn't predict class I as often, but when it does, it gets high proportion of class I instances correct
  • Not all models provide realistic probability estimates
In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)
y_proba_list = list(zip(y_test[0:20], y_proba_lr[0:20,1]))

# show the probability of positive class for first 20 instances
y_proba_list

# you can see that many entries with a positive label of one, have a high probability like 0.995. While many negative label
# instances have a very low prediction probability
Out[50]:
[(0, 8.59622935435435e-11),
 (0, 1.3152654562877047e-06),
 (0, 3.6801368532691122e-10),
 (0, 6.1461593657683724e-09),
 (0, 2.6843647432086779e-09),
 (0, 5.8302468525539246e-05),
 (1, 0.99470119196307671),
 (0, 4.119427394145904e-09),
 (0, 1.2551475474865303e-11),
 (0, 3.3151024835512065e-10),
 (0, 3.2449262447677433e-11),
 (0, 3.1465167386169614e-09),
 (0, 1.5701348842784633e-10),
 (0, 1.9928808739415592e-05),
 (0, 6.7085339693943354e-06),
 (0, 1.7052039108840334e-05),
 (1, 0.99998640787132853),
 (0, 9.8513045790707473e-13),
 (0, 2.6015997081086803e-06),
 (0, 5.9432508848392272e-12)]

Note that not all all models provide useful probability estimates of this type. For example, a model that was over-fit to a training set might provide overly optimistic high probabilities that were in fact not accurate.

Precision-recall curves

Varying the Decision Threshold

We can use these decision scores or prediction probabilities for getting more complete evaluation picture of a classifiers performance. For a particular application, we might pick a specific decision threshold depending on whether we want the classifier to be more or less conservative about making false-positive or false-negative errors. It might not be entirely clear when developing a new model what the right decision threshold would be, and how that choice will affect evaluation metrics like precision and recall. So instead, what we'll do is look at how classifier performs for all possible decision thresholds.

On the left there is a list of test instances with their true label and classifier score. If we set a decision threshold, then all the instances above that line, for example if we set the decision threshold to -20 here, then all the instances above the line are below the threshold of -20. So -20 or less and all the instances in this direction are above the threshold of -20. And so the ones below the threshold will be predicted to be in the negative class, and the ones above the threshold will be predicted to be in the positive class.

So if we pick the specific threshold, in this case, -20, and we partition the test points in this way, we can compute precision and recall for the points that are predicted to be in the positive class. So in this case, we have 12 instances here, 12 total instances. They are being predicted as positive and only four of them

In [67]:
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]

plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

The red circle indicates the precision and recall that's achieved when the decision threshold is zero.

In this particular case you see also that it's not exactly a smooth curve, there are some jaggy errors and in fact the jumps tend to get a little bigger as we approach maximum precision, this is a consequence of how the formulas for recall precision are computed, they use discrete counts that include the number of true positives.

So as the decision threshold increases, these are fewer and fewer points that remain as positive predictions, so the fractions that are computed for these smaller numbers can change pretty dramatically with small changes in the decision threshold. That's why these sort of trailing edges of precision-recall curve can appear a bit jagged when you plot them

ROC curves, Area-Under-Curve (AUC)

ROC curves or receiver operating characteristic curves are a very widely used visualziation method that illustrate the performance of a binary classifier.

X-axis: False Positive Rate Y-axis: True Positive Rate

Top left corner:

  • The "ideal" point
  • False positive rate of zero
  • True positive rate of one

"Steepness" of ROC curves is important:

  • Maximize the true positive rate
  • while minizing the false positive rate
In [68]:
from sklearn.metrics import roc_curve, auc

X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

y_score_lr = lr.fit(X_train, y_train).decision_function(X_test)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)

plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.axes().set_aspect('equal')
plt.show()
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

The dotted line here is the classifier curve that secretly results from a classifier that randomly guesses the label for a binary class, it's basically like flipping a coin, if you have two classes with equal numbers of positive and negative incidences, then flipping a coin will get you randomly equal numbers of false positives and true positives for a large virus data sets.

So the dotted line here is used as a base line, the bad classifier will have performance that is random or maybe even worse than random or be slightly better than random, reasonably good classifier will give an ROC curve that is consistently better than random across all decision threshold choices, and then an excellent classiferwould be one like shown here (LogRegr ROC curve), this particular example is an example of a logistic regression classifier.

So the shape of the curve can be important as well, the steepness of the curve, we want classifiers that maximize the true positive rate while minimizing the false positive rate. Now we can quantify the goodness of a classifier in some sense by looking at how much area there is underneath the curve. So the area underneath the random classifier is going to be 0.5. We use AUC, the single number that measures this total area underneath the ROC curve as a way to summarize a classifier's performance

In [69]:
from matplotlib import cm

X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)

plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
for g in [0.01, 0.1, 0.20, 1]:
    svm = SVC(gamma=g).fit(X_train, y_train)
    y_score_svm = svm.decision_function(X_test)
    fpr_svm, tpr_svm, _ = roc_curve(y_test, y_score_svm)
    roc_auc_svm = auc(fpr_svm, tpr_svm)
    accuracy_svm = svm.score(X_test, y_test)
    print("gamma = {:.2f}  accuracy = {:.2f}   AUC = {:.2f}".format(g, accuracy_svm, 
                                                                    roc_auc_svm))
    plt.plot(fpr_svm, tpr_svm, lw=3, alpha=0.7, 
             label='SVM (gamma = {:0.2f}, area = {:0.2f})'.format(g, roc_auc_svm))

plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate (Recall)', fontsize=16)
plt.plot([0, 1], [0, 1], color='k', lw=0.5, linestyle='--')
plt.legend(loc="lower right", fontsize=11)
plt.title('ROC curve: (1-of-10 digits classifier)', fontsize=16)
plt.axes().set_aspect('equal')

plt.show()
gamma = 0.01  accuracy = 0.91   AUC = 1.00
gamma = 0.10  accuracy = 0.90   AUC = 0.98
gamma = 0.20  accuracy = 0.90   AUC = 0.66
gamma = 1.00  accuracy = 0.90   AUC = 0.50
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

Evaluation measures for multi-class classification

  • Multi-class evalution is an extension of the binary case
    • A collection of true vs predicted binary outcomes, one per class
    • Confusion matrices are especially useful
    • Classification report
  • Overal evaluation metrics are averages across classes
    • But there are different ways to average multi-class results
    • The support (number of instances) for each class is important to consider, e.g. in case of imbalanced classes
  • Multi-label classification: each instance can have multiple labels (not covered here)

Multi-class confusion matrix

The multi-class confusion matrix is a straightforward extension of the binary classifier 2X2 confusion matrix.

As a general rule of thumb as part of model evaluation, I suggest always looking at the confusion matrix for your classifier, to get some insight into what kind of errors it is making for each class including whether some classes are much more prone to certain kinds of errors than others

In [70]:
dataset = load_digits()
X, y = dataset.data, dataset.target
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y, random_state=0)


svm = SVC(kernel = 'linear').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
df_cm = pd.DataFrame(confusion_mc, 
                     index = [i for i in range(0,10)], columns = [i for i in range(0,10)])

plt.figure(figsize=(5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM Linear Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc, 
                                                                       svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label')


svm = SVC(kernel = 'rbf').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
df_cm = pd.DataFrame(confusion_mc, index = [i for i in range(0,10)],
                  columns = [i for i in range(0,10)])

plt.figure(figsize = (5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM RBF Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc, 
                                                                    svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label');

Multi-class classification report

Get a classification report that summarizes multiple evaluation metrics for a multi-class classifier with an average metric computed for each class

In [55]:
print(classification_report(y_test_mc, svm_predicted_mc))
             precision    recall  f1-score   support

          0       1.00      0.65      0.79        37
          1       1.00      0.23      0.38        43
          2       1.00      0.39      0.56        44
          3       1.00      0.93      0.97        45
          4       0.14      1.00      0.25        38
          5       1.00      0.33      0.50        48
          6       1.00      0.54      0.70        52
          7       1.00      0.35      0.52        48
          8       1.00      0.02      0.04        48
          9       1.00      0.55      0.71        47

avg / total       0.93      0.49      0.54       450

Micro- vs. macro-averaged metrics

  • If the classes have about the same number of instances, marcro- and micro-average will be about the same
  • If some classes are much larger (more instances) than others, and you want to:
    • Weight your metric toward the largest ones, use micro average
    • Weight your metric toward the smallest ones, use macro average
  • If the micro-average is much lower than the macro-average then examine the larger classes for poor metric performance. The model is probably misclassifying the frequent labels more than the frequent labels.
  • If the macro-average is much lower than the micro-average then examine the smaller classes for poor metric performance. The model is probably misclassifying the infrequent labels more than the infrequent labels.

In [56]:
print('Micro-averaged precision = {:.2f} (treat instances equally)'
      .format(precision_score(y_test_mc, svm_predicted_mc, average = 'micro')))
print('Macro-averaged precision = {:.2f} (treat classes equally)'
      .format(precision_score(y_test_mc, svm_predicted_mc, average = 'macro')))
Micro-averaged precision = 0.49 (treat instances equally)
Macro-averaged precision = 0.91 (treat classes equally)
In [57]:
print('Micro-averaged f1 = {:.2f} (treat instances equally)'
      .format(f1_score(y_test_mc, svm_predicted_mc, average = 'micro')))
print('Macro-averaged f1 = {:.2f} (treat classes equally)'
      .format(f1_score(y_test_mc, svm_predicted_mc, average = 'macro')))
Micro-averaged f1 = 0.49 (treat instances equally)
Macro-averaged f1 = 0.54 (treat classes equally)

Regression evaluation metrics

  • Typically r2_score is enough, it's always positive does have the potential to go negative for bad model fits, such as when fitting non-linear functions to data.
    • Reminder: computes how well future instances will be predicted
    • Best possible score is 1.0
    • Constant prediction score is 0.0 without looking at the input - DummyRegressor
  • Alternative metrics include:
    • mean_absolute_error (absolute difference of target & predicted values), in machine learning terms this corresponds to the expected value of L1 norm loss. This is sometimes used for example to assess focused outcomes for regression in time series analysis.
    • mean_squared_error (squared difference of target & predicted values), this corresponds to the expected value of the L2 norm loss. This is widely used for many regression problems. Like mean_absolute_error, mean_squared_error doesn't distingush between over and under estimates.
    • median_absolute_error (robust to outliers), in those cases, when ignoring outlier is important, you can use the median_absolute_error score, which is robust with the presence of outliers because it uses the median of error distribution rather than the mean

Dummy Regressors The DummyRegressor class implements four simple baseline rules for regression, using the strategy parameter:

  • mean predicts the mean of the training target values
  • median predicts the median of the training target values
  • quantile predicts a user-provided quantile of the training values (e.g. value at the $75^{th}$ percentile)
  • constant predicts a custom constant value provided by the user

Although regression typically has simpler evaluation needs than classification, it does pay to double check to make sure the evaluation metric you choose for a regression problem does penalize errors in way that reflects the consequences of those errors for the business, organizational, or user needs of your application

In [63]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor

diabetes = datasets.load_diabetes()

X = diabetes.data[:, None, 6]
y = diabetes.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lm = LinearRegression().fit(X_train, y_train)
lm_dummy_mean = DummyRegressor(strategy = 'mean').fit(X_train, y_train)

y_predict = lm.predict(X_test)
y_predict_dummy_mean = lm_dummy_mean.predict(X_test)

print('Linear model, coefficients: ', lm.coef_)
print("Mean squared error (dummy): {:.2f}".format(mean_squared_error(y_test, 
                                                                     y_predict_dummy_mean)))
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, y_predict)))
print("r2_score (dummy): {:.2f}".format(r2_score(y_test, y_predict_dummy_mean)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, y_predict)))

# Plot outputs
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_predict, color='green', linewidth=2)
plt.plot(X_test, y_predict_dummy_mean, color='red', linestyle = 'dashed', 
         linewidth=2, label = 'dummy')

plt.show()
Linear model, coefficients:  [-698.80206267]
Mean squared error (dummy): 4965.13
Mean squared error (linear model): 4646.74
r2_score (dummy): -0.00
r2_score (linear model): 0.06

Beyond the essential evaluation metrics covered in this course, online controlled experiments, which involve A-B testing and other techniques, are perhaps the most important way that machine learning algorithms are evaluated for real-world use in Web and other online applications. This article by Ron Kohavi, Randal Henne, and Dan Sommerfield, reviews the key points of running controlled experiments, along with important engineering issues and limitations to keep in mind when using them.

Kohavi, R., Henne, R. M., & Sommerfield, D. (2007). Practical guide to controlled experiments on the web. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '07. doi:10.1145/1281192.1281295


Model selection using evaluation metrics

  • Train/test on same data
    • Single metric
    • Typically overfits and likely won't generalize well on new data
    • But can serve as sanity check: low accuracy on the training set may indicate an implementation problem
  • Single train/test split
    • Single metric
    • Speed and simplicity
    • Lack of variance information
  • K-fold cross-validation
    • K train-test splits
    • Average metric over all splits
    • Can be combined with parameter grid search: GridSearchCV (def. cv=3)

Cross-validation example

In [59]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

dataset = load_digits()
# again, making this a binary problem with 'digit 1' as positive class 
# and 'not 1' as negative class
X, y = dataset.data, dataset.target == 1
clf = SVC(kernel='linear', C=1)

# accuracy is the default scoring metric
print('Cross-validation (accuracy)', cross_val_score(clf, X, y, cv=5))
# use AUC as scoring metric
print('Cross-validation (AUC)', cross_val_score(clf, X, y, cv=5, scoring = 'roc_auc'))
# use recall as scoring metric
print('Cross-validation (recall)', cross_val_score(clf, X, y, cv=5, scoring = 'recall'))
Cross-validation (accuracy) [ 0.91944444  0.98611111  0.97214485  0.97493036  0.96935933]
Cross-validation (AUC) [ 0.9641871   0.9976571   0.99372205  0.99699002  0.98675611]
Cross-validation (recall) [ 0.81081081  0.89189189  0.83333333  0.83333333  0.83333333]

Grid search example

In [60]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = SVC(kernel='rbf')
grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}

# default metric to optimize over grid parameters: accuracy
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values)
grid_clf_acc.fit(X_train, y_train)
y_decision_fn_scores_acc = grid_clf_acc.decision_function(X_test) 

print('Grid best parameter (max. accuracy): ', grid_clf_acc.best_params_)
print('Grid best score (accuracy): ', grid_clf_acc.best_score_)

# alternative metric to optimize over grid parameters: AUC
grid_clf_auc = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf_auc.fit(X_train, y_train)
y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test) 

print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))
print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
print('Grid best score (AUC): ', grid_clf_auc.best_score_)
Grid best parameter (max. accuracy):  {'gamma': 0.001}
Grid best score (accuracy):  0.996288047513
Test set AUC:  0.999828581224
Grid best parameter (max. AUC):  {'gamma': 0.001}
Grid best score (AUC):  0.99987412783

Evaluation metrics supported for model selection

You can see the complete list of names for the evaluation metric supported by the scoring parameter by running he following code that uses the score's variable imported from sklearn metrics, you can see metrics for classification such as the string precision_micro that represents micro-averaged precision, as well as metrics for regression such as r2 metric for R-squared regression loss

In [61]:
from sklearn.metrics.scorer import SCORERS

print(sorted(list(SCORERS.keys())))
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']

Two-feature classification example using the digits dataset

Example: Optimizing a classifier using different evaluation metrics

Show a specific example how a classifier's decision boundary changes when it's optimized for different evaluation metrics, this classification problem is based on the same binary digit classifier training and test sets we've been using as an example throughout the notebook.

In this classification visualization examples, the positive examples, the digit one are shown as black points and the region of positive class prediction is shown in the light-colored or yellow region to the right of this decision boundary. The negative examples, all other digits, are shown as white points, and the region of negative class prediction here in these figures is to the left of the decision boundary. The data points have been plotted using two out of 64 future values in the digits dataset and have been jittered a little.

In [9]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from adspy_shared_utilities import plot_class_regions_for_classifier_subplot
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV


dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Create a two-feature input vector matching the example plot above
# We jitter the points (add a small amount of random noise) in case there are areas
# in feature space where many instances have the same features.
jitter_delta = 0.25
X_twovar_train = X_train[:,[20,59]]+ np.random.rand(X_train.shape[0], 2) - jitter_delta
X_twovar_test  = X_test[:,[20,59]] + np.random.rand(X_test.shape[0], 2) - jitter_delta

# We apply `GridSearchCV` here to explore different values of the optional class weight parameter that contols how much
# weight is given to each of the two classes during training 
clf = SVC(kernel = 'linear').fit(X_twovar_train, y_train)
grid_values = {'class_weight':['balanced', {1:2},{1:3},{1:4},{1:5},{1:10},{1:20},{1:50}]}
plt.figure(figsize=(9,6))
for i, eval_metric in enumerate(('precision','recall', 'f1','roc_auc')):
    grid_clf_custom = GridSearchCV(clf, param_grid=grid_values, scoring=eval_metric)
    grid_clf_custom.fit(X_twovar_train, y_train)
    print('Grid best parameter (max. {0}): {1}'
          .format(eval_metric, grid_clf_custom.best_params_))
    print('Grid best score ({0}): {1}'
          .format(eval_metric, grid_clf_custom.best_score_))
    plt.subplots_adjust(wspace=0.3, hspace=0.3)
    plot_class_regions_for_classifier_subplot(grid_clf_custom, X_twovar_test, y_test, None,
                                             None, None,  plt.subplot(2, 2, i+1))
    
    plt.title(eval_metric+'-oriented SVC')
plt.tight_layout()
plt.show()
Grid best parameter (max. precision): {'class_weight': {1: 2}}
Grid best score (precision): 0.5491341497717321
Grid best parameter (max. recall): {'class_weight': {1: 50}}
Grid best score (recall): 0.935661321592438
Grid best parameter (max. f1): {'class_weight': {1: 3}}
Grid best score (f1): 0.5148982937407222
Grid best parameter (max. roc_auc): {'class_weight': {1: 20}}
Grid best score (roc_auc): 0.893214145043902

As it turns out, optimazation for different evaluation metrics results in different optimal values of the class weight parameter. As the class weight parameter increases, more emphasis will be given to correctly classifiy the positive class instances. The precision-oriented classifier we see here with class weight of two, tries hard to reduce false positves while increasing true positives. So it focuses on the cluster of positive class points in the lower right corner where there are relatively few negative class points.

In contrast, the recall-oriented classifier with class weight of 50, tries hard to reduce the number of false negatives while increasing true positives. That is it tries to find most of the positive class points as part of its positive class predictions.

We can also see that the decision boundary for the F1-oriented classifier has an optimal class weight of two, which is between the optimal class weight values for the precision and recall-oriented classifiers. Visually we can see that the F1-oriented classifier also has a kind of intermediate positioning between the precision and recall-oriented decision boundaries. This makes sense given that F1 is the harmonic mean of precision and recall.

The AUC-oriented classifier with optimal class weight to 5 has a similar decision boundary to the F1-oriented classifier, but shifted slightly in favor of higher recall

Precision-recall curve for the default SVC classifier (with balanced class weights)

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
from adspy_shared_utilities import plot_class_regions_for_classifier
from sklearn.svm import SVC

dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# create a two-feature input vector matching the example plot above
jitter_delta = 0.25
X_twovar_train = X_train[:,[20,59]]+ np.random.rand(X_train.shape[0], 2) - jitter_delta
X_twovar_test  = X_test[:,[20,59]] + np.random.rand(X_test.shape[0], 2) - jitter_delta

clf = SVC(kernel='linear', class_weight='balanced').fit(X_twovar_train, y_train)

y_scores = clf.decision_function(X_twovar_test)

precision, recall, thresholds = precision_recall_curve(y_test, y_scores)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]

plot_class_regions_for_classifier(clf, X_twovar_test, y_test)
plt.title("SVC, class_weight = 'balanced', optimized for accuracy")
plt.show()

plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.title ("Precision-recall curve: SVC, class_weight = 'balanced'")
plt.plot(precision, recall, label = 'Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize=12, fillstyle='none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
print('At zero threshold, precision: {:.2f}, recall: {:.2f}'
      .format(closest_zero_p, closest_zero_r))
At zero threshold, precision: 0.22, recall: 0.74
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

As the decision threshold is shifted to become less and less conservative, tracing the curve up into the left, the classifier becomes more and more like the recall-oriented support vector classifier example. Again, the red circle represents the 8precision recall trade-off achieved at the zero score mark which is the actual decision boundary chosen for the trained classifier.

Training, Validation, and Test Framework for Model Selection and Evaluation

  • Using only cross-validation or a test set to do model selection may lead to more subtle overfitting / optimistic generalization estimates. An intuitive explanation for this might be the following:

    "remember that the whole point of evaluating on a test set is to estimate how well a learning algorithm might perform on future, unseen data. The more information we see about our dataset as part of repeated cross-validation passes in choosing our model, the more influence any potential held-up test data has played into selecting the final model. Not merely evaluating it.

This is sometimes called data leakage. So we haven't done an evaluation with a truely held-out test set unless we commit to holding back a test split that isn't seen by any process until the very end of the evaluation.

  • Instead, use three data splits:

    1. Training set (model building)
    2. Validation set (model selection)
    3. Test set (final evaluation)
  • In practice:

    • Create an initial training/test split
    • Do cross-validation on the training data for model/parameter selection
    • Save the held-out test set for final model evaluation

Machine learning researchers take this protocol very seriously, the train-validate-test design is a very important universally applied framework for effective evaluation of machine learning models

Concluding Notes

  • Accuracy is often not the right evaluation metric for many real-world machine learning tasks.
    • False positive and false negatives may need to be treated very differently
    • Make sure you understand the needs of your application and choose an evalution metric that matches your application, user, or business goals
  • Examples of additional evaluation methods include:
    • Learning curve: How much does accuracy (or other metric) change as a function of the amount of traing data? Learning curves may be useful as part of a cost-benefit analysis
    • Sensitivity analysis: How much does accuracy (or other metric) change as a function of key learning parameter values? sensitivity analysis amounts to looking at how an evaluation metric changes as small adjustments are made to important model parameters. This helps assess how robust the model is to choice of parameters.
In [7]:
import pip

def install(package):
   pip.main(['install', package])
In [8]:
install('graphviz')
Collecting graphviz
  Downloading graphviz-0.8.2-py2.py3-none-any.whl
Installing collected packages: graphviz
Successfully installed graphviz-0.8.2
You are using pip version 9.0.1, however version 9.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.