When we began working with supervised machine learning methods, we evaluated a classifier's performance using its accuracy. Accuracy, as you might recall is the fraction of samples that were classified correctly, that is where the classifier's predicted label matched the correct or True label.
Learning Objectives
Represent / Train / Evaluate / Refine Cycle
Representation
: Extract and select object features `
==>`Train models
: Fit the estimator to the data `
==>`Evaluation
`
==>`Feature and model refinement
`
==>`Representation
Evaluation
First, let's consider the case where we have a binary classification task where there are a lot of instances labeled with the negative class, but only a few instances that belong to the positive class. For example, we might see this scenario in online search or recommender systems where system has to predict whether or not to diasplay an advertisement or product suggestion or show a query suggestion or item on a page that's likely to be relevant given a user's query and what they clicked on in the past and so on, so those would be the positive examples, but of course there are many, many irrelevant items that are in the negative class that don't make sense to show a user. So this is called an imbalanced class scenario
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits #using the digits dataset, which has images of handwritten digits labeled with
#ten classes, representing the digits 0 though 9
dataset = load_digits()
X, y = dataset.data, dataset.target
for class_name, class_count in zip(dataset.target_names, np.bincount(dataset.target)):
print(class_name,class_count)
dataset
# Creating a dataset with imbalanced binary classes:
# Negative class (0) is 'not digit 1'
# Positive class (1) is 'digit 1'
y_binary_imbalanced = y.copy()
y_binary_imbalanced[y_binary_imbalanced != 1] = 0
print('Original labels:\t', y[1:30])
print('New binary labels:\t', y_binary_imbalanced[1:30])
np.bincount(y_binary_imbalanced) # Negative class (0) is the most frequent class
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
# Accuracy of Support Vector Machine classifier
from sklearn.svm import SVC
svm = SVC(kernel='rbf', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)
DummyClassifier is a classifier that makes predictions using simple rules, which can be useful as a baseline for comparison against actual classifiers, especially with imbalanced classes. DummyClassifier are called that because they don't even look at the data to make a prediction, they simply use the strategy or rule fo thumb that you instruct them to use, when creating them. In fact, when you create the claasifier, you set the strategy argument to tell it what rule of thumb to use to make its predictions. So here we set this to the most frequent strategy to predict the most frequent class.
from sklearn.dummy import DummyClassifier
# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
# Therefore the dummy 'most_frequent' classifier always predicts class 0
y_dummy_predictions = dummy_majority.predict(X_test)
y_dummy_predictions
dummy_majority.score(X_test, y_test)
Dummy classifiers completely ignore the input data!
strategy
parameter for DummyClassifier in scikit-learn:most_frequent
: predicts the most frequent label in the training set.stratified
: random predictions based on training set class distribution.uniform
: generates predictions uniformly at randomconstant
: always predicts a constant label provided by the usersWhen we have a binary classification task where the most frequent class is the negative class, turns out that using the most frequent strategy will never predict the positive class and will never be able to count the number of positive instances that are correctly predicted, and so the overall count of such positive correct predictions will be 0, and we'll cause some important metrics like $F_1$-score to always be 0. So using the constant strategy, we can force a dummy classifier to always predict the positie class even if it's the minority class in a set of classes, and this will lead to more meaningful computation of $F_1$-score
What if my classifier accuracy is close to the null accuracy baseline? This could be a sign of:
For example, if we change the support vector classifier's kernel parameter to linear
from rbf
,and recompute the accuracy on this retrain classifier, we can see that this leads to much better performance of almost 98% compared to the most_frequent
class based line of 90%
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)
In general, for imbalanced classification problems, you should use metrics other than accuracy, look at one shortly called AUC, which is short for area under the curve.
Dummy Regressors
strategy
parameter options:
mean
: predicts the mean of the training targetsmedian
:predicts the median of the training targetsquantile
:predicts a user-provided quantile of the training targetsconstant
:predicts a constant user-provided valuefrom sklearn.metrics import confusion_matrix
# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
y_majority_predicted = dummy_majority.predict(X_test)
confusion = confusion_matrix(y_test, y_majority_predicted)
print('Most frequent class (dummy classifier)\n', confusion)
# produces random predictions w/ same class proportion as training set
dummy_classprop = DummyClassifier(strategy='stratified').fit(X_train, y_train)
y_classprop_predicted = dummy_classprop.predict(X_test)
confusion = confusion_matrix(y_test, y_classprop_predicted)
print('Random class-proportional prediction (dummy classifier)\n', confusion)
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm_predicted = svm.predict(X_test)
confusion = confusion_matrix(y_test, svm_predicted)
print('Support vector machine classifier (linear kernel, C=1)\n', confusion)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(X_train, y_train)
lr_predicted = lr.predict(X_test)
confusion = confusion_matrix(y_test, lr_predicted)
print('Logistic regression classifier (default settings)\n', confusion)
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
tree_predicted = dt.predict(X_test)
confusion = confusion_matrix(y_test, tree_predicted)
print('Decision tree classifier (max_depth = 2)\n', confusion)
There is often a tradeoff between precision and recall
$F_1$-score: combining precision & recall into a single number
When evaluating classifiers, it's often convenient to compute a quantity known as an $F_1$-score, that combines precision
and recall
into a single number. Mathematically this is based on harmonic mean of precision
and recall
using this formula:
$$F_1=2\cdot \frac{Precision\cdot Recall}{Precion+Recall} = \frac {2\cdot TP}{2\cdot TP+FN+FP}$$
$$F_{\beta}=(1+\beta^2)\cdot \frac{Precision\cdot Recall}{(\beta^2\cdot Precion)+Recall} = \frac {(1+\beta^2)\cdot TP}{(1+\beta^2)\cdot TP+\beta \cdot FN+FP}$$
$\beta$ allows adjustment of the metric to control the emphasis on recall vs precision:
# Once a model is builted, we can get it's "accuracy", "precision","recall", "F1" and so on
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Accuracy = TP + TN / (TP + TN + FP + FN)
# Precision = TP / (TP + FP)
# Recall = TP / (TP + FN) Also known as sensitivity, or True Positive Rate
# F1 = 2 * Precision * Recall / (Precision + Recall)
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, tree_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, tree_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, tree_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, tree_predicted)))
# Combined report with all above metrics
from sklearn.metrics import classification_report
print(classification_report(y_test, tree_predicted, target_names=['not 1', '1']))
# the last column "support" shows the number of instances in the test set that have true label
print('Random class-proportional (dummy)\n',
classification_report(y_test, y_classprop_predicted, target_names=['not 1', '1']))
print('SVM\n',
classification_report(y_test, svm_predicted, target_names = ['not 1', '1']))
print('Logistic regression\n',
classification_report(y_test, lr_predicted, target_names = ['not 1', '1']))
print('Decision tree\n',
classification_report(y_test, tree_predicted, target_names = ['not 1', '1']))
Many classifiers in Scikit-learn can provide information about the uncertainty associated with a particular prediction either by using the decision function method or predict probability method
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))
# We can see the instances in the negative class often have large magnitude negative scores, and indeed the instances in
# the positive class has positive scores from the logistic regression classifier.
# show the decision_function scores for first 20 instances
y_score_list
Predicted Probability of Class Membership (predict_proba
)
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)
y_proba_list = list(zip(y_test[0:20], y_proba_lr[0:20,1]))
# show the probability of positive class for first 20 instances
y_proba_list
# you can see that many entries with a positive label of one, have a high probability like 0.995. While many negative label
# instances have a very low prediction probability
Note that not all all models provide useful probability estimates of this type. For example, a model that was over-fit to a training set might provide overly optimistic high probabilities that were in fact not accurate.
Varying the Decision Threshold
We can use these decision scores or prediction probabilities for getting more complete evaluation picture of a classifiers performance. For a particular application, we might pick a specific decision threshold depending on whether we want the classifier to be more or less conservative about making false-positive or false-negative errors. It might not be entirely clear when developing a new model what the right decision threshold would be, and how that choice will affect evaluation metrics like precision and recall. So instead, what we'll do is look at how classifier performs for all possible decision thresholds.
On the left there is a list of test instances with their true label and classifier score. If we set a decision threshold, then all the instances above that line, for example if we set the decision threshold to -20 here, then all the instances above the line are below the threshold of -20. So -20 or less and all the instances in this direction are above the threshold of -20. And so the ones below the threshold will be predicted to be in the negative class, and the ones above the threshold will be predicted to be in the positive class.
So if we pick the specific threshold, in this case, -20, and we partition the test points in this way, we can compute precision and recall for the points that are predicted to be in the positive class. So in this case, we have 12 instances here, 12 total instances. They are being predicted as positive and only four of them
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]
plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
The red circle indicates the precision and recall that's achieved when the decision threshold is zero.
In this particular case you see also that it's not exactly a smooth curve, there are some jaggy errors and in fact the jumps tend to get a little bigger as we approach maximum precision, this is a consequence of how the formulas for recall precision are computed, they use discrete counts that include the number of true positives.
So as the decision threshold increases, these are fewer and fewer points that remain as positive predictions, so the fractions that are computed for these smaller numbers can change pretty dramatically with small changes in the decision threshold. That's why these sort of trailing edges of precision-recall curve can appear a bit jagged when you plot them
ROC curves or receiver operating characteristic curves are a very widely used visualziation method that illustrate the performance of a binary classifier.
X-axis: False Positive Rate Y-axis: True Positive Rate
Top left corner:
"Steepness" of ROC curves is important:
from sklearn.metrics import roc_curve, auc
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_score_lr = lr.fit(X_train, y_train).decision_function(X_test)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)
plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.axes().set_aspect('equal')
plt.show()
The dotted line here is the classifier curve that secretly results from a classifier that randomly guesses the label for a binary class, it's basically like flipping a coin, if you have two classes with equal numbers of positive and negative incidences, then flipping a coin will get you randomly equal numbers of false positives and true positives for a large virus data sets.
So the dotted line here is used as a base line, the bad classifier will have performance
that is random or maybe even worse than random or be slightly better than random, reasonably good classifier will give
an ROC curve that is consistently better than random across all decision threshold choices, and then an excellent classiferwould be one like shown here (LogRegr ROC curve
), this particular example is an example of a logistic regression classifier.
So the shape of the curve can be important as well, the steepness of the curve, we want classifiers that maximize the true positive rate while minimizing the false positive rate. Now we can quantify the goodness of a classifier in some sense by looking at how much area there is underneath the curve. So the area underneath the random classifier is going to be 0.5. We use AUC
, the single number that measures this total area underneath the ROC
curve as a way to summarize a classifier's performance
from matplotlib import cm
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
for g in [0.01, 0.1, 0.20, 1]:
svm = SVC(gamma=g).fit(X_train, y_train)
y_score_svm = svm.decision_function(X_test)
fpr_svm, tpr_svm, _ = roc_curve(y_test, y_score_svm)
roc_auc_svm = auc(fpr_svm, tpr_svm)
accuracy_svm = svm.score(X_test, y_test)
print("gamma = {:.2f} accuracy = {:.2f} AUC = {:.2f}".format(g, accuracy_svm,
roc_auc_svm))
plt.plot(fpr_svm, tpr_svm, lw=3, alpha=0.7,
label='SVM (gamma = {:0.2f}, area = {:0.2f})'.format(g, roc_auc_svm))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate (Recall)', fontsize=16)
plt.plot([0, 1], [0, 1], color='k', lw=0.5, linestyle='--')
plt.legend(loc="lower right", fontsize=11)
plt.title('ROC curve: (1-of-10 digits classifier)', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
The multi-class confusion matrix is a straightforward extension of the binary classifier 2X2 confusion matrix.
As a general rule of thumb as part of model evaluation, I suggest always looking at the confusion matrix for your classifier, to get some insight into what kind of errors it is making for each class including whether some classes are much more prone to certain kinds of errors than others
dataset = load_digits()
X, y = dataset.data, dataset.target
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y, random_state=0)
svm = SVC(kernel = 'linear').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
df_cm = pd.DataFrame(confusion_mc,
index = [i for i in range(0,10)], columns = [i for i in range(0,10)])
plt.figure(figsize=(5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM Linear Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc,
svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label')
svm = SVC(kernel = 'rbf').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
df_cm = pd.DataFrame(confusion_mc, index = [i for i in range(0,10)],
columns = [i for i in range(0,10)])
plt.figure(figsize = (5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM RBF Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc,
svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label');
Get a classification report that summarizes multiple evaluation metrics for a multi-class classifier with an average metric computed for each class
print(classification_report(y_test_mc, svm_predicted_mc))
print('Micro-averaged precision = {:.2f} (treat instances equally)'
.format(precision_score(y_test_mc, svm_predicted_mc, average = 'micro')))
print('Macro-averaged precision = {:.2f} (treat classes equally)'
.format(precision_score(y_test_mc, svm_predicted_mc, average = 'macro')))
print('Micro-averaged f1 = {:.2f} (treat instances equally)'
.format(f1_score(y_test_mc, svm_predicted_mc, average = 'micro')))
print('Macro-averaged f1 = {:.2f} (treat classes equally)'
.format(f1_score(y_test_mc, svm_predicted_mc, average = 'macro')))
DummyRegressor
mean_absolute_error
(absolute difference of target & predicted values), in machine learning terms this corresponds to the expected value of L1 norm loss. This is sometimes used for example to assess focused outcomes for regression in time series analysis.mean_squared_error
(squared difference of target & predicted values), this corresponds to the expected value of the L2 norm loss. This is widely used for many regression problems. Like mean_absolute_error
, mean_squared_error
doesn't distingush between over and under estimates.median_absolute_error
(robust to outliers), in those cases, when ignoring outlier is important, you can use the median_absolute_error
score, which is robust with the presence of outliers because it uses the median of error distribution rather than the meanDummy Regressors
The DummyRegressor
class implements four simple baseline rules for regression, using the strategy
parameter:
mean
predicts the mean of the training target valuesmedian
predicts the median of the training target valuesquantile
predicts a user-provided quantile of the training values (e.g. value at the $75^{th}$ percentile)constant
predicts a custom constant value provided by the userAlthough regression typically has simpler evaluation needs than classification, it does pay to double check to make sure the evaluation metric you choose for a regression problem does penalize errors in way that reflects the consequences of those errors for the business, organizational, or user needs of your application
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor
diabetes = datasets.load_diabetes()
X = diabetes.data[:, None, 6]
y = diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
lm = LinearRegression().fit(X_train, y_train)
lm_dummy_mean = DummyRegressor(strategy = 'mean').fit(X_train, y_train)
y_predict = lm.predict(X_test)
y_predict_dummy_mean = lm_dummy_mean.predict(X_test)
print('Linear model, coefficients: ', lm.coef_)
print("Mean squared error (dummy): {:.2f}".format(mean_squared_error(y_test,
y_predict_dummy_mean)))
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, y_predict)))
print("r2_score (dummy): {:.2f}".format(r2_score(y_test, y_predict_dummy_mean)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, y_predict)))
# Plot outputs
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_predict, color='green', linewidth=2)
plt.plot(X_test, y_predict_dummy_mean, color='red', linestyle = 'dashed',
linewidth=2, label = 'dummy')
plt.show()
Beyond the essential evaluation metrics covered in this course, online controlled experiments, which involve A-B testing and other techniques, are perhaps the most important way that machine learning algorithms are evaluated for real-world use in Web and other online applications. This article by Ron Kohavi, Randal Henne, and Dan Sommerfield, reviews the key points of running controlled experiments, along with important engineering issues and limitations to keep in mind when using them.
Kohavi, R., Henne, R. M., & Sommerfield, D. (2007). Practical guide to controlled experiments on the web. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '07. doi:10.1145/1281192.1281295
GridSearchCV
(def. cv=3
)from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
dataset = load_digits()
# again, making this a binary problem with 'digit 1' as positive class
# and 'not 1' as negative class
X, y = dataset.data, dataset.target == 1
clf = SVC(kernel='linear', C=1)
# accuracy is the default scoring metric
print('Cross-validation (accuracy)', cross_val_score(clf, X, y, cv=5))
# use AUC as scoring metric
print('Cross-validation (AUC)', cross_val_score(clf, X, y, cv=5, scoring = 'roc_auc'))
# use recall as scoring metric
print('Cross-validation (recall)', cross_val_score(clf, X, y, cv=5, scoring = 'recall'))
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = SVC(kernel='rbf')
grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}
# default metric to optimize over grid parameters: accuracy
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values)
grid_clf_acc.fit(X_train, y_train)
y_decision_fn_scores_acc = grid_clf_acc.decision_function(X_test)
print('Grid best parameter (max. accuracy): ', grid_clf_acc.best_params_)
print('Grid best score (accuracy): ', grid_clf_acc.best_score_)
# alternative metric to optimize over grid parameters: AUC
grid_clf_auc = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf_auc.fit(X_train, y_train)
y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test)
print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))
print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
print('Grid best score (AUC): ', grid_clf_auc.best_score_)
You can see the complete list of names for the evaluation metric supported by the scoring parameter by running he following code that uses the score's variable imported from sklearn metrics, you can see metrics for classification such as the string precision_micro
that represents micro-averaged precision, as well as metrics for regression such as r2
metric for R-squared regression loss
from sklearn.metrics.scorer import SCORERS
print(sorted(list(SCORERS.keys())))
Show a specific example how a classifier's decision boundary changes when it's optimized for different evaluation metrics, this classification problem is based on the same binary digit classifier training and test sets we've been using as an example throughout the notebook.
In this classification visualization examples, the positive examples, the digit one are shown as black points and the region of positive class prediction is shown in the light-colored or yellow region to the right of this decision boundary. The negative examples, all other digits, are shown as white points, and the region of negative class prediction here in these figures is to the left of the decision boundary. The data points have been plotted using two out of 64 future values in the digits dataset and have been jittered a little.
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from adspy_shared_utilities import plot_class_regions_for_classifier_subplot
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Create a two-feature input vector matching the example plot above
# We jitter the points (add a small amount of random noise) in case there are areas
# in feature space where many instances have the same features.
jitter_delta = 0.25
X_twovar_train = X_train[:,[20,59]]+ np.random.rand(X_train.shape[0], 2) - jitter_delta
X_twovar_test = X_test[:,[20,59]] + np.random.rand(X_test.shape[0], 2) - jitter_delta
# We apply `GridSearchCV` here to explore different values of the optional class weight parameter that contols how much
# weight is given to each of the two classes during training
clf = SVC(kernel = 'linear').fit(X_twovar_train, y_train)
grid_values = {'class_weight':['balanced', {1:2},{1:3},{1:4},{1:5},{1:10},{1:20},{1:50}]}
plt.figure(figsize=(9,6))
for i, eval_metric in enumerate(('precision','recall', 'f1','roc_auc')):
grid_clf_custom = GridSearchCV(clf, param_grid=grid_values, scoring=eval_metric)
grid_clf_custom.fit(X_twovar_train, y_train)
print('Grid best parameter (max. {0}): {1}'
.format(eval_metric, grid_clf_custom.best_params_))
print('Grid best score ({0}): {1}'
.format(eval_metric, grid_clf_custom.best_score_))
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plot_class_regions_for_classifier_subplot(grid_clf_custom, X_twovar_test, y_test, None,
None, None, plt.subplot(2, 2, i+1))
plt.title(eval_metric+'-oriented SVC')
plt.tight_layout()
plt.show()
As it turns out, optimazation for different evaluation metrics results in different optimal values of the class weight parameter. As the class weight parameter increases, more emphasis will be given to correctly classifiy the positive class instances. The precision-oriented classifier we see here with class weight of two, tries hard to reduce false positves while increasing true positives. So it focuses on the cluster of positive class points in the lower right corner where there are relatively few negative class points.
In contrast, the recall-oriented classifier with class weight of 50, tries hard to reduce the number of false negatives while increasing true positives. That is it tries to find most of the positive class points as part of its positive class predictions.
We can also see that the decision boundary for the F1-oriented classifier has an optimal class weight of two, which is between the optimal class weight values for the precision and recall-oriented classifiers. Visually we can see that the F1-oriented classifier also has a kind of intermediate positioning between the precision and recall-oriented decision boundaries. This makes sense given that F1 is the harmonic mean of precision and recall.
The AUC-oriented classifier with optimal class weight to 5 has a similar decision boundary to the F1-oriented classifier, but shifted slightly in favor of higher recall
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
from adspy_shared_utilities import plot_class_regions_for_classifier
from sklearn.svm import SVC
dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# create a two-feature input vector matching the example plot above
jitter_delta = 0.25
X_twovar_train = X_train[:,[20,59]]+ np.random.rand(X_train.shape[0], 2) - jitter_delta
X_twovar_test = X_test[:,[20,59]] + np.random.rand(X_test.shape[0], 2) - jitter_delta
clf = SVC(kernel='linear', class_weight='balanced').fit(X_twovar_train, y_train)
y_scores = clf.decision_function(X_twovar_test)
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]
plot_class_regions_for_classifier(clf, X_twovar_test, y_test)
plt.title("SVC, class_weight = 'balanced', optimized for accuracy")
plt.show()
plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.title ("Precision-recall curve: SVC, class_weight = 'balanced'")
plt.plot(precision, recall, label = 'Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize=12, fillstyle='none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
print('At zero threshold, precision: {:.2f}, recall: {:.2f}'
.format(closest_zero_p, closest_zero_r))
As the decision threshold is shifted to become less and less conservative, tracing the curve up into the left, the classifier becomes more and more like the recall-oriented support vector classifier example. Again, the red circle represents the 8precision recall trade-off achieved at the zero score mark which is the actual decision boundary chosen for the trained classifier.
Training, Validation, and Test Framework for Model Selection and Evaluation
"remember that the whole point of evaluating on a test set is to estimate how well a learning algorithm might perform on future, unseen data. The more information we see about our dataset as part of repeated cross-validation passes in choosing our model, the more influence any potential held-up test data has played into selecting the final model. Not merely evaluating it.
This is sometimes called data leakage. So we haven't done an evaluation with a truely held-out test set unless we commit to holding back a test split that isn't seen by any process until the very end of the evaluation.
Instead, use three data splits:
In practice:
Machine learning researchers take this protocol very seriously, the train-validate-test design is a very important universally applied framework for effective evaluation of machine learning models
import pip
def install(package):
pip.main(['install', package])
install('graphviz')