When we began working with supervised machine learning methods, we evaluated a classifier's performance using its accuracy. Accuracy, as you might recall is the fraction of samples that were classified correctly, that is where the classifier's predicted label matched the correct or True label.

**Learning Objectives**

- Understand why accuracy only gives a partial picture of a classifier's performance.
- Understand the motivation and definition of important evalution metrics in machine learing
- Learn how to use a variety of evaluation metrics to evaluate supervised machine learning models
- Learn about choosing the right metric for selecting between models or for doing parameter tuning

**Represent / Train / Evaluate / Refine Cycle**

`Representation`

: Extract and select object features ```

==>``Train models`

: Fit the estimator to the data ```

==>``Evaluation`

```

==>``Feature and model refinement`

```

==>``Representation`

**Evaluation**

- Different applications have very different goals
- Accuracy is widely used, but many others are possible, e.g. :
- User satisfaction (Web search)
- Amount of revenue (e-commerce)
- Increase in patient survival rates (medical)

- It's very important to choose evaluation methods that match the goal of your application
- Compute your selected evaluation metric for multiple different models
- Then select the model with 'best' value of evaluation metric

- Suppose you have two classes:
- Relevant(R) : the positive class
- Not_Relevant(N) : the negative class

- Out of 1000 randomly selected items, on average
- One item is relevant and has an "R" label
- The rest of the items(999 of them) are not relevant and labelled "N".

- Recall that: $$Accuracy = \frac{correct.predictions}{total.instances}$$

First, let's consider the case where we have a binary classification task where there are a lot of instances labeled with the negative class, but only a few instances that belong to the positive class. For example, we might see this scenario in online search or recommender systems where system has to predict whether or not to diasplay an advertisement or product suggestion or show a query suggestion or item on a page that's likely to be relevant given a user's query and what they clicked on in the past and so on, so those would be the positive examples, but of course there are many, many irrelevant items that are in the negative class that don't make sense to show a user. So this is called an **imbalanced class scenario**

- You build a classifier to predict relevant items, and see that its accuracy on a test set is 99.9%
- Wow! Amazingly good, right?
- For comparison, suppose we had a "dummy" classifier that did't look at the features at all, and always just blindly predicted the most frequent class (i.e. the negetive "N" class)
- Assuming a test set of 1000 instances, what would this dummy classifier's accuracy be?
- Answer: $Accuracy_{DUMMY}=999/1000=99.9\%$

In [1]:

```
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits #using the digits dataset, which has images of handwritten digits labeled with
#ten classes, representing the digits 0 though 9
dataset = load_digits()
X, y = dataset.data, dataset.target
for class_name, class_count in zip(dataset.target_names, np.bincount(dataset.target)):
print(class_name,class_count)
```

In [34]:

```
dataset
```

Out[34]:

In [35]:

```
# Creating a dataset with imbalanced binary classes:
# Negative class (0) is 'not digit 1'
# Positive class (1) is 'digit 1'
y_binary_imbalanced = y.copy()
y_binary_imbalanced[y_binary_imbalanced != 1] = 0
print('Original labels:\t', y[1:30])
print('New binary labels:\t', y_binary_imbalanced[1:30])
```

In [36]:

```
np.bincount(y_binary_imbalanced) # Negative class (0) is the most frequent class
```

Out[36]:

In [37]:

```
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
# Accuracy of Support Vector Machine classifier
from sklearn.svm import SVC
svm = SVC(kernel='rbf', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)
```

Out[37]:

DummyClassifier is a classifier that makes predictions using simple rules, which can be useful as a baseline for comparison against actual classifiers, especially with imbalanced classes. DummyClassifier are called that because they don't even look at the data to make a prediction, they simply use the strategy or rule fo thumb that you instruct them to use, when creating them. In fact, when you create the claasifier, you set the strategy argument to tell it what rule of thumb to use to make its predictions. So here we set this to the most frequent strategy to predict the most frequent class.

In [38]:

```
from sklearn.dummy import DummyClassifier
# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
# Therefore the dummy 'most_frequent' classifier always predicts class 0
y_dummy_predictions = dummy_majority.predict(X_test)
y_dummy_predictions
```

Out[38]:

In [39]:

```
dummy_majority.score(X_test, y_test)
```

Out[39]:

**Dummy classifiers completely ignore the input data!**

- Dummy classifiers serve as a sanity check on your classifier's performance
- They provide a
*null metric*(e.g. null accuracy) baseline - Dummy classifiers should not be used for real problems
- Some commonly-used settings for the
`strategy`

parameter for DummyClassifier in scikit-learn:`most_frequent`

: predicts the most frequent label in the training set.`stratified`

: random predictions based on training set class distribution.`uniform`

: generates predictions uniformly at random`constant`

: always predicts a constant label provided by the users- A major motivation if this method is
**$F_1$-scoring**, when the positive class is in the minority.

- A major motivation if this method is

When we have a binary classification task where the most frequent class is the negative class, turns out that using the most frequent strategy will never predict the positive class and will never be able to count the number of positive instances that are correctly predicted, and so the overall count of such positive correct predictions will be 0, and we'll cause some important metrics like $F_1$-score to always be 0. So using the constant strategy, we can force a dummy classifier to always predict the positie class even if it's the minority class in a set of classes, and this will lead to more meaningful computation of $F_1$-score

**What if my classifier accuracy is close to the null accuracy baseline?**
This could be a sign of:

- Ineffective, erroneous or missing features
- Poor choice of kernel or hyperparameter
- Large class imbalance

For example, if we change the support vector classifier's kernel parameter to `linear`

from `rbf`

,and recompute the accuracy on this retrain classifier, we can see that this leads to much better performance of almost 98% compared to the `most_frequent`

class based line of 90%

In [40]:

```
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm.score(X_test, y_test)
```

Out[40]:

**In general**, for imbalanced classification problems, you should use metrics other than accuracy, look at one shortly called **AUC**, which is short for area under the curve.

**Dummy Regressors**
`strategy`

parameter options:

`mean`

: predicts the mean of the training targets`median`

:predicts the median of the training targets`quantile`

:predicts a user-provided quantile of the training targets`constant`

:predicts a constant user-provided value

In [41]:

```
from sklearn.metrics import confusion_matrix
# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
y_majority_predicted = dummy_majority.predict(X_test)
confusion = confusion_matrix(y_test, y_majority_predicted)
print('Most frequent class (dummy classifier)\n', confusion)
```

In [42]:

```
# produces random predictions w/ same class proportion as training set
dummy_classprop = DummyClassifier(strategy='stratified').fit(X_train, y_train)
y_classprop_predicted = dummy_classprop.predict(X_test)
confusion = confusion_matrix(y_test, y_classprop_predicted)
print('Random class-proportional prediction (dummy classifier)\n', confusion)
```

In [43]:

```
svm = SVC(kernel='linear', C=1).fit(X_train, y_train)
svm_predicted = svm.predict(X_test)
confusion = confusion_matrix(y_test, svm_predicted)
print('Support vector machine classifier (linear kernel, C=1)\n', confusion)
```

In [44]:

```
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(X_train, y_train)
lr_predicted = lr.predict(X_test)
confusion = confusion_matrix(y_test, lr_predicted)
print('Logistic regression classifier (default settings)\n', confusion)
```

In [45]:

```
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
tree_predicted = dt.predict(X_test)
confusion = confusion_matrix(y_test, tree_predicted)
print('Decision tree classifier (max_depth = 2)\n', confusion)
```

**There is often a tradeoff between precision and recall**

*Recall-oriented*machine learning tasks:- Search and information extraction in legal discovery
- Tumor detection
- Often paired with a human expert to filter out false positives

*Precision-oriented*machine learning tasks:- Search engine ranking, query suggestion
- Document classification

**$F_1$-score: combining precision & recall into a single number**

When evaluating classifiers, it's often convenient to compute a quantity known as an $F_1$-score, that combines `precision`

and `recall`

into a single number. Mathematically this is based on **harmonic mean** of `precision`

and `recall`

using this formula:
$$F_1=2\cdot \frac{Precision\cdot Recall}{Precion+Recall} = \frac {2\cdot TP}{2\cdot TP+FN+FP}$$

$$F_{\beta}=(1+\beta^2)\cdot \frac{Precision\cdot Recall}{(\beta^2\cdot Precion)+Recall} = \frac {(1+\beta^2)\cdot TP}{(1+\beta^2)\cdot TP+\beta \cdot FN+FP}$$

$\beta$ allows adjustment of the metric to control the emphasis on recall vs precision:

**Precision-oriented users**: $\beta = 0.5$ (false positives hurt performance more than false negative)**Recall-oriented users**: $\beta = 2$ (false negative hurt performance more than false positive)

In [46]:

```
# Once a model is builted, we can get it's "accuracy", "precision","recall", "F1" and so on
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Accuracy = TP + TN / (TP + TN + FP + FN)
# Precision = TP / (TP + FP)
# Recall = TP / (TP + FN) Also known as sensitivity, or True Positive Rate
# F1 = 2 * Precision * Recall / (Precision + Recall)
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, tree_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, tree_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, tree_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, tree_predicted)))
```

In [47]:

```
# Combined report with all above metrics
from sklearn.metrics import classification_report
print(classification_report(y_test, tree_predicted, target_names=['not 1', '1']))
# the last column "support" shows the number of instances in the test set that have true label
```

In [48]:

```
print('Random class-proportional (dummy)\n',
classification_report(y_test, y_classprop_predicted, target_names=['not 1', '1']))
print('SVM\n',
classification_report(y_test, svm_predicted, target_names = ['not 1', '1']))
print('Logistic regression\n',
classification_report(y_test, lr_predicted, target_names = ['not 1', '1']))
print('Decision tree\n',
classification_report(y_test, tree_predicted, target_names = ['not 1', '1']))
```

Many classifiers in Scikit-learn can provide information about the uncertainty associated with a particular prediction either by using the **decision function** method or **predict probability** method

- Each classifier score value per test point indicates how confidently the classifier predicts the positive class (large-magnitude positive values) or the negative class (large-magnitude negative values).
- Choosing a fixed decision threshold gives a classification rule.
- By sweeping the decision threshold through the entire range of possible score values, we get a series of classification outcomes that form a curve.

In [49]:

```
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))
# We can see the instances in the negative class often have large magnitude negative scores, and indeed the instances in
# the positive class has positive scores from the logistic regression classifier.
# show the decision_function scores for first 20 instances
y_score_list
```

Out[49]:

**Predicted Probability of Class Membership ( predict_proba)**

- Typical rule: choose most likely class
- e.g. class I if threshold > 0.50

- Adjusting threshold affects predictions of classifier
- Higher threshold results in a more conservative classifier
- e.g. only predict Class I if estimated probability of class I is above 70%
- This increase precision. Doesn't predict class I as often, but when it does, it gets high proportion of class I instances correct

- Not all models provide realistic probability estimates

In [50]:

```
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)
y_proba_list = list(zip(y_test[0:20], y_proba_lr[0:20,1]))
# show the probability of positive class for first 20 instances
y_proba_list
# you can see that many entries with a positive label of one, have a high probability like 0.995. While many negative label
# instances have a very low prediction probability
```

Out[50]:

**Note that** not all all models provide useful probability estimates of this type. For example, a model that was over-fit to a training set might provide overly optimistic high probabilities that were in fact not accurate.

**Varying the Decision Threshold**

We can use these decision scores or prediction probabilities for getting more complete evaluation picture of a classifiers performance. For a particular application, we might pick a specific decision threshold depending on whether we want the classifier to be more or less conservative about making false-positive or false-negative errors. It might not be entirely clear when developing a new model what the right decision threshold would be, and how that choice will affect evaluation metrics like *precision* and *recall*. So instead, what we'll do is look at how classifier performs for all possible decision thresholds.

On the left there is a list of test instances with their true label and classifier score. If we set a decision threshold, then all the instances above that line, for example if we set the decision threshold to -20 here, then all the instances above the line are below the threshold of -20. So -20 or less and all the instances in this direction are above the threshold of -20. And so the ones below the threshold will be predicted to be in the negative class, and the ones above the threshold will be predicted to be in the positive class.

So if we pick the specific threshold, in this case, -20, and we partition the test points in this way, we can compute *precision* and *recall* for the points that are predicted to be in the positive class. So in this case, we have 12 instances here, 12 total instances. They are being predicted as positive and only four of them

In [67]:

```
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]
plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
```

*The red circle* indicates the precision and recall that's achieved when the decision threshold is zero.

In this particular case you see also that it's not exactly a smooth curve, there are some jaggy errors and in fact the jumps tend to get a little bigger as we approach maximum precision, this is a consequence of how the formulas for recall precision are computed, they use discrete counts that include the number of true positives.

So as the decision threshold increases, these are fewer and fewer points that remain as positive predictions, so the fractions that are computed for these smaller numbers can change pretty dramatically with small changes in the decision threshold. That's why these sort of trailing edges of **precision-recall curve** can appear a bit jagged when you plot them

**ROC curves** or *receiver operating characteristic curves* are a very widely used visualziation method that illustrate the performance of a binary classifier.

**X-axis**: False Positive Rate
**Y-axis**: True Positive Rate

**Top left corner**:

- The "ideal" point
- False positive rate of zero
- True positive rate of one

**"Steepness" of ROC curves is important**:

- Maximize the true positive rate
- while minizing the false positive rate

In [68]:

```
from sklearn.metrics import roc_curve, auc
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
y_score_lr = lr.fit(X_train, y_train).decision_function(X_test)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)
plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.axes().set_aspect('equal')
plt.show()
```

The dotted line here is the classifier curve that secretly results from a classifier that randomly guesses the label for a binary class, it's basically like flipping a coin, if you have two classes with equal numbers of positive and negative incidences, then flipping a coin will get you randomly equal numbers of false positives and true positives for a large virus data sets.

So the dotted line here is used as a base line, the bad classifier will have performance
that is random or maybe even worse than random or be slightly better than random, reasonably good classifier will give
an ROC curve that is consistently better than random across all decision threshold choices, and then an excellent classiferwould be one like shown here (`LogRegr ROC curve`

), this particular example is an example of a logistic regression classifier.

So the shape of the curve can be important as well, the steepness of the curve, we want classifiers that maximize the true positive rate while minimizing the false positive rate. Now we can quantify the goodness of a classifier in some sense by looking at how much area there is underneath the curve. So the area underneath the random classifier is going to be 0.5. We use `AUC`

, the single number that measures this total area underneath the `ROC`

curve as a way to summarize a classifier's performance

In [69]:

```
from matplotlib import cm
X_train, X_test, y_train, y_test = train_test_split(X, y_binary_imbalanced, random_state=0)
plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
for g in [0.01, 0.1, 0.20, 1]:
svm = SVC(gamma=g).fit(X_train, y_train)
y_score_svm = svm.decision_function(X_test)
fpr_svm, tpr_svm, _ = roc_curve(y_test, y_score_svm)
roc_auc_svm = auc(fpr_svm, tpr_svm)
accuracy_svm = svm.score(X_test, y_test)
print("gamma = {:.2f} accuracy = {:.2f} AUC = {:.2f}".format(g, accuracy_svm,
roc_auc_svm))
plt.plot(fpr_svm, tpr_svm, lw=3, alpha=0.7,
label='SVM (gamma = {:0.2f}, area = {:0.2f})'.format(g, roc_auc_svm))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate (Recall)', fontsize=16)
plt.plot([0, 1], [0, 1], color='k', lw=0.5, linestyle='--')
plt.legend(loc="lower right", fontsize=11)
plt.title('ROC curve: (1-of-10 digits classifier)', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
```

- Multi-class evalution is an extension of the binary case
- A collection of true vs predicted binary outcomes, one per class
- Confusion matrices are especially useful
- Classification report

- Overal evaluation metrics are averages across classes
- But there are different ways to average multi-class results
- The support (number of instances) for each class is important to consider, e.g. in case of imbalanced classes

- Multi-label classification: each instance can have multiple labels (not covered here)

The multi-class confusion matrix is a straightforward extension of the binary classifier 2X2 confusion matrix.

As a general rule of thumb as part of model evaluation, I suggest always looking at the confusion matrix for your classifier, to get some insight into what kind of errors it is making for each class including whether some classes are much more prone to certain kinds of errors than others

In [70]:

```
dataset = load_digits()
X, y = dataset.data, dataset.target
X_train_mc, X_test_mc, y_train_mc, y_test_mc = train_test_split(X, y, random_state=0)
svm = SVC(kernel = 'linear').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
df_cm = pd.DataFrame(confusion_mc,
index = [i for i in range(0,10)], columns = [i for i in range(0,10)])
plt.figure(figsize=(5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM Linear Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc,
svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label')
svm = SVC(kernel = 'rbf').fit(X_train_mc, y_train_mc)
svm_predicted_mc = svm.predict(X_test_mc)
confusion_mc = confusion_matrix(y_test_mc, svm_predicted_mc)
df_cm = pd.DataFrame(confusion_mc, index = [i for i in range(0,10)],
columns = [i for i in range(0,10)])
plt.figure(figsize = (5.5,4))
sns.heatmap(df_cm, annot=True)
plt.title('SVM RBF Kernel \nAccuracy:{0:.3f}'.format(accuracy_score(y_test_mc,
svm_predicted_mc)))
plt.ylabel('True label')
plt.xlabel('Predicted label');
```

Get a classification report that summarizes multiple evaluation metrics for a multi-class classifier with an average metric computed for each class

In [55]:

```
print(classification_report(y_test_mc, svm_predicted_mc))
```

- If the classes have about the same number of instances, marcro- and micro-average will be about the same
- If some classes are much larger (more instances) than others, and you want to:
- Weight your metric toward the largest ones, use
**micro average** - Weight your metric toward the smallest ones, use
**macro average**

- Weight your metric toward the largest ones, use
- If the micro-average is much lower than the macro-average then examine the larger classes for poor metric performance. The model is probably misclassifying the frequent labels more than the frequent labels.
- If the macro-average is much lower than the micro-average then examine the smaller classes for poor metric performance. The model is probably misclassifying the infrequent labels more than the infrequent labels.

In [56]:

```
print('Micro-averaged precision = {:.2f} (treat instances equally)'
.format(precision_score(y_test_mc, svm_predicted_mc, average = 'micro')))
print('Macro-averaged precision = {:.2f} (treat classes equally)'
.format(precision_score(y_test_mc, svm_predicted_mc, average = 'macro')))
```

In [57]:

```
print('Micro-averaged f1 = {:.2f} (treat instances equally)'
.format(f1_score(y_test_mc, svm_predicted_mc, average = 'micro')))
print('Macro-averaged f1 = {:.2f} (treat classes equally)'
.format(f1_score(y_test_mc, svm_predicted_mc, average = 'macro')))
```

- Typically r2_score is enough, it's always positive does have the potential to go negative for bad model fits, such as when fitting non-linear functions to data.
- Reminder: computes how well future instances will be predicted
- Best possible score is 1.0
- Constant prediction score is 0.0 without looking at the input -
`DummyRegressor`

- Alternative metrics include:
`mean_absolute_error`

(absolute difference of target & predicted values), in machine learning terms this corresponds to the expected value of*L1 norm loss*. This is sometimes used for example to assess focused outcomes for regression in time series analysis.`mean_squared_error`

(squared difference of target & predicted values), this corresponds to the expected value of the*L2 norm loss*. This is widely used for many regression problems. Like`mean_absolute_error`

,`mean_squared_error`

doesn't distingush between over and under estimates.`median_absolute_error`

(robust to outliers), in those cases, when ignoring outlier is important, you can use the`median_absolute_error`

score, which is robust with the presence of outliers because it uses the*median*of error distribution rather than the*mean*

**Dummy Regressors**
The `DummyRegressor`

class implements four simple baseline rules for regression, using the `strategy`

parameter:

`mean`

predicts the mean of the training target values`median`

predicts the median of the training target values`quantile`

predicts a user-provided quantile of the training values (e.g. value at the $75^{th}$ percentile)`constant`

predicts a custom constant value provided by the user

Although regression typically has simpler evaluation needs than classification, it does pay to double check to make sure the evaluation metric you choose for a regression problem does penalize errors in way that reflects the consequences of those errors for the business, organizational, or user needs of your application

In [63]:

```
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor
diabetes = datasets.load_diabetes()
X = diabetes.data[:, None, 6]
y = diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
lm = LinearRegression().fit(X_train, y_train)
lm_dummy_mean = DummyRegressor(strategy = 'mean').fit(X_train, y_train)
y_predict = lm.predict(X_test)
y_predict_dummy_mean = lm_dummy_mean.predict(X_test)
print('Linear model, coefficients: ', lm.coef_)
print("Mean squared error (dummy): {:.2f}".format(mean_squared_error(y_test,
y_predict_dummy_mean)))
print("Mean squared error (linear model): {:.2f}".format(mean_squared_error(y_test, y_predict)))
print("r2_score (dummy): {:.2f}".format(r2_score(y_test, y_predict_dummy_mean)))
print("r2_score (linear model): {:.2f}".format(r2_score(y_test, y_predict)))
# Plot outputs
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_predict, color='green', linewidth=2)
plt.plot(X_test, y_predict_dummy_mean, color='red', linestyle = 'dashed',
linewidth=2, label = 'dummy')
plt.show()
```

Beyond the essential evaluation metrics covered in this course, online controlled experiments, which involve A-B testing and other techniques, are perhaps the most important way that machine learning algorithms are evaluated for real-world use in Web and other online applications. This article by Ron Kohavi, Randal Henne, and Dan Sommerfield, reviews the key points of running controlled experiments, along with important engineering issues and limitations to keep in mind when using them.

Kohavi, R., Henne, R. M., & Sommerfield, D. (2007). Practical guide to controlled experiments on the web. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '07. doi:10.1145/1281192.1281295

- Train/test on same data
- Single metric
- Typically overfits and likely won't generalize well on new data
- But can serve as sanity check: low accuracy on the training set may indicate an implementation problem

- Single train/test split
- Single metric
- Speed and simplicity
- Lack of variance information

- K-fold cross-validation
- K train-test splits
- Average metric over all splits
- Can be combined with parameter grid search:
`GridSearchCV`

(def.`cv=3`

)

In [59]:

```
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
dataset = load_digits()
# again, making this a binary problem with 'digit 1' as positive class
# and 'not 1' as negative class
X, y = dataset.data, dataset.target == 1
clf = SVC(kernel='linear', C=1)
# accuracy is the default scoring metric
print('Cross-validation (accuracy)', cross_val_score(clf, X, y, cv=5))
# use AUC as scoring metric
print('Cross-validation (AUC)', cross_val_score(clf, X, y, cv=5, scoring = 'roc_auc'))
# use recall as scoring metric
print('Cross-validation (recall)', cross_val_score(clf, X, y, cv=5, scoring = 'recall'))
```

In [60]:

```
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = SVC(kernel='rbf')
grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}
# default metric to optimize over grid parameters: accuracy
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values)
grid_clf_acc.fit(X_train, y_train)
y_decision_fn_scores_acc = grid_clf_acc.decision_function(X_test)
print('Grid best parameter (max. accuracy): ', grid_clf_acc.best_params_)
print('Grid best score (accuracy): ', grid_clf_acc.best_score_)
# alternative metric to optimize over grid parameters: AUC
grid_clf_auc = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf_auc.fit(X_train, y_train)
y_decision_fn_scores_auc = grid_clf_auc.decision_function(X_test)
print('Test set AUC: ', roc_auc_score(y_test, y_decision_fn_scores_auc))
print('Grid best parameter (max. AUC): ', grid_clf_auc.best_params_)
print('Grid best score (AUC): ', grid_clf_auc.best_score_)
```

You can see the complete list of names for the evaluation metric supported by the scoring parameter by running he following code that uses the score's variable imported from sklearn metrics, you can see metrics for classification such as the string `precision_micro`

that represents micro-averaged precision, as well as metrics for regression such as `r2`

metric for R-squared regression loss

In [61]:

```
from sklearn.metrics.scorer import SCORERS
print(sorted(list(SCORERS.keys())))
```

Show a specific example how a classifier's decision boundary changes when it's optimized for different evaluation metrics, this classification problem is based on the same binary digit classifier training and test sets we've been using as an example throughout the notebook.

In this classification visualization examples, the positive examples, the digit one are shown as black points and the region of positive class prediction is shown in the light-colored or yellow region to the right of this decision boundary. The negative examples, all other digits, are shown as white points, and the region of negative class prediction here in these figures is to the left of the decision boundary. The data points have been plotted using two out of 64 future values in the digits dataset and have been jittered a little.

In [9]:

```
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from adspy_shared_utilities import plot_class_regions_for_classifier_subplot
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
dataset = load_digits()
X, y = dataset.data, dataset.target == 1
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Create a two-feature input vector matching the example plot above
# We jitter the points (add a small amount of random noise) in case there are areas
# in feature space where many instances have the same features.
jitter_delta = 0.25
X_twovar_train = X_train[:,[20,59]]+ np.random.rand(X_train.shape[0], 2) - jitter_delta
X_twovar_test = X_test[:,[20,59]] + np.random.rand(X_test.shape[0], 2) - jitter_delta
# We apply `GridSearchCV` here to explore different values of the optional class weight parameter that contols how much
# weight is given to each of the two classes during training
clf = SVC(kernel = 'linear').fit(X_twovar_train, y_train)
grid_values = {'class_weight':['balanced', {1:2},{1:3},{1:4},{1:5},{1:10},{1:20},{1:50}]}
plt.figure(figsize=(9,6))
for i, eval_metric in enumerate(('precision','recall', 'f1','roc_auc')):
grid_clf_custom = GridSearchCV(clf, param_grid=grid_values, scoring=eval_metric)
grid_clf_custom.fit(X_twovar_train, y_train)
print('Grid best parameter (max. {0}): {1}'
.format(eval_metric, grid_clf_custom.best_params_))
print('Grid best score ({0}): {1}'
.format(eval_metric, grid_clf_custom.best_score_))
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plot_class_regions_for_classifier_subplot(grid_clf_custom, X_twovar_test, y_test, None,
None, None, plt.subplot(2, 2, i+1))
plt.title(eval_metric+'-oriented SVC')
plt.tight_layout()
plt.show()
```