Applied Machine Learning: Module 2 (Supervised Learning, Part I)

Classification and Regression

  • Both classification and regression take a set of training instances and learn a mapping to a target value
  • For classification, the target value is a discrete class value

    • Binary: target value 0 (negative class) or 1 (positive class)
      • e.g. detecting a fraudulent credit card transaction
    • Multi-class: target value is one of a set of discrete values
      • e.g. labelling the type of fruit from physical attributes
    • Multi-label: there are multiple target values (labels)
      • e.g. labelling the topics discussed on a Web page
  • For regression, the target value is continuous (floating point / real-valued)

    • e.g. predicting the selling price of a house from its attributes
  • Looking at the target value's type will guide you on what supervised learning method to use
  • Many supervised learning methods have 'flavors' for both classification and regression

Supervised learning methods: Overview

  • To start with, we'll look at two simple but powerful prediction algorithms:
    • K-nearest neighbors
    • Linear model fit using least-squares
  • These represent two complementary approaches to supervised learning:
    • K-nearest neighbors makes few assumptions about the structure of the data and gives potentially accurate but sometimes unstable predictions (sensitive to small changes in the training data)
    • Linear models make strong assumptions about the structure of the data and give stable but potentially inaccurate predictions.

What is a Model?

It's a specific mathematical or computational description that expresses the relationship between a set of input variables and one or more outcome variables that are being studied or predicted. In statistical terms the input variables are called independent variable and outcome variables are termed dependent variables. In machine learning we use the term features to refer to the input, or independent variables, and target value or target label to refer to the output, dependent varbales


Generalization, Overfitting, and Underfitting

  • Generalization ability refers to an algorithm's ability to give accurate predictions for new, previously unseen data
  • Assumptions:
    • Future unseen data (test set) will have the same properties as the current training sets, or more technically, as drawn from the same underlying distribution as the training set.
    • Thus, models that are accurate on the training set are expected to be accurate on the test set.
    • But that may not happen if the training model is tuned too specifically to the training set.
  • Models that are too complex for the amount of training data available are said to overfit and are not likely to generize well to new examples.
  • Models that are too simple, that don't even do well on the training data, are said to underfit and also not likely to generlize well

Preamble and Review

In [9]:
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

np.set_printoptions(precision=2)


fruits = pd.read_table('fruit_data_with_colors.txt')

feature_names_fruits = ['height', 'width', 'mass', 'color_score']
X_fruits = fruits[feature_names_fruits]
y_fruits = fruits['fruit_label']
target_names_fruits = ['apple', 'mandarin', 'orange', 'lemon']

X_fruits_2d = fruits[['height', 'width']]
y_fruits_2d = fruits['fruit_label']

X_train, X_test, y_train, y_test = train_test_split(X_fruits, y_fruits, random_state=0)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
# we must apply the scaling to the test set that we computed for the training set
X_test_scaled = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train_scaled, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(knn.score(X_train_scaled, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'
     .format(knn.score(X_test_scaled, y_test)))

example_fruit = [[5.5, 2.2, 10, 0.70]]
example_fruit_scaled = scaler.transform(example_fruit)
print('Predicted fruit type for ', example_fruit, ' is ', 
          target_names_fruits[knn.predict(example_fruit_scaled)[0]-1])
Accuracy of K-NN classifier on training set: 0.95
Accuracy of K-NN classifier on test set: 1.00
Predicted fruit type for  [[5.5, 2.2, 10, 0.7]]  is  mandarin

Datasets

In [11]:
from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer
from adspy_shared_utilities import load_crime_dataset

cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])


# synthetic dataset for simple regression
from sklearn.datasets import make_regression
plt.figure()
plt.title('Sample regression problem with one input variable')
X_R1, y_R1 = make_regression(n_samples = 100, n_features=1,
                            n_informative=1, bias = 150.0,
                            noise = 30, random_state=0)
plt.scatter(X_R1, y_R1, marker= 'o', s=50)
plt.show()


# synthetic dataset for more complex regression
from sklearn.datasets import make_friedman1
plt.figure()
plt.title('Complex regression problem with one input variable')
X_F1, y_F1 = make_friedman1(n_samples = 100,
                           n_features = 7, random_state=0)

plt.scatter(X_F1[:, 2], y_F1, marker= 'o', s=50)
plt.show()

# synthetic dataset for classification (binary) 
plt.figure()
plt.title('Sample binary classification problem with two informative features')
X_C2, y_C2 = make_classification(n_samples = 100, n_features=2,
                                n_redundant=0, n_informative=2,
                                n_clusters_per_class=1, flip_y = 0.1,
                                class_sep = 0.5, random_state=0)
plt.scatter(X_C2[:, 0], X_C2[:, 1], c=y_C2,
           marker= 'o', s=50, cmap=cmap_bold)
plt.show()


# more difficult synthetic dataset for classification (binary) 
# with classes that are not linearly separable
X_D2, y_D2 = make_blobs(n_samples = 100, n_features = 2, centers = 8,
                       cluster_std = 1.3, random_state = 4)
y_D2 = y_D2 % 2
plt.figure()
plt.title('Sample binary classification problem with non-linearly separable classes')
plt.scatter(X_D2[:,0], X_D2[:,1], c=y_D2,
           marker= 'o', s=50, cmap=cmap_bold)
plt.show()


# Breast cancer dataset for classification
cancer = load_breast_cancer()
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)


# Communities and Crime dataset
(X_crime, y_crime) = load_crime_dataset()

K-Nearest Neighbors

Classification

In [3]:
from adspy_shared_utilities import plot_two_class_knn

X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2,
                                                   random_state=0)

plot_two_class_knn(X_train, y_train, 1, 'uniform', X_test, y_test)
plot_two_class_knn(X_train, y_train, 3, 'uniform', X_test, y_test)
plot_two_class_knn(X_train, y_train, 11, 'uniform', X_test, y_test)

Regression

In [4]:
from sklearn.neighbors import KNeighborsRegressor

X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1, random_state = 0)

knnreg = KNeighborsRegressor(n_neighbors = 5).fit(X_train, y_train)

print(knnreg.predict(X_test))
print('R-squared test score: {:.3f}'
     .format(knnreg.score(X_test, y_test)))
[ 231.71  148.36  150.59  150.59   72.15  166.51  141.91  235.57  208.26
  102.1   191.32  134.5   228.32  148.36  159.17  113.47  144.04  199.23
  143.19  166.51  231.71  208.26  128.02  123.14  141.91]
R-squared test score: 0.425

The $R^2$ (R-squared) regression score

  • Measures how well a prediction model for regression fits the given data
  • The score is between 0 and 1:
    • A value of 0 corresponds to a constant model that predicts the mean value of all training target values.
    • A values of 1 correspends to perfect prediction
  • Also known as "coefficient of determination"
In [5]:
fig, subaxes = plt.subplots(1, 2, figsize=(8,4))
X_predict_input = np.linspace(-3, 3, 50).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X_R1[0::5], y_R1[0::5], random_state = 0)

for thisaxis, K in zip(subaxes, [1, 3]):
    knnreg = KNeighborsRegressor(n_neighbors = K).fit(X_train, y_train)
    y_predict_output = knnreg.predict(X_predict_input)
    thisaxis.set_xlim([-2.5, 0.75])
    thisaxis.plot(X_predict_input, y_predict_output, '^', markersize = 10,
                 label='Predicted', alpha=0.8)
    thisaxis.plot(X_train, y_train, 'o', label='True Value', alpha=0.8)
    thisaxis.set_xlabel('Input feature')
    thisaxis.set_ylabel('Target value')
    thisaxis.set_title('KNN regression (K={})'.format(K))
    thisaxis.legend()
plt.tight_layout()

Regression model complexity as a function of K

In [6]:
# plot k-NN regression on sample dataset for different values of K
fig, subaxes = plt.subplots(5, 1, figsize=(5,20))
X_predict_input = np.linspace(-3, 3, 500).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1,
                                                   random_state = 0)

for thisaxis, K in zip(subaxes, [1, 3, 7, 15, 55]):
    knnreg = KNeighborsRegressor(n_neighbors = K).fit(X_train, y_train)
    y_predict_output = knnreg.predict(X_predict_input)
    train_score = knnreg.score(X_train, y_train)
    test_score = knnreg.score(X_test, y_test)
    thisaxis.plot(X_predict_input, y_predict_output)
    thisaxis.plot(X_train, y_train, 'o', alpha=0.9, label='Train')
    thisaxis.plot(X_test, y_test, '^', alpha=0.9, label='Test')
    thisaxis.set_xlabel('Input feature')
    thisaxis.set_ylabel('Target value')
    thisaxis.set_title('KNN Regression (K={})\n\
Train $R^2 = {:.3f}$,  Test $R^2 = {:.3f}$'
                      .format(K, train_score, test_score))
    thisaxis.legend()
    plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

Sum up: KNeighborsClassifier and KNeighborsRegressor: important parameters

Model complexity:

  • n_neighbors: number of nearest neighbors (k) to consider
    • Default = 5 Model fitting:
  • metric: distance function between data points
    • Default: Minkowski distance with power parameter p = 2(Euclidean)

Linear models for regression

Linear regression

$y = w_0x + b$

$w_0$: linreg.coef_

$b$: linreg.intercept_

Note: If a Scikit-learn object attribute ends with an underscore, this means that these attributes were derived from training data, and not say quantities that were set by the user.

In [7]:
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1,
                                                   random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))
linear model coeff (w): [ 45.71]
linear model intercept (b): 148.446
R-squared score (training): 0.679
R-squared score (test): 0.492

Linear regression: example plot

In [8]:
plt.figure(figsize=(5,4))
plt.scatter(X_R1, y_R1, marker= 'o', s=50, alpha=0.8)
plt.plot(X_R1, linreg.coef_ * X_R1 + linreg.intercept_, 'r-')
plt.title('Least-squares linear regression')
plt.xlabel('Feature value (x)')
plt.ylabel('Target value (y)')
plt.show()
In [9]:
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

print('Crime dataset')
print('linear model intercept: {}'
     .format(linreg.intercept_))
print('linear model coeff:\n{}'
     .format(linreg.coef_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))
Crime dataset
linear model intercept: 3861.708902399444
linear model coeff:
[  1.62e-03  -1.03e+02   1.61e+01  -2.94e+01  -1.92e+00  -1.47e+01
  -2.41e-03   1.46e+00  -1.46e-02  -1.08e+01   4.35e+01  -6.92e+00
   4.95e+00  -4.11e+00  -3.63e+00   8.98e-03   8.33e-03   4.84e-03
  -5.25e+00  -1.59e+01   7.47e+00   2.31e+00  -2.48e-01   1.22e+01
  -2.90e+00  -1.49e+00   4.96e+00   5.21e+00   1.82e+02   1.15e+01
   1.54e+02  -3.40e+02  -1.22e+02   2.75e+00  -2.87e+01   2.39e+00
   9.44e-01   3.18e+00  -1.17e+01  -5.46e-03   4.24e+01  -1.10e-03
  -9.23e-01   5.13e+00  -4.69e+00   1.13e+00  -1.70e+01  -5.00e+01
   5.64e+01  -2.94e+01   3.42e-01  -3.10e+01   2.89e+01  -5.46e+01
   6.75e+02   8.54e+01  -3.35e+02  -3.17e+01   2.96e+01   7.07e+00
   7.46e+01   2.01e-02  -3.96e-01   3.15e+01   1.00e+01  -1.60e+00
  -5.63e-01   2.82e+00  -2.96e+01   1.08e+11  -1.01e-03  -1.08e+11
   1.08e+11  -3.13e+08  -4.95e-01   3.13e+08  -3.13e+08   1.47e+00
  -2.78e+00   1.12e+00  -3.70e+01   1.09e-01   3.07e-01   2.06e+01
   9.24e-01  -6.05e-01  -1.92e+00   5.88e-01]
R-squared score (training): 0.668
R-squared score (test): 0.520

Ridge regression

  • Ridge regression learns w, b using the same least-squares criterion but adds a penalty for large variations in w parameters

$$RSS_{RIDGS}(w, b) = \sum_{i=1}^N(y_i - (w\cdot x_i + b))^2 + \alpha\sum_{j=1}^pw_j^2$$

  • Once the parameters are learned, the ridge regression prediction formula is the same as ordinary least-squares
  • The addition of a parameter penalty is called regularization. Regularization prevents overfitting by restricting the model, typically to reduce its complexity
  • Ridge regression uses L2 regularization: minimize sum of squares of w entries
  • The influence of the regularization term is controlled by the alpha parameter
  • Higher alpha means more regularization and simpler models.

Think for a moment intuitively, what ridge regression is doing. It's regularizing the linear regression by imposing that sum of squares penalty on the size of the w coefficients. So the effect of increasing alpha is to shrink the w coefficients toward zero and towards each other. But if the input variables, the features have very different scales, then when this shrinkage happens of the coefficients, input variables with different scales will have different contributions to this L2 penalty, because the L2 penalty is a sum of squares of all the coefficients. So transforming the input features, then they're all on the same scale, means the ridge penalty is in some sense applied more fairly to all features without unduly weighting some more than others.


The Need for Feature Normalization

  • Important for some machine learning methods that all features are on the same scale (e.g. faster convergence in learning, more uniform or 'fair' influence for all weights)
    • e.g. regularized regression, k-NN, support vector machines, neural networks,...
  • Can also depend on the data. More on feature engineering in the later. For now, we MinMax scaling of the features:
    • For each feature $x_i$: compute the min value $x_i^{MIN}$ and the max value $x_i^{MAX}$ achieved across all instances in the training set
    • For each feature: transform a given feature $x_i$ value to a scaled version $x_i^{'}$ using the formula $$x_i^{'}=\frac{(x_i-x_i^{MIN})}{x_i^{MAX}-x_i^{MIN}}$$
In [10]:
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)

linridge = Ridge(alpha=20.0).fit(X_train, y_train)

print('Crime dataset')
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))
Crime dataset
ridge regression linear model intercept: -3352.4230358464793
ridge regression linear model coeff:
[  1.95e-03   2.19e+01   9.56e+00  -3.59e+01   6.36e+00  -1.97e+01
  -2.81e-03   1.66e+00  -6.61e-03  -6.95e+00   1.72e+01  -5.63e+00
   8.84e+00   6.79e-01  -7.34e+00   6.70e-03   9.79e-04   5.01e-03
  -4.90e+00  -1.79e+01   9.18e+00  -1.24e+00   1.22e+00   1.03e+01
  -3.78e+00  -3.73e+00   4.75e+00   8.43e+00   3.09e+01   1.19e+01
  -2.05e+00  -3.82e+01   1.85e+01   1.53e+00  -2.20e+01   2.46e+00
   3.29e-01   4.02e+00  -1.13e+01  -4.70e-03   4.27e+01  -1.23e-03
   1.41e+00   9.35e-01  -3.00e+00   1.12e+00  -1.82e+01  -1.55e+01
   2.42e+01  -1.32e+01  -4.20e-01  -3.60e+01   1.30e+01  -2.81e+01
   4.39e+01   3.87e+01  -6.46e+01  -1.64e+01   2.90e+01   4.15e+00
   5.34e+01   1.99e-02  -5.47e-01   1.24e+01   1.04e+01  -1.57e+00
   3.16e+00   8.78e+00  -2.95e+01  -2.34e-04   3.14e-04  -4.13e-04
  -1.80e-04  -5.74e-01  -5.18e-01  -4.21e-01   1.53e-01   1.33e+00
   3.85e+00   3.03e+00  -3.78e+01   1.38e-01   3.08e-01   1.57e+01
   3.31e-01   3.36e+00   1.61e-01  -2.68e+00]
R-squared score (training): 0.671
R-squared score (test): 0.494
Number of non-zero features: 88

Ridge regression with feature normalization

In [11]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)
# `fit` method using the training data to compute the min and max feature values
# To apply the scalar,then call `transform` method
# Note: use `fit_transform` method to efficiently perform fitting and transforming in a single step on the training set
X_train_scaled = scaler.fit_transform(X_train) 
X_test_scaled = scaler.transform(X_test)

linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train)

print('Crime dataset')
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))
Crime dataset
ridge regression linear model intercept: 933.3906385044113
ridge regression linear model coeff:
[  88.69   16.49  -50.3   -82.91  -65.9    -2.28   87.74  150.95   18.88
  -31.06  -43.14 -189.44   -4.53  107.98  -76.53    2.86   34.95   90.14
   52.46  -62.11  115.02    2.67    6.94   -5.67 -101.55  -36.91   -8.71
   29.12  171.26   99.37   75.07  123.64   95.24 -330.61 -442.3  -284.5
 -258.37   17.66 -101.71  110.65  523.14   24.82    4.87  -30.47   -3.52
   50.58   10.85   18.28   44.11   58.34   67.09  -57.94  116.14   53.81
   49.02   -7.62   55.14  -52.09  123.39   77.13   45.5   184.91  -91.36
    1.08  234.09   10.39   94.72  167.92  -25.14   -1.18   14.6    36.77
   53.2   -78.86   -5.9    26.05  115.15   68.74   68.29   16.53  -97.91
  205.2    75.97   61.38  -79.83   67.27   95.67  -11.88]
R-squared score (training): 0.615
R-squared score (test): 0.599
Number of non-zero features: 88

Feature Normalization: The test set must use identical scaling to the training set

  • Fit the scaler using the training set, then apply the same scalar to transform the test set
  • Do not scale the train and test sets using different scalers: this could lead to random skew in the data
  • Do not fit the scalar using any part of the test data: referencing the test data can lead to a form of data leakage, where the training phase has information that is leaked from the test set, for example, like the distribution of extreme values of each feature in the test data, which the learning should never have access to during training. This in turn can cause the learning method to give unrealistically good estimates on the same test set.

One downside to performing feature normalization is that the resulting model and the transformed features may be harder to interpret.

In the end, the type of feature normalization that's best to apply can depend on the data set learning task and learning algorithm to be used

Summary: In general, regularisation works especially well when you have relatively small amounts of training data compared to the number of features in your model. Regularisation becomes less important as the amount of training data you have increases.

Ridge regression with regularization parameter: alpha

Significantly larger or smaller values of alpha both lead to significantly worse model fit. This is another illustration of the general relationship between model complexity and test set performance. Where there's often an intermediate best value of a model of complexity parameter that does not lead to either under or overfitting

In [12]:
print('Ridge regression: effect of alpha regularization parameter\n')
for this_alpha in [0, 1, 10, 20, 50, 100, 1000]:
    linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train)
    r2_train = linridge.score(X_train_scaled, y_train)
    r2_test = linridge.score(X_test_scaled, y_test)
    num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0)
    print('Alpha = {:.2f}\nnum abs(coeff) > 1.0: {}, \
r-squared training: {:.2f}, r-squared test: {:.2f}\n'
         .format(this_alpha, num_coeff_bigger, r2_train, r2_test))
Ridge regression: effect of alpha regularization parameter

Alpha = 0.00
num abs(coeff) > 1.0: 87, r-squared training: 0.67, r-squared test: 0.50

Alpha = 1.00
num abs(coeff) > 1.0: 87, r-squared training: 0.66, r-squared test: 0.56

Alpha = 10.00
num abs(coeff) > 1.0: 87, r-squared training: 0.63, r-squared test: 0.59

Alpha = 20.00
num abs(coeff) > 1.0: 88, r-squared training: 0.61, r-squared test: 0.60

Alpha = 50.00
num abs(coeff) > 1.0: 86, r-squared training: 0.58, r-squared test: 0.58

Alpha = 100.00
num abs(coeff) > 1.0: 87, r-squared training: 0.55, r-squared test: 0.55

Alpha = 1000.00
num abs(coeff) > 1.0: 84, r-squared training: 0.31, r-squared test: 0.30

Lasso regression

Lasso regression is another form of regularized linear regression that uses an L1 regularization penalty for training (instead of ridge's L2 penalty)

  • L1 penalty: Minimize the sum fo the absolute values of the coefficients $$RSS_{LASSO}(w, b)=\sum_{i=1}^N(y_i-(w\cdot x_i + b))^2 + \alpha \sum_{j=1}^p{|w_j|}$$
  • This has the effect of setting parameter weights in w to zero for the least influential variables. This is called a sparse solution: a kind of feature selection
  • The parameter $\alpha$ controls amount of L1 regularization (default = 1.0)
  • The prediction formula is the same as ordinary least-squares.
  • When to use ridge vs lasso regression:
    • Many small/medium sized effects: use ridge
    • Only a few variables with medium/large effect: use lasso
In [13]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linlasso = Lasso(alpha=2.0, max_iter = 10000).fit(X_train_scaled, y_train)

print('Crime dataset')
print('lasso regression linear model intercept: {}'
     .format(linlasso.intercept_))
print('lasso regression linear model coeff:\n{}'
     .format(linlasso.coef_))
print('Non-zero features: {}'
     .format(np.sum(linlasso.coef_ != 0)))
print('R-squared score (training): {:.3f}'
     .format(linlasso.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
     .format(linlasso.score(X_test_scaled, y_test)))
print('Features with non-zero weight (sorted by absolute magnitude):')

for e in sorted (list(zip(list(X_crime), linlasso.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))
Crime dataset
lasso regression linear model intercept: 1186.6120619985809
lasso regression linear model coeff:
[    0.       0.      -0.    -168.18    -0.      -0.       0.     119.69
     0.      -0.       0.    -169.68    -0.       0.      -0.       0.
     0.       0.      -0.      -0.       0.      -0.       0.       0.
   -57.53    -0.      -0.       0.     259.33    -0.       0.       0.
     0.      -0.   -1188.74    -0.      -0.      -0.    -231.42     0.
  1488.37     0.      -0.      -0.      -0.       0.       0.       0.
     0.       0.      -0.       0.      20.14     0.       0.       0.
     0.       0.     339.04     0.       0.     459.54    -0.       0.
   122.69    -0.      91.41     0.      -0.       0.       0.      73.14
     0.      -0.       0.       0.      86.36     0.       0.       0.
  -104.57   264.93     0.      23.45   -49.39     0.       5.2      0.  ]
Non-zero features: 20
R-squared score (training): 0.631
R-squared score (test): 0.624

Features with non-zero weight (sorted by absolute magnitude):
	PctKidsBornNeverMar, 1488.365
	PctKids2Par, -1188.740
	HousVacant, 459.538
	PctPersDenseHous, 339.045
	NumInShelters, 264.932
	MalePctDivorce, 259.329
	PctWorkMom, -231.423
	pctWInvInc, -169.676
	agePct12t29, -168.183
	PctVacantBoarded, 122.692
	pctUrban, 119.694
	MedOwnCostPctIncNoMtg, -104.571
	MedYrHousBuilt, 91.412
	RentQrange, 86.356
	OwnOccHiQuart, 73.144
	PctEmplManu, -57.530
	PctBornSameState, -49.394
	PctForeignBorn, 23.449
	PctLargHouseFam, 20.144
	PctSameCity85, 5.198

Code book

top five strongest features:

  • PctKidsBornNeverMar, 1488.365: percentage of kids born to people who never
  • PctKids2Par, -1188.740: percentage of kids in family housing with two parents
  • HousVacant, 459.538: number of vacant households
  • PctPersDenseHous, 339.045: percent of persons in dense housing (more than 1 person/room)
  • NumInShelters, 264.932: number of people in homeless shelters

Lasso regression with regularization parameter: alpha

In [14]:
print('Lasso regression: effect of alpha regularization\n\
parameter on number of features kept in final model\n')

for alpha in [0.5, 1, 2, 3, 5, 10, 20, 50]:
    linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
    r2_train = linlasso.score(X_train_scaled, y_train)
    r2_test = linlasso.score(X_test_scaled, y_test)
    
    print('Alpha = {:.2f}\nFeatures kept: {}, r-squared training: {:.2f}, \
r-squared test: {:.2f}\n'
         .format(alpha, np.sum(linlasso.coef_ != 0), r2_train, r2_test))
Lasso regression: effect of alpha regularization
parameter on number of features kept in final model

Alpha = 0.50
Features kept: 35, r-squared training: 0.65, r-squared test: 0.58

Alpha = 1.00
Features kept: 25, r-squared training: 0.64, r-squared test: 0.60

Alpha = 2.00
Features kept: 20, r-squared training: 0.63, r-squared test: 0.62

Alpha = 3.00
Features kept: 17, r-squared training: 0.62, r-squared test: 0.63

Alpha = 5.00
Features kept: 12, r-squared training: 0.60, r-squared test: 0.61

Alpha = 10.00
Features kept: 6, r-squared training: 0.57, r-squared test: 0.58

Alpha = 20.00
Features kept: 2, r-squared training: 0.51, r-squared test: 0.50

Alpha = 50.00
Features kept: 1, r-squared training: 0.31, r-squared test: 0.30

Polynomial regression

Polynomial Features with Linear Regression

$$x=(x_0, x_1) => x^{'}=(x_0, x_1, x_0^2, x_0x_1, x_1^2)$$

$$\hat{y}= \hat{w_0}x_0+\hat{w_1}x_1+\hat{w_{00}}x_0^2+\hat{w_{01}}x_0x_1+\hat{w_{11}}x_1^2$$

  • Generate new features consisting of all polynomial combinations of the original two features (x_0,x_1)
  • The degree of the polynomial specifies how many variables participate at a time in each new feature (above example: degree 2)
  • This is still a weighted linear combination of features, so it's still a linear model, and can use same least-squares estimation method for w and b
In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures


X_train, X_test, y_train, y_test = train_test_split(X_F1, y_F1,
                                                   random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))

print('\nNow we transform the original input data to add\n\
polynomial features up to degree 2 (quadratic)\n')
poly = PolynomialFeatures(degree=2)
X_F1_poly = poly.fit_transform(X_F1)

X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
                                                   random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

print('(poly deg 2) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2) R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2) R-squared score (test): {:.3f}\n'
     .format(linreg.score(X_test, y_test)))

print('\nAddition of many polynomial features often leads to\n\
overfitting, so we often use polynomial features in combination\n\
with regression that has a regularization penalty, like ridge\n\
regression.\n')

X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
                                                   random_state = 0)
linreg = Ridge().fit(X_train, y_train)

print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))
linear model coeff (w): [  4.42   6.     0.53  10.24   6.55  -2.02  -0.32]
linear model intercept (b): 1.543
R-squared score (training): 0.722
R-squared score (test): 0.722

Now we transform the original input data to add
polynomial features up to degree 2 (quadratic)

(poly deg 2) linear model coeff (w):
[  3.41e-12   1.66e+01   2.67e+01  -2.21e+01   1.24e+01   6.93e+00
   1.05e+00   3.71e+00  -1.34e+01  -5.73e+00   1.62e+00   3.66e+00
   5.05e+00  -1.46e+00   1.95e+00  -1.51e+01   4.87e+00  -2.97e+00
  -7.78e+00   5.15e+00  -4.65e+00   1.84e+01  -2.22e+00   2.17e+00
  -1.28e+00   1.88e+00   1.53e-01   5.62e-01  -8.92e-01  -2.18e+00
   1.38e+00  -4.90e+00  -2.24e+00   1.38e+00  -5.52e-01  -1.09e+00]
(poly deg 2) linear model intercept (b): -3.206
(poly deg 2) R-squared score (training): 0.969
(poly deg 2) R-squared score (test): 0.805


Addition of many polynomial features often leads to
overfitting, so we often use polynomial features in combination
with regression that has a regularization penalty, like ridge
regression.

(poly deg 2 + ridge) linear model coeff (w):
[ 0.    2.23  4.73 -3.15  3.86  1.61 -0.77 -0.15 -1.75  1.6   1.37  2.52
  2.72  0.49 -1.94 -1.63  1.51  0.89  0.26  2.05 -1.93  3.62 -0.72  0.63
 -3.16  1.29  3.55  1.73  0.94 -0.51  1.7  -1.98  1.81 -0.22  2.88 -0.89]
(poly deg 2 + ridge) linear model intercept (b): 5.418
(poly deg 2 + ridge) R-squared score (training): 0.826
(poly deg 2 + ridge) R-squared score (test): 0.825

  • Why would we want to transform our data this way?
    • To capture interactions between the original features by adding them as features to the linear model
    • To make a classification problem easier
  • More generally, we can apply other non-linear transformations to create new features
    • Techinically, these are called non-linear basis functions
  • Beware of polynomial feature expansion with high degree, as this can lead to complex models that overfit
    • Thus, polynomial feature expansion is often combined with a regularized learning method like ridge regression

Linear models for classification

Logistic regression

Logistic regression can be seen as a kind of generalized linear model, and like ordinary least squares and other regression methods, logistic regression takes a set input variables, the features, and estimates a target value. However, unlike ordinary linear regression, in it's most basic form logistic repressions target value is a binary variable instead of a continuous value. There are flavors of logistic regression that can be also be used in cases where the target value to be predicted is multi class categorical variable, not just binary.

Linear Regression $$\hat{y}=\hat{b}+\hat{w_1}\cdot{x_1}+...\hat{w_n}\cdot{x_n}$$

Linear models for classification: Logistic Regression

$$\hat{y}=logistic(\hat{b}+\hat{w_1}\cdot{x_1}+...\hat{w_n}\cdot{x_n})$$ $$= \frac{1}{1+e^{-(\hat{b}+\hat{w_1}\cdot{x_1}+...\hat{w_n}\cdot{x_n})}}$$ The logistic function transforms real-valued input to an output number y between 0 and 1, interpreted as the probability the input object belongs to the positive class, given its input features $(x_0, x_1,...,x_n)$

Logistic regression for binary classification on fruits dataset using height, width features (positive class: apple, negative class: others)

In [16]:
from sklearn.linear_model import LogisticRegression
from adspy_shared_utilities import (
plot_class_regions_for_classifier_subplot)

fig, subaxes = plt.subplots(1, 1, figsize=(7, 5))
y_fruits_apple = y_fruits_2d == 1   # make into a binary problem: predicting whether an object is apples vs everything else
X_train, X_test, y_train, y_test = (
train_test_split(X_fruits_2d.as_matrix(), #using only height and width as the features space
                y_fruits_apple.as_matrix(),
                random_state = 0))

clf = LogisticRegression(C=100).fit(X_train, y_train)
plot_class_regions_for_classifier_subplot(clf, X_train, y_train, None,
                                         None, 'Logistic regression \
for binary classification\nFruit dataset: Apple vs others',
                                         subaxes)

h = 6
w = 8
print('A fruit with height {} and width {} is predicted to be: {}'
     .format(h,w, ['not an apple', 'an apple'][clf.predict([[h,w]])[0]]))

h = 10
w = 7
print('A fruit with height {} and width {} is predicted to be: {}'
     .format(h,w, ['not an apple', 'an apple'][clf.predict([[h,w]])[0]]))
subaxes.set_xlabel('height')
subaxes.set_ylabel('width')

print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
A fruit with height 6 and width 8 is predicted to be: an apple
A fruit with height 10 and width 7 is predicted to be: not an apple
Accuracy of Logistic regression classifier on training set: 0.77
Accuracy of Logistic regression classifier on test set: 0.73

Logistic regression on simple synthetic dataset

In [17]:
from sklearn.linear_model import LogisticRegression
from adspy_shared_utilities import (
plot_class_regions_for_classifier_subplot)


X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2,
                                                   random_state = 0)

fig, subaxes = plt.subplots(1, 1, figsize=(7, 5))
clf = LogisticRegression().fit(X_train, y_train)
title = 'Logistic regression, simple synthetic dataset C = {:.3f}'.format(1.0)
plot_class_regions_for_classifier_subplot(clf, X_train, y_train,
                                         None, None, title, subaxes)

print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
     
Accuracy of Logistic regression classifier on training set: 0.80
Accuracy of Logistic regression classifier on test set: 0.80

Logistic regression regularization: C parameter

  • L2 regularization is turned on by default (like ridge regression)
  • Parameter C controls amount of regularization (default 1.0)
  • As with regularized linear regression, it can be important to normalize all features so that they are on the same scale
In [18]:
X_train, X_test, y_train, y_test = (
train_test_split(X_fruits_2d.as_matrix(),
                y_fruits_apple.as_matrix(),
                random_state=0))

fig, subaxes = plt.subplots(3, 1, figsize=(4, 10))

for this_C, subplot in zip([0.1, 1, 100], subaxes):
    clf = LogisticRegression(C=this_C).fit(X_train, y_train)
    title ='Logistic regression (apple vs rest), C = {:.3f}'.format(this_C)
    
    plot_class_regions_for_classifier_subplot(clf, X_train, y_train,
                                             X_test, y_test, title,
                                             subplot)
plt.tight_layout()

Application to real dataset

In [19]:
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)

clf = LogisticRegression().fit(X_train, y_train)
print('Breast cancer dataset')
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
Breast cancer dataset
Accuracy of Logistic regression classifier on training set: 0.96
Accuracy of Logistic regression classifier on test set: 0.96

Support Vector Machines

Linear Support Vector Machine

In [20]:
from sklearn.svm import SVC
from adspy_shared_utilities import plot_class_regions_for_classifier_subplot


X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2, random_state = 0)

fig, subaxes = plt.subplots(1, 1, figsize=(7, 5))
this_C = 1.0
clf = SVC(kernel = 'linear', C=this_C).fit(X_train, y_train)
title = 'Linear SVC, C = {:.3f}'.format(this_C)
plot_class_regions_for_classifier_subplot(clf, X_train, y_train, None, None, title, subaxes)

Linear Support Vector Machine: C parameter

  • The strength of regularization is determined by C
  • Larger values of C: less regularization
    • Fit the training data as well as possible, e.g. with few errors as possible, even if it means using a small immersion decision boundary
    • Each individual data point is important to classify correctly
  • Smaller values of C: more regularization
    • More tolerant of errors on individual data points, encourages the classifier to find large margin on decision boundary, even if that decision boundary leads to more points being misclassified
In [21]:
from sklearn.svm import LinearSVC
from adspy_shared_utilities import plot_class_regions_for_classifier

X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2, random_state = 0)
fig, subaxes = plt.subplots(1, 2, figsize=(8, 4))

for this_C, subplot in zip([0.00001, 100], subaxes):
    clf = LinearSVC(C=this_C).fit(X_train, y_train)
    title = 'Linear SVC, C = {:.5f}'.format(this_C)
    plot_class_regions_for_classifier_subplot(clf, X_train, y_train,
                                             None, None, title, subplot)
plt.tight_layout()

Application to real dataset

In [22]:
from sklearn.svm import LinearSVC
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)

clf = LinearSVC().fit(X_train, y_train)
print('Breast cancer dataset')
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
Breast cancer dataset
Accuracy of Linear SVC classifier on training set: 0.93
Accuracy of Linear SVC classifier on test set: 0.94

Linear Models: Pros and Cons

Pros:

  • Simple and easy to train
  • Fast prediction
  • Scales well to very large datasets
  • Works well with sparse data, effectively on high dementional data set especially
  • Reasons for prediction are relatively easy to interpret

Cons:

  • For lower-dimensional data, other models may have superior generalization performance
  • For claasification, data may not be linearly separable (more on this in SVMs with non-linear kernels)

Multi-class classification with linear models

Scikit-learn makes it every easy to learn multi-class classification models. Essentially, it does this by converting a multi-class classification problem into a series of binary probelms. When you pass in a dataset that has a categorical variable for the target value, Scikit-learn detects this automatically and then for each class to be predicted, Scikit-learn creates one binary classifier that predicts that class against all the other classes. So for example, in the fruit dataset there are four categories of fruit, so Scikit-learn learns four different binary classifiers. To predict a new data instance, what is then does is takes that data instance to be predicted whose labels to be predict, and runs it against each of the binary classifiers in turn, and classifier that has the highest score is the one that whose class it uses as prediction value

LinearSVC with M classes generates M one vs rest classifiers.

In [23]:
from sklearn.svm import LinearSVC

X_train, X_test, y_train, y_test = train_test_split(X_fruits_2d, y_fruits_2d, random_state = 0)

clf = LinearSVC(C=5, random_state = 67).fit(X_train, y_train)
print('Coefficients:\n', clf.coef_)
print('Intercepts:\n', clf.intercept_)
Coefficients:
 [[-0.26  0.71]
 [-1.63  1.16]
 [ 0.03  0.29]
 [ 1.24 -1.64]]
Intercepts:
 [-3.29  1.2  -2.72  1.16]

$$y_{apple} = -0.26*height + 0.71*width - 3.29$$

height=2, width=6: $y_{apple}$ = +0.549 (>=0 : predict apple)

height=2, width=2: $y_{apple}$ = -2.340 (<0 : predict other)


Multi-class results on the fruit dataset

In [24]:
plt.figure(figsize=(6,6))
colors = ['r', 'g', 'b', 'y']
cmap_fruits = ListedColormap(['#FF0000', '#00FF00', '#0000FF','#FFFF00'])

plt.scatter(X_fruits_2d[['height']], X_fruits_2d[['width']],
           c=y_fruits_2d, cmap=cmap_fruits, edgecolor = 'black', alpha=.7)

x_0_range = np.linspace(-10, 15)

for w, b, color in zip(clf.coef_, clf.intercept_, ['r', 'g', 'b', 'y']):
    # Since class prediction with a linear model uses the formula y = w_0 x_0 + w_1 x_1 + b, 
    # and the decision boundary is defined as being all points with y = 0, to plot x_1 as a 
    # function of x_0 we just solve w_0 x_0 + w_1 x_1 + b = 0 for x_1:
    plt.plot(x_0_range, -(x_0_range * w[0] + b) / w[1], c=color, alpha=.8)
    
plt.legend(target_names_fruits)
plt.xlabel('height')
plt.ylabel('width')
plt.xlim(-2, 12)
plt.ylim(-2, 15)
plt.show()

Kernelized Support Vector Machines

Classification

In [25]:
from sklearn.svm import SVC
from adspy_shared_utilities import plot_class_regions_for_classifier

X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2, random_state = 0)

# The default SVC kernel is radial basis function (RBF)
plot_class_regions_for_classifier(SVC().fit(X_train, y_train),
                                 X_train, y_train, None, None,
                                 'Support Vector Classifier: RBF kernel')

# Compare decision boundries with polynomial kernel, degree = 3
plot_class_regions_for_classifier(SVC(kernel = 'poly', degree = 3) #polynomial kernal
                                 .fit(X_train, y_train), X_train,
                                 y_train, None, None,
                                 'Support Vector Classifier: Polynomial kernel, degree = 3')

Support Vector Machine with Radial Basic Function (RBF) kernel: gamma parameter

$$K(x,x^{'}) = e^{-\gamma \cdot ||x - x^{'}||}$$ $gamma(\gamma)$ : kernel width parameter

gamma controls how far the influence of a single trending example reaches, which in turn affects how tightly the decision boundaries end up surrounding points in the input space. Small gamma means a larger similarity radius. So that points farther apart are considered similar, which result in more points being group together and smoother decision boundaries. On the other hand, for larger values of gamma, the kernel value to K is more quickly and points have to be very close to be considered similar. This results in more complex, tightly constrained decision boundaries,

In [26]:
from adspy_shared_utilities import plot_class_regions_for_classifier

X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2, random_state = 0)
fig, subaxes = plt.subplots(3, 1, figsize=(4, 11))

for this_gamma, subplot in zip([0.01, 1.0, 10.0], subaxes):
    clf = SVC(kernel = 'rbf', gamma=this_gamma).fit(X_train, y_train)
    title = 'Support Vector Classifier: \nRBF kernel, gamma = {:.2f}'.format(this_gamma)
    plot_class_regions_for_classifier_subplot(clf, X_train, y_train,
                                             None, None, title, subplot)
    plt.tight_layout()

Support Vector Machine with RBF kernel: using both C and gamma parameter

SVMs also have a regularization parameter C that controls the tradeoff between satisfying the maximum margin criterion to find the simple decision boundary, and avoiding misclassification errors on the training set. The C parameter is also an important one for kernelized SVMs, and it interacts with the gamma parameter. If gamma is large, then C will have little to no effect. Well, if gamma is small, the model is much more constrained and the effective C will be similar to how it would affect a linear classifier. Typically gamma and C are turned together with the optimal combination typically in an intermediate range of values, for example, gamma between 0.0001 and 10, and C between 0.1 and 100. Kernelizaed SVMs are pretty sensitive to settings of gamma.

The most important thing to remember when applying SVMs is that it's important to normalize the input data, so that all the features have comparable units that are on the same scale.

In [27]:
from sklearn.svm import SVC
from adspy_shared_utilities import plot_class_regions_for_classifier_subplot

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2, random_state = 0)
fig, subaxes = plt.subplots(3, 4, figsize=(15, 10), dpi=50)

for this_gamma, this_axis in zip([0.01, 1, 5], subaxes):
    
    for this_C, subplot in zip([0.1, 1, 15, 250], this_axis):
        title = 'gamma = {:.2f}, C = {:.2f}'.format(this_gamma, this_C)
        clf = SVC(kernel = 'rbf', gamma = this_gamma,
                 C = this_C).fit(X_train, y_train)
        plot_class_regions_for_classifier_subplot(clf, X_train, y_train,
                                                 X_test, y_test, title,
                                                 subplot)
        plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)

Application of SVMs to a real dataset: unnormalized data

In [28]:
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
                                                   random_state = 0)

clf = SVC(C=10).fit(X_train, y_train)
print('Breast cancer dataset (unnormalized features)')
print('Accuracy of RBF-kernel SVC on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RBF-kernel SVC on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
Breast cancer dataset (unnormalized features)
Accuracy of RBF-kernel SVC on training set: 1.00
Accuracy of RBF-kernel SVC on test set: 0.63

Application of SVMs to a real dataset: normalized data with feature preprocessing using minmax scaling

In [29]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = SVC(C=10).fit(X_train_scaled, y_train)
print('Breast cancer dataset (normalized with MinMax scaling)')
print('RBF-kernel SVC (with MinMax scaling) training set accuracy: {:.2f}'
     .format(clf.score(X_train_scaled, y_train)))
print('RBF-kernel SVC (with MinMax scaling) test set accuracy: {:.2f}'
     .format(clf.score(X_test_scaled, y_test)))
Breast cancer dataset (normalized with MinMax scaling)
RBF-kernel SVC (with MinMax scaling) training set accuracy: 0.98
RBF-kernel SVC (with MinMax scaling) test set accuracy: 0.96

Kernelized Support Vector Machines: pros and cons

Pros:

  • Can perform well on a range of datasets
  • Versatile: different kernel functions can be specified, or custom kernels can be defined for specific data types
  • Works well for both low- and high-dimensional data

Cons:

  • Efficiency (runtime speed and memory usage) decreases as training set size increases (e.g. over 50000 samples)
  • Needs careful normalization of input data and parameter tuning.
  • Does not provide direct probability estimates (but can be estimated using e.g. Platt scaling which transforms the output of the classifier to a probability distribution over classes by fitting a logistic regression model to the classifiers course.)
  • Difficult to interpret why a prediction was made, which means the applicability of SVM in scenarios where interpretation is important for people may be limited when we want to understand why a particular prediction was made

Kernelized Support Vector Machines (SVC) : Important parameters

Model complexity

  • kernel: Type of kernel function to be used
    • Default = 'RBF' for radial basis function
    • Other types include 'polynomial'
  • kernel parameters
    • $gamma(\gamma)$ : RBF kernel width
  • C : regularization parameter
  • Typically C and gamma are turned at the same time

Cross-validation

When applying a number of supervised learning methods, we follow a consistent series of steps.

First, partitioning the data set into training and test set using the train-test-split function

Second, calling the fit method on the training set to estimate the model

Finally, applying the model by using the predict method to estimate a target value for the new data instances, or by using the score method to evaluate the trained model's performance on the test set

We divide the original data into training and test sets was to use the test set as a way to estimate how well the model trained on the training data would generalize to new previously unseen data. The test set represented data that had not been seen during training but had the same general attributes as the original data set, or in more technical language, which was drawn from the same underlying distribution as the training set.

Cross-validation is a method that goes beyond evaluating a single model using a single train/test split of the data by using multiple train/test splits, each of which is used to train and evaluate a separate model. So why is this better than our original method of a single train/test split? cross-validation basically gives more stable and reliable estimates of how the classifiers likely to perform on average by running multiple different training/test splits and then averaging the results, instead of relying entirely on a single particular training set

Example based on k-NN classifier with fruit dataset (2 features)

In [30]:
from sklearn.model_selection import cross_val_score

clf = KNeighborsClassifier(n_neighbors = 5)
X = X_fruits_2d.as_matrix()
y = y_fruits_2d.as_matrix()
#By default, `cross_val_score()` does threehold cross-validation.
#If you want to change the number of folds, you can set the `cv` parameter.
cv_scores = cross_val_score(clf, X, y)

print('Cross-validation scores (3-fold):', cv_scores)
#It's typical to then compute the mean of all the accuracy scores across the folds and
#report the mean cross validation score as a measure of how accurate we can expect the 
#the model to be on average.
print('Mean cross-validation score (3-fold): {:.3f}'
     .format(np.mean(cv_scores)))
Cross-validation scores (3-fold): [ 0.77  0.74  0.83]
Mean cross-validation score (3-fold): 0.781

One benefit of computing the accuracy of a model on multiple splits instead of a single split, is that it gives us notentially useful information about how sensitive the model is to the nature of the specific training set. We can look at the distribution of these multiple scores across all the cross-validation folds to see how likely it is that by chance, the model will perform very badly or very well on any new data set. So we can do a sort of worst case or best case performance estimate from these multiple scores.

This extra information does come with extra cost. It does take more time and computation to do cross-validation


A note on performing cross-validation for more advanced scenarios.

In some cases (e.g. when feature values have very different ranges), we've seen the need to scale or normalize the training and test sets before use with a classifier. The proper way to do cross-validation when you need to scale the data is not to scale the entire dataset with a single transform, since this will indirectly leak information into the training data about the whole dataset, including the test data (see the lecture on data leakage later in the course). Instead, scaling/normalizing must be computed and applied for each cross-validation fold separately. To do this, the easiest way in scikit-learn is to use pipelines. While these are beyond the scope of this course, further information is available in the scikit-learn documentation here:

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

or the Pipeline section in the recommended textbook: Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido (O'Reilly Media).

Validation curve example

Sometimes we want to evaluate the effect that an important parameter of a model has on the cross-validation scores, the useful function validation_curve makes it easy to run this type of expriment.

Like cross_val_score, validation_cure will do threefold cross-validation by default, but you can adjust this with the cv parameter as well. Unlike cross_val_score you can also specify a classifier parameter name and set of parameter values you want to sween across.

In [31]:
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve

param_range = np.logspace(-3, 3, 4)
# First pass in the estimator object or that is the classifier or regression object to use,
# followd by the data set samples `X` and target values `Y`
# the name of the parameter to sweep
# and the array of parameter values that parameter should take on in doing the sweep
train_scores, test_scores = validation_curve(SVC(), X, y,
                                            param_name='gamma',
                                            param_range=param_range, cv=3)

validation_curve will return two-dimensional arrays corresponding to evaluating on the training set and the test set.

One row per parameter sweep value : Each array has one row per parameter value in the sweep

One column per CV fold : the number of columns is the number of cross-validation folds that are used.

In [32]:
print(train_scores)
[[ 0.49  0.42  0.41]
 [ 0.84  0.72  0.76]
 [ 0.92  0.9   0.93]
 [ 1.    1.    0.98]]
In [33]:
print(test_scores)
[[ 0.45  0.32  0.33]
 [ 0.82  0.68  0.61]
 [ 0.41  0.84  0.67]
 [ 0.36  0.21  0.39]]

The validation curve shows the mean cross-validation accuracy (solid lines) for training (orange) and test (blue) set as a function of the SVM parameter (gamma). It also shows the variation around the mean (shaded region) as computed from k-fold cross-validation scores.

Finally as a reminder, cross-validation is used to evaluate the model and not learn or tune a new model. To do model tuning, we'll look at how to turn the model parameters using something called Grid Search

In [34]:
# This code based on scikit-learn validation_plot example
#  See:  http://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html
plt.figure()

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.title('Validation Curve with SVM')
plt.xlabel('$\gamma$ (gamma)')
plt.ylabel('Score')
plt.ylim(0.0, 1.1)
lw = 2

plt.semilogx(param_range, train_scores_mean, label='Training score',
            color='darkorange', lw=lw)

plt.fill_between(param_range, train_scores_mean - train_scores_std,
                train_scores_mean + train_scores_std, alpha=0.2,
                color='darkorange', lw=lw)

plt.semilogx(param_range, test_scores_mean, label='Cross-validation score',
            color='navy', lw=lw)

plt.fill_between(param_range, test_scores_mean - test_scores_std,
                test_scores_mean + test_scores_std, alpha=0.2,
                color='navy', lw=lw)

plt.legend(loc='best')
plt.show()
/opt/conda/lib/python3.5/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Decision Trees

Decision trees are a popular supervised learning method and can be used for both regression and classification. It's a good exploratory method if you're interested in getting a better idea about what the influential features are in your dataset.

Basically, decision trees learn a series of explicit if then rules on feature values that result in a decision that predicts the target value.

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from adspy_shared_utilities import plot_decision_tree
from sklearn.model_selection import train_test_split


iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 3)
clf = DecisionTreeClassifier().fit(X_train, y_train)

print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
Accuracy of Decision Tree classifier on training set: 1.00
Accuracy of Decision Tree classifier on test set: 0.95

Setting max decision tree depth to help avoid overfitting

In [3]:
clf2 = DecisionTreeClassifier(max_depth = 3).fit(X_train, y_train)

print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf2.score(X_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'
     .format(clf2.score(X_test, y_test)))
Accuracy of Decision Tree classifier on training set: 0.98
Accuracy of Decision Tree classifier on test set: 0.97

Visualizing decision trees

In [4]:
plot_decision_tree(clf, iris.feature_names, iris.target_names)
Out[4]:
Tree 0 petal width (cm) <= 0.8 samples = 112 value = [35, 38, 39] class = virginica 1 samples = 35 value = [35, 0, 0] class = setosa 0->1 True 2 petal width (cm) <= 1.65 samples = 77 value = [0, 38, 39] class = virginica 0->2 False 3 petal length (cm) <= 4.95 samples = 40 value = [0, 37, 3] class = versicolor 2->3 10 petal length (cm) <= 4.85 samples = 37 value = [0, 1, 36] class = virginica 2->10 4 samples = 36 value = [0, 36, 0] class = versicolor 3->4 5 sepal width (cm) <= 2.75 samples = 4 value = [0, 1, 3] class = virginica 3->5 6 petal length (cm) <= 5.05 samples = 2 value = [0, 1, 1] class = versicolor 5->6 9 samples = 2 value = [0, 0, 2] class = virginica 5->9 7 samples = 1 value = [0, 0, 1] class = virginica 6->7 8 samples = 1 value = [0, 1, 0] class = versicolor 6->8 11 sepal width (cm) <= 3.1 samples = 3 value = [0, 1, 2] class = virginica 10->11 14 samples = 34 value = [0, 0, 34] class = virginica 10->14 12 samples = 2 value = [0, 0, 2] class = virginica 11->12 13 samples = 1 value = [0, 1, 0] class = versicolor 11->13

Pre-pruned version (max_depth = 3)

In [5]:
plot_decision_tree(clf2, iris.feature_names, iris.target_names)
Out[5]:
Tree 0 petal width (cm) <= 0.8 samples = 112 value = [35, 38, 39] class = virginica 1 samples = 35 value = [35, 0, 0] class = setosa 0->1 True 2 petal width (cm) <= 1.65 samples = 77 value = [0, 38, 39] class = virginica 0->2 False 3 petal length (cm) <= 4.95 samples = 40 value = [0, 37, 3] class = versicolor 2->3 6 petal length (cm) <= 4.85 samples = 37 value = [0, 1, 36] class = virginica 2->6 4 samples = 36 value = [0, 36, 0] class = versicolor 3->4 5 samples = 4 value = [0, 1, 3] class = virginica 3->5 7 samples = 3 value = [0, 1, 2] class = virginica 6->7 8 samples = 34 value = [0, 0, 34] class = virginica 6->8

Feature importance : How important is a feature to overall prediction accuracy?

This is one of the most useful and widely used types of summary analysis you can perform on a supervised learning model.

  • A number between 0 and 1 assigned to each feature.
  • Feature importance of 0 => the feature was not used in prediction
  • Feature importance of 1 => the feature predicts the target perfectly
  • All feature importances are normalized to sum to 1

In Scikit-learn feature importance values are stored as a list in an estimated property called feature_importances_, note the underscore at end of the name which indicates it's property of the object that's set as a result of fitting the model and not say as a user defined property. The shared_utilities python file contain a function called plot_feature_importances that you can import and use to visualize future importance. It plot a horizental bar chart with the features listed along the y axis by name and feature importances along the axis

In [39]:
from adspy_shared_utilities import plot_feature_importances

plt.figure(figsize=(10,4), dpi=80)
plot_feature_importances(clf, iris.feature_names)
plt.show()

print('Feature importances: {}'.format(clf.feature_importances_))
/opt/conda/lib/python3.5/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
Feature importances: [ 0.    0.02  0.08  0.9 ]

Note: if a feature has a low feature importance value, that doesn't necessarily mean that the feature is not important for prediction. It simply means that the particular feature wasn't chosen at an early level of the tree and this could be because the future may be identical and highly correlated with anther informative feature and so doesn't provide any new additional signal for prediction.

Feature importance values don't tell us which specific classes a feature might be especially predictive for, and they are also don't indicate more complex relationship between features that may influence prediction. Even so the feature importance values provide an easy to understand summary that can give useful insight about individual features in the learning model.

Because feature importance can vary depending on the specific model learned for a particular train/test split for example. It's common when computing feature importance to use an average over multiple train/test splits. For example when performing cross-validation


In [40]:
from sklearn.tree import DecisionTreeClassifier
from adspy_shared_utilities import plot_class_regions_for_classifier_subplot

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 0)
fig, subaxes = plt.subplots(6, 1, figsize=(6, 32))

pair_list = [[0,1], [0,2], [0,3], [1,2], [1,3], [2,3]]
tree_max_depth = 4

for pair, axis in zip(pair_list, subaxes):
    X = X_train[:, pair]
    y = y_train
    
    clf = DecisionTreeClassifier(max_depth=tree_max_depth).fit(X, y)
    title = 'Decision Tree, max_depth = {:d}'.format(tree_max_depth)
    plot_class_regions_for_classifier_subplot(clf, X, y, None,
                                             None, title, axis,
                                             iris.target_names)
    
    axis.set_xlabel(iris.feature_names[pair[0]])
    axis.set_ylabel(iris.feature_names[pair[1]])
    
plt.tight_layout()
plt.show()
/opt/conda/lib/python3.5/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Decision Trees on a real-world dataset

In [12]:
from sklearn.tree import DecisionTreeClassifier
from adspy_shared_utilities import plot_decision_tree
from adspy_shared_utilities import plot_feature_importances

X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0)

clf = DecisionTreeClassifier(max_depth = 4, min_samples_leaf = 8,
                            random_state = 0).fit(X_train, y_train)

plot_decision_tree(clf, cancer.feature_names, cancer.target_names)
Out[12]:
Tree 0 mean concave points <= 0.049 samples = 426 value = [159, 267] class = benign 1 worst area <= 952.9 samples = 260 value = [13, 247] class = benign 0->1 True 8 worst area <= 785.8 samples = 166 value = [146, 20] class = malignant 0->8 False 2 area error <= 38.885 samples = 252 value = [7, 245] class = benign 1->2 7 samples = 8 value = [6, 2] class = malignant 1->7 3 worst texture <= 30.145 samples = 244 value = [4, 240] class = benign 2->3 6 samples = 8 value = [3, 5] class = benign 2->6 4 samples = 212 value = [0, 212] class = benign 3->4 5 samples = 32 value = [4, 28] class = benign 3->5 9 worst texture <= 23.74 samples = 30 value = [13, 17] class = benign 8->9 14 worst texture <= 19.385 samples = 136 value = [133, 3] class = malignant 8->14 10 samples = 14 value = [0, 14] class = benign 9->10 11 worst smoothness <= 0.166 samples = 16 value = [13, 3] class = malignant 9->11 12 samples = 8 value = [5, 3] class = malignant 11->12 13 samples = 8 value = [8, 0] class = malignant 11->13 15 samples = 9 value = [6, 3] class = malignant 14->15 16 samples = 127 value = [127, 0] class = malignant 14->16
In [42]:
print('Breast cancer dataset: decision tree')
print('Accuracy of DT classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of DT classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

plt.figure(figsize=(10,6),dpi=80)
plot_feature_importances(clf, cancer.feature_names)
plt.tight_layout()

plt.show()
Breast cancer dataset: decision tree
Accuracy of DT classifier on training set: 0.96
Accuracy of DT classifier on test set: 0.94
/opt/conda/lib/python3.5/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Decision Trees : Pros and Cons

Pros:

  • Easily visualized and interpreted
  • No feature normalization
  • Work well with datasets using a mixture of feature types (continuous, categorical, binary)

Cons:

  • Even after tuning, decision trees can often still overfit.
  • Usually need an ensemble of trees for better generalization performance

Decision Trees : DecisionTreeClassifier Key Parameters

  • max_depth: controls maximum depth (number of split points). Most common way to reduce tree complexity and overfitting
  • min_samples_leaf: threshold for the minimum # of data instances a leaf can have to avoid further splitting
  • max_leaf_nodes: limits total number of leaves in the tree
  • In practice, adjusting only one of these (e.g. max_depth) is enough to reduce overfitting