Assignment 3 - Evaluation

In this assignment you will train several models and evaluate how effectively they predict instances of fraud using data based on this dataset from Kaggle.   Each row in fraud_data.csv corresponds to a credit card transaction. Features include confidential variables V1 through V28 as well as Amount which is the amount of the transaction.    The target is stored in the class column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

In [1]:
import numpy as np
import pandas as pd

Question 1

Import the data from fraud_data.csv. What percentage of the observations in the dataset are instances of fraud?

This function should return a float between 0 and 1.

In [ ]:
def answer_one():
    
    df = pd.read_csv("./Data_Science_with_Python/Applied_Machine_with_Python/week3/fraud_data.csv")
    df.head()
    
    return sum(df['Class']) / df.shape[0]
In [5]:
df = pd.read_csv("./Data_Science_with_Python/Applied_Machine_with_Python/week3/fraud_data.csv")
df.head()
Out[5]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 1.176563 0.323798 0.536927 1.047002 -0.368652 -0.728586 0.084678 -0.069246 -0.266389 0.155315 ... -0.109627 -0.341365 0.057845 0.499180 0.415211 -0.581949 0.015472 0.018065 4.67 0
1 0.681109 -3.934776 -3.801827 -1.147468 -0.735540 -0.501097 1.038865 -0.626979 -2.274423 1.527782 ... 0.652202 0.272684 -0.982151 0.165900 0.360251 0.195321 -0.256273 0.056501 912.00 0
2 1.140729 0.453484 0.247010 2.383132 0.343287 0.432804 0.093380 0.173310 -0.808999 0.775436 ... -0.003802 0.058556 -0.121177 -0.304215 0.645893 0.122600 -0.012115 -0.005945 1.00 0
3 -1.107073 -3.298902 -0.184092 -1.795744 2.137564 -1.684992 -2.015606 -0.007181 -0.165760 0.869659 ... 0.130648 0.329445 0.927656 -0.049560 -1.892866 -0.575431 0.266573 0.414184 62.10 0
4 -0.314818 0.866839 -0.124577 -0.627638 2.651762 3.428128 0.194637 0.670674 -0.442658 0.133499 ... -0.312774 -0.799494 -0.064488 0.953062 -0.429550 0.158225 0.076943 -0.015051 2.67 0

5 rows × 30 columns

In [8]:
sum(df['Class']) / df.shape[0]
Out[8]:
0.016410823768035772
In [9]:
# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

df = pd.read_csv('fraud_data.csv')

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Question 2

Using X_train, X_test, y_train, and y_test (as defined above), train a dummy classifier that classifies everything as the majority class of the training data. What is the accuracy of this classifier? What is the recall?

This function should a return a tuple with two floats, i.e. (accuracy score, recall score).

In [14]:
def answer_two():
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import recall_score, accuracy_score
    
    dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
    y_dummy_predictions = dummy_majority.predict(X_test)
    dummy_majority.score(X_test, y_test)
    ans = (accuracy_score(y_test, y_dummy_predictions), recall_score(y_test, y_dummy_predictions))
    
    return ans
In [15]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import recall_score, accuracy_score
# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
# Therefore the dummy 'most_frequent' classifier always predicts class 0
y_dummy_predictions = dummy_majority.predict(X_test)

dummy_majority.score(X_test, y_test)
(accuracy_score(y_test, y_dummy_predictions), recall_score(y_test, y_dummy_predictions))
Out[15]:
(0.98525073746312686, 0.0)

Question 3

Using X_train, X_test, y_train, y_test (as defined above), train a SVC classifer using the default parameters. What is the accuracy, recall, and precision of this classifier?

This function should a return a tuple with three floats, i.e. (accuracy score, recall score, precision score).

In [ ]:
def answer_three():
    from sklearn.metrics import recall_score, precision_score, accuracy_score
    from sklearn.svm import SVC

    svm = SVC().fit(X_train, y_train)
    svm_predicted = svm.predict(X_test)
    ans = (accuracy_score(y_test, svm_predicted),recall_score(y_test, svm_predicted),precision_score(y_test, svm_predicted))
    
    return ans
In [17]:
from sklearn.metrics import recall_score, precision_score, accuracy_score
from sklearn.svm import SVC
svm = SVC().fit(X_train, y_train)
svm_predicted = svm.predict(X_test)
(accuracy_score(y_test, svm_predicted),recall_score(y_test, svm_predicted),precision_score(y_test, svm_predicted))
Out[17]:
(0.99078171091445433, 0.375, 1.0)

Question 4

Using the SVC classifier with parameters {'C': 1e9, 'gamma': 1e-07}, what is the confusion matrix when using a threshold of -220 on the decision function. Use X_test and y_test.

This function should return a confusion matrix, a 2x2 numpy array with 4 integers.

In [ ]:
def answer_four():
    from sklearn.metrics import confusion_matrix
    from sklearn.svm import SVC

    svm = SVC(C=1e9, gamma = 1e-07).fit(X_train, y_train)
    y_scores_svm = svm.decision_function(X_test) > -220
    confusion = confusion_matrix(y_test, y_scores_svm)
    
    return confusion
In [44]:
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
svm = SVC(C=1e9, gamma = 1e-07).fit(X_train, y_train)
y_scores_svm = svm.decision_function(X_test) > -220
confusion = confusion_matrix(y_test, y_scores_svm)
confusion
Out[44]:
array([[5320,   24],
       [  14,   66]])
In [22]:
y_scores_svm > -220
Out[22]:
array([False, False, False, ..., False, False, False], dtype=bool)

Question 5

Train a logisitic regression classifier with default parameters using X_train and y_train.

For the logisitic regression classifier, create a precision recall curve and a roc curve using y_test and the probability estimates for X_test (probability it is fraud).

Looking at the precision recall curve, what is the recall when the precision is 0.75?

Looking at the roc curve, what is the true positive rate when the false positive rate is 0.16?

This function should return a tuple with two floats, i.e. (recall, true positive rate).

In [ ]:
def answer_five():
    #precision recall curve    
    %matplotlib notebook
    import matplotlib.pyplot as plt
    from sklearn.metrics import precision_recall_curve
    from sklearn.linear_model import LogisticRegression

    lr = LogisticRegression().fit(X_train, y_train)
    y_scores_lr = lr.decision_function(X_test)
    
    precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
    closest_zero = np.argmin(np.abs(thresholds))
    closest_zero_p = precision[closest_zero]
    closest_zero_r = recall[closest_zero]

    plt.figure()
    plt.xlim([0.0, 1.01])
    plt.ylim([0.0, 1.01])
    plt.plot(precision, recall, label='Precision-Recall Curve')
    plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
    plt.xlabel('Precision', fontsize=16)
    plt.ylabel('Recall', fontsize=16)
    plt.axes().set_aspect('equal')
    plt.show()
    
    #roc curve
    from sklearn.metrics import roc_curve, auc

    fpr_lr, tpr_lr, _ = roc_curve(y_test, y_scores_lr)
    roc_auc_lr = auc(fpr_lr, tpr_lr)

    plt.figure()
    plt.xlim([-0.01, 1.00])
    plt.ylim([-0.01, 1.01])
    plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
    plt.legend(loc='lower right', fontsize=13)
    plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
    plt.axes().set_aspect('equal')
    plt.show()
    return (0.83, 0.94)
In [29]:
%matplotlib notebook
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression().fit(X_train, y_train)
y_scores_lr = lr.decision_function(X_test)
In [30]:
precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]

plt.figure()
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.axes().set_aspect('equal')
plt.show()
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)
In [31]:
from sklearn.metrics import roc_curve, auc

fpr_lr, tpr_lr, _ = roc_curve(y_test, y_scores_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)

plt.figure()
plt.xlim([-0.01, 1.00])
plt.ylim([-0.01, 1.01])
plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16)
plt.legend(loc='lower right', fontsize=13)
plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
plt.axes().set_aspect('equal')
plt.show()
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:106: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

Question 6

Perform a grid search over the parameters listed below for a Logisitic Regression classifier, using recall for scoring and the default 3-fold cross validation.

'penalty': ['l1', 'l2']

'C':[0.01, 0.1, 1, 10, 100]

From .cv_results_, create an array of the mean test scores of each parameter combination. i.e.

l1 l2
0.01 ? ?
0.1 ? ?
1 ? ?
10 ? ?
100 ? ?


This function should return a 5 by 2 numpy array with 10 floats.

Note: do not return a DataFrame, just the values denoted by '?' above in a numpy array. You might need to reshape your raw result to meet the format we are looking for.

In [45]:
def answer_six():    
    from sklearn.model_selection import GridSearchCV
    from sklearn.linear_model import LogisticRegression

    grid_values = {'penalty': ['l1', 'l2'],'C':[0.01, 0.1, 1, 10, 100]}
    lr = LogisticRegression()
    grid_lr_acc = GridSearchCV(lr, param_grid = grid_values, scoring = 'recall', cv=3)
    ans = grid_lr_acc.fit(X_train, y_train).cv_results_
    
    return ans['mean_test_score'].reshape(5,2)
In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

grid_values = {'penalty': ['l1', 'l2'],'C':[0.01, 0.1, 1, 10, 100]}

lr = LogisticRegression()
grid_lr_acc = GridSearchCV(lr, param_grid = grid_values, scoring = 'recall', cv=3)
ans = grid_lr_acc.fit(X_train, y_train).cv_results_

ans
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split0_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split1_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split2_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('std_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
Out[41]:
{'mean_fit_time': array([ 0.03675071,  0.06236728,  0.03626378,  0.10213105,  0.11926373,
         0.13743035,  0.22438375,  0.16281295,  0.25784826,  0.16818285]),
 'mean_score_time': array([ 0.00163666,  0.00131162,  0.00131639,  0.00130908,  0.00150307,
         0.00131734,  0.00138028,  0.00132823,  0.00134802,  0.00134675]),
 'mean_test_score': array([ 0.66666667,  0.76086957,  0.80072464,  0.80434783,  0.8115942 ,
         0.8115942 ,  0.80797101,  0.8115942 ,  0.80797101,  0.80797101]),
 'mean_train_score': array([ 0.68115942,  0.77355072,  0.80615942,  0.8134058 ,  0.81884058,
         0.81702899,  0.82427536,  0.82427536,  0.82427536,  0.82427536]),
 'param_C': masked_array(data = [0.01 0.01 0.1 0.1 1 1 10 10 100 100],
              mask = [False False False False False False False False False False],
        fill_value = ?),
 'param_penalty': masked_array(data = ['l1' 'l2' 'l1' 'l2' 'l1' 'l2' 'l1' 'l2' 'l1' 'l2'],
              mask = [False False False False False False False False False False],
        fill_value = ?),
 'params': [{'C': 0.01, 'penalty': 'l1'},
  {'C': 0.01, 'penalty': 'l2'},
  {'C': 0.1, 'penalty': 'l1'},
  {'C': 0.1, 'penalty': 'l2'},
  {'C': 1, 'penalty': 'l1'},
  {'C': 1, 'penalty': 'l2'},
  {'C': 10, 'penalty': 'l1'},
  {'C': 10, 'penalty': 'l2'},
  {'C': 100, 'penalty': 'l1'},
  {'C': 100, 'penalty': 'l2'}],
 'rank_test_score': array([10,  9,  8,  7,  2,  2,  4,  1,  4,  4], dtype=int32),
 'split0_test_score': array([ 0.70652174,  0.80434783,  0.80434783,  0.82608696,  0.82608696,
         0.82608696,  0.82608696,  0.83695652,  0.82608696,  0.82608696]),
 'split0_train_score': array([ 0.69021739,  0.77173913,  0.79891304,  0.80978261,  0.82065217,
         0.81521739,  0.82065217,  0.82065217,  0.82065217,  0.82065217]),
 'split1_test_score': array([ 0.69565217,  0.75      ,  0.83695652,  0.83695652,  0.83695652,
         0.83695652,  0.83695652,  0.83695652,  0.83695652,  0.83695652]),
 'split1_train_score': array([ 0.67391304,  0.76630435,  0.79891304,  0.80434783,  0.80978261,
         0.80978261,  0.81521739,  0.81521739,  0.81521739,  0.81521739]),
 'split2_test_score': array([ 0.59782609,  0.72826087,  0.76086957,  0.75      ,  0.77173913,
         0.77173913,  0.76086957,  0.76086957,  0.76086957,  0.76086957]),
 'split2_train_score': array([ 0.67934783,  0.7826087 ,  0.82065217,  0.82608696,  0.82608696,
         0.82608696,  0.83695652,  0.83695652,  0.83695652,  0.83695652]),
 'std_fit_time': array([ 0.00248158,  0.00043832,  0.00593771,  0.00189305,  0.03923745,
         0.01413375,  0.08077784,  0.02312068,  0.13505623,  0.0111456 ]),
 'std_score_time': array([  1.30144276e-04,   3.48212672e-05,   4.74580085e-06,
          8.13500900e-06,   1.30997552e-04,   5.52093329e-06,
          5.07267239e-05,   3.34585459e-05,   7.43570432e-06,
          2.05596764e-05]),
 'std_test_score': array([ 0.04887948,  0.03199913,  0.03116785,  0.03868507,  0.02852901,
         0.02852901,  0.03360007,  0.03586774,  0.03360007,  0.03360007]),
 'std_train_score': array([ 0.00677836,  0.00677836,  0.01024792,  0.00923735,  0.00677836,
         0.00677836,  0.00923735,  0.00923735,  0.00923735,  0.00923735])}
In [42]:
ans['mean_test_score'].reshape(5,2)
Out[42]:
array([[ 0.66666667,  0.76086957],
       [ 0.80072464,  0.80434783],
       [ 0.8115942 ,  0.8115942 ],
       [ 0.80797101,  0.8115942 ],
       [ 0.80797101,  0.80797101]])
In [46]:
# Use the following function to help visualize results from the grid search
def GridSearch_Heatmap(scores):
    %matplotlib notebook
    import seaborn as sns
    import matplotlib.pyplot as plt
    plt.figure()
    sns.heatmap(scores.reshape(5,2), xticklabels=['l1','l2'], yticklabels=[0.01, 0.1, 1, 10, 100])
    plt.yticks(rotation=0);

GridSearch_Heatmap(answer_six())