Assignment 4 - Understanding and Predicting Property Maintenance Fines¶

This assignment is based on a data challenge from the Michigan Data Science Team (MDST).

The Michigan Data Science Team (MDST) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences (MSSISS) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. Blight violations are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

All data for this assignment has been provided to us through the Detroit Open Data Portal. Only the data already included in your Coursera directory can be used for training the model for this assignment. Nonetheless, we encourage you to look into data from other Detroit datasets to help inform feature creation and model selection. We recommend taking a look at the following related datasets:

We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.

Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.

File descriptions (Use only this data for training your model!)

train.csv - the training set (all tickets issued 2004-2011)
test.csv - the test set (all tickets issued 2012-2016)
addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
 Note: misspelled addresses may be incorrectly geolocated.

Data fields

train.csv & test.csv

ticket_id - unique identifier for tickets
agency_name - Agency that issued the ticket
inspector_name - Name of inspector that issued the ticket
violator_name - Name of the person/organization that the ticket was issued to
violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
ticket_issued_date - Date and time the ticket was issued
hearing_date - Date and time the violator's hearing was scheduled
violation_code, violation_description - Type of violation
disposition - Judgment and judgement type
fine_amount - Violation fine amount, excluding fees
admin_fee - $20 fee assigned to responsible judgments

state_fee - $10 fee assigned to responsible judgments late_fee - 10% fee assigned to responsible judgments discount_amount - discount applied, if any clean_up_cost - DPW clean-up or graffiti removal cost judgment_amount - Sum of all fines and fees grafitti_status - Flag for graffiti violations

train.csv only

payment_amount - Amount paid, if any
payment_date - Date payment was made, if it was received
payment_status - Current payment status as of Feb 1 2017
balance_due - Fines and fees still owed
collection_status - Flag for payments in collections
compliance [target variable for prediction] 
 Null = Not responsible
 0 = Responsible, non-compliant
 1 = Responsible, compliant
compliance_detail - More information on why each ticket was marked compliant or non-compliant

Evaluation¶

Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUROC of 0.7 passes this assignment, over 0.75 will recieve full points.

For this assignment, create a function that trains a model to predict blight ticket compliance in Detroit using train.csv. Using this model, return a series of length 61001 with the data being the probability that each corresponding ticket from test.csv will be paid, and the index being the ticket_id.

Example:

ticket_id
   284932    0.531842
   285362    0.401958
   285361    0.105928
   285338    0.018572
             ...
   376499    0.208567
   376500    0.818759
   369851    0.018528
   Name: compliance, dtype: float32

Hints¶

Make sure your code is working before submitting it to the autograder.
Print out your result to see whether there is anything weird (e.g., all probabilities are the same).
Generally the total runtime should be less than 10 mins. You should NOT use Neural Network related classifiers (e.g., MLPClassifier) in this question.
Try to avoid global variables. If you have other functions besides blight_model, you should move those functions inside the scope of blight_model.
Refer to the pinned threads in Week 4's discussion forum when there is something you could not figure it out.

import pandas as pd
import numpy as np

def blight_model():
    
    from sklearn.preprocessing import LabelEncoder
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.metrics import roc_auc_score
        

    train = pd.read_csv('~/data/train2.csv', encoding = "ISO-8859-1")
    test = pd.read_csv('~/data/test.csv')
    addresses = pd.read_csv('~/data/addresses.csv')
    latlons = pd.read_csv('~/data/latlons.csv')

    train = train[np.isfinite(train['compliance'])]
    train = train[train.country == 'USA']
    test = test[test.country == 'USA']

    train = pd.merge(train, pd.merge(addresses, latlons, on='address'), on='ticket_id')
    test = pd.merge(test, pd.merge(addresses, latlons, on='address'), on='ticket_id')

    train.drop(['agency_name', 'inspector_name', 'violator_name', 'non_us_str_code', 'violation_description','grafitti_status', 
                'state_fee', 'admin_fee', 'ticket_issued_date', 'hearing_date', 'payment_amount', 'balance_due', 'payment_date', 
                'payment_status','collection_status', 'compliance_detail', 'violation_zip_code', 'country', 'address',
                'violation_street_number','violation_street_name', 'mailing_address_str_number', 'mailing_address_str_name', 
                'city', 'state', 'zip_code', 'address'], axis=1, inplace=True)
    
    label_encoder = LabelEncoder()
    for col in train.columns[train.dtypes == "object"]:
        train[col] = label_encoder.fit_transform(train[col])
    
    train['lat'] = train['lat'].fillna(method='pad')
    train['lon'] = train['lon'].fillna(method='pad') 
    test['lat'] = test['lat'].fillna(method='pad') 
    test['lon'] = test['lon'].fillna(method='pad') 
    train_columns = list(train.columns.values)
    train_columns.remove('compliance')
    test = test[train_columns]    
    
    X_train, X_test, y_train, y_test = train_test_split(train.ix[:, train.columns != 'compliance'], train['compliance'])
    rf = RandomForestRegressor()
    grid_values = {'n_estimators': [10, 200], 'max_depth': [3, 50]}
    grid_rf_auc = GridSearchCV(rf, param_grid=grid_values, scoring='roc_auc')
    grid_rf_auc.fit(X_train, y_train)
    print('Model best parameter (max. AUC): ', grid_rf_auc.best_params_)
    print('Model score (AUC): ', grid_rf_auc.best_score_)


    for col in test.columns[test.dtypes == "object"]:
        test[col] = label_encoder.fit_transform(test[col])

    ans = pd.DataFrame(grid_rf_auc.predict(test), test.ticket_id)  
    
    return ans

blight_model()

/home/sabodhapati/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2808: DtypeWarning: Columns (11,12,31) have mixed types. Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:42: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

Model best parameter (max. AUC):  {'max_depth': 50, 'n_estimators': 200}
Model score (AUC):  0.808579957714

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

train = pd.read_csv("~/data/train.csv",encoding = 'ISO-8859-1')

/home/sabodhapati/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (11,12,19,20,21,31,33,34,35) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Index(['ticket_id', 'agency_name', 'inspector_name', 'violator_name',
       'violation_street_number', 'violation_street_name',
       'violation_zip_code', 'mailing_address_str_number',
       'mailing_address_str_name', 'city', 'state', 'zip_code',
       'non_us_str_code', 'country', 'ticket_issued_date', 'hearing_date',
       'violation_code', 'violation_description', 'disposition', 'fine_amount',
       'admin_fee', 'state_fee', 'late_fee', 'discount_amount',
       'clean_up_cost', 'judgment_amount', 'payment_amount', 'balance_due',
       'payment_date', 'payment_status', 'collection_status',
       'grafitti_status', 'compliance_detail', 'compliance', 'Unnamed: 34',
       'Unnamed: 35'],
      dtype='object')

train = pd.read_csv("~/data/train.csv",encoding = 'ISO-8859-1')
test = pd.read_csv("~/data/test.csv")

/home/sabodhapati/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (11,12,19,20,21,31,33,34,35) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

train = train[(train["compliance"] == 1) | (train["compliance"] ==0)]

addresses = pd.read_csv('~/data/addresses.csv')
latlons = pd.read_csv('~/data/latlons.csv')

temp = pd.merge(addresses, latlons, on = 'address')
train = pd.merge(train, temp, on = 'ticket_id')
test  = pd.merge(test, temp, on = 'ticket_id')

pd.get_dummies?

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestRegressor

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
        
# load data
train = pd.read_csv('~/data/train2.csv', encoding = "ISO-8859-1")
#train = train[np.isfinite(train['compliance'])]
#train.drop(['Unnamed: 34','Unnamed: 35'], axis=1, inplace=True)
test = pd.read_csv('~/data/test.csv')
addresses = pd.read_csv('~/data/addresses.csv')
latlons = pd.read_csv('~/data/latlons.csv')
# drop all rows with Null compliance
train = train[np.isfinite(train['compliance'])]
#train_df = train_df.dropna(subset = ['compliance'])#drop rows in training data set where target is NaN


# drop all rows not in the U.S
train = train[train.country == 'USA']
test = test[test.country == 'USA']
# merge latlons and addresses with data
train = pd.merge(train, pd.merge(addresses, latlons, on='address'), on='ticket_id')
test = pd.merge(test, pd.merge(addresses, latlons, on='address'), on='ticket_id')
# drop all unnecessary columns
train.drop(['agency_name', 'inspector_name', 'violator_name', 'non_us_str_code', 'violation_description','grafitti_status', 
            'state_fee', 'admin_fee', 'ticket_issued_date', 'hearing_date', 'payment_amount', 'balance_due', 'payment_date', 
            'payment_status','collection_status', 'compliance_detail', 'violation_zip_code', 'country', 'address',
            'violation_street_number','violation_street_name', 'mailing_address_str_number', 'mailing_address_str_name', 
            'city', 'state', 'zip_code', 'address'], axis=1, inplace=True)

# discretizing relevant columns
#label_encoder = LabelEncoder()
#label_encoder.fit(train['disposition'].append(test['disposition'], ignore_index=True))
#train['disposition'] = label_encoder.transform(train['disposition'])
#test['disposition'] = label_encoder.transform(test['disposition'])
#label_encoder = LabelEncoder()
#label_encoder.fit(train['violation_code'].append(test['violation_code'], ignore_index=True))
#train['violation_code'] = label_encoder.transform(train['violation_code'])
#test['violation_code'] = label_encoder.transform(test['violation_code'])

label_encoder = LabelEncoder()
for col in train.columns[train.dtypes == "object"]:
    train[col] = label_encoder.fit_transform(train[col])
    
train['lat'] = train['lat'].fillna(method='pad') #train['lat'].mean()
train['lon'] = train['lon'].fillna(method='pad') #train['lon'].mean()
test['lat'] = test['lat'].fillna(method='pad') #test['lat'].mean()
test['lon'] = test['lon'].fillna(method='pad') #test['lon'].mean()
train_columns = list(train.columns.values)
train_columns.remove('compliance')
test = test[train_columns]
    

# train the model
    
X_train, X_test, y_train, y_test = train_test_split(train.ix[:, train.columns != 'compliance'], train['compliance'])
rf = RandomForestRegressor()
grid_values = {'n_estimators': [10, 30], 'max_depth': [3, 10]}
grid_rf_auc = GridSearchCV(rf, param_grid=grid_values, scoring='roc_auc')
grid_rf_auc.fit(X_train, y_train)
print('Grid best parameter (max. AUC): ', grid_rf_auc.best_params_)
print('Grid best score (AUC): ', grid_rf_auc.best_score_)


for col in test.columns[test.dtypes == "object"]:
    test[col] = label_encoder.fit_transform(test[col])

pd.DataFrame(grid_rf_auc.predict(test), test.ticket_id) #return 

#---------------------------------------------------------------------
#换GradientBoostingClassifier算法进行分类建模，#前面的数据处理同下面的算法
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc, precision_score,recall_score, accuracy_score, precision_recall_curve
from sklearn.ensemble import GradientBoostingClassifier

grid_values = {'learning_rate': [0.01, 0.1, 1]}
clf = GradientBoostingClassifier(random_state = 0)
grid = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid.fit(X_train, y_train)
result = grid.predict_proba(test)[:, 1]
print(grid.best_score_) #0.82247378736744237
result_new = pd.Series(result, index = test.ticket_id)
result_new #return

/home/sabodhapati/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (11,12,31) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
/home/sabodhapati/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:53: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

Grid best parameter (max. AUC):  {'max_depth': 10, 'n_estimators': 30}
Grid best score (AUC):  0.820089170536

test.dtypes

ticket_id            int64
violation_code       int64
disposition          int64
fine_amount        float64
late_fee           float64
discount_amount    float64
clean_up_cost      float64
judgment_amount    float64
lat                float64
lon                float64
dtype: object

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc, precision_score,recall_score, accuracy_score, precision_recall_curve
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
def blight_model():
    
    # Your code here
    data_train = pd.read_csv("train.csv", encoding = "ISO-8859-1")
    data_test = pd.read_csv("test.csv", encoding = "ISO-8859-1")
    addresses = pd.read_csv("addresses.csv",encoding="ISO-8859-1")
    data_train = pd.merge(data_train,addresses,on="ticket_id", how="inner")
    data_test = pd.merge(data_test,addresses,on="ticket_id",how="inner")
    
   
    
    data_train = data_train.dropna(subset=["compliance"])
    data_train["compliance"] = data_train["compliance"].astype(int)
    
    convert_columns={'country': 'category',
                     'non_us_str_code': 'category',
                     'compliance': 'category',
                     'state': 'category',
                     'zip_code': 'category'
                    }
    for df in [data_test, data_train]:
        for col, col_type in convert_columns.items():
            if col in df:
                if col_type == 'category':
                    df[col] = df[col].replace(np.nan, "NA", regex=True).astype(col_type)
                elif col_type == 'int':
                    df[col] = df[col].replace(np.nan, 0, regex=True).astype(col_type)
    #print(data_train.head())
    #print(data_train.isnull().any())
    
    #dropping the columns we don't need
    common_cols_to_drop = ['agency_name', 'inspector_name', 'mailing_address_str_number',
                           'violator_name', 'violation_street_number', 'violation_street_name',
                           'mailing_address_str_name', 'address', 'admin_fee', 'violation_zip_code',
                           'state_fee', 'late_fee', 'ticket_issued_date', 'hearing_date', 'violation_description',
                           'fine_amount', 'clean_up_cost', 'disposition', 'grafitti_status',
                           'violation_code', 'city']
    data_train_cols_to_drop = ['payment_status', 'payment_date', 'balance_due', 'payment_amount'] + common_cols_to_drop
    
    data_test = data_test.drop(common_cols_to_drop,axis=1).set_index("ticket_id")
    data_train = data_train.drop(data_train_cols_to_drop,axis=1).set_index("ticket_id")
    
    #print (data_test.head())
    y_train = data_train["compliance"]
    data_train = data_train.drop(["compliance", "compliance_detail","collection_status"],axis=1)
    
    cat = data_train.select_dtypes(['category']).columns
    X_train,
    #print(cat)
    for df in [data_test, data_train]:
        df[cat] = df[cat].apply(lambda x: x.cat.codes)
    X_train = data_train.copy()
    grid_values = {'learning_rate': [0.01, 0.1, 1]}
    clf = GradientBoostingClassifier(random_state = 0)
    grid = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
    grid.fit(X_train, y_train)
    result = grid.predict_proba(data_test)[:, 1]
    result_new = pd.Series(result, index = data_test.index)
    
    return result_new
    #return # Your answer here
blight_model()

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
        

def blight_model():
    from sklearn.neural_network import MLPClassifier
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.tree import DecisionTreeClassifier
    from datetime import datetime
    def time_gap(hearing_date_str, ticket_issued_date_str):
        if not hearing_date_str or type(hearing_date_str)!=str: return 73
        hearing_date = datetime.strptime(hearing_date_str, "%Y-%m-%d %H:%M:%S")
        ticket_issued_date = datetime.strptime(ticket_issued_date_str, "%Y-%m-%d %H:%M:%S")
        gap = hearing_date - ticket_issued_date
        return gap.days
    train_data = pd.read_csv('~/data/train2.csv', encoding = 'ISO-8859-1')
    #train_data.drop(['Unnamed: 34','Unnamed: 35'], axis=1, inplace=True)
    test_data = pd.read_csv('~/data/test.csv')
    train_data = train_data[(train_data['compliance'] == 0) | (train_data['compliance'] == 1)]
    address =  pd.read_csv('~/data/addresses.csv')
    latlons = pd.read_csv('~/data/latlons.csv')
    address = address.set_index('address').join(latlons.set_index('address'), how='left')
    train_data = train_data.set_index('ticket_id').join(address.set_index('ticket_id'))
    test_data = test_data.set_index('ticket_id').join(address.set_index('ticket_id'))
    train_data = train_data[~train_data['hearing_date'].isnull()]
    train_data['time_gap'] = train_data.apply(lambda row: time_gap(row['hearing_date'], row['ticket_issued_date']), axis=1)
    test_data['time_gap'] = test_data.apply(lambda row: time_gap(row['hearing_date'], row['ticket_issued_date']), axis=1)
    feature_to_be_splitted = ['agency_name', 'state', 'disposition']
    train_data.lat.fillna(method='pad', inplace=True)
    train_data.lon.fillna(method='pad', inplace=True)
    train_data.state.fillna(method='pad', inplace=True)

    test_data.lat.fillna(method='pad', inplace=True)
    test_data.lon.fillna(method='pad', inplace=True)
    test_data.state.fillna(method='pad', inplace=True)
    train_data = pd.get_dummies(train_data, columns=feature_to_be_splitted)
    test_data = pd.get_dummies(test_data, columns=feature_to_be_splitted)
    list_to_remove_train = [
        'balance_due',
        'collection_status',
        'compliance_detail',
        'payment_amount',
        'payment_date',
        'payment_status'
    ]
    list_to_remove_all = ['fine_amount', 'violator_name', 'zip_code', 'country', 'city',
                          'inspector_name', 'violation_street_number', 'violation_street_name',
                          'violation_zip_code', 'violation_description',
                          'mailing_address_str_number', 'mailing_address_str_name',
                          'non_us_str_code',
                          'ticket_issued_date', 'hearing_date', 'grafitti_status', 'violation_code']
    train_data.drop(list_to_remove_train, axis=1, inplace=True)
    train_data.drop(list_to_remove_all, axis=1, inplace=True)
    test_data.drop(list_to_remove_all, axis=1, inplace=True)
    train_features = train_data.columns.drop('compliance')
    train_features_set = set(train_features)
    
    for feature in set(train_features):
        if feature not in test_data:
            train_features_set.remove(feature)
    train_features = list(train_features_set)
    
    X_train = train_data[train_features]
    y_train = train_data.compliance
    X_test = test_data[train_features]
    
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    clf = MLPClassifier(hidden_layer_sizes = [100, 10], alpha = 5,
                       random_state = 0, solver='lbfgs', verbose=0)
#     clf = DecisionTreeClassifier()
    clf.fit(X_train_scaled, y_train)
    
    test_proba = clf.predict_proba(X_test_scaled)[:,1]

    
    test_df = pd.read_csv('test.csv', encoding = "ISO-8859-1")
    test_df['compliance'] = test_proba
    test_df.set_index('ticket_id', inplace=True)
   
    return test_df.compliance
blight_model()

/home/sabodhapati/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2808: DtypeWarning: Columns (11,12,31) have mixed types. Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):

ticket_id
284932    0.057335
285362    0.017222
285361    0.050029
285338    0.067539
285346    0.075920
285345    0.067772
285347    0.080323
285342    0.550117
285530    0.012768
284989    0.050083
285344    0.079892
285343    0.026712
285340    0.026791
285341    0.080181
285349    0.074943
285348    0.070457
284991    0.066749
285532    0.053306
285406    0.035078
285001    0.025056
285006    0.010218
285405    0.028740
285337    0.011582
285496    0.066945
285497    0.056424
285378    0.028420
285589    0.029135
285585    0.045543
285501    0.063581
285581    0.016921
            ...   
376367    0.011935
376366    0.029193
376362    0.213796
376363    0.229523
376365    0.011935
376364    0.029193
376228    0.036970
376265    0.028460
376286    0.242697
376320    0.034552
376314    0.035135
376327    0.312466
376385    0.283308
376435    0.527915
376370    0.423595
376434    0.056980
376459    0.066174
376478    0.000133
376473    0.034752
376484    0.033396
376482    0.032976
376480    0.032976
376479    0.032976
376481    0.032976
376483    0.044413
376496    0.008347
376497    0.008347
376499    0.057490
376500    0.057481
369851    0.079108
Name: compliance, Length: 61001, dtype: float64

https://github.com/ethanknows/Applied_Machine_Learning_in_Python/blob/master/Assignment%2B4.ipynb

Load data¶

train_data = pd.read_csv("~/data/train2.csv",delimiter=",",encoding='ISO-8859-1')
train_data = train_data[np.isfinite(train_data['compliance'])]

/home/sabodhapati/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (11,12,31) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 158605 entries, 0 to 159879
Data columns (total 18 columns):
agency_name              158605 non-null int64
inspector_name           158605 non-null int64
zip_code                 158605 non-null int64
violation_description    158605 non-null int64
disposition              158605 non-null int64
admin_fee                158605 non-null float64
state_fee                158605 non-null float64
late_fee                 158605 non-null float64
discount_amount          158605 non-null float64
clean_up_cost            158605 non-null float64
compliance               158605 non-null float64
lat                      158605 non-null float64
lon                      158605 non-null float64
year                     158605 non-null int64
dow                      158605 non-null int64
woy                      158605 non-null int64
h_woy                    158605 non-null int64
h_dow                    158605 non-null int64
dtypes: float64(8), int64(10)
memory usage: 28.0 MB

select non-Null target values from train data¶

train_data.compliance.value_counts()

0.0    148283
1.0     11597
Name: compliance, dtype: int64

train_data = train_data[~(train_data.compliance.isnull())]
train_data.shape

(158605, 18)

Conbine train data, addresses data and latlons data¶

addresses_data = pd.read_csv("~/data/addresses.csv",delimiter=",")
latlons_data = pd.read_csv("~/data/latlons.csv",delimiter=",")
add_loc_data = pd.merge(addresses_data,latlons_data,on="address")
train_data = pd.merge(train_data,add_loc_data,on="ticket_id")
add_loc_data.head()
train_data.head()

Delete violation address and mailing address(except for zipcode) and grafitti_status which contains all NaN values¶

train_data = train_data.drop(["violation_street_number","violation_street_name","violation_zip_code",
                 "mailing_address_str_number","mailing_address_str_name","city","state","country",
                 "non_us_str_code","grafitti_status","address"],axis = 1)

train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159880 entries, 0 to 159879
Data columns (total 40 columns):
ticket_id                     159880 non-null int64
agency_name                   159880 non-null object
inspector_name                159880 non-null object
violator_name                 159854 non-null object
violation_street_number       159880 non-null float64
violation_street_name         159880 non-null object
violation_zip_code            0 non-null float64
mailing_address_str_number    157322 non-null float64
mailing_address_str_name      159877 non-null object
city                          159880 non-null object
state                         159796 non-null object
zip_code                      159879 non-null object
non_us_str_code               3 non-null object
country                       159880 non-null object
ticket_issued_date            159880 non-null object
hearing_date                  159653 non-null object
violation_code                159880 non-null object
violation_description         159880 non-null object
disposition                   159880 non-null object
fine_amount                   159880 non-null float64
admin_fee                     159880 non-null float64
state_fee                     159880 non-null float64
late_fee                      159880 non-null float64
discount_amount               159880 non-null float64
clean_up_cost                 159880 non-null float64
judgment_amount               159880 non-null float64
payment_amount                159880 non-null float64
balance_due                   159880 non-null float64
payment_date                  39611 non-null object
payment_status                159880 non-null object
collection_status             36897 non-null object
grafitti_status               0 non-null object
compliance_detail             159880 non-null object
compliance                    159880 non-null float64
address_x                     159880 non-null object
lat_x                         159878 non-null float64
lon_x                         159878 non-null float64
address_y                     159880 non-null object
lat_y                         159878 non-null float64
lon_y                         159878 non-null float64
dtypes: float64(17), int64(1), object(22)
memory usage: 50.0+ MB

Delete the obs with wrong zip_code¶

bad_zipcode = train_data.zip_code[train_data.zip_code.apply(lambda x: len(str(x)) != 5)].index
bad_zipcode

Int64Index([     2,    336,    492,    584,    680,    997,   1009,   1487,
              1502,   1577,
            ...
            159179, 159190, 159307, 159324, 159410, 159466, 159635, 159751,
            159769, 159801],
           dtype='int64', length=1049)

train_data.drop(bad_zipcode,axis = 0,inplace=True)

train_data.shape

(158831, 40)

train_data.zip_code = train_data.zip_code.astype("int64")

train_data = train_data[~(train_data.hearing_date.isnull())]
train_data.shape

(158607, 40)

train_data.payment_status.value_counts()

NO PAYMENT APPLIED         119450
PAID IN FULL                30049
PARTIAL PAYMENT APPLIED      9108
Name: payment_status, dtype: int64

train_data.drop(["payment_date","collection_status","payment_status"],1,inplace=True)

train_data.drop(["payment_amount","balance_due","compliance_detail"],1,inplace= True)
train_data = train_data[~(train_data.lat.isnull())]
train_data.drop("violator_name",1,inplace=True)
train_data.drop("ticket_id",1,inplace=True)
train_data.info()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-217-7eb775c2671a> in <module>()
----> 1 train_data.drop(["payment_amount","balance_due","compliance_detail"],1,inplace= True)
      2 #train_data = train_data[~(train_data.lat.isnull())]
      3 train_data.drop("violator_name",1,inplace=True)
      4 train_data.drop("ticket_id",1,inplace=True)
      5 train_data.info()

~/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, errors)
   2159                 new_axis = axis.drop(labels, level=level, errors=errors)
   2160             else:
-> 2161                 new_axis = axis.drop(labels, errors=errors)
   2162             dropped = self.reindex(**{axis_name: new_axis})
   2163             try:

~/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
   3622             if errors != 'ignore':
   3623                 raise ValueError('labels %s not contained in axis' %
-> 3624                                  labels[mask])
   3625             indexer = indexer[~mask]
   3626         return self.delete(indexer)

ValueError: labels ['payment_amount' 'balance_due' 'compliance_detail'] not contained in axis

Exam object variables¶

for col in train_data.columns[train_data.dtypes == "object"]:
    print("The number of unique values for '{}' is {}.".format(col,train_data[col].nunique()))

The number of unique values for 'agency_name' is 5.
The number of unique values for 'inspector_name' is 159.
The number of unique values for 'ticket_issued_date' is 67747.
The number of unique values for 'hearing_date' is 5966.
The number of unique values for 'violation_code' is 189.
The number of unique values for 'violation_description' is 207.
The number of unique values for 'disposition' is 4.

train_data.disposition.value_counts()

test_data = pd.read_csv("test.csv",delimiter=",",encoding="ISO-8859-1")

for col in test_data.columns[test_data.dtypes == "object"]:
    print("The number of unique values for '{}' is {}.".format(col,test_data[col].nunique()))

The number of unique values for 'agency_name' is 3.
The number of unique values for 'inspector_name' is 116.
The number of unique values for 'violator_name' is 38515.
The number of unique values for 'violation_street_name' is 1477.
The number of unique values for 'violation_zip_code' is 58.
The number of unique values for 'mailing_address_str_number' is 9703.
The number of unique values for 'mailing_address_str_name' is 16851.
The number of unique values for 'city' is 3266.
The number of unique values for 'state' is 58.
The number of unique values for 'zip_code' is 2900.
The number of unique values for 'country' is 1.
The number of unique values for 'ticket_issued_date' is 33064.
The number of unique values for 'hearing_date' is 3312.
The number of unique values for 'violation_code' is 151.
The number of unique values for 'violation_description' is 163.
The number of unique values for 'disposition' is 8.
The number of unique values for 'grafitti_status' is 1.

Change date columns('ticket_issued_date', 'hearing_date') into datetime format and extract year, week of year, and day of week information¶

train_data.ticket_issued_date = pd.to_datetime(train_data.ticket_issued_date)

train_data.ticket_issued_date.head()

0   2004-03-16 11:40:00
1   2004-04-23 12:30:00
3   2004-06-14 14:15:00
4   2004-06-16 12:30:00
5   2004-06-16 12:25:00
Name: ticket_issued_date, dtype: datetime64[ns]

train_data["year"] = [date.isocalendar()[0] for date in train_data.ticket_issued_date]

train_data["dow"] = [date.isocalendar()[2] for date in train_data.ticket_issued_date]
train_data["woy"] = [date.isocalendar()[1] for date in train_data.ticket_issued_date]
train_data.hearing_date = pd.to_datetime(train_data.hearing_date)
train_data["h_year"] = [date.isocalendar()[0] for date in train_data.hearing_date]
train_data["h_woy"] = [date.isocalendar()[1] for date in train_data.hearing_date]
train_data["h_dow"] = [date.isocalendar()[2] for date in train_data.hearing_date]
train_data.drop(["ticket_issued_date","hearing_date"],1,inplace=True)
train_data.drop("violation_code",1,inplace=True)

import matplotlib.pyplot as plt
import seaborn as sns

corr = train_data.corr()

%matplotlib inline
plt.figure(figsize=(12,12))
sns.heatmap(corr,annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f53ecf730f0>

%matplotlib inline
plt.figure(figsize=(12,12))
sns.heatmap(train_data.drop(["h_year","fine_amount","judgment_amount"],1).corr(),annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f53e9d01550>

train_data.drop(["h_year","fine_amount","judgment_amount"],1,inplace=True)

Do not forget fit labelencoder on both train and test data at same time at end¶

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in train_data.columns[train_data.dtypes == "object"]:
    train_data[col] = le.fit_transform(train_data[col])

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

cv = StratifiedKFold(n_splits=5)
scaler  = MinMaxScaler()
clfs = [LogisticRegression(),GaussianNB(),
       RandomForestClassifier(random_state=0,n_estimators=100),AdaBoostClassifier()]

X = train_data.drop("compliance",1)
y = train_data.compliance

for clf in clfs:
    print(clf)
    for train, test in cv.split(X,y):
        X_train = scaler.fit_transform(X.iloc[train])
        X_test = scaler.transform(X.iloc[test])
        clf.fit(X_train,y.iloc[train])
        if hasattr(clf, "predict_proba"):
            prob_pos = clf.predict_proba(X_test)[:, 1]
        else:  # use decision function
            prob_pos = clf.decision_function(X_test)
            prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
        fpr, tpr, _ = roc_curve(y.iloc[test], prob_pos)
        auc_score = auc(fpr,tpr)
        print(auc_score)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
0.67039842957
0.689655597598
0.703354934695
0.662813873675
0.689035657805
GaussianNB(priors=None)
0.784572923815
0.751614176083
0.810751064504
0.762176584884
0.72070840107
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)
0.640760017352
0.624394542917
0.691316770961
0.617916054982
0.682398600505
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
0.790753712224
0.723985440155
0.78847250068
0.747755534275
0.755070475445

As we can see from the results, the GaussianNB and AdaBoostClassifier give us better profromence. So we will use these two classifier to tune our models.¶

param_grid = {"n_estimators":[10,100],"learning_rate":[0.01,0.1,1]}
ada = AdaBoostClassifier()
gs = GridSearchCV(ada,param_grid=param_grid,cv=5,scoring="roc_auc")
gs.fit(X,y)

cv = StratifiedKFold(n_splits=10)
scaler  = MinMaxScaler()
clfs = [GaussianNB(),AdaBoostClassifier(learning_rate=0.5, n_estimators=100)]

X = train_data.drop("compliance",1)
y = train_data.compliance

for clf in clfs:
    print(clf)
    sc = []
    for train, test in cv.split(X,y):
        X_train = scaler.fit_transform(X.iloc[train])
        X_test = scaler.transform(X.iloc[test])
        clf.fit(X_train,y.iloc[train])
        prob_pos = clf.predict_proba(X_test)[:, 1]
        fpr, tpr, _ = roc_curve(y.iloc[test], prob_pos)
        sc.append(auc(fpr,tpr))
    print(np.mean(sc))

GaussianNB(priors=None)
0.761055931107
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.5, n_estimators=100, random_state=None)
0.76508600177

pandas的get_dummies¶

我理解get_dummies是将拥有不同值的变量转换为0/1数值。打个比方，小明有黄、红、蓝三种颜色的帽子，小明今天戴黄色帽子用1表示，红色帽子用2表示，蓝色帽子用3表示。但1、2、3数值大小本身是没有意义的，只是用于区分帽子的颜色，因此在实际分析时，需要将1、2、3转化为0、1，如下代码所示：

作者：dechuan 链接：https://www.jianshu.com/p/c324f4101785

http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

import pandas as pd 
xiaoming=pd.DataFrame([1,2,3],index=['yellow','red','blue'],columns=['hat']) 
print(xiaoming) 
hat_ranks=pd.get_dummies(xiaoming['hat'],prefix='hat') 
print(hat_ranks.head())

        hat
yellow    1
red       2
blue      3
        hat_1  hat_2  hat_3
yellow      1      0      0
red         0      1      0
blue        0      0      1

	ticket_id	agency_name	inspector_name	violator_name	violation_street_number	violation_street_name	violation_zip_code	mailing_address_str_number	mailing_address_str_name	city	...	collection_status	grafitti_status	compliance_detail	compliance	address_x	lat_x	lon_x	address_y	lat_y	lon_y
0	22056	Buildings, Safety Engineering & Env Department	Sims, Martinzie	INVESTMENT INC., MIDWEST MORTGAGE	2900.0	TYLER	NaN	3.0	S. WICKER	CHICAGO	...	NaN	NaN	non-compliant by no payment	0.0	2900 tyler, Detroit MI	42.390729	-83.124268	2900 tyler, Detroit MI	42.390729	-83.124268
1	27586	Buildings, Safety Engineering & Env Department	Williams, Darrin	Michigan, Covenant House	4311.0	CENTRAL	NaN	2959.0	Martin Luther King	Detroit	...	NaN	NaN	compliant by late payment within 1 month	1.0	4311 central, Detroit MI	42.326937	-83.135118	4311 central, Detroit MI	42.326937	-83.135118
2	22046	Buildings, Safety Engineering & Env Department	Sims, Martinzie	KASIMU, UKWELI	6478.0	NORTHFIELD	NaN	2755.0	E. 17TH	LOG BEACH	...	NaN	NaN	non-compliant by no payment	0.0	6478 northfield, Detroit MI	42.145257	-83.208233	6478 northfield, Detroit MI	42.145257	-83.208233
3	18738	Buildings, Safety Engineering & Env Department	Williams, Darrin	Deerwood Development Group Inc, Deer	8027.0	BRENTWOOD	NaN	476.0	Garfield	Clinton	...	NaN	NaN	non-compliant by no payment	0.0	8027 brentwood, Detroit MI	42.433466	-83.023493	8027 brentwood, Detroit MI	42.433466	-83.023493
4	18735	Buildings, Safety Engineering & Env Department	Williams, Darrin	Rafee Auto Services L.L.C., RAF	8228.0	MT ELLIOTT	NaN	8228.0	Mt. Elliott	Detroit	...	NaN	NaN	non-compliant by no payment	0.0	8228 mt elliott, Detroit MI	42.388641	-83.037858	8228 mt elliott, Detroit MI	42.388641	-83.037858

	0
ticket_id
284932	0.700
285362	0.130
285361	0.165
285338	0.265
285346	0.280
285345	0.240
285347	0.385
285342	0.815
285530	0.455
284989	0.485
285344	0.380
285343	0.375
285340	0.485
285341	0.410
285349	0.295
285348	0.255
284991	0.490
285532	0.440
285406	0.120
285001	0.420
285006	0.530
285405	0.115
285337	0.125
285496	0.630
285497	0.620
285378	0.210
285589	0.135
285585	0.325
285501	0.645
285581	0.095
...	...
376367	0.165
376366	0.340
376362	0.230
376363	0.240
376365	0.165
376364	0.340
376228	0.355
376265	0.355
376286	0.940
376320	0.630
376314	0.695
376327	0.930
376385	0.930
376435	0.780
376370	0.790
376434	0.190
376459	0.665
376478	0.090
376473	0.650
376484	0.425
376482	0.150
376480	0.150
376479	0.150
376481	0.150
376483	0.225
376496	0.155
376497	0.155
376499	0.645
376500	0.645
369851	0.870

	0
ticket_id
284932	0.418396
285362	0.124353
285361	0.289257
285338	0.436207
285346	0.439132
285345	0.418815
285347	0.436402
285342	0.973755
285530	0.335714
284989	0.520280
285344	0.554954
285343	0.350420
285340	0.332684
285341	0.423921
285349	0.439132
285348	0.418815
284991	0.520280
285532	0.352453
285406	0.200727
285001	0.483443
285006	0.464627
285405	0.124353
285337	0.164809
285496	0.373567
285497	0.406901
285378	0.155169
285589	0.161880
285585	0.341142
285501	0.463836
285581	0.136862
...	...
376367	0.089328
376366	0.280130
376362	0.199895
376363	0.204412
376365	0.089328
376364	0.280130
376228	0.500940
376265	0.341467
376286	0.990514
376320	0.464122
376314	0.404799
376327	0.993338
376385	0.956655
376435	0.913152
376370	0.926671
376434	0.117135
376459	0.418792
376478	0.171134
376473	0.400290
376484	0.589217
376482	0.123325
376480	0.123325
376479	0.123325
376481	0.123325
376483	0.171159
376496	0.085415
376497	0.085415
376499	0.398323
376500	0.398323
369851	0.993674