# 0. Introduction

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

In the first part, Loigstic regression model was built do different kind of analysis. In this part, we will try Random Forest models.Since this is imbalanced data, we will try different methods and compare their results:

1. Model on imbalanced data directly
2. Model on over-sampling data
3. Assign more weights on rare class
4. Use customed loss function


## Conclusion:

1. As expected, use the imbalanced data is not a good way. The performance is the worst compaed to using over-sampling or class weights
2. Use imbalanced data, RandomForestClassifier result is better than LogisticRegression.
3. If we can custom a good loss function, the model performance will be better: here the customed loss function performance is better than roc_auc scoring function.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, Normalizer, scale
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
from sklearn.model_selection import cross_val_score, train_test_split, KFold, StratifiedShuffleSplit, GridSearchCV
from sklearn.linear_model import LogisticRegression

%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 9)

seed = 999

creditcard.columns = [x.lower() for x in creditcard.columns]
creditcard.rename(columns = {'class': 'fraud'}, inplace = True)

# 1. Split Test Data Out
creditcard.drop(columns = 'time', inplace = True)

# Normalize the 'amount' column
scaler = StandardScaler()
creditcard['amount'] = scaler.fit_transform(creditcard['amount'].values.reshape(-1, 1))
# creditcard.drop(columns = 'amount', inplace = True)

X = creditcard.iloc[:, :-1]
y = creditcard.iloc[:, -1]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = .33, stratify = y, random_state = seed)


The reason why I check this:

For non-fraud transactions, the average amount is 88. For fraud transactions, the average amount is 122. So, in average there will be 122 loss for a fraud. Suppose for each transaction, the company can get 2% transaction fee. That is, the average is 88 * 2% = 1.76.

That means: if we predict a non-fraud as fraud, we might loss 1.76. However, if we miss to detect a fraud transaction, we will loss about 122.

Later I will use this to build a customed loss function.

# Modeling Part 2: RandomForestClassifier

Usually for imbalanced data, we can try:

1. Collect more data (which not work here since the data is given)
2. Down-Sampling or Over-Sampling to get balanced samples
3. Change the Thresholds to adjust the prediction
4. Assign class weights for the low rate class


Here we will try 4 different ways and compare their results:

2.1. Do nothing, use original data to model
2.2. Do Over-Sampling, use the over-sampled data to model
2.3. Assigning sample weights in RandomForestClassifier
2.4. Use customed loss function


Since this is Fraud detection question, if we miss predicting a fraud, the credit company will lose a lot. If we miss predicting a normal transaction as Fraud, we can still let the exprt to review the transactions or we can ask the user to verify the transaction. So in this specific case, False Positive will cause more loss than False Negative.

# 1. Use the Imbalanced Data Directly in RandomForestClassifier

X = creditcard.iloc[:, :-1]
y = creditcard.iloc[:, -1]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = .33, stratify = y, random_state = seed)

estimator = RandomForestClassifier(random_state=0, warm_start = True)

rf_tuned_parameters = {"max_depth": [10, 20, 50, 100], 'n_estimators': [50, 100, 200, 500], 'min_samples_leaf': [10, 20, 50]}

cv_grid = GridSearchCV(estimator, param_grid = rf_tuned_parameters, scoring = 'roc_auc', verbose = 5, n_jobs = 70) # 'recall', my_score
cv_grid.fit(Xtrain, ytrain)

# print cv_grid.cv_results_

best_parameters = cv_grid.best_estimator_.get_params()

# for param_name in sorted(rf_tuned_parameters.keys()):
#     print("\t%s: %r" % (param_name, best_parameters[param_name]))

pred_test = cv_grid.predict(Xtest)
print(recall_score(ytest, pred_test))     # 0.65
print(precision_score(ytest, pred_test))  # 0.85
print(roc_auc_score(ytest, pred_test))    # 0.83
print("confustion matrix on validation data: \n" + str(confusion_matrix(ytest, pred_test)))


confustion matrix on validation data:

[[93807    18]
[   57   105]]


If we use the imbalanced data directly in the RandomForestClassifier, we will find the result is not very good: recall score is 0.65 and the auc = 0.83. Although this result is better than the result from Logistic Regression using imbalanced data directly. To improve the model performance, we will try two methods: over-sampling and assigning more weights to rare class.

# 2. Create Over-sampling data and Fit the model

oversample_ratio = sum(ytrain == 0) / sum(ytrain == 1)  # size to repeat y == 1
# repeat the positive data for X and y
ytrain_pos_oversample = pd.concat([ytrain[ytrain==1]] * oversample_ratio, axis = 0)
Xtrain_pos_oversample = pd.concat([Xtrain.loc[ytrain==1, :]] * oversample_ratio, axis = 0)
# concat the repeated data with the original data together
ytrain_oversample = pd.concat([ytrain, ytrain_pos_oversample], axis = 0).reset_index(drop = True)
Xtrain_oversample = pd.concat([Xtrain, Xtrain_pos_oversample], axis = 0).reset_index(drop = True)

ytrain_oversample.value_counts(dropna = False, normalize = True)   # 50:50

estimator = RandomForestClassifier(random_state=0, warm_start = True)

rf_tuned_parameters = {"max_depth": [10, 20, 50, 100], 'n_estimators': [50, 100, 200, 500], 'min_samples_leaf': [10, 20, 50]}

cv_grid = GridSearchCV(estimator, param_grid = rf_tuned_parameters, scoring = 'roc_auc', verbose = 5, n_jobs = 70) # 'recall', my_score
cv_grid.fit(Xtrain_oversample, ytrain_oversample)

# print cv_grid.best_params_
# print cv_grid.cv_results_

best_parameters = cv_grid.best_estimator_.get_params()

# for param_name in sorted(rf_tuned_parameters.keys()):
#     print("\t%s: %r" % (param_name, best_parameters[param_name]))

pred_test = cv_grid.predict(Xtest)
print(recall_score(ytest, pred_test))     # 0.83
print(precision_score(ytest, pred_test))  # 0.83
print(roc_auc_score(ytest, pred_test))    # 0.92
print("\n confustion matrix on validation data: \n" + str(confusion_matrix(ytest, pred_test)))

[[93798    27]
[   27   135]]


By using over-sampling, we can find the model performance is improved a lot. recall score = 0.83 now and the auc = 0.92 From the confusion matrix, 135 frauds from 162 True frauds are detected. There are 27 non-frauds are mistakenly predicted as frauds. We can do fine-tuning by changeing the thresholds to get less false negatives. The price will be getting more false positives.

Next we will test using class_weights rather than over-sampling. We know if Logistic Regression these two are equivalent. We will have a try to see what will happen for RandomForest?

# 3. RandomForestClassifier with class_weight

Rather than over-sampling, we can assign more weights to the lower rate class.

X = creditcard.iloc[:, :-1]
y = creditcard.iloc[:, -1]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = .33, stratify = y, random_state = seed)

positive_weight = sum(ytrain == 0) / sum(ytrain == 1)  # size to repeat y == 1

estimator = RandomForestClassifier(random_state=0, class_weight = {0 : 1, 1 : positive_weight}, warm_start = True)

rf_tuned_parameters = {"max_depth": [10, 20, 50, 100], 'n_estimators': [50, 100, 200, 500], 'min_samples_leaf': [10, 20, 50]}

cv_grid = GridSearchCV(estimator, param_grid = rf_tuned_parameters, scoring = 'roc_auc', verbose = 5, n_jobs = 70) # 'recall', my_score
cv_grid.fit(Xtrain, ytrain)

# print cv_grid.cv_results_

best_parameters = cv_grid.best_estimator_.get_params()

# for param_name in sorted(rf_tuned_parameters.keys()):
#     print("\t%s: %r" % (param_name, best_parameters[param_name]))

pred_test = cv_grid.predict(Xtest)
print(recall_score(ytest, pred_test))     #  0.85
print(precision_score(ytest, pred_test))  #  0.81
print(roc_auc_score(ytest, pred_test))    #  0.92
print("\n confustion matrix on validation data: \n" + str(confusion_matrix(ytest, pred_test)))

[[93793    32]
[   25   137]]


Compared with over-sampling, the recall score increased from 0.83 to 0.85, while the precision decreases from 0.83 to 0.81. The auc is 0.92, which is close the the result of over-sampling.

Overall, I think this model works well: 85% of the frauds can be detected by this model, which will prevent a lot of loss. At the same time, only 0.03% of the non-frauds will be mistakenly predicted as frauds. This will result in very little potential loss for the company. The company can also do manual review of these false fraud detections.

# 4. Self-defined Score and GridSearchCV of hyperparameter

Since the loss from frauds and false predicted frauds are different for us. We will define a function to re-weight the effects by average loss from missing predicted frauds and falsely predicted frauds.

def scoring(ground_truth, predictions):
'''
based on results above about the average loss from false positive and false negative predictions.
'''
cmatrix = confusion_matrix(ground_truth, predictions)
fp = cmatrix[0, 1]
fn = cmatrix[1, 0]
return  fn * 122 + fp * 1.76

wt_loss_score = make_scorer(scoring, greater_is_better = False)

oversample_ratio = sum(ytrain == 0) / sum(ytrain == 1)  # size to repeat y == 1
# repeat the positive data for X and y
ytrain_pos_oversample = pd.concat([ytrain[ytrain==1]] * oversample_ratio, axis = 0)
Xtrain_pos_oversample = pd.concat([Xtrain.loc[ytrain==1, :]] * oversample_ratio, axis = 0)
# concat the repeated data with the original data together
ytrain_oversample = pd.concat([ytrain, ytrain_pos_oversample], axis = 0).reset_index(drop = True)
Xtrain_oversample = pd.concat([Xtrain, Xtrain_pos_oversample], axis = 0).reset_index(drop = True)

ytrain_oversample.value_counts(dropna = False, normalize = True)   # 50:50

estimator = RandomForestClassifier(random_state=0, warm_start = True)

rf_tuned_parameters = {"max_depth": [10, 20, 50, 100], 'n_estimators': [50, 100, 200, 500],
'min_samples_leaf': [10, 20, 50]}

cv_grid = GridSearchCV(estimator, param_grid = rf_tuned_parameters, scoring = wt_loss_score, verbose = 5, n_jobs = 70)
cv_grid.fit(Xtrain_oversample, ytrain_oversample)

# print cv_grid.best_params_

pred_test = cv_grid.predict(Xtest)
print(recall_score(ytest, pred_test))     # 0.84
print(precision_score(ytest, pred_test))  # 0.84
print(roc_auc_score(ytest, pred_test))    # 0.92
print("\n confustion matrix on validation data: \n" + str(confusion_matrix(ytest, pred_test)))


With the self defined loss function, the confusion matrix is:

[[93800    25]
[   26   136]]


Compared to the same setup but using 'roc_auc' as the scoring function results:

[[93798    27]
[   27   135]]