Credit card fraud happens everyday. Even happened to me! How can we stop credit card fraud before the card is accepted? In this project we will be using machine learning to detect credit card frauds when it happens. Machine Learning uses data to teach a computer to predict humans outcomes.
This dataset is from Kaggle. It contains only numeric input features. Unfortunately, due to confidentiality issues, the dataset cannot provide the original features and background information about the data. The dataset contains V1, V2, … V28, 'Time', 'Amount', and 'Class'. ‘Class’ is the response variable and it takes values 1 in case of fraud and 0 for genuine.
First let’s import the packages Pandas, collections and itertools. Pandas are used to manipulate and transform the data while StatsForecast has an extensive library of models that can forecast a time series. Counter holds the data in an unordered collection, just like hashtable objects to count. Itertools provides various functions to produce complex iterators.import pandas as pd
from collections import Counter
import itertools
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, plotly.express as px
# import for features
import shap
#for plotting visuals
import matplotlib.pyplot as plt
# Importing train and test data split
from sklearn.model_selection import train_test_split
# Sklearn's metrics to evaluate our models
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, recall_score, f1_score
# Classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
df=pd.read_csv('/content/drive/MyDrive/creditcard.csv')
df.head(5)
df[['Time', 'V1', 'V28', 'Amount', 'Class']].head()
Let's see some information about our data.
df["Amount"].describe()
count 284807.000000
mean 88.349619
std 250.120109
min 0.000000
25% 5.600000
50% 22.000000
75% 77.165000
max 25691.160000
Name: Amount, dtype: float64
As we can see about 75% of transactions were below $77.17. Note the mean of transactions is $88.34 while the maximum transaction is $25,691.16. Let’s count the number of fraud and non fraud cases, then plot the information using matplotlib.
non_fraud = len(df[df.Class == 0])
fraud = len(df[df.Class == 1])
fraud_percent = (fraud / (fraud + non_fraud)) * 100
print("Number of Genuine transactions: ", non_fraud)
print("Number of Fraud transactions: ", fraud)
print("Percentage of Fraud transactions: {:.4f}%".format(fraud_percent))
Number of Genuine transactions: 284315
Number of Fraud transactions: 492
Percentage of Fraud transactions: 0.1727%
import matplotlib.pyplot as plt
labels = ["Genuine", "Fraud"]
count_classes = df.value_counts(df['Class'], sort= True)
count_classes.plot(kind = "bar", rot = 0)
plt.title("Visualization of Labels")
plt.ylabel("Count")
plt.xticks(range(2), labels)
plt.show()
# Visualizing Class distribution
fig = px.pie(values = df.Class.value_counts(), names=['Genuine', 'Fraud'], title='Fraudulent and Genuine Transactions in the Dataset')
fig.show('png')
We can see the number of genuine transactions is 284,315 and the number of fraud transactions is 492. The number of genuine transactions is over 99%. This shows some imbalance between the genuine and fraudulent transactions.
The main purpose of data normalization and scaling is to reduce the impact of outliers, skewness, and varying ranges of values on the performance of machine learning algorithms and data analysis methods. Let’s apply the scaling techniques on the “Amount” feature to transform the range of values. We drop the original “Amount” column and add a new column with the scaled values “NormalizedAmount”. We also drop the “Time” column as it is irrelevant for our case.
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df["NormalizedAmount"] = \
scaler.fit_transform(df["Amount"].values.reshape(-1, 1))
df.drop(["Amount", "Time"], inplace= True, axis= 1)
Y = df["Class"]
X = df.drop(["Class"], axis= 1)
df.head(5)
Next, we split the credit card data into 70% and 30% chunks using train_test_split().
from sklearn.model_selection import train_test_split
#split the data
(train_X, test_X, train_Y, test_Y) = train_test_split(X, Y, test_size= 0.3, random_state= 42)
print("Shape of train_X: ", train_X.shape)
print("Shape of test_X: ", test_X.shape)
Shape of train_X: (199364, 29)
Shape of test_X: (85443, 29)
Now we will apply Machine Learning Algorithms to the credit card dataset. For this project we will use a decision tree and random forest algorithm, then see which one performs better.
The Decision Tree algorithm is a machine learning algorithm that looks like a tree with root nodes and sub-trees. The algorithm’s aim is to train a model that predicts the value of a target class variable by learning simple if-then-else decision rules inferred from the training data.
Let's build a Random Forest Model. The Random Forest Algorithm builds decision trees on different samples and takes their majority vote for classification and average in case of regression. Random Forest is a supervised learning algorithm that works on the concept of bagging. In bagging, a group of models is trained on different subsets of the dataset, and the final output is generated by collating the outputs of all the different models. In the case of random forest, the base model is a decision tree. The code for this one is RandomForestClassifier().
#for feature talk
#!pip install shap
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
import shap
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})
Next, we fit the random forest regressor with 100 decision trees.
#feature talk in random forest data
rf = RandomForestRegressor(n_estimators=100)
rf.fit(train_X, train_Y)
Let’s talk about feature importance, this tells us which feature is important. It can help us understand our model better and how we can improve our model. First we load the packages we are using.
To get the feature importance we use this code.
array([0.01633373, 0.00403563, 0.00856925, 0.01796998, 0.00592385, 0.00535211, 0.01949634, 0.00445168, 0.0069014 , 0.04106758, 0.00966285, 0.06289331, 0.01005891, 0.10879169, 0.01347047, 0.01622807, 0.50989317, 0.00412784, 0.0133063 , 0.01140531, 0.0136416 , 0.01046962, 0.00528906, 0.00842428, 0.0062364 , 0.03252556, 0.01316906, 0.00845199, 0.01185296])
This is difficult to interpret, so let’s graph the data.
column1=['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20','V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'NormalizedAmount']
plt.barh(column1, rf.feature_importances_[sorted_idx])
The bar graph represents the Random Forest Regressor. V27 is an important feature while V8 is not. Let’s find the permutation based feature importance. The permutation based importance can be used to overcome drawbacks of default feature importance computed with mean impurity decrease. We also sort them from greatest to least so it’s easier to interpret the graph.
perm_importance = permutation_importance(rf, test_X, test_Y)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(column1, perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
After looking at the bar graph, we can see that V17 is the greatest important feature while V27 is the least important feature. In the Random Forest Regressor, V27 was the most important feature, but here it’s the least important in our model.
Lastly, the SHAP interpretation can be used to compute the feature importances from the Random Forest. It is using the Shapley values from game theory to estimate how each feature contributes to the prediction.
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(test_X)
shap.summary_plot(shap_values, test_X, plot_type="bar")
Each dot represents a row in the data set. V17 has some low values contributing high and some high values contributing midrange. V15 and V27 have high values on the negative side. That means these variables are not as important.
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
#Decision Tree
decision_tree = DecisionTreeClassifier()
# Random Forest
random_forest = RandomForestClassifier(n_estimators= 100)
decision_tree.fit(train_X, train_Y)
predictions_dt = decision_tree.predict(test_X)
decision_tree_score = decision_tree.score(test_X, test_Y) * 100
random_forest.fit(train_X, train_Y)
predictions_rf = random_forest.predict(test_X)
random_forest_score = random_forest.score(test_X, test_Y) * 100
print("Random Forest Score: ", random_forest_score)
print("Decision Tree Score: ", decision_tree_score)
Random Forest Score: 99.9602073897218
Decision Tree Score: 99.92626663389628
After running both algorithms, we can see that the Random Forest performs better than the Decision Tree. Let’s create a function to print the accuracy, precision, recall, and f1-score into a confusion matrix. Next let’s visualize the confusion matrix.
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, recall_score, f1_score
def metrics(actuals, predictions):
print("Accuracy: {:.5f}".format(accuracy_score(actuals, predictions)))
print("Precision: {:.5f}".format(precision_score(actuals, predictions)))
print("Recall: {:.5f}".format(recall_score(actuals, predictions)))
print("F1-score: {:.5f}".format(f1_score(actuals, predictions)))
confusion_matrix_dt = confusion_matrix(test_Y, predictions_dt.round())
print("Confusion Matrix - Decision Tree")
print(confusion_matrix_dt)
Confusion Matrix - Random Forest
[[85297 10]
[ 25 111]
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(test_Y, predictions_dt.round())
color = 'white'
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
print("Evaluation of Decision Tree Model")
print()
metrics(test_Y, predictions_dt.round())
Evaluation of Random Forest Model
Accuracy: 0.99959
Precision: 0.91736
Recall: 0.81618
F1-score: 0.86381
The accuracy shows how accurate the model is, in our case, the decision tree model is 99.922% accurate. Since there is imbalance in our data, this is not the best indicator to show that our model is correct. Precision talks about how precise our model is out of those predicted positives, what percentage of them are actual positives. Recall talks about how precise our model is out of those actual positives, what percentage of them are true positives. The F1 score is a function of precision and recall. The F1 Score is a better measurement to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives).
Confusion Matrix of the Decision Tree shown below. Recall non fraud cases are labeled as 0 and fraud cases are labeled as 1. The table below shows how to interpret the confusion matrix. As we can see the number of non-fraud true cases is 85,264, the number of fraud false cases is 43, the number of non-fraud false cases is 24, and the number of fraud true cases is 112. The confusion matrix shows there is a class imbalance with non-fraud true cases.
confusion_matrix_dt = confusion_matrix(test_Y, predictions_rf.round())
print("Confusion Matrix - Random Forest")
print(confusion_matrix_dt)
Confusion Matrix - Random Forest
[[85300 7]
[ 26 110]]
#random forest tree
cm = confusion_matrix(test_Y, predictions_rf.round())
color = 'white'
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
Figure 1. This image shows the confusion matrix for a random forest model between fraudulent and non-fraudulent cases.
print("Evaluation of Random Forest Model")
print()
metrics(test_Y, predictions_rf.round())
Evaluation of Random Forest Model
Accuracy: 0.99961
Precision: 0.94017
Recall: 0.80882
F1-score: 0.86957
Confusion Matrix of the Random Forest shown below. Recall non fraud cases are labeled as 0 and fraud cases are labeled as 1. The table below shows how to interpret the confusion matrix. As we can see the number of non-fraud true cases is 85,300, the number of fraud false cases is 7, the number of non-fraud false cases is 26, and the number of fraud true cases is 110. The confusion matrix shows there is a class imbalance with non-fraud true cases.
Which model performed better? The F1 score of the decision tree model is 0.76976, while the F1 score of the random forest model is 0.86957. Thus the random forest model performed better; however, both algorithms ignore the minority class fraud cases. This is the class we are most interested in! This issue can be solved by oversampling. Oversampling duplicates samples of our minority class. We will be using The Synthetic Minority Oversampling Technique, or SMOTE for short, is a method of data augmentation for the minority class. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line. Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space. First we need to import SMOTE from the imblearn.over_sampling package. Next, we’ll resample our data.
from imblearn.over_sampling import SMOTE
X_resampled, Y_resampled = SMOTE().fit_resample(X, Y)
print("Resampled shape of X: ", X_resampled.shape)
print("Resampled shape of Y: ", Y_resampled.shape)
value_counts = Counter(Y_resampled)
print(value_counts)
(train_X, test_X, train_Y, test_Y) = train_test_split(X_resampled, Y_resampled, test_size= 0.3, random_state= 42)
Resampled shape of X: (568630, 29)
Resampled shape of Y: (568630,)
Counter({0: 284315, 1: 284315})
As we can see, the number of non-fraudulent cases is 284315 and the number of fraudulent cases is 284315. Now let’s see the confusion matrix of our model.
rf_resampled = RandomForestClassifier(n_estimators = 100)
rf_resampled.fit(train_X, train_Y)
predictions_resampled = rf_resampled.predict(test_X)
random_forest_score_resampled = rf_resampled.score(test_X, test_Y) * 100
cm_resampled = confusion_matrix(test_Y, predictions_resampled.round())
print("Confusion Matrix - Random Forest")
print(cm_resampled)
color = 'white'
disp = ConfusionMatrixDisplay(confusion_matrix=cm_resampled)
disp.plot()
plt.show()
Confusion Matrix - Random Forest
[[85128 21]
[ 0 85440]]
print("Evaluation of Random Forest Model")
print()
metrics(test_Y, predictions_resampled.round())
Evaluation of Random Forest Model
Accuracy: 0.99988
Precision: 0.99975
Recall: 1.00000
F1-score: 0.99988
The table below shows how to interpret the confusion matrix. As we can see the number of non-fraud true cases is 85,128, the number of fraud false cases is 21, the number of non-fraud false cases is 0, and the number of fraud true cases is 85,440.
The oversampled Random Forest Model has an F1-score of 0.99988 while the old Random Forest Model had an accuracy of 0.99961. There is no class imbalance issue in our oversampled data, as shown in the confusion matrix. Our new model performed better than our previous one by oversampling the data.