Machine Learning

Credit card fraud happens everyday. Even happened to me! How can we stop credit card fraud before the card is accepted? In this project we will be using machine learning to detect credit card frauds when it happens. Machine Learning uses data to teach a computer to predict humans outcomes.

This dataset is from Kaggle. It contains only numeric input features. Unfortunately, due to confidentiality issues, the dataset cannot provide the original features and background information about the data. The dataset contains V1, V2, … V28, 'Time', 'Amount', and 'Class'. ‘Class’ is the response variable and it takes values 1 in case of fraud and 0 for genuine.

First let’s import the packages Pandas, collections and itertools. Pandas are used to manipulate and transform the data while StatsForecast has an extensive library of models that can forecast a time series. Counter holds the data in an unordered collection, just like hashtable objects to count. Itertools provides various functions to produce complex iterators.

import pandas as pd

from collections import Counter

import itertools

import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, plotly.express as px

# import for features

import shap

#for plotting visuals

import matplotlib.pyplot as plt

# Importing train and test data split

from sklearn.model_selection import train_test_split

# Sklearn's metrics to evaluate our models

from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, recall_score, f1_score

# Classifiers

from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier

Now let’s import the csv file, and preview the data.

df=pd.read_csv('/content/drive/MyDrive/creditcard.csv')

df.head(5)

df[['Time', 'V1', 'V28', 'Amount', 'Class']].head()

Let's see some information about our data.

df["Amount"].describe()

count 284807.000000

mean 88.349619

std 250.120109

min 0.000000

25% 5.600000

50% 22.000000

75% 77.165000

max 25691.160000

Name: Amount, dtype: float64

As we can see about 75% of transactions were below $77.17. Note the mean of transactions is $88.34 while the maximum transaction is $25,691.16. Let’s count the number of fraud and non fraud cases, then plot the information using matplotlib.

non_fraud = len(df[df.Class == 0])

fraud = len(df[df.Class == 1])

fraud_percent = (fraud / (fraud + non_fraud)) * 100

print("Number of Genuine transactions: ", non_fraud)

print("Number of Fraud transactions: ", fraud)

print("Percentage of Fraud transactions: {:.4f}%".format(fraud_percent))

Number of Genuine transactions: 284315

Number of Fraud transactions: 492

Percentage of Fraud transactions: 0.1727%

import matplotlib.pyplot as plt

labels = ["Genuine", "Fraud"]

count_classes = df.value_counts(df['Class'], sort= True)

count_classes.plot(kind = "bar", rot = 0)

plt.title("Visualization of Labels")

plt.ylabel("Count")

plt.xticks(range(2), labels)

plt.show()

# Visualizing Class distribution

fig = px.pie(values = df.Class.value_counts(), names=['Genuine', 'Fraud'], title='Fraudulent and Genuine Transactions in the Dataset')

fig.show('png')

Next project

We can see the number of genuine transactions is 284,315 and the number of fraud transactions is 492. The number of genuine transactions is over 99%. This shows some imbalance between the genuine and fraudulent transactions.

The main purpose of data normalization and scaling is to reduce the impact of outliers, skewness, and varying ranges of values on the performance of machine learning algorithms and data analysis methods. Let’s apply the scaling techniques on the “Amount” feature to transform the range of values. We drop the original “Amount” column and add a new column with the scaled values “NormalizedAmount”. We also drop the “Time” column as it is irrelevant for our case.

import numpy as np

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df["NormalizedAmount"] = \

scaler.fit_transform(df["Amount"].values.reshape(-1, 1))

df.drop(["Amount", "Time"], inplace= True, axis= 1)

Y = df["Class"]

X = df.drop(["Class"], axis= 1)

df.head(5)

Next, we split the credit card data into 70% and 30% chunks using train_test_split().

from sklearn.model_selection import train_test_split

#split the data

(train_X, test_X, train_Y, test_Y) = train_test_split(X, Y, test_size= 0.3, random_state= 42)

print("Shape of train_X: ", train_X.shape)

print("Shape of test_X: ", test_X.shape)

Shape of train_X: (199364, 29)

Shape of test_X: (85443, 29)

Now we will apply Machine Learning Algorithms to the credit card dataset. For this project we will use a decision tree and random forest algorithm, then see which one performs better.

The Decision Tree algorithm is a machine learning algorithm that looks like a tree with root nodes and sub-trees. The algorithm’s aim is to train a model that predicts the value of a target class variable by learning simple if-then-else decision rules inferred from the training data.

Let's build a Random Forest Model. The Random Forest Algorithm builds decision trees on different samples and takes their majority vote for classification and average in case of regression. Random Forest is a supervised learning algorithm that works on the concept of bagging. In bagging, a group of models is trained on different subsets of the dataset, and the final output is generated by collating the outputs of all the different models. In the case of random forest, the base model is a decision tree. The code for this one is RandomForestClassifier().

#for feature talk

#!pip install shap

import numpy as np

import pandas as pd

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor

from sklearn.inspection import permutation_importance

import shap

from matplotlib import pyplot as plt

plt.rcParams.update({'figure.figsize': (12.0, 8.0)})

plt.rcParams.update({'font.size': 14})

Next, we fit the random forest regressor with 100 decision trees.

#feature talk in random forest data

rf = RandomForestRegressor(n_estimators=100)

rf.fit(train_X, train_Y)

Let’s talk about feature importance, this tells us which feature is important. It can help us understand our model better and how we can improve our model. First we load the packages we are using.

To get the feature importance we use this code.

array([0.01633373, 0.00403563, 0.00856925, 0.01796998, 0.00592385, 0.00535211, 0.01949634, 0.00445168, 0.0069014 , 0.04106758, 0.00966285, 0.06289331, 0.01005891, 0.10879169, 0.01347047, 0.01622807, 0.50989317, 0.00412784, 0.0133063 , 0.01140531, 0.0136416 , 0.01046962, 0.00528906, 0.00842428, 0.0062364 , 0.03252556, 0.01316906, 0.00845199, 0.01185296])

This is difficult to interpret, so let’s graph the data.

column1=['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20','V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'NormalizedAmount']

plt.barh(column1, rf.feature_importances_[sorted_idx])

The bar graph represents the Random Forest Regressor. V27 is an important feature while V8 is not. Let’s find the permutation based feature importance. The permutation based importance can be used to overcome drawbacks of default feature importance computed with mean impurity decrease. We also sort them from greatest to least so it’s easier to interpret the graph.

perm_importance = permutation_importance(rf, test_X, test_Y)

sorted_idx = perm_importance.importances_mean.argsort()

plt.barh(column1, perm_importance.importances_mean[sorted_idx])

plt.xlabel("Permutation Importance")

After looking at the bar graph, we can see that V17 is the greatest important feature while V27 is the least important feature. In the Random Forest Regressor, V27 was the most important feature, but here it’s the least important in our model.

Lastly, the SHAP interpretation can be used to compute the feature importances from the Random Forest. It is using the Shapley values from game theory to estimate how each feature contributes to the prediction.

explainer = shap.TreeExplainer(rf)

shap_values = explainer.shap_values(test_X)

shap.summary_plot(shap_values, test_X, plot_type="bar")

Each dot represents a row in the data set. V17 has some low values contributing high and some high values contributing midrange. V15 and V27 have high values on the negative side. That means these variables are not as important.

from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier

#Decision Tree

decision_tree = DecisionTreeClassifier()

# Random Forest

random_forest = RandomForestClassifier(n_estimators= 100)

decision_tree.fit(train_X, train_Y)

predictions_dt = decision_tree.predict(test_X)

decision_tree_score = decision_tree.score(test_X, test_Y) * 100

random_forest.fit(train_X, train_Y)

predictions_rf = random_forest.predict(test_X)

random_forest_score = random_forest.score(test_X, test_Y) * 100

print("Random Forest Score: ", random_forest_score)

print("Decision Tree Score: ", decision_tree_score)

Random Forest Score: 99.9602073897218

Decision Tree Score: 99.92626663389628

After running both algorithms, we can see that the Random Forest performs better than the Decision Tree. Let’s create a function to print the accuracy, precision, recall, and f1-score into a confusion matrix. Next let’s visualize the confusion matrix.

from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, recall_score, f1_score

def metrics(actuals, predictions):

print("Accuracy: {:.5f}".format(accuracy_score(actuals, predictions)))

print("Precision: {:.5f}".format(precision_score(actuals, predictions)))

print("Recall: {:.5f}".format(recall_score(actuals, predictions)))

print("F1-score: {:.5f}".format(f1_score(actuals, predictions)))

confusion_matrix_dt = confusion_matrix(test_Y, predictions_dt.round())

print("Confusion Matrix - Decision Tree")

print(confusion_matrix_dt)

Confusion Matrix - Random Forest

[[85297 10]

[ 25 111]

import matplotlib.pyplot as plt

from sklearn.metrics import ConfusionMatrixDisplay

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(test_Y, predictions_dt.round())

color = 'white'

disp = ConfusionMatrixDisplay(confusion_matrix=cm)

disp.plot()

plt.show()

print("Evaluation of Decision Tree Model")

print()

metrics(test_Y, predictions_dt.round())

Evaluation of Random Forest Model

Accuracy: 0.99959

Precision: 0.91736

Recall: 0.81618

F1-score: 0.86381

The accuracy shows how accurate the model is, in our case, the decision tree model is 99.922% accurate. Since there is imbalance in our data, this is not the best indicator to show that our model is correct. Precision talks about how precise our model is out of those predicted positives, what percentage of them are actual positives. Recall talks about how precise our model is out of those actual positives, what percentage of them are true positives. The F1 score is a function of precision and recall. The F1 Score is a better measurement to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives).

Confusion Matrix of the Decision Tree shown below. Recall non fraud cases are labeled as 0 and fraud cases are labeled as 1. The table below shows how to interpret the confusion matrix. As we can see the number of non-fraud true cases is 85,264, the number of fraud false cases is 43, the number of non-fraud false cases is 24, and the number of fraud true cases is 112. The confusion matrix shows there is a class imbalance with non-fraud true cases.

confusion_matrix_dt = confusion_matrix(test_Y, predictions_rf.round())

print("Confusion Matrix - Random Forest")

print(confusion_matrix_dt)

Confusion Matrix - Random Forest

[[85300 7]

[ 26 110]]

#random forest tree

cm = confusion_matrix(test_Y, predictions_rf.round())

color = 'white'

disp = ConfusionMatrixDisplay(confusion_matrix=cm)

disp.plot()

plt.show()

Figure 1. This image shows the confusion matrix for a random forest model between fraudulent and non-fraudulent cases.

print("Evaluation of Random Forest Model")

print()

metrics(test_Y, predictions_rf.round())

Evaluation of Random Forest Model

Accuracy: 0.99961

Precision: 0.94017

Recall: 0.80882

F1-score: 0.86957

Confusion Matrix of the Random Forest shown below. Recall non fraud cases are labeled as 0 and fraud cases are labeled as 1. The table below shows how to interpret the confusion matrix. As we can see the number of non-fraud true cases is 85,300, the number of fraud false cases is 7, the number of non-fraud false cases is 26, and the number of fraud true cases is 110. The confusion matrix shows there is a class imbalance with non-fraud true cases.

Which model performed better? The F1 score of the decision tree model is 0.76976, while the F1 score of the random forest model is 0.86957. Thus the random forest model performed better; however, both algorithms ignore the minority class fraud cases. This is the class we are most interested in! This issue can be solved by oversampling. Oversampling duplicates samples of our minority class. We will be using The Synthetic Minority Oversampling Technique, or SMOTE for short, is a method of data augmentation for the minority class. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line. Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space. First we need to import SMOTE from the imblearn.over_sampling package. Next, we’ll resample our data.

from imblearn.over_sampling import SMOTE

X_resampled, Y_resampled = SMOTE().fit_resample(X, Y)

print("Resampled shape of X: ", X_resampled.shape)

print("Resampled shape of Y: ", Y_resampled.shape)

value_counts = Counter(Y_resampled)

print(value_counts)

(train_X, test_X, train_Y, test_Y) = train_test_split(X_resampled, Y_resampled, test_size= 0.3, random_state= 42)

Resampled shape of X: (568630, 29)

Resampled shape of Y: (568630,)

Counter({0: 284315, 1: 284315})

As we can see, the number of non-fraudulent cases is 284315 and the number of fraudulent cases is 284315. Now let’s see the confusion matrix of our model.

rf_resampled = RandomForestClassifier(n_estimators = 100)

rf_resampled.fit(train_X, train_Y)

predictions_resampled = rf_resampled.predict(test_X)

random_forest_score_resampled = rf_resampled.score(test_X, test_Y) * 100

cm_resampled = confusion_matrix(test_Y, predictions_resampled.round())

print("Confusion Matrix - Random Forest")

print(cm_resampled)

color = 'white'

disp = ConfusionMatrixDisplay(confusion_matrix=cm_resampled)

disp.plot()

plt.show()

Confusion Matrix - Random Forest

[[85128 21]

[ 0 85440]]

print("Evaluation of Random Forest Model")

print()

metrics(test_Y, predictions_resampled.round())

Evaluation of Random Forest Model

Accuracy: 0.99988

Precision: 0.99975

Recall: 1.00000

F1-score: 0.99988

The table below shows how to interpret the confusion matrix. As we can see the number of non-fraud true cases is 85,128, the number of fraud false cases is 21, the number of non-fraud false cases is 0, and the number of fraud true cases is 85,440.

The oversampled Random Forest Model has an F1-score of 0.99988 while the old Random Forest Model had an accuracy of 0.99961. There is no class imbalance issue in our oversampled data, as shown in the confusion matrix. Our new model performed better than our previous one by oversampling the data.

Page updated

Google Sites

Report abuse