Fraud Detection, Coming soon!!!

I am working in a Synthetic Database for Fraud Detection in banking transactions. This is another Kaggle project with over 6 million rows, being each row a bank transaction. One of the challenges is that this is a very unbalanced database with a very small amount of fraudulent transactions (less than 1%). Let’s start working. nofrauds

Brief description of Fraud Detection.

Dataset contains 6,362,620 transactionsm from a synthetic dataset generated using the simulator called PaySim. “The simulator uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods”, as explained in Kaggle.
The baseline is 99.87% of non-fraudulent mobile money transactions.
The link in github for a more detailed and technical explanation is not yet available.

Non-technical description of Appointments No-show study.

Gathering the data.

The file happens to be pretty clean with no nulls but some inconsistencies as I will point out.

First one:

THe field ‘isFlaggedFraud” is supposed to be the transactionsDataset contains 6,362,620 transactionsm from a synthetic dataset generated using the simulator called PaySim. “The simulator uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods”, as explained in Kaggle.
The baseline is 99.87% of non-fraudulent mobile money transactions.
The link in github for a more detailed and technical explanation is not yet available.

python

file = 'PS_20174392719_1491204439457_log.csv'
df = pd.read_csv(file)

data = df.copy()

data = data.rename(columns={'oldbalanceOrg': 'oldbalanceOrig'})

data.shape

Written on October 1, 2017