Big Data Architecture In Banking

Hana Izdihar Nafisyahrin
7 min readMar 13, 2021

--

Photo by Eduardo Soares on Unsplash

Disclaimer: This article was written as a practice case on IYKRA Data Fellowship Program Batch 5.

The evolution of technology has changed the cultures and practical means in every aspect in our life including Banking. Advanced technology makes it possible for Bank to track each ‘moves’ made by their customers which had never been previously thought. A part from this, banks can also harness the information contained in various external data sources to further effectivate their tailored-approaches and minimize the risk.

This article will focusing on creating an appropriate big data architecture to resolve fraud problems by utilizing historical and real-time data.

Fraud Detection

Similar but different to anomaly and outlier detection, fraud detection simply means finding and detecting an entity that doesn’t belong to the group. Commonly, bank uses history, preferences and profile data of a customer to assess whether a new transaction request suited with the customers profile and past behaviors. Bank would like to detect such transactions so the account’s saving balance is not subtracted by an illegal transaction made by a a card thief for example. The follow up action would be verifying the validity of transactions to the relevant customers by sending mails or sms.

Feature selection and feature engineering is one of the most critical step in building a predictive machine learning models. This process will shape and limit our framework so to centralize our focus on to the important features predefined based on business domain knowledge. The feature chosen mainly depends on the problems at hand. In the case of fraud detections, here are some important features to explore,

  1. Preferences
  2. Total amount of money spend on the daily basis. To account for this variable more categorize the data based on the favorite brands and service or products being sold.
  3. Number of transactions made
  4. Location
  5. Date
  6. Hour

Logically, a fraudster will try to spend as much as possible in a short time extent. To predict the probability of a transaction being fraudulent there are several methods to be choose from ranging from simple statistical methods to complex machine learning modellings. I will not go deep to each one of them instead only uncover the surface and summarize some of them.

To find anomalies, one could take benefit from z-score that measures how far a data point is from the mean of a group in standard deviation unit. But to make this assessment valid, the distribution has to be approximately normal. Usually an instance is regarded as anomaly if its probability of happening is below 0.05. This method is non-robust to outliers and will give a wrong conclusion when a customer would like to spend more than he usually does. The other approach is to use simple boxplot but with some modifications to make it more-robust to outliers and unexpected event. For example, you can widen the range of non-outliers by imposing “intertentile” or “interpercentile” range adjusted to the characteristics of the data. The drawback of these approaches is it could not be used for several categories so we need to compare each variables in specific occasions.

Machine Learning Modelling

One method of choice that can be used for anomaly detection in fraud problem is one class SVM. As generally known, this method make use of hyperplane to separate data into groups based on margin optimization. In the case of one class SVM, the machine will look for and learn the characteristics inherited by ‘normal’ class. Data points that do not belong here is automatically regarded as anomalies or outliers.

Another method to consider is DBSCAN, a clustering algorithm that characterize anomalies as data being in less dense region. To apply this method, we need to tune two essential parameters those are eps and minPoints. Eps measures the maximum distance between two points to be regarded as the elements of a cluster meanwhile minPoints restrict the minimum number of dense region/cluster.

The last method has been known as the best anomaly detection for big data, that is isolation forest, abbreviated as iForest. Different from the usual anomaly detection algorithms this approach directly detect and labels the outliers using decision tree and random feature selection and random splitting values. Outliers will stay closer to the root node as these data points will be less frequent and need less partitioning.

Data Architecture

The essential part in building big data architecture or pipeline are the objectives and the characteristic of data sources. Different data will need different API/crawling/scraping methods and ingestion tools. First, lets dive into the general pictures of data pipeline as illustrate in the pictures.

Data are collected from various sources including website, mobile banking application, ATM machine, and bank’s database related to user’s profile. These various data types than ingested by a tool of choice before processed further in batch or stream mode. Basically, the model that has been trained in batch mode will be used to analyze the data in stream part. Next, these results will be used to build an interactive real-time dashboard as well as recommendations and notifications.

The real tools application is displayed in the above diagram. To predict if a request is fraudulent we need to accumulate transaction data as well as operational data and multi-channel contacts (to notify the customer of the transaction).

To ingest and integrate the data, we can use Sqoop, a tool popular for importing data from RDBMS to Hadoop. It can also be used to export data from Hadoop to a relational database. To accommodate for non-relational database, Gobblin and Kafka Stream are the choices successively for real time and batch processing. Gobblin is an ideal and flexible framework that can be used to integrate data from various sources such as databases, rest APIs, servers, FTP/SFTP, etc where as Kafka Stream is powerful in stream ingestion. Kafka Streams is elastic, highly scalable and fault-tolerant, offering a processing latency that is on a millisecond level. Kafka is also known as the first streams processing library in the world that provides “exactly once” capability. This means the ability to execute a read-process-write cycle exactly one time, neither missing any input messages, nor producing duplicate output messages [1].

The data then transferred to Kafka for real-time analytics or HDFS of Hadoop before being processed in Spark. Hadoop is framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. The superiority of Hadoop lie in its ability to do big data processing with low cost, fault-tolerant and high throughput by making use of distributed commodity hardware to store and replicate the data. It is also capable to process diverse type of data.

In the case of real time processing, Spark is much more better. This is because Spark has benefits from its Resilient Distributed Datasets (RDD). Put simply, Spark uses in-memory processing computation makes it possible for different job to share in-state information objects. This process minimizes the time needed to process the data and 10 to 100 faster than disk and network memory sharing. For batch processing, data preprocessing (wrangling, cleaning, partitioning) will be done using Hadoop meanwhile Spark is used to do the same for real time predictive modelling. Spark and Hadoop are equipped with their own libraries for SQL queries, machine learning, visualizing, DataFrames that makes them an-one in all frameworks. The resulting analysis data and data of interest then being stored for a predefined retention period in the cloud or hdfs for future use.

Once the data has been trained, the model than being used and inputted to spark to predict if a transaction is fraudulent. The bank can then send notifications or warnings through messages and emails to the relevant customers. For the bank insights, the data can also be transferred to tableau for creating an interactive dashboard that will give insights about the predicted number of fraudulent transactions and its location distribution.

After all, the technology environment is dynamic and constantly changing so it is best not to cling on a certain framework or tool. The important thing to be considered here is the confidentiality of bank data so make sure to use frameworks with guaranteed security.

References

Sathyapriya, M., Thiagarasu, V. 2015. Big Data Analytics Techniques for Credit Card Fraud Detection: A Review. IJSR 6, 5

Rakhman, R.A., Widiastuti, R.Y., Legowo, N., Kaburuan, E.R. 2019. Big Data Analytics Implementation in Banking Industry — Case Study Cross Selling Activity in Indonesia’s Commercial Bank. IJSTR 8, 9

Gour, R. 2019. Big Data Architecture — The Art of Handling Big Data. Towards Data Science

Young, A. 2019. Isolation Forest is the best Anomaly Detection Algorithm for Big Data Right Now. Towards Data Science

Kovachev, D. 2019. A Beginner’s Guide to Apache Spark. Towards Data Science

Hashemi, A. 2021. Anomaly & Fraud detection. Towards Data Science

--

--

Hana Izdihar Nafisyahrin
Hana Izdihar Nafisyahrin

Written by Hana Izdihar Nafisyahrin

Lifelong learner, Writer, Educator, Data Enthusiast

No responses yet