CN116821688A

CN116821688A - Method for processing data set in credit card fraud transaction based on clustering downsampling technology

Info

Publication number: CN116821688A
Application number: CN202310892750.3A
Authority: CN
Inventors: 黄华杰; 刘波; 薛潇雨; 曹玖新; 陈欣怡
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-09-29

Abstract

The invention discloses a method for processing a data set in credit card fraud transaction based on a clustering downsampling technology, and relates to application of related algorithms in the field of artificial intelligence. Dividing credit card transactions into normal transactions and fraudulent transactions according to the nature of the transactions, and clustering all normal transactions based on a K-Means clustering algorithm; for each cluster of credit card normal transactions obtained by clustering, creating a new node representing the cluster, wherein the new node represents the characteristics of normal transactions with similar characteristics; these new nodes represent a representation of some type of normal transaction nodes with similar characteristics, the new nodes of all clusters constituting a new normal transaction set; and finally, combining the new normal transaction set and the fraudulent transaction set into a training data set in an equilibrium state. The invention solves the technical problems of low operation efficiency and low accuracy caused by the fact that the characteristic of a specific group cannot be effectively extracted because the unbalanced data set in the credit card fraud detection is processed in the prior art.

Description

Method for processing data set in credit card fraud transaction based on clustering downsampling technology

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method for processing a data set in credit card fraud transaction based on a clustering downsampling technology, aiming at the problem of unbalanced data sets commonly existing in credit card fraud detection.

Background

Along with the global economic integration process, credit cards increasingly play a role in the everyday life of the population. Through the convenient consumption mode of the credit card, people can easily finish economic activities such as life payment, online shopping, bill payment and the like. The rapid development of credit card services has also led to the proliferation of fraudulent transactions. According to EU statistics, the credit card transaction amount in a single EU payment area in 2018 reaches 4.84 trillion EU, and credit card fraudulent transaction reaches incredibly 18 billi EU, namely, the fraudulent transaction amount accounts for 0.037% of the total transaction amount of the credit card, which means that consumption per 100 EU loses 3.7 EU due to fraudulent transaction. In view of this, research into credit card fraud transaction detection is particularly important to both the financial and academia industries. The data volume of credit card transactions is unprecedented huge, and the need of practical application cannot be met more and more by screening the credit card transactions in a manual mode, and each world gradually applies the current relatively popular machine learning algorithm to credit card fraud detection.

An unbalanced dataset problem refers to the overwhelming position of data of a certain class on the amount of data, in terms of two or more classifications. On the credit card fraud detection data sets, the data sets are classified into normal transactions and fraudulent transactions depending on whether the credit card transaction is of a fraudulent type, which are classified into two categories. Because the quantity of the fraudulent transactions accounts for a small proportion in the total credit card transactions, the machine learning algorithm usually only keeps track of the characteristics of normal transactions, and the judgment of the fraudulent transactions is also defaulted to be the normal transactions, thereby affecting the application value of the machine learning algorithm in practice. The real credit card data set is taken as experimental data, fig. 1 is a diagram showing the condition of the prediction accuracy of a machine learning algorithm on an original data set, the ordinate on the diagram shows the iteration times of the machine learning algorithm, and the abscissa shows the prediction accuracy. As can be seen, the machine learning algorithm equates fraudulent transactions to normal transactions, and although the predicted accuracy is as high as 99.89%, the credit card fraudulent transactions are judged to be erroneous, which can result in significant economic losses to financial enterprises and individuals.

Early, students mainly used different sampling techniques to adjust various transaction data amounts in samples to deal with the problem. The sampling technologies mainly realize the balance of data volume among categories by adding or deleting samples of different categories, and mainly comprise Random Oversampling (ROS), random downsampling (RUS), synthetic Minority Oversampling Technology (SMOTE), adaptive integrated oversampling technology (ADASYN) and the like. The random oversampling technology increases the number of secondary class nodes by simply copying the nodes, and realizes the data volume balance between the primary class and the secondary class. Conversely, random downsampling achieves a balance between the two classes by deleting the number of nodes of the primary class. In order to solve the problem that random oversampling and random downsampling have randomness, a learner creates a new minority class node by using an improved oversampling technology, and proposes a synthesized minority class oversampling technology. According to the different distributions of the secondary classes, the scholars propose an adaptive comprehensive oversampling technique to generate different new nodes according to the number of the secondary classes. Although the above-described method randomly generates new samples, the likelihood of overfitting of the sample features is also increased. In addition, a learner evaluates different sampling methods on the unbalanced data problem through a large number of experiments, and discovers that the downsampling technology of the majority type samples and the oversampling of the minority type samples can improve the performance of algorithms such as AdaBoost, XGBoost, random forests and the like.

In the past few years, researchers have tried to address the costs incurred by unbalanced categories, and have proposed new approaches to solving unbalanced data set problems, including cost-sensitive algorithms and loss functions. The traditional classification algorithm does not specialize the cost of processing the primary class and the secondary class, which means that the misclassification cost of the primary class and the minority class is the same. In everyday applications, misclassification costs directly affect the loss function, and researchers find that the value of a minority class is far higher than that of a main class. By reducing direct misclassification cost effects caused by different categories, a learner proposes a cost-sensitive concept to improve the efficiency of predicting fraudulent transactions. In order to solve the problem that the labor force and the traditional sampling method cannot efficiently solve the unbalanced data set, researchers put forward a cost sensitive idea in 2014 and are sequentially applied to the industry, the medical community and the financial community. In order to efficiently detect the gap of the daily sewage pipeline, a learner constructs a cost-sensitive convolutional neural network to analyze the pictures in the pipeline. In an unbalanced dataset, the primary class will contribute more cost value than the secondary class, and play a decisive role in the loss function. By capturing errors of the primary and secondary classes, the learner proposes a loss function based on the average misclassification error and the squared average misclassification error. Although the above method can solve the unbalanced data problem to some extent, it is difficult to obtain the cost value in advance, which is set only empirically. At the same time, it may introduce larger errors and cause deviations in the loss function.

In recent years, scholars have begun to try to solve the problem of unbalanced data sets using a hybrid approach combining clustering algorithms and oversampling techniques. Prachuabsepakij et al create new nodes using the SMOTE algorithm based on different classes of nodes. Fu et al designed to achieve class-to-class balance based on cost sampling methods by dividing fraudulent transactions into different clusters according to cost and then creating new synthetic fraudulent samples according to the node conditions of the same clusters. Unbalanced data problems also exist in software vulnerability detection, gong et al design KMFOS methods. The KMFOS method divides all defective nodes into different clusters, and then creates a new node using the nodes of two different clusters. However, the above approach may incur a significant computational penalty due to the creation of a large number of nodes. Meanwhile, a large number of similar nodes are newly built by the oversampling technology, so that an overfitting phenomenon can be caused.

Disclosure of Invention

The invention aims to: in order to solve the problem of unbalanced categories of data sets in credit card fraud detection, the invention provides a method for processing the data sets in credit card fraud transactions based on a clustering downsampling technology.

The technical scheme is as follows: in order to achieve the above purpose, the present invention adopts the following technical scheme:

a method of processing a data set in a credit card fraudulent transaction based on a clustered downsampling technique, the method comprising the steps of:

s1, dividing credit card transactions into normal transactions and fraudulent transactions according to the nature of the transactions, and clustering all normal transactions based on a K-Means clustering algorithm;

s2, for each cluster of normal transactions of the credit cards obtained by clustering, creating a new node to represent the cluster, wherein the new node represents the characteristics of the normal transactions with similar characteristics;

s3, representing the representative of a certain type of normal transaction nodes with similar characteristics by the new nodes, wherein the new nodes of all clusters form a new normal transaction set;

s4, combining the new normal transaction set and the fraudulent transaction set into a training data set in an equilibrium state.

Further, the specific process of step S1 includes the following sub-steps:

step S1-1, initializing a clustering center: setting the number k of clusters of normal transactions of the credit card to be divided to be equal to the number of fraudulent transactions, and randomly selecting k normal transaction nodes of the credit card as cluster center points, thereby finishing initializing k cluster center point information;

sub-step S1-2, clustering and dividing all normal transactions: calculating the distance between each normal transaction node and all k cluster center points, wherein the distance calculation formula is Euclidean distance, each normal transaction node is allocated to the cluster in which the closest cluster center exists, namely the cluster in which the normal transaction node belongs to the cluster in which the node with the smallest Euclidean distance exists, the step is specifically shown as formula (1), the formula (1) shows that the j-th credit card normal transaction node is classified into the cluster class with the shortest Euclidean distance,

wherein x is _jl The first attribute value, C, representing the jth node _kl A first attribute value representing a center point of a kth cluster, n' representing the number of clusters, and m representing the number of attributes contained in each node;

step S1-3, updating the information of the cluster center point according to the classification node condition: after normal transaction is classified, calculating the central point of the cluster again according to the new clustering condition, and updating the information generating node of the clustering central point;

and S1-4, repeating the step S1-2 and the step S1-3 until the iteration termination condition is met, and obtaining a final clustering result.

Further, step S2 describes creating a new node representing the cluster for each normal transaction cluster of credit cards, the new node representing the characteristics of normal transactions with similar characteristics, assuming that each node is composed of continuous variables and discrete variables;

new node x 'created for continuous variable' _l For the average value of all nodes in the cluster on the attribute, a specific calculation formula (2) is as follows:

wherein n is _k Representing the number of nodes of the kth cluster;

new node x' created for discrete variables _l For the numerical value with the highest occurrence frequency of all nodes in the cluster attribute, a specific calculation formula (3) is as follows:

the count function is used for counting the function of the occurrence frequency of the node attribute. (1) a specific implementation pseudocode for a credit card fraud monitoring dataset generation algorithm based on clustered downsampling is as follows.

The beneficial effects are that: compared with the prior art, the invention adopts the technical scheme as follows:

(1) Aiming at the unbalanced data problem existing in the credit card fraud detection problem, the invention designs a clustering downsampling technology, and the common characteristic condition in normal transaction is efficiently extracted by regenerating nodes after normal clustering, so that the data volume is greatly reduced, the group characteristics are maintained, the running time is shortened, the accuracy is improved, and the characteristic condition of fraudulent transaction is combined, so that the prediction of fraudulent transaction with higher efficiency is realized;

(2) Two unique features for credit card fraud detection: the decision to accept or reject a transaction is made in a very limited time, a large amount of transaction detail data is processed in a given time. The invention fully utilizes the self-learning capability of the machine learning algorithm, and proposes a continuous variable processing scheme and a discrete variable processing scheme in a data set based on a downsampling algorithm of clustering k-means for the first time in the industry and the academic scope, thereby reducing the subjective cognitive influence of people and improving the recognition screening accuracy.

Drawings

FIG. 1 is a flow chart of a clustering downsampling technique;

FIG. 2 is a diagram of a credit card fraud detection process;

FIG. 3 is a diagram of a hybrid neural network architecture based on a clustering downsampling technique for credit card fraud detection;

fig. 4 is a diagram of a credit card fraud detection architecture based on a hybrid neural network.

Detailed Description

In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings. It is apparent that the described embodiments are only some of the application scenarios of the present invention, and not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that, related information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by a user or sufficiently authorized by each party. For example, an interface is provided between the system and the relevant user or institution, before acquiring the relevant information, the system needs to send an acquisition request to the user or institution through the interface, and acquire the relevant information after receiving the consent information fed back by the user or institution.

Examples

In accordance with the execution of the present invention, there is provided an embodiment of a method of predicting credit card transaction behavior, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical sequence is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in a different order than that illustrated herein.

(1) Introduction to data Condition

The data of the present invention is derived from credit card fraud data of a branch of a bank. The business volume of the branch transaction is large, and the data index has representative significance in China. At the end of 2020, the credit card spending amount for this branch was 57.75 million and the reject ratio was 1.62%.

(2) Flow chart for credit card fraud detection

The credit card fraud process mainly comprises the following steps: data preparation, training phase, testing phase, scoring phase, refer specifically to fig. 2.

1. Data preparation: extracting credit card transaction information and relevant client background information from a credit card database, extracting useful features to form a data set, and dividing the data set into a training set and a testing set;

2. training phase: comprehensively judging the credit card transaction information, the client information and the like, and performing parameter adjustment training on the model based on the training set condition;

3. testing: evaluating the model by using the evaluation index to obtain optimal model parameters;

4. scoring stage: and judging the transaction details of the credit card user, and predicting whether the transaction belongs to the normal class or the fraudulent class.

(3) Data set preparation

Credit card fraud detection the training process in different machine learning models may involve importing different sets of input data, including personal customer information and credit card transaction data. The invention is described herein with respect to a hybrid neural network (credit card fraud detection, a method for detecting credit card fraud based on a deep hybrid neural network, patent application number 202211582573) as an example, while comparing some of the front-edge algorithms for credit card fraud monitoring with the process of dealing with the problem of unbalanced credit card data.

The data set used in the invention is the credit card transaction detail data of the branch from 1 month in 2020 to 12 months in 2020, the abnormal data is processed, the details of non-consumption transaction are deleted, and finally 661 more than ten thousand credit card detail consumption records are combined at the time node. The proportion of credit card fraud in the line from 7 months 2020 to 10 months 2020 is approximately 0.0139% -0.008%. The specific data are shown in the following table. To verify efficiency on an unbalanced dataset, we randomly selected 5 ten thousand records as verification and test sets, respectively, at 11 and 12 months 2020, and required that they all include all fraudulent transactions within the month.

1. Feature extraction of personal customer information

In order to improve the recognition efficiency of the deep neural network, 10 pieces of customer information data are converted into 134 onehot variables by using onehot codes, and the 134 pieces of customer information data are input as input data of the deep neural network, wherein the input data are [ batch size,134] which is the batch size fed in each time. The data dictionary for specific individual customer information features is as follows.

2. Extraction of credit card transaction information features

A corresponding 13x8 data matrix is generated for each credit card transaction, with two dimensions being time and feature, respectively. According to the attribute of the credit card transaction information, corresponding characteristic indexes are generated according to characteristic forms, wherein the characteristic indexes are shown in the following table, and the time dimension range T takes one day, two days, one week, one month, three months, half year and one year:

feature names	Meaning of features	Attribute category
			Average transaction amount	Average transaction amount over past time T	Continuous variable
Summarizing transaction amount	Total transaction amount in past time T	Continuous variable
			Transaction error amount	In the past time T, the error between the current transaction and the average transaction amount	Continuous variable
Number of transactions	Total number of transactions in elapsed time T	Continuous variable
			At most transaction mode	In the past time T, the most transaction modes	Discrete variable
Number of transaction terminals	In the past time T, the transaction terminal number of the corresponding card of the transaction	Continuous variable
			Maximum transaction channel type	Channel type of most transactions in past time T	Discrete variable
Entropy of transaction channel types	Gain of transaction channel type during past time T	Continuous variable
			Time period of maximum transaction	Time period of maximum exchange within past time T	Discrete variable
Time period of maximum transaction amount	Time period of maximum exchange within past time T	Discrete variable
			Average transaction amount and proportion of average daily assets	Proportion of average transaction amount to average day asset over past time T	Continuous variable
Proportion of total amount of transactions and average daily assets	Proportion of total amount of transactions over time T and average day asset	Continuous variable
			Maximum trade amount and daily asset ratio	Maximum transaction amount and proportion of day-average assets in past time T	Continuous variable

(4) Introduction to the model

1. Down-sampling algorithm introduction based on clustering k-means

In the processing process of the input data set, continuous variables (transaction amount, transaction terminal number, proportion of average transaction amount and average daily asset) and discrete variables (gender, occupation code, education degree and the like) are included, and a downsampling algorithm based on clustering k-means is provided for enriching actual application situations.

The algorithm classifies the normal transactions according to the k-means algorithm, then selects representative nodes in each cluster, and forms a new data set with fraudulent transactions as an input data set of a training model, and a specific flow chart is shown in fig. 3.

The algorithm comprises the following steps:

s1, dividing credit card transactions into normal transactions and fraudulent transactions according to the nature of the transactions, and clustering all normal transactions based on a K-Means clustering algorithm, wherein the method specifically comprises the following sub-steps:

S2, for each cluster of credit card normal transactions obtained by clustering, creating a new node to represent the cluster, wherein the new node represents the characteristics of normal transactions with similar characteristics, and each node is assumed to consist of a continuous variable and a discrete variable;

wherein n is _k Representing the number of nodes of the kth cluster;

2. Credit card fraud detection algorithm introduction

The credit card fraud detection adopts a hybrid neural network (credit card fraud detection is a credit card fraud detection method based on a deep hybrid neural network, patent application number 202211582573). The hybrid neural network is composed of two parts, a reverse neural network f for extracting credit card customer information _BPNN Partial and convolutional neural network f for extracting credit card transaction features _CNN The composition and specific structure are shown in fig. 4.

Due to the nonlinear mapping and normalization capabilities, the inverse neural network is used to identify the unique features of cardholder background information and the existence of economic conditions. The reverse neural network has a 4-layer structure, the input layer is based on personal information of a credit card user, the second and third layers belong to a hidden layer, and the last layer is an output layer consisting of a fully connected layer.

Because of the ability to capture continuity and two-dimensional spatial variations in the vicinity, convolutional neural networks are used to identify dynamic variations in transaction information, time windows, and personal economics, which accurately reflect subtle differences between normal and fraudulent transactions. The convolutional neural network has a 5-layer structure, the input layer is based on user transaction characteristics, the middle layer comprises 3 hidden layers including a convolutional layer and a maximum pooling layer, and the fifth layer is a full-connection layer for outputting a final judgment result for a user.

After training f _BPNN And f _CNN Then we obtain the judgment result for the user information and the judgment result for the transaction information, then links the two together to judge and input to Sigmoid, and finally obtains the final judgment for the credit card transaction. If the judgment result exceeds a certain threshold, the transaction is considered to be fraudulent, otherwise, the transaction is considered to be normal.

(5) Performance evaluation index

To evaluate the success of the clustering-based downsampling technique in the field of credit card fraud detection, we introduce here some evaluation metrics for evaluating the case of the correlation algorithm.

1. Confusion matrix

To evaluate the accuracy of the classification algorithm, a confusion matrix is used to assist in evaluating the efficiency of the algorithm. For the classification algorithm, a 2x2 data table is constructed and used for recording the prediction condition of the classifier, so that the subsequent analysis performance is facilitated, and the following table is specific.

	Predictive value = 1	Predictive value=0
			Tag value＝1	True yang value	False negative value
Tag value=0	False positive value	True yin value

(1) The true positive value (TP), i.e. the sample predicted value is consistent with the label value, is positive;

(2) the false negative value (FN), i.e. the sample label value, is positive, while the predicted value is the opposite;

(3) the false positive value (FP), i.e. the sample tag value, is negative, whereas the predicted value is opposite;

(4) the true negative value (TN), i.e., the sample predicted value, is identical to the tag value, and is negative.

2. Accuracy rate of

The accuracy is used to determine the model prediction, meaning that multiple instances can accurately have the model successfully determined. In general, higher accuracy means higher performance. The specific formula is as follows:

3. precision rate

The accuracy refers to the accuracy with which the model predicts as a positive sample, in general, higher accuracy means better performance. The specific formula is as follows:

4. recall rate of recall

Recall refers to the accuracy of the model to the positive samples, and generally, higher recall means that the model can better complete the judgment of the positive samples.

5.F1-Score

F1-Score refers to an important indicator of recall and accuracy, which balances the accuracy and recall.

Its value ranges between 0 and 1, 0 meaning worst and 1 meaning best.

(6) Experimental effect conditions

Evaluating the situation that the clustering downsampling technique proposed by the patent processes unbalanced data sets on credit card data sets, we compare schemes proposed in papers with random downsampling and other modes. To test the combination of different sampling methods and clustering techniques, we designed another clustering algorithm and combined random downsampling and random oversampling techniques. The technique first divides the primary class into clusters, the number of clusters is n times the number of secondary classes (here we set n to 10), applying the formula newfraud=a·x ₁ +(1-a)·x ₂ Creating a new fraudulent transaction, wherein a is a random value between 0 and 1, x ₁ 、x ₂ Are different random samples in the secondary class. The credit card fraud detection algorithm involved in the test has a hybrid neural network, and the method for processing the unbalanced data problem has random up-sampling, random down-sampling, a synthetic minority class over-sampling technology and a self-adaptive comprehensive over-sampling technology.

According to experimental results (the following table), we can see that the mixed neural network based on the clustering downsampling technology realizes the optimal F1-Score value, and the optimal F1-Score value is mainly reflected in balanced improvement of accuracy and recall rate.

/>

Claims

1. A method for processing a data set in a credit card fraudulent transaction based on a clustered downsampling technique, the method comprising the steps of:

2. The method of processing data sets in credit card fraudulent transactions based on a clustered downsampling technique according to claim 1, characterized in that the specific procedure of step S1 comprises the following sub-steps:

3. The method for processing data sets in credit card fraudulent transactions based on the clustering downsampling technique according to claim 1, wherein step S2 said creating a new node for each cluster of credit card normal transactions represents the cluster, the new node representing the characteristics of normal transactions with similar characteristics, assuming each node is composed of continuous variables and discrete variables;

wherein n is _k Representing the number of nodes of the kth cluster;

the count function is used for counting the function of the occurrence frequency of the node attribute.