CN116797346A

CN116797346A - Financial fraud detection method and system based on federal learning

Info

Publication number: CN116797346A
Application number: CN202310669873.0A
Authority: CN
Inventors: 司徒任远; 谷文聪; 罗旭东; 王慧慧
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-09-22

Abstract

The invention discloses a financial fraud detection method and a system based on federal learning, wherein a client locally carries out oversampling on a data set, a deep learning method based on a one-dimensional CNN network is utilized for carrying out feature extraction, then an XGBoost algorithm is executed, two parameters (a first derivative g and a second derivative h) of client data are calculated, and parameters of a local model are transmitted to a central server. And the central server performs data aggregation and joint updating, and broadcasts the weighted average parameters to the client for iterative updating. According to the global sharing model maintained by the local model parameters, the problems of insufficient effective and available data volume, extremely unbalanced inclination of data samples and the like brought by data island in the field of financial fraud detection can be relieved to a certain extent, and meanwhile, the security of a data set and the protection of data privacy are enhanced.

Description

Financial fraud detection method and system based on federal learning

Technical Field

The invention relates to the field of machine learning technology application, in particular to a federal learning-based financial fraud detection method and a federal learning-based financial fraud detection system.

Background

The forefront uses Support Vector Machine (SVM), BP neural network and other machine learning methods to detect financial credit fraud transactions, and certain results are obtained, in the subsequent development, HMM-based random forest methods, and methods for developing and using effective index Shape (SV) in alliance game theory are sequentially proposed; the method mainly comprises the steps of constructing a fraud detection scheme combining active learning and supervised learning by an integrated learning framework in China, comparing the fraud detection scheme with a traditional scheme, and respectively performing training verification on two groups of credit card data by using a high posterior probability algorithm.

With the development of modern technology, many new technologies have been applied to analyze transactions that customers may make fraudulent. On the one hand, since the financial transaction and the customer information are highly secret, the data of each financial credit institution are considered to be sensitive and can not be mutually transmitted, so that a plurality of data islands are formed, the data can not be completely integrated, and the integration of the data across areas is more difficult and has higher cost.

On the other hand, because the number of fraudulent samples is too small compared to normal samples, it is difficult for a typical detection system to detect fraudulent patterns, and often such established models are not effective because of lack of sufficient data or data features. The application of the federal learning framework can well solve the first problem, and the training of the model can be completed on the basis of avoiding the direct data exchange of the end to end. The second problem can be solved to some extent by introducing a suitable oversampling algorithm.

Disclosure of Invention

The invention aims to provide a financial fraud detection method and system based on federal learning.

The technical solution for realizing the purpose of the invention is as follows: a financial fraud detection method based on federal learning comprises the following steps:

establishing a central server on a financial credit network in a region where financial fraud detection is located, wherein the central server stores a financial fraud detection model and global model parameters;

the financial credit institution client downloads the current financial fraud detection model and the global model parameters;

the financial credit institution client uses the SMOTE algorithm to oversample the data set according to the local data set, so as to improve the duty ratio of the fraud sample in the whole data set;

the financial credit institution client performs feature extraction by deep learning based on a one-dimensional CNN network;

the financial credit institution client side calculates model parameters according to the local private data set by adopting the first derivative and the second derivative of the own data set, updates the detection model, and calculates the model accuracy of the updated detection model;

the central server receives the model parameters calculated by the financial credit institution client side for aggregation and update, broadcasts the parameters after weighted averaging to the client side, and calculates an update detection model;

the financial credit institution client receives the update parameters, carries out iterative update, and carries out model accuracy calculation on the updated detection model;

and after the accuracy of the financial fraud detection models of all the financial credit institution clients reach the set requirement, detecting the local financial credit data by using the finally updated financial fraud detection models corresponding to the financial credit institution clients, and outputting suspected financial fraud samples.

Further, all financial institution clients join or connect to the regional financial credit network.

Further, the local private data set includes multiple sets of data, each set of data including a feature vector and a corresponding tag.

Further, the feature vector, the dataset only contains digital input variables, and V1-V28 are Principal Component Analysis (PCA) converted information except for "transaction time", "transaction amount" and response variable "transaction category".

Further, the financial fraud detection model body is an XGBoost decision tree model, and the learning target is that the information gain of the local private data set is maximum under the current global model parameter; the local dataset is oversampled using SMOTE algorithm and feature extraction is performed by one-dimensional CNN.

Further, the specific process of updating the detection model comprises the following steps: the self-adaptive adjustment learning rate of the financial credit institution client calculates the first derivative g and the second derivative h and the information gain on the private data set of the client under the current global model parameters, and uploads the parameters to the server, and the local financial fraud detection model is updated according to the updated global parameters g and h broadcasted by the server and the accuracy of the client financial fraud detection model.

Further, the central server receives the model parameters calculated by the financial credit institution client for aggregation update, and the specific process of updating the detection model comprises the following steps:

updating parameters of K financial credit institution clients to t th round of K financial credit institution clientsAnd->Uploading to a central server, which aggregates models of all financial credit institution clients and calculates updated parameters by weighted average to generate new rounds of global financial fraud detection model parameters g _t+1 And h _t+1 And broadcasting the global model parameters to the client, and continuing to update iteratively until the accuracy of the financial fraud detection models of all the financial credit institution client reaches the set requirement.

A federal learning-based financial fraud detection system, comprising:

the central server is used for storing the financial fraud detection model and the global model parameters, receiving the model parameters calculated by the financial credit institution clients for aggregation and updating, updating the detection model, weighting, calculating and broadcasting the global parameters, and stopping updating when the accuracy of the financial fraud detection models of all the financial credit institution clients meets the set requirements;

the financial credit institution client is used for downloading the current financial fraud detection model and receiving global model parameters broadcasted by the central server, calculating model parameters according to the local private data set by adopting own first derivative and second derivative, uploading calculated values to the central server, receiving the broadcasted global parameters, updating the detection model, calculating the model accuracy of the updated detection model, downloading the latest financial fraud detection model, detecting the local financial credit data by using the latest financial fraud detection model, and outputting suspected financial fraud data.

As an alternative embodiment, the central server is arranged on a financial credit network of the region where the financial fraud detection is located.

As an alternative embodiment, the financial credit institution client comprises:

a receiving module for downloading the financial fraud detection model from the central server and receiving global model parameters broadcast by the central server;

the calculation module is used for carrying out model training calculation on the local data to obtain model accuracy and local model parameters;

the uploading module is used for uploading the local model parameters to the central server;

and the local detection module is used for detecting the local financial credit data and outputting suspected financial fraud samples.

Compared with the prior art, the invention has the remarkable advantages that: (1) According to the invention, the federal learning framework is used, each financial credit institution client trains the local model by using the local private financial credit data sets, and the data sets are still kept local without uploading to a data center, so that the problem of data non-circulation among terminals caused by data island in the detection field of the deceptive financial fraud can be relieved to a certain extent. (2) The financial credit institution terminal locally executes the XGBoost decision tree algorithm and transmits the parameters of the local model to the central server. The central server executes a model average aggregation algorithm, and maintains a global sharing model according to local model parameters, so that the security of a data set and the protection of data privacy are enhanced. (3) Before training, the financial credit institution terminal firstly executes an SMOTE oversampling algorithm, so that the defect of effective available data quantity caused by data island in the detection field of the cheating financial fraud can be relieved to a certain extent.

Drawings

FIG. 1 is a block diagram of a federally learned financial fraud detection system of the present invention.

FIG. 2 is a schematic diagram of a federal learning-based financial credit fraud training detection procedure of the present invention.

Detailed Description

According to the financial fraud detection method and system based on federal learning, a client locally performs oversampling on a data set, performs feature extraction by using a deep learning method based on a one-dimensional CNN network, then executes an XGBoost algorithm, calculates two parameters (a first derivative g and a second derivative h) of client data, and transmits parameters of a local model to a central server. And the central server performs data aggregation and joint updating, and broadcasts the weighted average parameters to the client for iterative updating.

The scheme adopted by the invention combines federal learning with financial fraud detection in the financial field, provides a new method and thinking for breaking the problem of data island among various financial credit institutions, reduces the loss of the financial credit institutions while protecting the privacy of clients, and protects the data security and the data privacy of the financial credit institutions so that the financial credit institutions participating in federal learning can share a financial fraud detection model by means of the data learning of a third party for risk control.

The invention is further described below with reference to the drawings and examples.

The invention relates to a financial fraud detection system based on federal learning, which comprises the following specific implementation modes:

referring to fig. 1, a federal learning-based financial fraud detection system is composed of a central server module 101, a financial credit institution end module 102, a receiving module 103, a calculating module 104, an uploading module 105, an aggregation updating module 106, and a local detection module 107.

A central server module 101 for determining a central server;

a financial credit institution client 102 for determining a financial credit institution participating in the detection of financial fraud;

a receiving module 103 for downloading the financial fraud detection model from the central server and receiving the global model parameters broadcast from the central server at the financial credit institution terminal;

the calculation module 104 is used for performing model training calculation on the local data by the financial credit institution end to obtain model accuracy and local model parameters;

an upload module 105 for uploading local model parameters to a central server at the financial credit institution's end;

an aggregation update module 106, configured to aggregate the model update parameters of all the financial credit institution terminals by using the central server, generate new model parameters by using weighted average, broadcast the new model parameters to all the financial credit institution terminals, and continue to update iteratively;

the local detection module 107 is configured to detect local financial credit data locally at a financial credit institution end, and output a suspected financial fraud sample.

The invention relates to a financial fraud detection method based on federal learning, which comprises the following specific implementation modes:

referring to fig. 2, the specific flow of the method for detecting the financial credit fraud based on federal learning of the invention is as follows:

in step 201, the central server is determined by the central server module 101.

Step 202, a financial credit institution participating in federal learning financial credit fraud detection is determined by the financial institution client module 102.

At step 203, the local data set is oversampled by the computing module 104 to increase the fraud sample duty cycle.

Step 204, feature extraction is performed on the local data set by the computing module 104.

And 205, finally, performing model training calculation through local data of the calculation module 104 to obtain model accuracy and local model parameters.

In step 206, the financial institution client uploads the local model parameters to the central server via the upload module 105.

In step 207, the central server aggregates the model update parameters of all the financial institution clients through the aggregation update module 106, generates new model parameters, broadcasts the new model parameters to all the financial institution clients, and continues to update iteratively.

In step 208, the financial institution client receives the global model parameters after the weighted average update broadcast by the central server through the receiving module 103, and performs iterative update.

Step 209, after the accuracy reaches the standard, all the financial institution clients suspend federal learning, detect the unknown financial credit data through the local detection module 107, and output suspected financial credit fraud samples.

Specifically, in step 201, all financial credit institution clients participating in federal learning cooperatively train a joint financial fraud detection system under the coordination of a central server. A fixed set of K financial institutions each having its own local private data set as participants in federal learning Is a feature vector, features V1, V2..v 28 is the principal component obtained with PCA, there are only two features, time and quantity, +.>Is a corresponding label, which indicates whether the fraud is performed, and n is used for indicating participation in the whole associationAll data set sizes, n, built by fraud detection system _k The data set representing the kth financial institution client participating in federal learning is sized so there is n _k = |di| macroscopic data amount from the central server is +|>

In step 203, the financial credit client over-samples the local data set with SMOTE algorithm, where there is in reality far less financial credit fraud than normal financial credit. The SMOTE algorithm will generate random fraud samples for the blue book with the existing fraud samples, increasing the duty cycle of the fraud samples in the whole dataset, making the class distribution of the dataset relatively balanced.

In step 204, the data we need to classify is made up of text and numerical values, so we consider the use of one-dimensional convolution, and therefore feature extraction using deep learning based on one-dimensional CNN networks. The convolution layer in CNN learns and represents the original characteristics of the financial sample, and the pooling layer adopts maximum pooling to reduce the dimension of the characteristics. With this network infrastructure, the appropriate activation functions and network layers are selected. The dropout technique is then used to overcome the deep network overfitting problem. Finally, the one-dimensional vector passing through the full connection layer is obtained, namely the extracted feature.

In step 205, each financial institution client financial credit fraud detection model uses an XGBoost decision tree, which is maintained by all financial institution clients and central servers participating in federal learning.

In step 205, the learning objective of the XGBoost decision tree model by all financial institution clients is:

wherein the method comprises the steps ofIs the output of the whole accumulation model, regularization term sigma _k Ω(f _k ) Is a function representing the complexity of the tree. And calculating a first derivative g and a second derivative h of the loss function to obtain information gain, and optimizing the tree according to the information gain. The smaller the target loss function, the better the learning effect, and in the learning process, in order to promote learning, each model has a loss function defined for g and h, and errors of the model are captured on the loss function data.

In each round of communication t=1, 2,.. each financial institution client participating in federal learning downloads current global model parameters g from the server _t And h _t . Searching for the best segmentation by using a greedy algorithm aims at maximizing the learning benefit per iteration:

wherein the method comprises the steps ofHere I _L And I _R Representing left and right data sample index sets.

In step 207, the K financial institution client updates the parameters of the t th round of the clientAnd->Uploading to a central server, and the central server aggregates model update parameters of all financial institution clients to generate a new round of global financial fraud detection model parameters g _t+1 And h _t+1 Broadcasting to all clients, and continuing iteration.

In step 209, after iterating T times, when the accuracy of the financial credit fraud detection model of all the financial institution clients reaches more than 90%, updating the model parameters may be suspended to avoid resource waste, and each client performs the financial fraud detection on the financial credit data locally.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. A financial fraud detection method based on federal learning is characterized by comprising the following steps:

firstly, a central server is established on a regional financial credit network, all financial credit institution clients join or connect the regional financial credit network, the central server stores a financial fraud detection model and global model parameters, and a detection model main body is an XGBoost decision tree model;

secondly, the financial credit institution client downloads the current financial fraud detection model and the global model parameters;

thirdly, the financial credit institution client uses the SMOTE algorithm to oversample the data set according to the local private data set;

fourthly, training a client/server side joint fraud detection model, completing training of a detection model XGBoost decision tree under a federal learning framework, and training the detection model by a financial credit institution client side by adopting the XGBoost decision tree model;

and fifthly, the central server receives model parameters calculated by the financial credit institution client side to perform global parameter calculation and aggregate updating.

And sixthly, detecting the local financial credit data by the financial credit institution client by utilizing the finally updated financial fraud detection model after the accuracy of the financial fraud detection models of all the financial credit institution clients reach the set requirement, and outputting suspected financial fraud samples.

2. The federally learned based financial fraud detection method according to claim 1, wherein: the local private data set comprises a plurality of groups of data, each group of data comprises a feature vector and a corresponding label, and the method specifically comprises the following steps: the dataset only contains digital input variables including "transaction time", "transaction amount", response variables "transaction category" and V1-V28, which are principal components, to analyze PCA converted information, and the corresponding label indicates whether fraud is occurring.

3. The federally learned based financial fraud detection method according to claim 1, characterized by: and the XGBoost decision tree model has the learning target that the information gain of the local private data set is maximum under the current global model parameter.

4. The federal learning-based financial fraud detection method of claim 1, wherein the client/server-side joint fraud detection model training method specifically comprises:

the financial credit institution client performs feature extraction on the data set by adopting deep learning based on a one-dimensional CNN network;

the financial credit institution client trains the data set after oversampling and feature extraction by adopting an XGBoost decision tree model, and carries out model parameter calculation by adopting a first derivative and a second derivative of a decision tree;

the central server receives the model parameters uploaded by the financial credit institution client side for aggregation and update, broadcasts the parameters after weighted averaging to the client side, and calculates an update detection model;

and the financial credit institution client receives the update parameters, carries out iterative update, and calculates the model accuracy of the updated detection model.

5. The federal learning-based financial fraud detection method of claim 1, wherein the specific process of training the detection model with the XGBoost decision tree model by the financial credit institution client includes:

the financial credit institution client calculates the first derivative g and the second derivative h of the trained decision tree on the private data set of the client under the current global model parameters, uploads the two non-global parameters to the server, the client receives the g and the h of the global parameters which are broadcasted by the server and are updated by the weighted average, calculates the accuracy of the financial fraud detection model of the client, and carries out iterative updating on the local financial fraud detection model.

6. The federally learned based financial fraud detection method according to claim 1, wherein: the central server receives the model parameters calculated by the financial credit institution client side to calculate and aggregate and update the overall parameters, and the specific process of updating the detection model comprises the following steps:

updating parameters of K financial credit institution clients to t th round of K financial credit institution clientsAnd->Uploading to a central server, which aggregates models of all financial credit institution clients and calculates updated parameters by weighted average to generate new rounds of global financial fraud detection model parameters g _t+1 And h _t+1 And broadcasting the global model parameters to the client, and continuing to update iteratively until the accuracy of the financial fraud detection models of all financial credit institution client terminals reaches the set requirement.

7. A federal learning-based financial fraud detection system, comprising:

the central server module is used for storing the financial fraud detection model and the global model parameters, calling the aggregation updating module to conduct global aggregation updating, updating the detection model, weighting calculation and broadcasting the global parameters; the central server module is disposed on a financial credit network in the region where the financial fraud detection is located,

the financial credit institution client side invokes the receiving module to download the current financial fraud detection model; according to the local private data set, a calculation module is called, model parameter calculation is carried out by adopting the first derivative and the second derivative of the calculation module, and the calculated value is uploaded to a central server through an uploading module; the receiving module is called to receive the broadcasted global parameters and update the detection model, and the calculation module calculates the model accuracy of the updated detection model;

the receiving module is used for downloading the financial fraud detection model from the central server and receiving global model parameters broadcasted by the central server;