CN109886284B

CN109886284B - Fraud detection method and system based on hierarchical clustering

Info

Publication number: CN109886284B
Application number: CN201811522918.7A
Authority: CN
Inventors: 蒋昌俊; 闫春钢; 丁志军; 刘关俊; 张亚英; 张友军
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2021-02-12
Anticipated expiration: 2038-12-12
Also published as: CN109886284A

Abstract

A fraud detection method and system based on hierarchical clustering, obtain and analyze the characteristic information of trade and get the characteristic analysis data, choose the clustering model according to the characteristic analysis data; acquiring a sample data set, hierarchically clustering the sample data set according to a clustering model to construct a tree structure, and dividing the sample data set into leaf nodes of the tree structure; classifying the leaf nodes to obtain node type data; the leaf nodes in the clustering tree model are processed according to the node type data to finish fraud transaction detection, and the technical problems of incomplete performance consideration, low detection accuracy and unbalanced category in the prior art are solved.

Description

Fraud detection method and system based on hierarchical clustering

Technical Field

The invention relates to a financial fraud detection system, in particular to a fraud detection method and system based on hierarchical clustering.

Background

With the rapid development of electronic commerce, the online transaction amount is increased rapidly, and transaction fraud events are frequent. Due to the openness of the internet environment, a fraudster can master various fraud means such as phishing websites, phone fraud, and the like; meanwhile, due to the characteristics of diversity, anonymity and the like of payment modes, fraud modes are continuously changed. Faced with these problems, it is difficult for financial companies to detect fraudulent transactions through conventional rule-based expert systems, which causes serious economic losses to companies and individuals. Therefore, it is of great practical significance to research how to establish an effective transaction fraud detection model.

In order to solve the increasingly serious transaction fraud problem, a plurality of Machine learning models are applied to fraud transaction detection, wherein the Machine learning models include classification models such as a Support Vector Machine (SVM), a K-nearest neighbor (KNN), a random forest and the like. However, since the number of valid transaction samples in the transaction data set is much larger than that of fraudulent transaction samples, i.e. there is a class imbalance phenomenon, which greatly reduces the classification performance of the conventional model, there are four main factors for generating the problem: unbalanced ratio, sample size, separability, and intra-class sub-clustering. The existing improvement method mainly reduces the negative influence of the class imbalance phenomenon on the performance of the traditional classification model through two aspects, namely a data level and an algorithm level. The data layer is mainly based on a data resampling method to achieve the purpose of changing the ratio of positive samples to negative samples in the data set, but the mode can cause the risk of under-fitting or over-fitting; in the aspect of the algorithm, the existing classification model structure is mainly modified, or a cost sensitive function and other modes are introduced, so that the model is more biased to learning of a few classes of samples in the training process, but the mode has no universality and high complexity. At the same time, in essence, they only consider one essential factor of the imbalance ratio, and ignore the other three factors.

In conclusion, the prior art has the technical problems of incomplete performance consideration, low detection accuracy and unbalanced category.

Disclosure of Invention

In view of the above disadvantages of the prior art, an object of the present invention is to provide a fraud detection method and system based on hierarchical clustering, which solve the technical problems of incomplete performance consideration, low detection accuracy and unbalanced classification in the prior art. A fraud detection method based on hierarchical clustering comprises the following steps: acquiring and analyzing transaction characteristic information to obtain characteristic analysis data, and selecting a clustering model according to the characteristic analysis data; acquiring a sample data set, hierarchically clustering the sample data set according to a clustering model to construct a tree structure, and dividing the sample data set into leaf nodes of the tree structure; classifying the leaf nodes to obtain node type data; and processing leaf nodes in the clustering tree model according to the node type data to finish fraud transaction detection.

In one embodiment of the present invention, obtaining and analyzing the transaction feature information to obtain feature analysis data, and selecting the clustering model according to the feature analysis data includes: acquiring an actual data set, and extracting transaction characteristic information in the actual data set; obtaining feature analysis data based on separability analysis of the transaction feature information; processing the characteristic analysis data into distribution judgment data; and selecting a clustering model according to the distribution judgment data.

In an embodiment of the present invention, acquiring a sample data set, hierarchically clustering the sample data set according to a clustering model to construct a tree structure, and partitioning the sample data set into leaf nodes of the tree structure, includes: creating a tree structure; acquiring and storing a sample data set and node condition data of leaf nodes; selecting applicable processing logic of the current leaf node according to the node condition data; dividing the current nodes into tree structures according to applicable processing logic hierarchical clustering; and iterating the steps until the sample data set is completely divided into leaf nodes in a tree structure.

In one embodiment of the present invention, classifying leaf nodes to obtain node type data includes: acquiring all leaf nodes in a tree structure; extracting category information, balance ratio data and sample number information of leaf nodes; classifying the current leaf nodes according to the category information, the balance ratio data and the sample number information; and acquiring node type data of the current leaf node, and circularly executing the steps until all the leaf nodes are classified into single-class leaf nodes, class balance leaf nodes and leaf nodes containing abnormal samples.

In an embodiment of the present invention, processing leaf nodes in a clustering tree model according to node type data to complete fraud transaction detection includes: acquiring node type data, and selecting an applicable processing mode of a node according to the node type data; and traversing and processing the leaf nodes in the tree structure according to an applicable processing mode.

In an embodiment of the present invention, traversing and processing leaf nodes in a tree structure according to an applicable processing manner includes: judging the type of the current leaf node according to the node type data; if the current leaf node is a single-category node, directly returning the type of the leaf node; if the current leaf node is a category balancing node, training samples in the leaf node by using a preset classification method; if the current leaf node is a leaf node containing an abnormal sample, detecting the leaf node by using preset abnormal detection logic; the foregoing operations are performed on leaf nodes in the tree structure.

In an embodiment of the present invention, a fraud detection system based on hierarchical clustering is characterized by comprising: the system comprises a clustering model selection module, a tree structure module, a leaf node classification module and a fraud detection module; the cluster model selection module is used for acquiring and analyzing the transaction characteristic information to obtain characteristic analysis data and selecting a cluster model according to the characteristic analysis data; the tree structure module is used for acquiring the sample data set, hierarchically clustering the sample data set according to the clustering model to construct a tree structure, and dividing the sample data set into leaf nodes of the tree structure, and the tree structure module is connected with the clustering model selection module; the leaf node classification module is used for classifying the leaf nodes to obtain node type data and is connected with the tree structure module; and the fraud detection module is used for processing the leaf nodes in the clustering tree model according to the node type data to complete fraud transaction detection, and is connected with the leaf node classification module.

In an embodiment of the present invention, the cluster model selecting module includes: the system comprises a transaction characteristic extraction module, a characteristic analysis module, an analysis data processing module and a model selection module; the transaction characteristic extraction module is used for acquiring an actual data set and extracting transaction characteristic information in the actual data set; the characteristic analysis module is used for obtaining characteristic analysis data based on separability analysis of the transaction characteristic information, and the transaction characteristic extraction module is connected with the characteristic analysis module; the analysis data processing module is used for processing the characteristic analysis data into distribution judgment data and is connected with the characteristic analysis module; and the model selection module is used for selecting the clustering model according to the distribution judgment data and is connected with the analysis data processing module.

In one embodiment of the present invention, the tree structure module includes: the system comprises a cluster tree creating module, a node condition obtaining module, a processing logic selecting module, a tree dividing module and a sample data iteration module; the system comprises a clustering tree creating module, a tree structure creating module and a tree structure setting module, wherein the clustering tree creating module is used for creating a tree structure; the node condition acquisition module is used for acquiring and storing the sample data set and the node condition data of the leaf nodes, and is connected with the aggregation tree creation module; the processing logic selection module is used for selecting the applicable processing logic of the current leaf node according to the node condition data, and is connected with the node condition acquisition module; the division and tree-entry module is used for dividing the current nodes into tree-shaped structures according to the applicable processing logic hierarchical categories, and the division and tree-entry module is connected with the processing logic selection module; and the sample data iteration module is used for iterating the steps until the sample data set is completely divided into leaf nodes in a tree structure, and is connected with the tree division module.

In an embodiment of the present invention, the leaf node classifying module includes: the system comprises a leaf node acquisition module, a node data extraction module, a current node classification module and a node class traversal module; the leaf node acquisition module is used for acquiring all leaf nodes in the tree structure; the node data extraction module is used for extracting the category information, the balance ratio data and the sample number information of the leaf nodes, and the leaf node extraction module is connected with the leaf node acquisition module; the current node classification module is used for classifying the current leaf nodes according to the class information, the balance ratio data and the sample number information, and is connected with the node data extraction module; and the node type traversal module is used for acquiring node type data of the current leaf node, and circularly executing the steps until all the leaf nodes are classified into single-type leaf nodes, type balance leaf nodes and leaf nodes containing abnormal samples, and the node type traversal module is connected with the current node classification module.

In one embodiment of the present invention, the fraud detection module includes: the device comprises an application mode selection module and a traversal detection module; the applicable mode selection module is used for acquiring the node type data and selecting the applicable processing mode of the node according to the node type data; and the traversal detection module is used for traversing and processing the leaf nodes in the tree structure according to an applicable processing mode and is connected with the use mode selection module.

In an embodiment of the present invention, the traversal detection module includes: the system comprises a node type judging module, a single-class returning module, a balanced node training module, an abnormal node detecting module and a tree structure traversing detecting module; the node type judging module is used for judging the type of the current leaf node according to the node type data; the single-category returning module is used for directly returning the type of the leaf node when the current leaf node is the single-category node, and the single-category returning module is connected with the node type judging module; the balanced node training module is used for training samples in the leaf nodes by using a preset classification method when the current leaf node is a category balanced node, and is connected with the node type judging module; the abnormal node detection module is used for detecting the leaf nodes by using preset abnormal detection logic when the current leaf nodes are leaf nodes containing abnormal samples, and the abnormal node detection module is connected with the node type judgment module; and the tree structure traversal detection module is used for executing the operation on the leaf nodes in the tree structure, and is connected with the node type judgment module.

As described above, the fraud detection method and system based on hierarchical clustering provided by the present invention have the following beneficial effects: four essential factors influencing the classification performance are comprehensively considered: the imbalance ratio, the sample size, the separability and the intra-class sub-clustering make up for the defect that only a single factor of the imbalance ratio is considered in the prior art. The unsupervised clustering model is used for hierarchical clustering, the large data set with unbalanced categories is divided into a plurality of data subsets with three characteristics, and the problems of division, treatment and simplification are solved, and the problem of unbalanced categories is solved from a new angle.

In conclusion, the invention solves the technical problems of incomplete performance consideration, low detection accuracy and unbalanced category in the prior art.

Drawings

FIG. 1 is a schematic diagram showing steps of a hierarchical clustering-based fraud detection method according to the present invention.

Fig. 2 is a flowchart illustrating step S1 in fig. 1 in an embodiment.

Fig. 3 is a flowchart illustrating step S2 in fig. 1 in an embodiment.

FIG. 4 is a schematic diagram of the class tree structure of the present invention.

Fig. 5 is a flowchart illustrating step S3 in fig. 1 in an embodiment.

Fig. 6 is a flowchart illustrating step S4 in fig. 1 in an embodiment.

Fig. 7 is a flowchart illustrating step S42 in fig. 1 in an embodiment.

FIG. 8 is a schematic diagram of a hierarchical clustering-based fraud detection system according to the present invention.

Fig. 9 is a schematic diagram illustrating a specific module of the clustering model selecting module 11 in fig. 8 in an embodiment.

Fig. 10 is a block diagram of the tree structure module 12 of fig. 8 according to an embodiment.

Fig. 11 is a block diagram of the leaf node classification module 13 in fig. 8 according to an embodiment.

FIG. 12 is a block diagram of the fraud detection module 14 of FIG. 8 in one embodiment.

Fig. 13 is a block diagram illustrating the traversal detection module 142 of fig. 8 in an embodiment.

Description of the element reference numerals

Fraud detection system based on hierarchical clustering

11 clustering model selection module

12-tree structure module

13 leaf node classification module

14 fraud detection module

111 transaction feature extraction module

112 feature analysis module

113 analysis data processing module

114 model selection module

121 class tree creation module

122 node condition acquisition module

123 processing logic selection module

124 divide into tree module

125 sample data iteration module

131 leaf node obtaining module

132 node data extraction module

133 current node classification module

134 node class traversal module

141 applicable mode selecting module

142 traversal detection module

1421 node type judgment module

1422 Single-class Return Module

1423 balanced node training module

1424 abnormal node detection module

1425 tree structure traversal detection module

Description of step designations

FIGS. 1S 1-S4

FIGS. 2S 11-S14

FIGS. 3S 21-S25

FIGS. 5S 31-S34

FIGS. 6S 41-S42

FIGS. 7S 421 to S425

Detailed Description

The following description of the embodiments of the present invention is provided for illustrative purposes, and other advantages and effects of the present invention will become apparent to those skilled in the art from the present disclosure.

Referring to fig. 1 to 12, it should be understood that the structures shown in the drawings attached to the present specification are only used for matching with the contents disclosed in the specification to be known and read by those skilled in the art, and are not used to limit the conditions under which the present invention can be implemented, so that the present invention has no essential technical significance. In addition, the terms "upper", "lower", "left", "right", "middle" and "one" used in the present specification are for clarity of description, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not to be construed as a scope of the present invention.

Referring to fig. 1, a schematic diagram showing steps of the fraud detection method based on hierarchical clustering according to the present invention is shown, as shown in fig. 1, a fraud detection method based on hierarchical clustering includes:

s1, acquiring and analyzing transaction characteristic information to obtain characteristic analysis data, selecting a clustering model according to the characteristic analysis data, and providing a fraud detection model based on hierarchical clustering aiming at the problem of category imbalance in fraud transaction detection;

s2, acquiring a sample data set, hierarchically clustering the sample data set according to a clustering model to construct a tree structure, and dividing the sample data set into leaf nodes of the tree structure, wherein optionally, the tree structure is a clustering tree, a fraud detection model forms a clustering tree in a hierarchical clustering mode, and in the process, an original data set is divided into the leaf nodes of the clustering tree after multiple iterations;

s3, classifying the leaf nodes to obtain node type data, optionally, each leaf node is a data subset;

and S4, processing the leaf nodes in the clustering tree model according to the node type data to complete fraud transaction detection, and finally, only performing corresponding processing on the data subsets in each leaf node to detect abnormal transaction samples in each data subset.

Referring to fig. 2, which is a detailed flowchart of step S1 in fig. 1 in an embodiment, as shown in fig. 2, step S1 is performed to obtain and analyze transaction feature information to obtain feature analysis data, and selecting a cluster model according to the feature analysis data includes:

s11, acquiring an actual data set, extracting transaction characteristic information in the actual data set, and aiming at four essential factors influencing classification performance, regarding sample scale, a class unbalanced data set can be used as the input of the model without any resampling pretreatment, so that the sample scale is equal to the size of the whole data set, regarding the unbalance ratio, the model automatically filters most classes of samples in the hierarchical clustering process, and finally, some leaf nodes with balanced classes are constructed, in other words, the model can automatically adjust the class unbalance ratio in the data set;

s12, analyzing separability based on the transaction characteristic information to obtain characteristic analysis data, and selecting a proper clustering model according to the characteristics of the data set for separability in order to filter more samples in the hierarchical clustering process;

s13, processing the feature analysis data into distribution judgment data, selecting a proper clustering model based on separability, if the data set meets the features of Gaussian distribution, using a Gaussian Mixture Model (GMM) in the model, and if the abnormal samples have aggregations in the Euclidean space, using K-Means, optionally, constructing a corresponding clustering tree model based on real transaction data of a financial company in the model. First, the features of the real data set need to be analyzed based on separability to select the most appropriate clustering model. The distribution characteristics of the data set can be found in the Euclidean space, and for visualization, the data set needs to be subjected to dimensionality reduction by a PCA method so as to obtain a more visual scatter diagram in a two-dimensional space;

s14, selecting a clustering model according to the distribution judgment data, and for intra-class sub-clustering, because the model is constructed based on an unsupervised clustering algorithm, the influence of intra-class sub-clustering on the classification performance can be greatly reduced. Alternatively, the data set can be found to have a clustering distribution phenomenon in the Euclidean space by means of a graph. For this case, K-Means may then be selected as the clustering model.

Referring to fig. 3 and 4, which are a detailed flowchart of step S2 in fig. 1 in an embodiment and a schematic diagram of a class tree structure of the present invention, as shown in fig. 3 and 4, step S2, acquiring a sample data set, hierarchically clustering the sample data set according to a clustering model to construct a tree structure, and partitioning the sample data set into leaf nodes of the tree structure, includes:

s21, creating a tree structure, wherein the most important part in the whole model is an algorithm for constructing a clustering tree through hierarchical clustering, and the construction process of the algorithm is explained as follows: the algorithm is a recursive calling algorithm;

s22, acquiring and storing the sample data set and node condition data of leaf nodes, wherein the algorithm needs to input a data set Dataset, a leaf node balance ratio BRate and a leaf node minimum sample number MSize at the beginning, then positive and negative sample numbers in the Dataset are respectively calculated and stored in N1 and N0;

s23, selecting applicable processing logic of the current leaf node according to the node condition data, optionally, sequentially judging whether the current Dataset meets three conditions of a leaf node, if the value of N1 or N0 is 0, the condition of the single-class leaf node is met, and the current leaf node is processed by using 'singleLable' (directly returning to the class of the data subset in the leaf node); if the ratio of N1 to N0 is less than BRate, then the class balance leaf node condition is satisfied and an "SVM" (support vector machine separator) is required to classify the subset of data in the current leaf node; if the total number of N1 and N0 is less than MSize, then the leaf node condition with outlier samples is satisfied, requiring the use of "KNN" (K neighbor model) for outlier detection on the subset of data in the current leaf node. When the conditions of the three leaf nodes are all satisfied, the current node is used as a non-leaf node, a data set in the current node needs to be clustered by using a KMeans (K-Means clustering model) or a GMM (Gaussian mixture model), the current process is recursively called for the data subsets divided into each cluster, and the result is used as a sub-tree of the current node;

s24, dividing the current node into tree structures according to the applicable processing logic hierarchical clustering, constructing a tree structure by continuous iteration by using a selected clustering model, wherein optionally, "cluster number" represents the ID number of the cluster to which the current node belongs after the last-layer clustering operation, "normal" represents the number of normal samples, "abnormal" represents the number of abnormal samples, and "model" represents the model used for processing the data subsets in the current node;

and S25, iterating the previous steps until the sample data set is completely divided into leaf nodes in a tree structure, wherein the data set is continuously divided into the leaf nodes in the process.

Referring to fig. 5, which is a detailed flowchart of step S3 in fig. 1 according to an embodiment, as shown in fig. 5, the step S3 of classifying leaf nodes to obtain node type data includes:

s31, acquiring all leaf nodes in the tree structure;

s32, extracting the category information, the balance ratio data and the sample number information of the leaf nodes, and comprehensively considering four essential factors influencing the classification performance: unbalanced ratio, sample size, separability and intra-class sub-clustering;

s33, classifying the current leaf node according to the category information, the balance ratio data, and the sample number information, and optionally, finally forming three leaf nodes: a single class leaf node, a class balance leaf node, and a leaf node containing an abnormal sample;

and S34, acquiring the node type data of the current leaf node, and circularly executing the steps until all the leaf nodes are classified into single-class leaf nodes, class balance leaf nodes and leaf nodes containing abnormal samples.

Referring to fig. 6, which is a detailed flowchart of step S4 in fig. 1, in an embodiment, as shown in fig. 6, step S4, processing leaf nodes in the clustering tree model according to the node type data to complete fraud transaction detection, includes:

s41, acquiring node type data, selecting an applicable processing mode of a node according to the node type data, and constructing a decision tree model, namely a clustering tree, in a hierarchical clustering mode by combining the ideas of a clustering model, an anomaly detection method and a decision tree classification model;

and S42, traversing and processing the leaf nodes in the tree structure according to the applicable processing mode, and respectively adopting three processing modes aiming at the three leaf nodes to carry out different processing on different leaf nodes generated in the process so as to detect more fraud transaction samples.

Referring to fig. 7, which is a detailed flowchart of step S42 in fig. 1 in one embodiment, as shown in fig. 7, step S42, traversing the leaf nodes in the tree structure according to the applicable processing method, includes:

s421, judging the type of the current leaf node according to the node type data;

s422, if the current leaf node is a single-class node, directly returning the type of the leaf node, and a single-class leaf node, where all the data subsets in the leaf node belong to the same class, optionally, for the single-class leaf node, directly returning the type to which the sample in the leaf node belongs, and for the evaluation of the clustering tree model, calculating to obtain a confusion matrix according to the result of the fraud detection, as shown in table 1.

TABLE 1 confusion matrix for two-class tasks

Then, according to table 1, Recall (Recall), Precision (Precision) and weighted average of the two (F1) are calculated, and the calculation formula is as follows.

Finally, we will use five common fraud detection models to detect on the same data and compare on these three metrics. The results of the experiment are shown in table 2.

TABLE 2 results of the experiment

Model	F1	Precision	Recall
				Clustering.Tree	0.807	0.712	0.932
AdaBoosting	0.752	0.608	0.985
				Random Forest	0.747	0.607	0.971
Decision Tree	0.661	0.502	0.965
				SVM	0.657	0.494	0.981
Logistic Regression	0.651	0.487	0.979

From table 2, it can be seen that compared to other models, the model proposed herein has an improvement in accuracy index of 10% compared to the second name AdaBoosting, while the recall rate is reduced by only 5% and has a significant improvement in F1 index;

s423, if the current leaf node is a category balancing node, training samples in the leaf node by using a preset classification method, and the category balancing leaf node, where the sample subset in the leaf node has reached a category balancing ratio, that is, the ratio of the number of most samples to the number of least samples reaches the preset balancing ratio, optionally, for the category balancing leaf node, performing model training on the data set in the leaf node by using a decision tree, an SVM, a random forest and other traditional classification methods;

s424, if the current leaf node is a leaf node containing an abnormal sample, detecting the leaf node by using a preset abnormal detection logic, where the leaf node contains an abnormal sample leaf node, and the leaf node does not satisfy the conditions of the first two leaf nodes, but the total number of samples is less than the minimum number of samples allowed by a preset single node, so as to prevent the occurrence of model overfitting, and optionally, for the leaf node containing an abnormal sample, processing by using an abnormal detection method, such as an abnormal detection method based on distance, etc.;

and S425, performing the operation on the leaf nodes in the tree structure.

Referring to fig. 8, a schematic diagram of a hierarchical clustering-based fraud detection system module according to the present invention is shown, and as shown in fig. 8, a hierarchical clustering-based fraud detection system 1 is characterized by comprising: the system comprises a clustering model selection module 11, a tree structure module 12, a leaf node classification module 13 and a fraud detection module 14; the cluster model selecting module 11 is used for acquiring and analyzing transaction characteristic information to obtain characteristic analysis data, selecting a cluster model according to the characteristic analysis data, selecting the cluster model according to the characteristic analysis data, and providing a fraud detection model based on hierarchical clustering aiming at the problem of category imbalance in fraud transaction detection; the tree structure module 12 is configured to acquire a sample data set, hierarchically cluster the sample data set according to a clustering model to construct a tree structure, and partition the sample data set into leaf nodes of the tree structure, optionally, the tree structure is a clustering tree, the fraud detection model forms a clustering tree in a hierarchical clustering manner, in the process, an original data set is partitioned into each leaf node of the clustering tree through multiple iterations, and the tree structure module 12 is connected with the clustering model selection module 11; a leaf node classifying module 13, configured to classify leaf nodes to obtain node type data, optionally, each leaf node is a data subset, and the leaf node classifying module 13 is connected to the tree structure module 12; and the fraud detection module 14 is configured to process leaf nodes in the clustering tree model according to the node type data to complete fraud transaction detection, and finally, only corresponding processing needs to be performed on the data subsets in each leaf node to detect abnormal transaction samples in each data subset, and the fraud detection module 14 is connected with the leaf node classification module 13.

Referring to fig. 9, which is a schematic diagram illustrating a specific module of the cluster model selecting module 11 in fig. 8 in an embodiment, as shown in fig. 9, the cluster model selecting module 11 includes: a transaction feature extraction module 111, a feature analysis module 112, an analysis data processing module 113 and a model selection module 114; the transaction feature extraction module 111 is configured to obtain an actual data set, extract transaction feature information in the actual data set, and for four essential factors affecting classification performance, in terms of sample size, a class-unbalanced data set may be used as an input of the model without any resampling preprocessing, so that the sample size is equal to the size of the entire data set, and for an imbalance ratio, the model automatically filters most classes of samples in a hierarchical clustering process, and finally, some leaf nodes with balanced classes are constructed, in other words, the model may automatically adjust the class imbalance ratio in the data set; the characteristic analysis module 112 is used for analyzing characteristic analysis data based on separability of transaction characteristic information, for separability, in order to filter more samples in the hierarchical clustering process, a proper clustering model can be selected according to the characteristics of a data set, and the transaction characteristic extraction module 112 is connected with the characteristic analysis module 111; the analysis data processing module 113 is configured to process the feature analysis data into distribution judgment data, select a suitable clustering model based on separability, use a Gaussian Mixture Model (GMM) if the data set satisfies features of gaussian distribution, use K-Means if the abnormal samples have aggregations in the middle of the euclidean space, and optionally, construct a corresponding clustering tree model based on real transaction data of a financial company. First, the features of the real data set need to be analyzed based on separability to select the most appropriate clustering model. The distribution characteristics of the data set can be found in the Euclidean space, for visualization, the data set needs to be subjected to dimensionality reduction by a PCA method so as to obtain a more visual scatter diagram in a two-dimensional space, and the analysis data processing module 113 is connected with the characteristic analysis module 112; and the model selecting module 114 is used for selecting a clustering model according to the distribution judgment data, and for the intra-class sub-clustering, because the model is constructed based on an unsupervised clustering algorithm, the influence of the intra-class sub-clustering on the classification performance can be greatly reduced. Alternatively, the data set can be found to have a clustering distribution phenomenon in the Euclidean space by means of a graph. For this case, K-Means may be selected as the clustering model, and the model selection module 114 is connected to the analysis data processing module 113.

Referring to fig. 10, which is a schematic diagram illustrating a specific module of the tree structure module 12 in fig. 8 in an embodiment, as shown in fig. 10, the tree structure module 12 includes: the system comprises a cluster tree creating module 121, a node condition obtaining module 122, a processing logic selecting module 123, a tree dividing module 124 and a sample data iteration module 125; the cluster tree creating module 121 is configured to create a tree structure, where the most important part in the entire model is an algorithm for building a cluster tree through hierarchical clustering, and a process description of the algorithm is built: the algorithm is a recursive calling algorithm; the node condition obtaining module 122 is configured to obtain and store node condition data of a sample data set and leaf nodes, where an algorithm starts to input a data set Dataset, a balance ratio of a leaf node, and a minimum sample number of the leaf node, MSize, and then positive and negative sample numbers in the Dataset are respectively calculated and stored in N1 and N0, and the node condition obtaining module 122 is connected to the aggregation tree creating module 121; a processing logic selecting module 123, configured to select an applicable processing logic of a current leaf node according to the node condition data, optionally, then sequentially determine whether the current Dataset meets three conditions of a leaf node, if the value of N1 or N0 is 0, the single-category leaf node condition is met, and the current leaf node needs to be processed using a "SingleLable" (directly returning to the category of the data subset in the leaf node); if the ratio of N1 to N0 is less than BRate, then the class balance leaf node condition is satisfied and an "SVM" (support vector machine separator) is required to classify the subset of data in the current leaf node; if the total number of N1 and N0 is less than MSize, then the leaf node condition with outlier samples is satisfied, requiring the use of "KNN" (K neighbor model) for outlier detection on the subset of data in the current leaf node. When the conditions of the three leaf nodes are all satisfied, the current node is used as a non-leaf node, a data set in the current node needs to be clustered by using a KMeans (K-Means clustering model) or a GMM (Gaussian mixture model), the current process is recursively called for the data subsets divided into each cluster, the result is used as a sub-tree of the current node, and the processing logic selection module 123 is connected with the node condition acquisition module 122; a tree dividing and entering module 124, configured to divide the current node into tree structures according to applicable processing logic hierarchical clustering, and construct a tree structure through continuous iteration using a selected clustering model, optionally, in each leaf node, "cluster number" indicates an ID number of a cluster to which the current node belongs after a previous-layer clustering operation, "normal" indicates the number of normal samples, "abnormal" indicates the number of abnormal samples, "model" indicates a model used for processing a data subset in the current node, and the tree dividing and entering module 124 is connected to the processing logic selecting module 123; and the sample data iteration module 125 is used for iterating the steps until the sample data set is completely divided into leaf nodes in a tree structure, and the sample data iteration module 125 is connected with the tree division module 124.

Referring to fig. 11, which is a schematic block diagram illustrating an embodiment of the leaf node classification module 13 in fig. 8, as shown in fig. 11, the leaf node classification module 13 includes: a leaf node obtaining module 131, a node data extracting module 132, a current node classifying module 133 and a node class traversing module 134; a leaf node obtaining module 131, configured to obtain all leaf nodes in the tree structure; the node data extracting module 132 is configured to extract category information, balance ratio data, and sample number information of the leaf nodes, and comprehensively consider four essential factors that affect the classification performance: unbalanced ratio, sample size, separability and intra-class clustering, and the leaf node extracting module 132 is connected with the leaf node acquiring module 131; the current node classifying module 133 is configured to classify the current leaf node according to the category information, the balance ratio data, and the sample number information, and optionally, three types of leaf nodes are finally formed: a single-class leaf node, a class balance leaf node and a leaf node containing an abnormal sample, wherein the current node classification module 133 is connected with the node data extraction module 132; the node type traversal module 134 is configured to obtain node type data of a current leaf node, and execute the foregoing steps in a loop until all leaf nodes are classified into single-class leaf nodes, class balance leaf nodes, and leaf nodes containing abnormal samples, where the node type traversal module 134 is connected to the current node classification module 133.

Referring to fig. 12, which is a schematic block diagram illustrating the fraud detection module 14 in fig. 8 in an embodiment, as shown in fig. 12, the fraud detection module 14 includes: an applicable mode selection module 141 and a traversal detection module 142; an applicable mode selection module 141, configured to obtain node type data, select an applicable processing mode of a node according to the node type data, and construct a decision tree model, i.e., a clustering tree, in a hierarchical clustering manner by combining concepts of a clustering model, an anomaly detection method, and a decision tree classification model; the traversal detection module 142 is configured to perform traversal processing on leaf nodes in the tree structure according to an applicable processing manner, and for the three leaf nodes, different processing is performed on different leaf nodes generated in the process by using three processing manners respectively to detect more fraudulent transaction samples, and the traversal detection module 142 is connected to the use manner selection module 141.

Referring to fig. 13, which is a schematic diagram illustrating specific modules of the traversal detection module 142 in fig. 8 in an embodiment, as shown in fig. 13, the traversal detection module 142 includes: a node type judgment module 1421, a single category return module 1422, a balanced node training module 1423, an abnormal node detection module 1424 and a tree structure traversal detection module 1425; a node type determining module 1421, configured to determine the type of the current leaf node according to the node type data; a single-category returning module 1422, configured to, when the current leaf node is a single-category node, directly return the type of the leaf node, a single-category leaf node, where all the subsets of data in the leaf node belong to the same category, and optionally, for the leaf nodes of the single category, directly returning the type of the sample in the leaf nodes, for the evaluation of the clustering tree model, firstly, calculating a confusion matrix according to the result of fraud detection, calculating a Recall ratio (Recall), an accuracy ratio (Precision) and a weighted average value (F1) of the Recall ratio and the accuracy ratio, finally, detecting on the same data by using five common fraud detection models, compared with other models, the model provided by the method has the advantages that the accuracy index is improved by 10 percent compared with the second name AdaBoosting, the recall rate is reduced by only 5%, and an obvious promotion list type returning module 1422 is connected with the node type judging module 1421 on the F1 index; the balanced node training module 1423 is configured to, when the current leaf node is a category balanced node, train samples in the leaf node by using a preset classification method, and balance the leaf node by category, where a sample subset in the leaf node has reached a category balance ratio, that is, a ratio of a majority sample number to a minority sample number reaches a preset balance ratio, optionally, for the category balanced leaf node, perform model training on a data set in the leaf node by using a traditional classification method such as a decision tree, an SVM, a random forest, and the like, and the balanced node training module 1423 is connected to the node type determining module 1421; an abnormal node detection module 1424, configured to detect a leaf node by using a preset abnormal detection logic when the current leaf node is a leaf node containing an abnormal sample, where the leaf node does not satisfy the conditions of the first two leaf nodes, but the total number of samples is less than the minimum number of samples allowed by a preset single node, so as to prevent the occurrence of a phenomenon of model overfitting, and optionally, for the leaf node containing an abnormal sample, the abnormal node detection module 1424 is connected to the node type determination module 1421 by using an abnormal detection method for processing, for example, a distance-based abnormal detection method, and the like; the tree structure traversal detecting module 1425 is configured to perform the foregoing operation on a leaf node in the tree structure, and the tree structure traversal detecting module 1425 is connected to the node type determining module 1421.

In summary, the fraud detection method and system based on hierarchical clustering provided by the invention. The invention has the following beneficial effects: the fraud detection model based on hierarchical clustering provided by the invention comprehensively considers four factors influencing the classification performance and avoids the limitations of the two methods to a certain extent. The invention provides a fraud detection model based on hierarchical clustering, aiming at the problem of unbalanced category in fraud transaction detection. The model forms a clustering tree in a hierarchical clustering mode, in the process, an original data set is divided into leaf nodes of the clustering tree through multiple iterations, and each leaf node is a data subset. Finally, only the data subsets in each leaf node are correspondingly processed, and abnormal transaction samples in each data subset are detected, so that the method can be summarized, the technical problems that hierarchical clustering is performed by using an unsupervised clustering model in the prior art, a large data set with unbalanced categories is divided into a plurality of data subsets with three characteristics, the data subsets are treated separately and simplified, the problem of unbalanced categories is solved from a new angle, and four essential factors influencing the classification performance are comprehensively considered: the imbalance ratio, the sample size, the separability and the intra-class sub-clustering make up for the defect that only a single factor of the imbalance ratio is considered in the prior art. The unsupervised clustering model is used for hierarchical clustering, the large data set with unbalanced categories is divided into a plurality of data subsets with three characteristics, and the data subsets are treated by dividing the data subsets into three types, so that the problems of unbalanced categories are solved from a new angle, and the method has good authentication safety and accuracy, and has high commercial value and practicability.

Claims

1. A fraud detection method based on hierarchical clustering is characterized by comprising the following steps:

acquiring and analyzing transaction characteristic information to obtain characteristic analysis data, and selecting a clustering model according to the characteristic analysis data;

acquiring a sample data set, hierarchically clustering the sample data set according to the clustering model to construct a tree structure, and dividing the sample data set into leaf nodes of the tree structure;

classifying the leaf nodes to obtain node type data;

processing the leaf nodes in the clustering tree model according to the node type data to complete fraud transaction detection;

the acquiring of the sample data set, hierarchically clustering the sample data set according to the clustering model to construct a tree structure, and partitioning the sample data set into leaf nodes of the tree structure includes:

creating the tree structure;

acquiring and storing node condition data of the sample data set and leaf nodes, wherein the node condition data comprise a data set Dataset, a balance ratio BRate of a leaf node, and a minimum sample number Msize of the leaf node, and positive and negative sample numbers of the data set Dataset are calculated through the data set Dataset, the balance ratio BRate of the leaf node and the minimum sample number Msize of the leaf node, and the positive and negative sample numbers of the data set Dataset are respectively stored in N1 and N0;

selecting applicable processing logic of the current leaf node according to the node condition data, wherein the applicable processing logic comprises,

judging whether the current Dataset meets three conditions of the leaf node, wherein the three conditions are as follows:

if the value of N1 or N0 of the Dataset is 0, a single-category leaf node condition is satisfied, the current leaf node needs to be processed by directly returning the category of the data subset in the current leaf node,

if the ratio of N1 to N0 of the data set Dataset is less than the equilibrium ratio of the leaf node, then a leaf node condition of class equilibrium is satisfied, a support vector machine separator is used to classify the data subset in the leaf node,

if the total number of N1 and N0 of the data set Dataset is less than the minimum number of leaf nodes Msize, then satisfying the leaf node condition containing abnormal samples, using a K neighbor model to perform abnormal detection on the data subset in the current leaf node,

if none of the data sets Dataset is satisfied, the current node is a non-leaf node, a K-Means clustering model or a Gaussian mixture model is used for clustering the data set in the current node, the current process is recursively called for the data subsets divided into each cluster, and the result is taken as a sub-tree of the current node;

dividing the current nodes into the tree structure according to the applicable processing logic hierarchical clustering;

and iterating until the sample data set is completely divided into the leaf nodes in the tree structure.

2. The method of claim 1, wherein the obtaining and analyzing transaction characteristic information to obtain characteristic analysis data, and the selecting a clustering model based on the characteristic analysis data comprises:

acquiring an actual data set, and extracting transaction characteristic information in the actual data set;

obtaining the characteristic analysis data based on separability analysis of the transaction characteristic information;

processing the characteristic analysis data into distribution judgment data;

and selecting the clustering model according to the distribution judgment data.

3. The method of claim 1, wherein classifying the leaf nodes to obtain node type data comprises:

acquiring all the leaf nodes in the tree structure;

extracting the category information, the balance ratio data and the sample number information of the leaf nodes;

classifying the current leaf node according to the category information, the balance ratio data and the sample number information;

and acquiring the node type data of the current leaf node, and executing in a circulating way until all the leaf nodes are classified into single-class leaf nodes, class balance leaf nodes and leaf nodes containing abnormal samples.

4. The method of claim 1, wherein processing the leaf nodes in the clustering tree model according to the node type data to perform fraud transaction detection comprises:

acquiring the node type data, and selecting an applicable processing mode of the node according to the node type data, wherein the applicable processing mode comprises a decision tree model constructed in a hierarchical clustering mode by combining ideas of a clustering model, an anomaly detection method and a decision tree classification model;

and traversing the leaf nodes in the tree structure according to the applicable processing mode.

5. The method according to claim 4, wherein said traversal processing of said leaf nodes in said tree structure according to said applicable processing mode comprises the steps of:

s1', judging the type of the current leaf node according to the node type data;

s2', if the current leaf node is a single-class node, directly returning the type of the leaf node;

s3', if the current leaf node is a class balance node, training the sample in the leaf node by using a preset classification method;

s4', if the current leaf node is a leaf node containing an abnormal sample, detecting the leaf node by using preset abnormal detection logic;

performing the operations of steps S1 'to S4' on the leaf nodes in the tree structure.

6. A hierarchical clustering-based fraud detection system, comprising: the system comprises a clustering model selection module, a tree structure module, a leaf node classification module and a fraud detection module;

the clustering model selecting module is used for acquiring and analyzing transaction characteristic information to obtain characteristic analysis data, and selecting a clustering model according to the characteristic analysis data;

the tree structure module is used for acquiring a sample data set, hierarchically clustering the sample data set according to the clustering model to construct a tree structure, and dividing the sample data set into leaf nodes of the tree structure;

the leaf node classifying module is used for classifying the leaf nodes to obtain node type data;

the fraud detection module is used for processing the leaf nodes in the clustering tree model according to the node type data to complete fraud transaction detection;

wherein the tree structure module comprises: a cluster tree creating module, a node condition obtaining module, a processing logic selecting module, a tree dividing module and a sample data iteration module,

the node condition obtaining module is configured to obtain and store node condition data of the sample data set and a leaf node in the node condition data, where the node condition data includes a data set Dataset, a balance ratio of a leaf node, a minimum number of samples of the leaf node Msize, and positive and negative sample numbers of the data set Dataset calculated by the data set Dataset, the balance ratio of the leaf node, and the minimum number of samples of the leaf node Msize, and the positive and negative sample numbers of the data set Dataset are stored in N1 and N0, respectively;

the processing logic selecting module is used for selecting the applicable processing logic of the current leaf node according to the node condition data, wherein the applicable processing logic comprises,

the tree dividing and entering module is used for dividing the current nodes into the tree structure according to the applicable processing logic hierarchical clustering;

and the sample data iteration module is used for iterating until the sample data set is completely divided into the leaf nodes in the tree structure.

7. The system of claim 6, wherein the cluster model selecting module comprises: the system comprises a transaction characteristic extraction module, a characteristic analysis module, an analysis data processing module and a model selection module;

the transaction characteristic extraction module is used for acquiring an actual data set and extracting transaction characteristic information in the actual data set;

the characteristic analysis module is used for analyzing separability of the transaction characteristic information to obtain the characteristic analysis data;

the analysis data processing module is used for processing the characteristic analysis data into distribution judgment data;

and the model selection module is used for selecting the clustering model according to the distribution judgment data.

8. The system of claim 6, wherein the leaf node classification module comprises: the system comprises a leaf node acquisition module, a node data extraction module, a current node classification module and a node class traversal module;

the leaf node acquisition module is used for acquiring all the leaf nodes in the tree structure;

the node data extraction module is used for extracting the category information, the balance ratio data and the sample number information of the leaf nodes;

the current node classification module is used for classifying the current leaf nodes according to the category information, the balance ratio data and the sample number information;

and the node type traversal module is used for acquiring the node type data of the current leaf node and executing in a circulating way until all the leaf nodes are classified into single-type leaf nodes, type balance leaf nodes and leaf nodes containing abnormal samples.

9. The system of claim 6, wherein the fraud detection module comprises: the device comprises an application mode selection module and a traversal detection module;

the applicable mode selection module is used for acquiring the node type data and selecting an applicable processing mode of the node according to the node type data, wherein the applicable processing mode comprises a decision tree model constructed in a hierarchical clustering mode by combining ideas of a clustering model, an anomaly detection method and a decision tree classification model;

and the traversal detection module is used for performing traversal processing on the leaf nodes in the tree structure according to the applicable processing mode.

10. The system of claim 9, wherein the traversal detection module comprises: the system comprises a node type judging module, a single-class returning module, a balanced node training module, an abnormal node detecting module and a tree structure traversing detecting module;

the node type judging module is configured to execute step S1', and judge the type of the current leaf node according to the node type data;

the single-category returning module is configured to execute step S2', and when the current leaf node is a single-category node, directly return the type of the leaf node;

the balanced node training module is configured to execute step S3', and when the current leaf node is a category balanced node, train the sample in the leaf node by using a preset classification method;

the abnormal node detecting module is configured to execute step S4', and when the current leaf node is a leaf node containing an abnormal sample, detect the leaf node using a preset abnormal detection logic;

the tree structure traversal detection module is used for executing the operations from S1 'to S4' on the leaf nodes in the tree structure.