CN114862404A - Credit card fraud detection method and device based on cluster samples and limit gradients - Google Patents

Credit card fraud detection method and device based on cluster samples and limit gradients Download PDF

Info

Publication number
CN114862404A
CN114862404A CN202210478879.5A CN202210478879A CN114862404A CN 114862404 A CN114862404 A CN 114862404A CN 202210478879 A CN202210478879 A CN 202210478879A CN 114862404 A CN114862404 A CN 114862404A
Authority
CN
China
Prior art keywords
sub
samples
minority
cluster
credit card
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210478879.5A
Other languages
Chinese (zh)
Inventor
陈宏伟
艾河
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Priority to CN202210478879.5A priority Critical patent/CN114862404A/en
Publication of CN114862404A publication Critical patent/CN114862404A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/40Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
    • G06Q20/401Transaction verification
    • G06Q20/4016Transaction verification involving fraud or risk level assessment in transaction processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Accounting & Taxation (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Finance (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Technology Law (AREA)
  • Operations Research (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Computing Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a credit card fraud detection method and device based on cluster samples and limit gradients. The method comprises the following steps: step 1 to step 8. According to the method, the problem of imbalance among classes is solved, meanwhile, the problem of imbalance in the classes is effectively avoided, the data quality of artificially synthesized samples is improved, the extreme gradient lifting tree is used as a classifier in a credit card fraud detection model, a better classification effect can be obtained, and after the clustering algorithm, the adaptive weight calculation and the oversampling algorithm are combined, the finally generated credit card fraud detection model has good detection accuracy.

Description

Credit card fraud detection method and device based on cluster samples and limit gradients
Technical Field
The embodiment of the invention relates to the technical field of data mining, in particular to a credit card fraud detection method and device based on clustering samples and limit gradients.
Background
Before the era of big data and artificial intelligence comes, the establishment of credit card fraud detection models is completed by some traditional means, including artificial detection, establishment of detection models based on expert rules, and cost analysis models, but these traditional means all have the defects of low accuracy, long detection time, and the like. With the rise and development of big data and artificial intelligence technology, various deep learning means of traditional machine learning algorithm based on statistics or emerging big fire in recent years are reformed by researchers at home and abroad aiming at the characteristics of the credit card fraud field and are applied to the field. One significant characteristic of credit card fraud data sets is data imbalance, and common machine learning algorithms, such as logistic regression algorithms, decision tree algorithms, and the like, have a significantly reduced effect when such imbalanced data sets are directly trained. Therefore, developing a credit card fraud detection method and apparatus based on cluster samples and limit gradients can effectively overcome the above-mentioned drawbacks in the related art, and is a technical problem to be solved in the industry.
Disclosure of Invention
Aiming at the problems in the prior art, the embodiment of the invention provides a credit card fraud detection method and device based on cluster samples and limit gradients.
In a first aspect, an embodiment of the present invention provides a credit card fraud detection method based on cluster samples and limit gradients, including: step 1: firstly, carrying out data cleaning and feature engineering processing on original data directly taken from a bank database to obtain an original data set for machine learning; step 2: clustering and dividing the original data set by using a clustering algorithm to obtain a plurality of sub-clusters, and discarding sub-clusters without few samples or with the number of the few samples being only 1; and step 3: for other sub-clusters, calculating the unbalance rate of the sub-clusters according to the proportion of the number of the samples of the minority class to the number of the samples of the majority class in the sub-clusters, and screening out the sub-clusters to be subjected to oversampling according to a set unbalance rate threshold; and 4, step 4: calculating the sparse factors of all the sub-clusters to be oversampled, and determining the corresponding oversampling weight of each sub-cluster to be oversampled according to the sparse factors; and 5: distributing different self-adaptive weights according to the learning degree of each minority sample in the sub-cluster to be oversampled on the boundary information, and determining the oversampling weight of each minority sample; step 6: performing oversampling interpolation calculation based on self-adaptive weight on each minority sample respectively to generate an artificial data set with the number balance of the majority samples and the minority samples; and 7: training the balanced data set in the last step by using a limit gradient lifting tree algorithm to obtain a final credit card fraud detection model; and 8: the credit card transaction data is detected using the credit card fraud detection model trained in step 7.
On the basis of the content of the above embodiment of the method, in the credit card fraud detection method based on the cluster samples and the extreme gradient provided in the embodiment of the present invention, the determining the corresponding oversampling weight of each sub-cluster to be oversampled according to the sparsity factor in step 4 includes: step 4.1: calculating a Euclidean distance matrix of the sub-cluster to be oversampled; step 4.2: calculating the average distance of the minority classes of the sub-clusters to be oversampled; step 4.3: calculating the minority class density and sparse factor of the sub-cluster to be oversampled; step 4.4: and calculating the oversampling index corresponding to each sub-cluster to be oversampled, namely multiplying the total number of samples to be generated by the sparse factor corresponding to the sub-cluster.
On the basis of the content of the embodiment of the method, the credit card fraud detection method based on the cluster samples and the extreme gradient provided by the embodiment of the invention comprises the following steps:
Figure BDA0003626815960000021
wherein the content of the first and second substances,
Figure BDA0003626815960000022
the square of the euclidean distance between the minority class samples xi in the sub-cluster i to be oversampled and the feature vector xj, n is the number of the minority class samples in the sub-cluster, and A (i) is a euclidean distance matrix.
Based on the content of the above method embodiment, the method for detecting credit card fraud based on cluster samples and extreme gradients provided in the embodiments of the present invention includes the following steps:
Figure BDA0003626815960000023
wherein, Σ a xy Represents the sum of all off-diagonal elements in the euclidean distance matrix a (i), c (i) represents the number of off-diagonal elements in the euclidean distance matrix a (i), and averdist (i) is the average distance of the minority class of the sub-clusters to be oversampled.
Based on the above contents of the embodiments of the method, the method for detecting credit card fraud based on cluster samples and extreme gradients provided in the embodiments of the present invention includes:
Figure BDA0003626815960000031
wherein dens (i) is a minority of the density, numOfMin (FC) i ) Is the number of minority class samples in the sub-cluster, averDist m (i) M is the m-th power of the average spacing of the minority class, and m represents the characteristic number of the sample.
On the basis of the content of the embodiment of the method, the credit card fraud detection method based on the clustering samples and the extreme gradient provided by the embodiment of the invention comprises the following sparse factors:
Figure BDA0003626815960000032
Figure BDA0003626815960000033
wherein spar (i) is the sub-cluster FC to be oversampled i Sparse Fac (i) is the sparsity factor and sigma spar is the sum of the sparsity of all sub-clusters to be oversampled.
Based on the above contents of the embodiments of the method, the method for detecting credit card fraud based on cluster samples and extreme gradients provided in the embodiments of the present invention determines the oversampling weight of each of the minority class samples in step 5, including: step 5.1: k nearest neighbor samples of each minority sample in the sub-cluster to be oversampled are obtained by using a nearest neighbor algorithm, a majority nearest neighbor rate corresponding to each minority sample is calculated, and the number of the majority samples in the nearest neighbors is divided by K; step 5.2: calculating the self-adaptive sampling weight corresponding to each minority sample, and dividing the majority neighbor rate corresponding to the minority sample by the sum of the majority neighbor rates corresponding to all the minority samples in the sub-cluster; step 5.3: and calculating a sampling index corresponding to each minority sample, and multiplying the sampling index of the cluster where the minority sample is positioned by the adaptive sampling weight of the minority.
In a second aspect, an embodiment of the present invention provides a credit card fraud detection apparatus based on cluster samples and limit gradients, including: a first master module, configured to implement step 1: firstly, carrying out data cleaning and feature engineering processing on original data directly taken from a bank database to obtain an original data set for machine learning; a second master module, configured to implement step 2: clustering and dividing the original data set by using a clustering algorithm to obtain a plurality of sub-clusters, and discarding sub-clusters without few samples or with the number of the few samples being only 1; a third main module, configured to implement step 3: for other sub-clusters, calculating the unbalance rate of the sub-clusters according to the proportion of the number of the samples of the minority class to the number of the samples of the majority class in the sub-clusters, and screening out the sub-clusters to be subjected to oversampling according to a set unbalance rate threshold; a fourth master module, configured to implement step 4: calculating the sparse factors of all the sub-clusters to be oversampled, and determining the corresponding oversampling weight of each sub-cluster to be oversampled according to the sparse factors; a fifth master module, configured to implement step 5: distributing different self-adaptive weights according to the learning degree of each minority sample in the sub-cluster to be oversampled on the boundary information, and determining the oversampling weight of each minority sample; a sixth master module, configured to implement step 6: performing oversampling interpolation calculation based on self-adaptive weight on each minority sample respectively to generate an artificial data set with the number balance of the majority samples and the minority samples; a seventh master module, configured to implement step 7: training the balanced data set in the last step by using a limit gradient lifting tree algorithm to obtain a final credit card fraud detection model; an eighth master module, configured to implement step 8: the credit card transaction data is detected using the credit card fraud detection model trained in step 7.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the clustered sample and extreme gradient based credit card fraud detection method provided by any of the various implementations of the first aspect.
In a fourth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method for detecting credit card fraud based on clustered samples and extreme gradients provided in any of the various implementations of the first aspect.
According to the credit card fraud detection method and device based on the clustering samples and the extreme gradient, provided by the embodiment of the invention, through combining the clustering algorithm, the sample adaptive weight calculation and the oversampling algorithm, the problem of imbalance among classes is solved, meanwhile, the problem of imbalance in the classes is effectively avoided, the data quality of artificially synthesized samples is improved, the extreme gradient lifting tree is used as a classifier in a credit card fraud detection model, a better classification effect can be obtained, and after the oversampling algorithm based on the clustering and the adaptive weight is combined, the finally generated credit card fraud detection model has good detection accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below to the drawings required for the description of the embodiments or the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a method for detecting fraud in a credit card based on clustered samples and extreme gradients according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a credit card fraud detection apparatus based on cluster samples and limit gradients according to an embodiment of the present invention;
fig. 3 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. In addition, technical features of various embodiments or individual embodiments provided by the present invention may be arbitrarily combined with each other to form a feasible technical solution, and such combination is not limited by the sequence of steps and/or the structural composition mode, but must be realized by a person skilled in the art, and when the technical solution combination is contradictory or cannot be realized, such a technical solution combination should not be considered to exist and is not within the protection scope of the present invention.
The embodiment of the invention provides a credit card fraud detection method based on cluster samples and limit gradients, and the method is shown in figure 1 and comprises the following steps: step 1: firstly, carrying out data cleaning and feature engineering processing on original data directly taken from a bank database to obtain an original data set for machine learning; step 2: clustering and dividing the original data set by using a clustering algorithm to obtain a plurality of sub-clusters, and discarding sub-clusters without few samples or with the number of the few samples being only 1; and step 3: for other sub-clusters, calculating the unbalance rate of the sub-clusters according to the proportion of the number of the few types and the number of the most types of samples in the sub-clusters, and screening out the sub-clusters to be subjected to oversampling according to a set unbalance rate threshold value; and 4, step 4: calculating the sparse factors of all the sub-clusters to be oversampled, and determining the corresponding oversampling weight of each sub-cluster to be oversampled according to the sparse factors; and 5: distributing different self-adaptive weights according to the learning degree of each minority sample in the sub-cluster to be oversampled on the boundary information, and determining the oversampling weight of each minority sample; step 6: performing oversampling interpolation calculation based on self-adaptive weight on each minority sample respectively to generate an artificial data set with the number balance of the majority samples and the minority samples; and 7: training the balanced data set in the last step by using a limit gradient lifting tree algorithm to obtain a final credit card fraud detection model; and 8: and 7, detecting the credit card transaction data by using the credit card fraud detection model trained in the step 7. It should be noted that, in step 1, dirty data elimination and PCA principal component analysis are performed on raw data directly obtained from a bank database to obtain a raw data set that can be used for machine learning. And dividing the original data set into a training set and a test set.
Based on the content of the foregoing method embodiment, as an optional embodiment, in the credit card fraud detection method based on cluster samples and extreme gradients provided in the embodiment of the present invention, the determining, according to the sparseness factor, the corresponding oversampling weight of each sub-cluster to be oversampled in step 4 includes: step 4.1: calculating a Euclidean distance matrix of the sub-cluster to be oversampled; step 4.2: calculating the average distance of the minority classes of the sub-clusters to be oversampled; step 4.3: calculating the minority class density and sparse factor of the sub-cluster to be oversampled; step 4.4: and calculating the oversampling index corresponding to each sub-cluster to be oversampled, namely multiplying the total number of samples to be generated by the sparse factor corresponding to the sub-cluster.
Based on the content of the foregoing method embodiment, as an optional embodiment, in the credit card fraud detection method based on the cluster samples and the extreme gradient provided in the embodiment of the present invention, the euclidean distance matrix includes:
Figure BDA0003626815960000061
wherein the content of the first and second substances,
Figure BDA0003626815960000062
the square of the euclidean distance between the minority class samples xi in the sub-cluster i to be oversampled and the feature vector xj, n is the number of the minority class samples in the sub-cluster, and A (i) is a euclidean distance matrix.
Based on the content of the foregoing method embodiment, as an optional embodiment, the method for detecting credit card fraud based on cluster samples and extreme gradients provided in the embodiment of the present invention includes:
Figure BDA0003626815960000063
wherein, Σ a xy C (i) represents the number of the non-diagonal elements in the Euclidean distance matrix A (i), and averDist (i) is the average distance of the minority classes of the sub-clusters to be oversampled.
Based on the content of the foregoing method embodiment, as an optional embodiment, the method for detecting credit card fraud based on cluster samples and limit gradients provided in the embodiment of the present invention includes:
Figure BDA0003626815960000064
wherein dens (i) is a minority of the density, numOfMin (FC) i ) Is the number of minority class samples in the sub-cluster, averDist m (i) M is the m-th power of the average spacing of the minority class, and m represents the characteristic number of the sample.
Based on the content of the foregoing method embodiment, as an optional embodiment, the credit card fraud detection method based on the cluster samples and the extreme gradient provided in the embodiment of the present invention, where the sparse factor includes:
Figure BDA0003626815960000071
Figure BDA0003626815960000072
wherein spar (i) is the sub-cluster FC to be oversampled i Sparse Fac (i) is the sparsity factor and Σ spar is the sum of the sparsity of all sub-clusters to be oversampled.
Based on the content of the foregoing method embodiment, as an optional embodiment, in the credit card fraud detection method based on cluster samples and limit gradients provided in the embodiment of the present invention, the determining the oversampling weight of each of the minority samples in step 5 includes: step 5.1: obtaining K nearest neighbor samples of each minority sample in the sub-cluster to be oversampled by using a nearest neighbor algorithm, calculating a majority nearest neighbor rate corresponding to each minority sample, and dividing the number of the majority samples in the nearest neighbors by K; step 5.2: calculating the self-adaptive sampling weight corresponding to each minority sample, and dividing the majority neighbor rate corresponding to the minority sample by the sum of the majority neighbor rates corresponding to all the minority samples in the sub-cluster; step 5.3: and calculating a sampling index corresponding to each minority sample, and multiplying the sampling index of the cluster where the minority sample is positioned by the adaptive sampling weight of the minority.
According to the credit card fraud detection method based on the clustering samples and the extreme gradients, the clustering algorithm, the sample adaptive weight calculation and the oversampling algorithm are combined, the problem of imbalance among classes is solved, the problem of imbalance in the classes is effectively avoided, the data quality of artificially synthesized samples is improved, the extreme gradient lifting tree is used as a classifier in a credit card fraud detection model, a better classification effect can be obtained, and the finally generated credit card fraud detection model has good detection accuracy after the oversampling algorithm based on the clustering and the adaptive weights is combined.
The implementation basis of the various embodiments of the present invention is realized by programmed processing performed by a device having a processor function. Therefore, in engineering practice, the technical solutions and functions thereof of the embodiments of the present invention can be packaged into various modules. Based on this practical situation, on the basis of the above embodiments, embodiments of the present invention provide a credit card fraud detection apparatus based on cluster samples and limit gradients, which is used for executing the credit card fraud detection method based on cluster samples and limit gradients in the above method embodiments. Referring to fig. 2, the apparatus includes: a first master module, configured to implement step 1: firstly, carrying out data cleaning and feature engineering processing on original data directly taken from a bank database to obtain an original data set for machine learning; a second master module, configured to implement step 2: clustering and dividing the original data set by using a clustering algorithm to obtain a plurality of sub-clusters, and discarding sub-clusters without few samples or with the number of the few samples being only 1; a third main module, configured to implement step 3: for other sub-clusters, calculating the unbalance rate of the sub-clusters according to the proportion of the number of the samples of the minority class to the number of the samples of the majority class in the sub-clusters, and screening out the sub-clusters to be subjected to oversampling according to a set unbalance rate threshold; a fourth master module, configured to implement step 4: calculating the sparse factors of all the sub-clusters to be oversampled, and determining the corresponding oversampling weight of each sub-cluster to be oversampled according to the sparse factors; a fifth master module, configured to implement step 5: distributing different self-adaptive weights according to the learning degree of each minority sample in the sub-cluster to be oversampled on the boundary information, and determining the oversampling weight of each minority sample; a sixth master module, configured to implement step 6: performing oversampling interpolation calculation based on self-adaptive weight on each minority sample respectively to generate an artificial data set with the number balance of the majority samples and the minority samples; a seventh master module, configured to implement step 7: training the balanced data set in the last step by using a limit gradient lifting tree algorithm to obtain a final credit card fraud detection model; an eighth master module, configured to implement step 8: the credit card transaction data is detected using the credit card fraud detection model trained in step 7.
The credit card fraud detection device based on the clustering samples and the extreme gradients provided by the embodiment of the invention adopts a plurality of modules in the graph 2, solves the problem of imbalance among classes by combining the clustering algorithm, the sample adaptive weight calculation and the oversampling algorithm, effectively avoids the problem of imbalance in the classes, improves the data quality of artificially synthesized samples, can obtain better classification effect by using the extreme gradient lifting tree as a classifier in a credit card fraud detection model, and has good detection accuracy after combining the oversampling algorithm based on the clustering and the adaptive weights.
It should be noted that, the apparatus in the apparatus embodiment provided by the present invention may be used for implementing methods in other method embodiments provided by the present invention, except that corresponding function modules are provided, and the principle of the apparatus embodiment provided by the present invention is basically the same as that of the apparatus embodiment provided by the present invention, so long as a person skilled in the art obtains corresponding technical means by combining technical features on the basis of the apparatus embodiment described above, and obtains a technical solution formed by these technical means, on the premise of ensuring that the technical solution has practicability, the apparatus in the apparatus embodiment described above may be modified, so as to obtain a corresponding apparatus class embodiment, which is used for implementing methods in other method class embodiments. For example:
based on the content of the foregoing device embodiment, as an optional embodiment, the credit card fraud detection device based on cluster samples and limit gradients provided in the embodiment of the present invention further includes: the first sub-module is configured to determine, according to the sparsity factor, a corresponding oversampling weight for each sub-cluster to be oversampled in step 4, and includes: step 4.1: calculating a Euclidean distance matrix of the sub-cluster to be oversampled; and 4.2: calculating the average distance of the minority classes of the sub-clusters to be oversampled; step 4.3: calculating the minority class density and sparse factor of the sub-cluster to be oversampled; step 4.4: and calculating the oversampling index corresponding to each sub-cluster to be oversampled, namely multiplying the total number of samples to be generated by the sparse factor corresponding to the sub-cluster.
Based on the content of the foregoing device embodiment, as an optional embodiment, the credit card fraud detection device based on the cluster samples and the limit gradients provided in the embodiment of the present invention further includes: a second submodule, configured to implement the euclidean distance matrix, including:
Figure BDA0003626815960000091
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003626815960000092
the square of the euclidean distance between the minority class samples xi in the sub-cluster i to be oversampled and the feature vector xj, n is the number of the minority class samples in the sub-cluster, and A (i) is a euclidean distance matrix.
Based on the content of the foregoing device embodiment, as an optional embodiment, the credit card fraud detection device based on cluster samples and limit gradients provided in the embodiment of the present invention further includes: a third sub-module, configured to implement the calculating of the average distance of the minority class of the sub-clusters to be oversampled, including:
Figure BDA0003626815960000093
wherein, Σ a xy Represents the sum of all off-diagonal elements in the euclidean distance matrix a (i), c (i) represents the number of off-diagonal elements in the euclidean distance matrix a (i), and averdist (i) is the average distance of the minority class of the sub-clusters to be oversampled.
Based on the content of the foregoing device embodiment, as an optional embodiment, the credit card fraud detection device based on the cluster samples and the limit gradients provided in the embodiment of the present invention further includes: a fourth sub-module for implementing the minority class of consistency, comprising:
Figure BDA0003626815960000094
wherein dens (i) is a minority of the density, numOfMin (FC) i ) Is the number of minority class samples in the sub-cluster, averDist m (i) M is the m-th power of the average spacing of the minority class, and m represents the characteristic number of the sample.
Based on the content of the foregoing device embodiment, as an optional embodiment, the credit card fraud detection device based on the cluster samples and the limit gradients provided in the embodiment of the present invention further includes: a fifth sub-module for implementing the sparsity factor, comprising:
Figure BDA0003626815960000101
Figure BDA0003626815960000102
wherein spar (i) is the sub-cluster FC to be oversampled i Sparse Fac (i) is the sparsity factor and Σ spar is the sum of the sparsity of all sub-clusters to be oversampled.
Based on the content of the foregoing device embodiment, as an optional embodiment, the credit card fraud detection device based on the cluster samples and the limit gradients provided in the embodiment of the present invention further includes: a sixth sub-module, configured to determine the oversampling weight for each of the minority class samples in step 5, including: step 5.1: obtaining K nearest neighbor samples of each minority sample in the sub-cluster to be oversampled by using a nearest neighbor algorithm, calculating a majority nearest neighbor rate corresponding to each minority sample, and dividing the number of the majority samples in the nearest neighbors by K; step 5.2: calculating the self-adaptive sampling weight corresponding to each minority sample, and dividing the majority neighbor rate corresponding to the minority sample by the sum of the majority neighbor rates corresponding to all the minority samples in the sub-cluster; step 5.3: and calculating a sampling index corresponding to each minority sample, and multiplying the sampling index of the cluster where the minority sample is positioned by the adaptive sampling weight of the minority.
The method of the embodiment of the invention is realized by depending on the electronic equipment, so that the related electronic equipment is necessarily introduced. To this end, an embodiment of the present invention provides an electronic apparatus, as shown in fig. 3, including: the system comprises at least one processor (processor), a communication Interface (communication Interface), at least one memory (memory) and a communication bus, wherein the at least one processor, the communication Interface and the at least one memory are communicated with each other through the communication bus. The at least one processor may invoke logic instructions in the at least one memory to perform all or a portion of the steps of the methods provided by the various method embodiments described above.
In addition, the logic instructions in the at least one memory may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the method embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. Based on this recognition, each block in the flowchart or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In this patent, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A credit card fraud detection method based on cluster samples and limit gradients is characterized by comprising the following steps: step 1: firstly, carrying out data cleaning and feature engineering processing on original data directly taken from a bank database to obtain an original data set for machine learning; step 2: clustering and dividing the original data set by using a clustering algorithm to obtain a plurality of sub-clusters, and discarding sub-clusters without few samples or with the number of the few samples being only 1; and step 3: for other sub-clusters, calculating the unbalance rate of the sub-clusters according to the proportion of the number of the samples of the minority class to the number of the samples of the majority class in the sub-clusters, and screening out the sub-clusters to be subjected to oversampling according to a set unbalance rate threshold; and 4, step 4: calculating the sparse factors of all the sub-clusters to be oversampled, and determining the corresponding oversampling weight of each sub-cluster to be oversampled according to the sparse factors; and 5: distributing different self-adaptive weights according to the learning degree of each minority sample in the sub-cluster to be oversampled on the boundary information, and determining the oversampling weight of each minority sample; step 6: performing oversampling interpolation calculation based on self-adaptive weight on each minority sample respectively to generate an artificial data set with the number balance of the majority samples and the minority samples; and 7: training the balanced data set in the last step by using a limit gradient lifting tree algorithm to obtain a final credit card fraud detection model; and 8: the credit card transaction data is detected using the credit card fraud detection model trained in step 7.
2. The clustered sample and extreme gradient based credit card fraud detection method of claim 1, wherein the determining the corresponding oversampling weight for each sub-cluster to be oversampled according to the sparseness factor in step 4 comprises: step 4.1: calculating a Euclidean distance matrix of the sub-cluster to be oversampled; step 4.2: calculating the average distance of the minority classes of the sub-clusters to be oversampled; step 4.3: calculating the minority class density and sparse factor of the sub-cluster to be oversampled; step 4.4: and calculating the oversampling index corresponding to each sub-cluster to be oversampled, namely multiplying the total number of samples to be generated by the sparse factor corresponding to the sub-cluster.
3. The credit card fraud detection method based on cluster samples and extreme gradients as claimed in claim 2, wherein the euclidean distance matrix comprises:
Figure FDA0003626815950000011
wherein the content of the first and second substances,
Figure FDA0003626815950000021
for a few class samples x within a sub-cluster i to be oversampled i And x j N is the number of minority samples in the sub-cluster, and A (i) is a Euclidean distance matrix.
4. The clustered sample and extreme gradient-based credit card fraud detection method of claim 3, wherein said calculating the minority-class average distance of the sub-clusters to be oversampled comprises:
Figure FDA0003626815950000022
wherein, Σ a xy Represents the sum of all off-diagonal elements in the Euclidean distance matrix A (i), and c (i) represents the Euclidean distance matrixThe number of non-diagonal elements in A (i), averDist (i), is the average distance of the minority classes of the sub-clusters to be oversampled.
5. The clustered sample and extreme gradient-based credit card fraud detection method of claim 4, wherein said minority class consistencies comprise:
Figure FDA0003626815950000023
wherein dens (i) is a minority of the density, numOfMin (FC) i ) Is the number of minority class samples in the sub-cluster, averDist m (i) M is the m-th power of the average spacing of the minority class, and m represents the characteristic number of the sample.
6. The clustered sample and extreme gradient based credit card fraud detection method of claim 5, wherein said sparseness factor comprises:
Figure FDA0003626815950000024
Figure FDA0003626815950000031
wherein spar (i) is the sub-cluster FC to be oversampled i Sparse Fac (i) is the sparsity factor and Σ spar is the sum of the sparsity of all sub-clusters to be oversampled.
7. The clustered sample and extreme gradient based credit card fraud detection method of claim 6, wherein the determining the oversampling weight for each of the minority-class samples in step 5 comprises: step 5.1: obtaining K nearest neighbor samples of each minority sample in the sub-cluster to be oversampled by using a nearest neighbor algorithm, calculating a majority nearest neighbor rate corresponding to each minority sample, and dividing the number of the majority samples in the nearest neighbors by K; and step 5.2: calculating the self-adaptive sampling weight corresponding to each minority sample, and dividing the majority neighbor rate corresponding to the minority sample by the sum of the majority neighbor rates corresponding to all the minority samples in the sub-cluster; step 5.3: and calculating a sampling index corresponding to each minority sample, and multiplying the sampling index of the cluster where the minority sample is positioned by the adaptive sampling weight of the minority.
8. A credit card fraud detection apparatus based on clustered samples and extreme gradients, comprising: a first master module, configured to implement step 1: firstly, carrying out data cleaning and feature engineering processing on original data directly taken from a bank database to obtain an original data set for machine learning; a second master module, configured to implement step 2: clustering and dividing the original data set by using a clustering algorithm to obtain a plurality of sub-clusters, and discarding sub-clusters without few samples or with the number of the few samples being only 1; a third main module, configured to implement step 3: for other sub-clusters, calculating the unbalance rate of the sub-clusters according to the proportion of the number of the samples of the minority class to the number of the samples of the majority class in the sub-clusters, and screening out the sub-clusters to be subjected to oversampling according to a set unbalance rate threshold; a fourth master module, configured to implement step 4: calculating the sparse factors of all the sub-clusters to be oversampled, and determining the corresponding oversampling weight of each sub-cluster to be oversampled according to the sparse factors; a fifth master module, configured to implement step 5: distributing different self-adaptive weights according to the learning degree of each minority sample in the sub-cluster to be oversampled on the boundary information, and determining the oversampling weight of each minority sample; a sixth master module, configured to implement step 6: performing oversampling interpolation calculation based on self-adaptive weight on each minority sample respectively to generate an artificial data set with the number balance of the majority samples and the minority samples; a seventh master module, configured to implement step 7: training the balanced data set in the last step by using a limit gradient lifting tree algorithm to obtain a final credit card fraud detection model; an eighth master module, configured to implement step 8: the credit card transaction data is detected using the credit card fraud detection model trained in step 7.
9. An electronic device, comprising:
at least one processor, at least one memory, and a communication interface; wherein the content of the first and second substances,
the processor, the memory and the communication interface are communicated with each other;
the memory stores program instructions executable by the processor, which invokes the program instructions to perform the method of any of claims 1 to 7.
10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 7.
CN202210478879.5A 2022-05-05 2022-05-05 Credit card fraud detection method and device based on cluster samples and limit gradients Pending CN114862404A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210478879.5A CN114862404A (en) 2022-05-05 2022-05-05 Credit card fraud detection method and device based on cluster samples and limit gradients

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210478879.5A CN114862404A (en) 2022-05-05 2022-05-05 Credit card fraud detection method and device based on cluster samples and limit gradients

Publications (1)

Publication Number Publication Date
CN114862404A true CN114862404A (en) 2022-08-05

Family

ID=82635088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210478879.5A Pending CN114862404A (en) 2022-05-05 2022-05-05 Credit card fraud detection method and device based on cluster samples and limit gradients

Country Status (1)

Country Link
CN (1) CN114862404A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108387A (en) * 2023-04-14 2023-05-12 湖南工商大学 Unbalanced data oversampling method and related equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108387A (en) * 2023-04-14 2023-05-12 湖南工商大学 Unbalanced data oversampling method and related equipment
CN116108387B (en) * 2023-04-14 2023-07-04 湖南工商大学 Unbalanced data oversampling method and related equipment

Similar Documents

Publication Publication Date Title
CN108737406B (en) Method and system for detecting abnormal flow data
CN109299741B (en) Network attack type identification method based on multi-layer detection
CN111260064A (en) Knowledge inference method, system and medium based on knowledge graph of meta knowledge
CN107103332A (en) A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN111144548A (en) Method and device for identifying working condition of pumping well
CN111178196B (en) Cell classification method, device and equipment
CN115510042A (en) Power system load data filling method and device based on generation countermeasure network
CN115801374A (en) Network intrusion data classification method and device, electronic equipment and storage medium
CN114862404A (en) Credit card fraud detection method and device based on cluster samples and limit gradients
CN111260490A (en) Rapid claims settlement method and system based on tree model for car insurance
CN109934286B (en) Bug report severity recognition method based on text feature extraction and imbalance processing strategy
CN114417095A (en) Data set partitioning method and device
CN105224954B (en) It is a kind of to remove the topic discovery method that small topic influences based on Single-pass
CN114547365A (en) Image retrieval method and device
CN113824580A (en) Network index early warning method and system
CN115713669A (en) Image classification method and device based on inter-class relation, storage medium and terminal
CN115688101A (en) Deep learning-based file classification method and device
CN115545111A (en) Network intrusion detection method and system based on clustering self-adaptive mixed sampling
CN116861226A (en) Data processing method and related device
CN114662568A (en) Data classification method, device, equipment and storage medium
CN113313206A (en) Method and device for binning feature sequences and computer-readable storage medium
CN114528906A (en) Fault diagnosis method, device, equipment and medium for rotary machine
JP2022154862A (en) Information processing method, program and information processing device
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination