CN111105241A

CN111105241A - Identification method for anti-fraud of credit card transaction

Info

Publication number: CN111105241A
Application number: CN201911323155.8A
Authority: CN
Inventors: 董雪梅; 崔奔雷
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-05
Anticipated expiration: 2039-12-20
Also published as: CN111105241B

Abstract

The invention discloses an identification method applied to credit card transaction anti-fraud, which specifically comprises the following steps: 101) aggregating transaction characteristics according to the identity characteristics; 102) aggregating transaction time characteristics according to regions; 103) a prediction model processing step and 104) a prediction result step; the invention provides an anti-fraud identification method applied to credit card transactions, which is based on fusion of multiple gradient lifting tree models.

Description

Identification method for anti-fraud of credit card transaction

Technical Field

The present invention relates to the field of credit cards, and more particularly, to a method for identifying anti-fraud in credit card transactions.

Background

As the internet financial industry develops, the situation of performing financial service transactions through internet channels is becoming more and more popular. For both internet transaction parties, it is particularly important to be able to correctly evaluate transaction risks and prevent financial fraud and other situations from occurring in a wind control work.

For credit investigation examination and anti-fraud tests of internet financial users, various credit investigation and evaluation materials of the users need to be examined and examined, so that transaction risks are evaluated, and benefits of financial platforms are guaranteed. At present, corresponding risk examination work also needs manual access to different degrees, so that the efficiency and stability of business development are limited.

Disclosure of Invention

The invention overcomes the defects of the prior art and provides an anti-fraud identification method applied to credit card transaction based on the fusion of a plurality of gradient lifting tree models.

In order to solve the technical problems, the technical scheme of the invention is as follows:

an identification method applied to credit card transaction anti-fraud specifically comprises the following steps:

101) aggregating transaction characteristics according to the identity characteristics;

102) aggregating transaction time characteristics according to regions;

103) establishing three models based on XGboost, Catboost and LightGBM to predict credit card transactions to obtain a probability judgment value of fraud;

the XGboost model has the following specific processing formula:

in the formula

Is the term for the residual error,

is a regular term, wherein gamma is the number of decision trees, T is the number of leaf nodes,

lambda is a constant for the weight value of each leaf node;

will be shown in formula (1)

Instead, it is changed into

As a loss function in the formula

Instead, it is changed into

As a regular term in the formula, the formula after conversion is as follows:

in the formula

Is a newly added t-th tree, and the changed value is recorded as f_t(x_i) (ii) a Wherein the t-1 tree is fit to

Further decomposing the residual sum of squares of the previous t-1 trees, and the newly fitted t-th tree, convert equation (2) to the following equation:

so that each time a decision tree is found, f is made_t(x_i) The maximum residual value is reduced;

will be that in formula (3)

As x, then f_t(x_i) Δ x, obj (t) ═ F (x + Δ x), taylor expansion is performed, and this is described

To pair

Is noted as the first derivative of_iThe second derivative is denoted as h_iIgnoring the constant component C, the following equation is obtained:

wherein f is_t(x_i) As a function of the leaf node weights based on the t-th tree, equation (4) is transformed as follows:

wherein

The samples are divided into leaf nodes, and the sequential traversal of the samples 1 to n is changed into the traversal from the sample on the leaf node 1 to the sample on the leaf node n, so that the following formula is obtained:

note the book

Is G_iMemory for recording

Is H_iIs converted into w_jThe multivariate extreme value formula of (1):

the new objective function obtained by substituting equation (6) is:

according to the division of the leaf nodes, the divided part is divided into an L part and an R part, and the classified income formula is as follows:

obtaining a maximum fraud probability judgment value Gain of the XGboost;

104) and performing average weighted fusion on the results with low correlation according to the output results of the three models established in the step 103) by using a Pearson correlation coefficient matrix to obtain a final prediction result.

Further, the identification of the unique identity is determined based on the identification of the explicit and/or implicit identity characteristics of the client, and the transaction characteristics under the unique identity include the average amount of the transaction, the frequency of the transaction and the type of the used equipment.

Furthermore, the time characteristic is based on the time characteristic of the region, the time of the highest transaction frequency band is counted according to the region classification, and the difference value between each transaction time and the local high-frequency transaction time is calculated to serve as the important characteristic for judging the abnormal transaction.

Further, the Catboost model randomly orders the training set, and for the p-th sample, the statistical value of the previous p-1 sample values is used for replacing the p-th sample, and the specific formula is as follows:

p and a are hyper-parameters so as to reduce noise obtained in a low-frequency category, and the robustness and generalization capability of the model are improved in a sequencing promotion mode.

Further, the specific steps of step 104) are as follows:

401) acquiring Pearson correlation coefficients predicted by the three model outputs;

402) taking out the prediction result of which the Pearson correlation coefficient is lower than 0.99 in the step 401), wherein the prediction precision is close and excellent;

403) and fusing the prediction results of the three models by the same weight to output a final result as a final prediction result.

Compared with the prior art, the invention has the advantages that:

the invention has complementation among the characteristic types, and the real property of the software can be better found by fusing the characteristics of different abstraction layers. Furthermore, since the assumptions of learning algorithms are different, there is no learning algorithm that is optimal for various types of problems. It is not an easy task to select a suitable classification algorithm for different features. Different classification algorithms have induction bias, various learning algorithms can exert respective advantages by being fused, and the defects are overcome, so that the accuracy of the classification algorithms is improved, the false alarm rate is reduced, and the generalization performance of the classification algorithms is improved.

Detailed Description

The following specific embodiments are given to further illustrate the present invention.

101) aggregating transaction characteristics according to the identity characteristics; the identification of the unique identity is identified based on the explicit and/or implicit identity characteristics of the client, and the transaction characteristics under the statistic unique identity comprise the average amount of the transaction, the frequency of the transaction and the type of the used equipment.

102) Aggregating transaction time characteristics according to regions; the time characteristic is based on the time characteristic of the region, the time of the highest transaction frequency band is counted according to the region classification, and the difference value between each transaction time and the local high-frequency transaction time is calculated to serve as the important characteristic for judging the abnormal transaction.

the XGboost-based model has the following specific processing formula:

in the formula

Is the term for the residual error,

lambda is a constant for the weight value of each leaf node;

converting the formula (1) into the following formula (2), concretely

Is rewritten as

As a loss function, will

As a regularization term, rewrite to

In the formula

will be that in formula (3)

To pair

wherein f is_t(x_i) As a function of the leaf node weights based on the t-th tree, f_t(x_i) Is determined by the weight w_qAnd (4) converting and expressing the formula (4) as the following formula:

wherein

note the book

Is G_iMemory for recording

Is H_iIs converted into w_jThe multivariate extreme value formula of (1):

the new objective function obtained by substituting equation (6) is:

according to the division of the leaf nodes, dividing the divided part into an L part and an R part, and expressing the classified benefits as follows:

and traversing all possible conditions for each division, so that leaf nodes of each layer of each newly-built tree have the optimal weight coefficient, and the maximum fraud probability judgment value Gain based on the XGboost is obtained.

Randomly ordering the training set based on a Catboost model, and replacing the p sample with the statistical value of the previous p-1 sample values for the p sample, wherein the specific formula is as follows:

p and a are hyper-parameters and are used for reducing noise obtained in a low-frequency category, and the robustness and the generalization capability of the model are improved in a sequencing and promoting mode. Because the Catboost has great advantages in processing the classified data, the general processing of the classified data can be performed by adopting a coding (such as one-hot coding) mode and the like, but the scheme adopts a more effective strategy on a Catboost model, randomly orders a training set, improves the problem of prediction offset in the GBDT, and replaces a gradient calculation method (calculating gradient by using the same data set every time) in the GBDT by an ordered boosting mode, thereby achieving the effect of reducing gradient estimation deviation and improving the robustness and generalization capability of the model.

LightGBM directly adopts an improved algorithm of GBDT algorithm proposed by Microsoft in 2015, and has the main innovation point that the method reduces the sample size, reduces the calculation overhead and ensures the considerable accuracy rate by adopting a Gradient-based One-Side Sampling (GOSS) technology and an independent Feature merging (EFB) technology.

104) Obtaining a Pearson correlation coefficient matrix according to the output results of the three models established in the step 103), and performing average weighted fusion on the results with low correlation to obtain a final prediction result. The specific process is as follows:

402) and (4) taking out the prediction result of which the Pearson correlation coefficient is lower than 0.99 in the step 401), and the prediction accuracy is close and better. Such as: in the scheme, based on the probability judgment value of the fraud of the XGboost model, the obtained Pearson correlation coefficient performance reaches 0.95, based on the probability judgment values of the fraud of the LightGBM and the Catboost model, the obtained Pearson correlation coefficient performance reaches 0.945, the probability judgment values of the fraud of the three models are close to each other, but the correlation coefficient ratio of the output result is lower, and then fusion is needed.

403) Fusing the probability judgment values of the fraud of the three models in the step 402) by the same weight to output a final result as a final prediction result. Such as: the probability judgment values of the output fraud of the three models are y1, y2 and y3 respectively. The final output is 1/3 by y1+1/3 by y2+1/3 by y 3.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the spirit of the present invention, and these modifications and decorations should also be regarded as being within the scope of the present invention.

Claims

1. An identification method applied to credit card transaction anti-fraud is characterized by comprising the following steps:

102) aggregating transaction time characteristics according to regions;

the XGboost model has the following specific processing formula:

in the formulaIs the term for the residual error,

lambda is a constant for the weight value of each leaf node;

will be shown in formula (1)

Instead, it is changed into

As a loss function in the formula

Instead, it is changed into

As a regular term in the formula, the formula after conversion is as follows:

in the formula

will be that in formula (3)

To pair

wherein

The samples are divided into leaf nodes, and the sequential traversal of the samples 1 to n is changed into the traversal from the sample on the leaf node 1 to the sample on the leaf node n, so as to obtain the result ofThe following formula:

note the book

Is G_iMemory for recording

Is H_iIs converted into w_jThe multivariate extreme value formula of (1):

the new objective function obtained by substituting equation (6) is:

obtaining a maximum fraud probability judgment value Gain of the XGboost;

2. An identification method applied to credit card transaction anti-fraud according to claim 1, characterized in that: the identification of the unique identity is determined based on the identification of the explicit and/or implicit identity characteristics of the client, and the transaction characteristics under the statistical unique identity comprise the average amount of the transaction, the frequency of the transaction and the type of the used equipment.

3. An identification method applied to credit card transaction anti-fraud according to claim 1, characterized in that: the time characteristic is based on the time characteristic of the region, the time of the highest transaction frequency band is counted according to the region classification, and the difference value between each transaction time and the local high-frequency transaction time is calculated to serve as the important characteristic for judging the abnormal transaction.

4. An identification method applied to credit card transaction anti-fraud according to claim 1, characterized in that: the Catboost model randomly orders the training set, and for the p-th sample, the statistical value of the previous p-1 sample values is used for replacing the p-th sample, and the specific formula is as follows:

5. An identification method applied to credit card transaction anti-fraud according to claim 1, characterized in that: step 104) comprises the following specific steps:

402) taking out the prediction result of which the Pearson correlation coefficient is lower than 0.99 in the step 401), and the prediction precision is close and better;