CN110717601A

CN110717601A - Anti-fraud method based on supervised learning and unsupervised learning

Info

Publication number: CN110717601A
Application number: CN201910977563.9A
Authority: CN
Inventors: 施铭铮; 刘占辉
Original assignee: Xiamen Pencil Head Information Technology Co Ltd
Current assignee: Xiamen Pencil Head Information Technology Co Ltd
Priority date: 2019-10-15
Filing date: 2019-10-15
Publication date: 2020-01-21
Anticipated expiration: 2039-10-15
Also published as: CN110717601B

Abstract

The invention discloses an anti-fraud method based on supervised learning and unsupervised learning, which comprises the following specific steps: the anti-fraud method based on the supervised learning and the unsupervised learning uses a method combining the supervised learning and the unsupervised learning to establish an anti-fraud model, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of integrating the model, the prediction result is better than the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the combination of the two learning modes can detect the known fraud mode and the unknown fraud mode.

Description

Anti-fraud method based on supervised learning and unsupervised learning

Technical Field

The invention discloses an anti-fraud method based on supervised learning and unsupervised learning, and belongs to the technical field of anti-fraud.

Background

In a large number of orders or transactions in the financial field, normal data and a small part of abnormal data with fraud are existed, how to detect the transaction with fraud from a large number of instant financial transactions and process the transaction in time is an important work, and the machine learning method is already applied in a large number and anti-fraud models, the machine learning algorithm can be roughly classified into a supervised learning method and an unsupervised learning method, and the supervised learning method and the unsupervised learning method have respective advantages and disadvantages in anti-fraud application.

The supervised learning algorithm depends on historical data, and the historical data needs to be labeled, that is, each historical transaction needs to be labeled as a normal transaction or an abnormal (fraudulent) transaction, the supervised learning algorithm learns the labeled historical data and stores the learning result in a model, and when a new transaction is generated, the trained supervised learning model can predict the new transaction and calculate the probability or possibility that the new transaction belongs to the abnormal transaction. The supervised learning method has the problems that the supervised learning learns good samples and bad samples in historical data to find rules and patterns in the good samples and the bad samples, and then classifies newly generated transactions based on the patterns, namely, the supervised learning can only match the patterns existing in the historical data, and when a new fraud pattern occurs, the supervised learning cannot detect the new fraud pattern because the supervised learning does not learn the patterns from the historical data in advance.

The weakness of supervised learning can be compensated by unsupervised learning, which does not need to label the historical data but directly analyzes the distribution of the historical data, and an assumption of unsupervised learning is that when an outlier occurs in the distribution of the transaction data, the outlier will belong to a fraudulent transaction. Unsupervised learning will be useful for discovering unknown patterns of fraud because unsupervised learning does not rely on annotated historical data.

So to say, in anti-fraud applications, the methods of supervised learning and unsupervised learning are complementary, supervised learning being used to discover existing fraud patterns and unsupervised learning being used to detect new fraud patterns;

therefore, the invention provides an anti-fraud method based on supervised learning and unsupervised learning.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an anti-fraud method based on supervised learning and unsupervised learning, wherein an anti-fraud model is established by using a method combining the supervised learning and the unsupervised learning, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of an integrated model, the predicted result is better than the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the known fraud mode and the unknown fraud mode can be detected by combining the two learning modes.

In order to achieve the purpose, the invention provides the following technical scheme: an anti-fraud method based on supervised learning and unsupervised learning comprises the following specific steps:

the method comprises the following steps: preprocessing data; the input data is numerical data, non-numerical data needs to be converted into numerical data, and then all the input data can be standardized, for example, the data can be standardized by using a standard scaler of scinit-lean;

step two: data conversion; in general, each feature of input data has a certain correlation relationship, the input data needs to be converted by using a Principal Component Analysis (PCA) method, each column of the converted data has no correlation relationship, and the converted data is still marked as X;

step three: creating a Gaussian mixture model; the invention aims to establish a corresponding Gaussian mixture model for each column of the converted data, wherein the Gaussian mixture model is recorded as:

wherein x_iFor the ith row of the data,

for the parameters of the model, the gaussian mixture model contains k components, each component is an independent gaussian univariate distribution, and because each gaussian mixture model is only for one column in the data set X, its components only need gaussian univariate distributions, that is:

furthermore, it is possible to provide a liquid crystal display device,

each variable ω_jIs the weight of the jth component in the mixture model, now given to each data point x_iAdding an implicit variable z_i，z_iHas a value range of {1, 2.., k }, i.e., z }_iData point x is denoted 2_iIs generated by the 2 nd component, z_iDenoted by j is the data point x_iIs the jth component generated, then for any θ there is:

according to the formula (2) and the Bayesian formula, the following can be obtained:

the above formula represents the data point x_iIs the probability of the generation of the jth component, defines γ_ij＝p(z_i＝j|x_i，θ)；

Step four: an expectation maximization algorithm; after the model is determined, parameters of the model need to be estimated, and in this step, a maximum likelihood parameter estimation method is used, and a log likelihood function is generally used:

according to the total probability formula:

equation (3) can be written as:

the logarithm of the sum contained in the formula (4), solving the maximum likelihood estimation of the formula (4) does not have an analytic solution, and the maximum likelihood estimation can only be solved through an iterative method, wherein the maximum value of the objective function L (theta) is calculated by adopting an Expectation-Maximization (EM) algorithm, the EM algorithm is an iterative algorithm and mainly comprises an E step (expection) and an M step (Maximization), and the maximum value solving problem can be changed into the following maximum value solving problem through derivation (4):

θ^(m+1)＝argmax_θE[logP(x，z|θ)|x，θ^(m)](5)

wherein, theta^(m)The parameter estimation value obtained after the mth iteration is shown, and the parameter value theta obtained after the (m + 1) th iteration is obtained through the calculation of the formula (5)^(m+1)Defining a Q function:

Q(θ|θ^(m))＝E[logP(x，z|θ)|x，θ^(m)]

then the ith data point x_iThe Q function of (A) is:

the above derivation uses gamma_ijAnd equation (2), so the Q function for all data points is:

after deriving the formula of the Q function, the main steps of the EM algorithm can be listed as follows:

① initialization, setting the value of parameter m to 0 when iteration is not started, wherein the parameter m can be regarded as the number of iterations, and the model parameter needs to be set

Of where theta is an initial value⁽⁰⁾Namely, the model parameter value after the 0 th iteration is represented;

② step E, according to the parameter value theta of the current model^(m)Need forComputing component j versus data x_iResponsivity of (2)

The above formula has been given in step three;

step ③, M, according to the definition of the Q function, in each iteration, it is necessary to find the value of the parameter θ so as to maximize the value of the Q function, that is:

that is, in M steps, it is necessary to find the function Q (θ | θ!)^(m)) For maximum value of parameter theta, in order to obtain optimum mu_jIs required to measure mu_jThe partial derivative is calculated and made zero, i.e.:

substituting equation (6) into the above equation and solving for it yields:

the same optimum σ_jThe following formula needs to be calculated:

substituting and solving equation (6) yields:

calculating optimal omega_jThe formula of (1) is:

after M steps are finished, new parameter values after iteration are obtained

④, repeating ② and ③, namely alternately calculating the steps E and M, and noting that the parameter M is increased by 1 after the step M is finished, namely M is M +1, and the loop is finished when the iteration converges to a preset threshold value;

when the ① to ④ are executed, all parameters of the Gaussian mixture model are calculated, and the training of one Gaussian mixture model is also finished;

step five: supervised learning is available; the historical data which needs to be labeled in supervised learning is assumed to be a matrix X, each row in the matrix X is assumed to be labeled, the normal transaction is recorded as 0, the abnormal transaction is recorded as 1, the labeled information is stored in one column of the matrix X, and the matrix X is split into two matrices X according to the labeled information₀And X₁，X₀Contains all normal transactions, and X₁Containing all anomalous transactions, assuming that matrix X has n columns, then X₀And X₁Also each have n columns, now X₀And X₁Are respectively divided into n single-column matrixes which are respectively marked as X₀₁，X₀₂，…，X_0nAnd X₁₁， X₁₂，…，X_1nThus, the matrix X is split into 2n single-column matrices, which now need to be applied to each single-column matrix X₀₁，X₀₂，…，X_0n，X₁₁，X₁₂，…，X_1nCreating an independent Gaussian mixture model and training each model by using an EM algorithm, wherein 2n models are required to be created, and the pseudo code is as follows:

algorithm 1. supervised learning gaussian mixture model,

in algorithm 1, rows 2-3 are two cycles, which are cycled for 2n times, each cycle creates a gaussian mixture model in row 4 (corresponding to step three), and trains the gaussian mixture model in row 5 (corresponding to step four), and the trained data is a single-column matrix X_ijTherefore, the algorithm 1 creates and trains 2n Gaussian mixture models, and circulates the third step and the fourth step for 2n times, and the trained models are stored in array models;

step six: unsupervised learning; assuming that there is some history data that has not been labeled, denoted X ', because the data has not been labeled, there is no way to distinguish between good and bad samples, assuming that matrix X' has m columns, X 'is now broken down into m single-column matrices X'₁，X′₂，...，X′_mSimilarly to step five, a gaussian mixture model is created for each single-column matrix, so that in step six there are a total of m models, and the pseudo-code is as follows:

algorithm 2 unsupervised learning of a gaussian mixture model,

algorithm 2 is similar to algorithm 1, algorithm 2 has a cycle, which is cycled m times, each cycle executes step three and step four, a gaussian mixture model is created and trained, so that there are m models, and the trained models are placed in the group of models 2;

step seven: predicting; first, looking at the prediction of supervised learning, as new transaction data X^(new)When generated, the new data needs to be predicted for its class, that is, when given data, the posterior probability P (y | X) needs to be calculated^(new)) Where y ═ {0, 1} is a category, y ═ 0 is a good sample, y ═ 1 is a bad sample, according to bayes' theorem:

so that the a posteriori probability is calculated,the prior probability P (y) and the conditional probability P (X) are calculated^(new)Y), and P (X)^(new)) The prior probability p (y) can be given by expert experience or calculated as a proportion of good or bad samples. Naive bayes can be used when calculating the conditional probability. In step five, the data has n columns, where new data X is assumed^(new)Consistent with the training data, there are also n columns, noted

Because in step two, the data has been transformed, and the columns of the transformed data are independent, a naive bayes formula can be applied:

equation (7) becomes:

wherein

The total 2n probabilities can be calculated by the 2n models trained in the fifth step, and

the probability can be calculated by a total probability formula, and the posterior probability calculated by the formula (8) is the predicted value of supervised learning. As the data in unsupervised learning is not marked with good or bad samples, the data can not be calculated by using the formula (8), and only the naive Bayes formula can be used for calculation:

note that in the above equation, i is 1, 2, …, m, since step six assumes that there are m columns of unsupervised learning training data. Since the Gaussian mixture model is a generative model, P (X) can be set^(new2)) Treated as new data X^(new2)Is the probability generated by the gaussian mixture model. When the calculated probability of some new data is low, these data points can be taken as outliers, i.e. anomalous transactions.

In one embodiment: in the seventh step, after the prediction results of supervised learning and unsupervised learning are obtained, a rule model, a linear model or an integrated model can be established to synthesize the prediction results to obtain the final output result.

To achieve the above object, the present invention further provides an anti-fraud computer apparatus based on supervised learning and unsupervised learning, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any of the preceding claims 1 to 2 when executing the computer program.

To achieve the above object, the present invention also provides an anti-fraud computer device storage medium based on supervised and unsupervised learning, having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the preceding claims 1 to 2.

After the technical scheme is adopted, on one hand, the anti-fraud model is established by using a method combining supervised learning and unsupervised learning, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of an integrated model, the predicted result is superior to the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the known fraud mode and the unknown fraud mode can be detected by combining the two learning modes;

on the other hand, in the fourth step, a specific calculation formula of an expectation-maximization (EM) algorithm is derived, and as can be seen from the fourth step, the final formula is easily implemented by using a programming language, and after the algorithm is implemented, the algorithm can be improved according to specific application scenarios, for example, each iteration of the third step and the fourth step requires a full amount of data to be input for calculation, but when the data amount is very large, the time required for one iteration is very long, and at this time, the existing EM algorithm can be improved, and only one batch (batch) of data is input for each iteration, so that the calculation speed can be greatly improved under the condition of ensuring the calculation accuracy.

Drawings

FIG. 1 is a flow chart showing the specific steps of an anti-fraud method based on supervised learning and unsupervised learning according to the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides an anti-fraud method based on supervised learning and unsupervised learning, which includes the following specific steps:

wherein x_iIs the ith row of data，

furthermore, it is possible to provide a liquid crystal display device,each variable ω_jIs the weight of the jth component in the mixture model, now given to each data point x_iAdding an implicit variable z_i，z_iHas a value range of {1, 2.., k }, i.e., z }_iData point x is denoted 2_iIs generated by the 2 nd component, z_iDenoted by j is the data point x_iIs the jth component generated, then for any θ there is:

according to the total probability formula:

equation (3) can be written as:

θ^(m+1)＝argmax_θE[logP(x，z|θ)|x，θ^(m)](5)

Q(θ|θ^(m))＝E[logP(x，z|θ)|x，θ^(m)]

then the ith data point x_iThe Q function of (A) is:

② step E, according to the parameter value theta of the current model^(m)It is necessary to calculate the component j for the data x_iResponsivity of (2)

The above formula has been given in step three;

substituting equation (6) into the above equation and solving for it yields:

the same optimum σ_jThe following formula needs to be calculated:

substituting and solving equation (6) yields:

calculating optimal omega_jThe formula of (1) is:

after M steps are finished, new parameter values after iteration are obtained

step five: supervised learning is available; the historical data which needs to be labeled in supervised learning is assumed to be a matrix X, each row in the matrix X is assumed to be labeled, the normal transaction is recorded as 0, the abnormal transaction is recorded as 1, the labeled information is stored in one column of the matrix X, and the matrix X is split into two matrices X according to the labeled information₀And X₁，X₀Contains all normal orders, and X₁Including all exception orders, assuming that matrix X has n columns, then X₀And X₁Also each have n columns, now X₀And X₁Are respectively divided into n single-column matrixes which are respectively marked as X₀₁，X₀₂，…，X_0nAnd X₁₁， X₁₂，…，X_1nThus, the matrix X is split into 2n single-column matrices, which now need to be applied to each single-column matrix X₀₁，X₀₂，…，X_0n，X₁₁，X₁₂，…，X_1nCreating an independent Gaussian mixture model and training each model by using an EM algorithm, wherein 2n models are required to be created, and the pseudo code is as follows:

algorithm 1. supervised learning gaussian mixture model,

algorithm 2 unsupervised learning of a gaussian mixture model,

therefore, to calculate the posterior probability, the prior probability P (y), the conditional probability P (X) are calculated^(new)Y), and P (X)^(new)) The prior probability P (y) can be given by expert experience, or calculated according to the proportion of good and bad samples, when calculating the conditional probability, naive Bayes can be used, in step five, the data has n columns, and here, new data X is assumed^(new)Consistent with the training data, there are also n columns, notedBecause in step two, the data has been transformed, and the columns of the transformed data are independent, a naive bayes formula can be applied:

equation (7) becomes:

(8)

wherein

the probability can be calculated by a total probability formula, the posterior probability calculated by the formula (8) is a predicted value of supervised learning, and the data in the unsupervised learning is not marked with good or bad samples, so the formula (8) can not be used for calculation, and only the naive Bayes formula can be used for calculation:

note that in the above equation, i is 1, 2, …, m, since step six assumes that there are m columns in the unsupervised learning training data, and since the gaussian mixture model is a generative model, P (X) can be used^(new2)) Treated as new data X^(new2)Is the probability generated by the gaussian mixture model, and when the probability obtained by the calculation of some new data is very low, the data points can be taken as outliers, namely abnormal transactions.

In this embodiment: in the seventh step, after the prediction results of supervised learning and unsupervised learning are obtained, a rule model, a linear model or an integrated model can be established to synthesize the prediction results to obtain the final output result.

Further, the programming language used in the present invention is Python, the principal component analysis in step two may use the composition.pca class in scimit-spare package, the gaussian mixture model in step three may use the mixture.gaussian mixture class in scimit-spare package, an important parameter in this class is n _ components, i.e. how many components the mixture model is composed of, the parameter n _ components is a hyper-parameter, which needs to be manually adjusted or searched using a hyper-parameter, the value of the parameter n _ components will be over-fitted if it is set too high, and the calculation time will be longer.

Further, the fourth step is expectation maximization algorithm, model parameter initialization of ① may use random values, but EM algorithm is sensitive to setting of initial values, it is preferable to use k-means algorithm to calculate initial values of parameters, k-means algorithm may use cluster in scipit-left package, kmans class, ② and ③ may be implemented according to the formula listed in the fourth step, numpy package may be used to optimize calculation speed, of course, ② and ③ may also be implemented by algorithm built in mix.

Further, the supervised learning of the step five and the unsupervised learning of the step six are both loops of the step three and the step four.

Further, in step seven, the conditional probability of the data in each gaussian mixture model can be calculated by using score _ samples method in the mixture.

After the technical scheme is adopted, on one hand, the anti-fraud model is established by using a method combining supervised learning and unsupervised learning, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of an integrated model, the predicted result is superior to the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the known fraud mode and the unknown fraud mode can be detected by combining the two learning modes; on the other hand, in the fourth step, a specific calculation formula of an expectation-maximization (EM) algorithm is derived, and as can be seen from the fourth step, the final formula is easily implemented by using a programming language, and after the algorithm is implemented, the algorithm can be improved according to specific application scenarios, for example, each iteration of the third step and the fourth step requires a full amount of data to be input for calculation, but when the data amount is very large, the time required for one iteration is very long, and at this time, the existing EM algorithm can be improved, and only one batch (batch) of data is input for each iteration, so that the calculation speed can be greatly improved under the condition of ensuring the calculation accuracy.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. An anti-fraud method based on supervised learning and unsupervised learning is characterized by comprising the following specific steps:

wherein x_iFor the ith row of the data,for the parameters of the model, the gaussian mixture model contains k components, each component is an independent gaussian univariate distribution, and because each gaussian mixture model is only for one column in the data set X, its components only need gaussian univariate distributions, that is:

furthermore, it is possible to provide a liquid crystal display device,

according to the total probability formula:

equation (3) can be written as:

θ^(m+1)＝argmax_θE[logP(x，z|θ)x，θ^(m)](5)

Q(θ|θ^(m))＝E[logP(x，z|θ)|x，θ^(m)]

then the ith data point x_iThe Q function of (A) is:

① initialization, setting the value of parameter m to 0 when iteration is not started, wherein the parameter m can be regarded as the number of iterations, and the model parameter needs to be setOf where theta is an initial value⁽⁰⁾Namely, the model parameter value after the 0 th iteration is represented;

The above formula has been given in step three;

substituting equation (6) into the above equation and solving for it yields:

the same optimum σ_jThe following formula needs to be calculated:

substituting and solving equation (6) yields:

calculating optimal omega_jThe formula of (1) is:

after M steps are finished, new parameter values after iteration are obtained

step five: supervised learning is available; the historical data which needs to be labeled in supervised learning is assumed to be a matrix X, each row in the matrix X is assumed to be labeled, the normal transaction is recorded as 0, the abnormal transaction is recorded as 1, the labeled information is stored in one column of the matrix X, and the matrix X is split into two matrices X according to the labeled information₀And X₁，X₀Contains all normal transactions, and X₁Containing all anomalous transactions, assuming that matrix X has n columns, then X₀And X₁Also each have n columns, now X₀And X₁Are respectively divided into n single-column matrixes which are respectively marked as X₀₁，X₀₂，…，X_0nAnd X₁₁，X₁₂，…，X_1nThus, the matrix X is split into 2n single-column matrices, which now need to be applied to each single-column matrix X₀₁，X₀₂，…，X_0n，X₁₁，X₁₂，…，X_1nCreating an independent Gaussian mixture model and training each model by using an EM algorithm, wherein 2n models are required to be created, and the pseudo code is as follows:

algorithm 1. supervised learning gaussian mixture model,

algorithm 2 unsupervised learning of a gaussian mixture model,

therefore, to calculate the posterior probability, the prior probability P (y), the conditional probability P (X) are calculated^(new)Y), and P (X)^(new)). The prior probability P (y) can be given by expert experience, or calculated according to the proportion of good and bad samples, and when calculating the conditional probability, naive Bayes can be used. In step five, the data has n columns, where new data X is assumed^(new)Consistent with the training data, there are also n columns, notedBecause in step two, the data has been transformed, and the columns of the transformed data are independent, a naive bayes formula can be applied:

equation (7) becomes:

wherein

A total of 2n probabilities can be trained by the five-step 2n modelsIs calculated to obtain

note that in the above equation, i is 1, 2, …, m, since step six assumes that there are m columns of unsupervised learning training data. Since the Gaussian mixture model is a generative model, P (X) can be set^(new2)) Treated as new data X^(new2)Is the probability generated by the gaussian mixture model, and when the probability obtained by the calculation of some new data is very low, the data points can be taken as outliers, namely abnormal transactions.

2. The prior probability P (y) can be given by expert experience, or calculated according to the proportion of good and bad samples, and when calculating the conditional probability, naive Bayes can be used.

3. In step five, the data has n columns, where new data is assumed

Consistent with the training data, there are also n columns, noted

，

，…，

Because in step two, the data has already been convertedThe columns of the converted data are independent, so a naive bayes formula can be applied:

equation (7) becomes:

（8）

whereinThe total 2n probabilities can be calculated by the 2n models trained in step five, and

note that i =1, 2, …, m in the above equation, because step six assumes that there are m columns of unsupervised learning training data.

4. Because the Gaussian mixture model is a generative model, the model can be generated

Treated as new dataIs the probability generated by the Gaussian mixture model, and when the probability obtained by calculating some new data is very low, the data can be processedThe points serve as outliers, i.e., anomalous transactions.

5. An anti-fraud method based on supervised and unsupervised learning according to claim 1, characterized in that: in the seventh step, after the prediction results of supervised learning and unsupervised learning are obtained, a rule model, a linear model or an integrated model can be established to synthesize the prediction results to obtain the final output result.

6. An anti-fraud computer device based on supervised and unsupervised learning, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that: the processor, when executing the computer program, realizes the steps of the method of any of the preceding claims 1 to 2.

7. An anti-fraud computer device storage medium based on supervised and unsupervised learning, having a computer program stored thereon, characterized by: the program when executed by a processor implements the steps of the method of any of claims 1 to 2.