CN110717601B - Anti-fraud method based on supervised learning and unsupervised learning - Google Patents

Anti-fraud method based on supervised learning and unsupervised learning Download PDF

Info

Publication number
CN110717601B
CN110717601B CN201910977563.9A CN201910977563A CN110717601B CN 110717601 B CN110717601 B CN 110717601B CN 201910977563 A CN201910977563 A CN 201910977563A CN 110717601 B CN110717601 B CN 110717601B
Authority
CN
China
Prior art keywords
data
model
formula
gaussian mixture
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910977563.9A
Other languages
Chinese (zh)
Other versions
CN110717601A (en
Inventor
施铭铮
刘占辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Qianbitou Information Technology Co ltd
Original Assignee
Xiamen Qianbitou Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Qianbitou Information Technology Co ltd filed Critical Xiamen Qianbitou Information Technology Co ltd
Priority to CN201910977563.9A priority Critical patent/CN110717601B/en
Publication of CN110717601A publication Critical patent/CN110717601A/en
Application granted granted Critical
Publication of CN110717601B publication Critical patent/CN110717601B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses an anti-fraud method based on supervised learning and unsupervised learning, which comprises the following specific steps: the anti-fraud method based on the supervised learning and the unsupervised learning uses a method combining the supervised learning and the unsupervised learning to establish an anti-fraud model, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of integrating the model, the prediction result is better than the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the combination of the two learning modes can detect the known fraud mode and the unknown fraud mode.

Description

Anti-fraud method based on supervised learning and unsupervised learning
Technical Field
The invention discloses an anti-fraud method based on supervised learning and unsupervised learning, and belongs to the technical field of anti-fraud.
Background
In a large number of orders or transactions in the financial field, normal data and a small part of abnormal data with fraud are existed, how to detect the transaction with fraud from a large number of instant financial transactions and process the transaction in time is an important work, and the machine learning method is already applied in a large number and anti-fraud models, the machine learning algorithm can be roughly classified into a supervised learning method and an unsupervised learning method, and the supervised learning method and the unsupervised learning method have respective advantages and disadvantages in anti-fraud application.
The supervised learning algorithm depends on historical data, and the historical data needs to be labeled, that is, each historical transaction needs to be labeled as a normal transaction or an abnormal (fraudulent) transaction, the supervised learning algorithm learns the labeled historical data and stores the learning result in a model, and when a new transaction is generated, the trained supervised learning model can predict the new transaction and calculate the probability or possibility that the new transaction belongs to the abnormal transaction. The supervised learning method has the problems that the supervised learning learns good samples and bad samples in historical data to find rules and patterns in the good samples and the bad samples, and then classifies newly generated transactions based on the patterns, namely, the supervised learning can only match the patterns existing in the historical data, and when a new fraud pattern occurs, the supervised learning cannot detect the new fraud pattern because the supervised learning does not learn the patterns from the historical data in advance.
The weakness of supervised learning can be compensated by unsupervised learning, which does not need to label the historical data but directly analyzes the distribution of the historical data, and an assumption of unsupervised learning is that when an outlier occurs in the distribution of the transaction data, the outlier will belong to a fraudulent transaction. Unsupervised learning will be useful for discovering unknown patterns of fraud because unsupervised learning does not rely on annotated historical data.
So to say, in anti-fraud applications, the methods of supervised learning and unsupervised learning are complementary, supervised learning being used to discover existing fraud patterns and unsupervised learning being used to detect new fraud patterns;
therefore, the invention provides an anti-fraud method based on supervised learning and unsupervised learning.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an anti-fraud method based on supervised learning and unsupervised learning, wherein an anti-fraud model is established by using a method combining the supervised learning and the unsupervised learning, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of an integrated model, the predicted result is better than the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the known fraud mode and the unknown fraud mode can be detected by combining the two learning modes.
In order to achieve the purpose, the invention provides the following technical scheme: an anti-fraud method based on supervised learning and unsupervised learning comprises the following specific steps:
the method comprises the following steps: preprocessing data; the input data is numerical data, non-numerical data needs to be converted into numerical data, and then all the input data can be standardized, for example, the data can be standardized by using a standard scaler of scinit-lean;
step two: data conversion; in general, each feature of input data has a certain correlation relationship, the input data needs to be converted by using a Principal Component Analysis (PCA) method, each column of the converted data has no correlation relationship, and the converted data is still marked as X;
step three: creating a Gaussian mixture model; the invention aims to establish a corresponding Gaussian mixture model for each column of the converted data, and the Gaussian mixture model is recorded as:
Figure GDA0002288788020000031
wherein xiFor the ith row of the data,
Figure GDA0002288788020000032
for the parameters of the model, the gaussian mixture model contains k components, each component is an independent gaussian univariate distribution, and because each gaussian mixture model is only for one column in the data set X, its components only need gaussian univariate distributions, that is:
Figure GDA0002288788020000033
furthermore, it is possible to provide a liquid crystal display device,
Figure GDA0002288788020000034
each variable ωjIs the weight of the jth component in the mixture model, now given to each data point xiAdding an implicit variable zi,ziHas a value range of {1, 2.., k }, i.e., ziData point x is denoted 2iIs generated by the 2 nd component, ziDenoted by j is the data point xiIs generated for the jth component, thenAny θ, has:
Figure GDA0002288788020000041
according to the formula (2) and the Bayesian formula, the following can be obtained:
Figure GDA0002288788020000042
the above formula represents the data point xiIs the probability of the generation of the jth component, defines γij=p(zi=j|xi,θ);
Step four: an expectation maximization algorithm; after the model is determined, parameters of the model need to be estimated, and in this step, a maximum likelihood parameter estimation method is used, and a log likelihood function is generally used:
Figure GDA0002288788020000043
according to a total probability formula:
Figure GDA0002288788020000044
equation (3) can be written as:
Figure GDA0002288788020000045
the logarithm of the sum contained in the formula (4), solving the maximum likelihood estimation of the formula (4) does not have an analytic solution, and the maximum likelihood estimation can only be solved through an iterative method, wherein the maximum value of the objective function L (theta) is calculated by adopting an Expectation-Maximization (EM) algorithm, the EM algorithm is an iterative algorithm and mainly comprises an E step (expection) and an M step (Maximization), and the maximum value solving problem can be changed into the following maximum value solving problem through derivation (4):
θ(m+1)=argmaxθE[logP(x,z|θ)|x,θ(m)] (5)
wherein, theta(m)The parameter estimation value obtained after the mth iteration is shown, and the parameter value theta obtained after the (m + 1) th iteration is obtained through the calculation of the formula (5)(m+1)Defining a Q function:
Q(θ|θ(m))=E[logP(x,z|θ)|x,θ(m)]
then the ith data point xiThe Q function of (A) is:
Figure GDA0002288788020000051
the above derivation uses gammaijAnd equation (2), so the Q function for all data points is:
Figure GDA0002288788020000052
after deriving the formula of the Q function, the main steps of the EM algorithm can be listed as follows:
the method comprises the following steps: initializing, when iteration is not started, setting the value of a parameter m to be 0, wherein the parameter m can be regarded as the number of iterations, and simultaneously, model parameters need to be set
Figure GDA0002288788020000061
Of where theta is an initial value(0)Namely, the model parameter value after the 0 th iteration is represented;
secondly, the step of: e, according to the parameter value theta of the current model(m)It is necessary to calculate the component j for the data xiResponsivity of (2)
Figure GDA0002288788020000062
Figure GDA0002288788020000063
The above formula has been given in step three;
③: m, according to the definition of the Q function, in each iteration, it is necessary to find the value of the parameter θ so as to maximize the value of the Q function, that is:
Figure GDA0002288788020000064
that is, in M steps, it is necessary to find the function Q (θ | θ!)(m)) For maximum value of parameter theta, in order to obtain optimum mujIs required to measure mujThe partial derivative is calculated and made zero, i.e.:
Figure GDA0002288788020000065
substituting equation (6) into the above equation and solving for it yields:
Figure GDA0002288788020000066
the same optimum σjThe following formula needs to be calculated:
Figure GDA0002288788020000067
substituting and solving equation (6) yields:
Figure GDA0002288788020000071
calculating optimal omegajThe formula of (1) is:
Figure GDA0002288788020000072
after M steps are finished, new parameter values after iteration are obtained
Figure GDA0002288788020000073
Fourthly, the method comprises the following steps: and (4) circulating, repeating the step (E) and the step (M), namely alternately calculating the step (E) and the step (M), and paying attention to that the parameter M is automatically increased by 1 after the step (M) is finished, namely M is equal to M +1, and the circulation is finished when the iteration converges to a preset threshold value;
after the execution of the first step to the fourth step is finished, all parameters of the Gaussian mixture model are calculated, and the training of the Gaussian mixture model is also finished;
step five: supervised learning is available; the historical data which needs to be labeled in supervised learning is assumed to be a matrix X, each row in the matrix X is assumed to be labeled, the normal transaction is recorded as 0, the abnormal transaction is recorded as 1, the labeled information is stored in one column of the matrix X, and the matrix X is split into two matrices X according to the labeled information0And X1,X0Contains all normal transactions, and X1Containing all anomalous transactions, assuming that matrix X has n columns, then X0And X1Also each have n columns, now X0And X1Are respectively divided into n single-column matrixes which are respectively marked as X01,X02,…,X0nAnd X11,X12,…,X1nThus, the matrix X is split into 2n single-column matrices, which now need to be applied to each single-column matrix X01,X02,…,X0n,X11,X12,…,X1nCreating an independent Gaussian mixture model and training each model by using an EM algorithm, wherein 2n models are required to be created, and the pseudo code is as follows:
algorithm 1. supervised learning gaussian mixture model,
Figure GDA0002288788020000081
in algorithm 1, the 2 nd to 3 rd rows are two cycles, which are cycled for 2n times, each cycle creates a gaussian mixture model in the 4 th row (corresponding to step three), and trains the gaussian mixture model in the 5 th row (corresponding to step four), and the trained data is a single-column matrix XijSo to say that2n Gaussian mixture models are created and trained by the algorithm 1, the third step and the fourth step are circulated for 2n times, and the trained models are stored in array models;
step six: unsupervised learning; assuming that there is some history data that has not been labeled, denoted X ', because the data has not been labeled, there is no way to distinguish between good and bad samples, assuming that matrix X' has m columns, X 'is now broken down into m single-column matrices X'1,X′2,...,X′mSimilarly to step five, a gaussian mixture model is created for each single-column matrix, so that in step six there are a total of m models, and the pseudo-code is as follows:
algorithm 2 unsupervised learning of a gaussian mixture model,
Figure GDA0002288788020000091
algorithm 2 is similar to algorithm 1, algorithm 2 has a cycle, which is cycled m times, each cycle executes step three and step four, a gaussian mixture model is created and trained, so that there are m models, and the trained models are placed in the group of models 2;
step seven: predicting; first, looking at the prediction of supervised learning, as new transaction data X(new)When generated, the new data needs to be predicted for its class, that is, when given data, the posterior probability P (y | X) needs to be calculated(new)) Where y ═ {0, 1} is a category, y ═ 0 is a good sample, y ═ 1 is a bad sample, according to bayes' theorem:
Figure GDA0002288788020000092
therefore, to calculate the posterior probability, the prior probability P (y), the conditional probability P (X) are calculated(new)Y), and P (X)(new)) The prior probability p (y) can be given by expert experience or calculated as a proportion of good or bad samples. Naive bayes can be used when calculating the conditional probability. In step five, the data has n columns, whereSuppose new data X(new)Consistent with the training data, there are also n columns, noted
Figure GDA0002288788020000093
Because in step two, the data has been transformed, and the columns of the transformed data are independent, a naive bayes formula can be applied:
Figure GDA0002288788020000101
equation (7) becomes:
Figure GDA0002288788020000102
wherein
Figure GDA0002288788020000103
The total 2n probabilities can be calculated by the 2n models trained in the fifth step, and
Figure GDA0002288788020000104
the probability can be calculated by a total probability formula, and the posterior probability calculated by the formula (8) is the predicted value of supervised learning. As the data in unsupervised learning is not marked with good or bad samples, the data can not be calculated by using the formula (8), and only the naive Bayes formula can be used for calculation:
Figure GDA0002288788020000105
note that in the above equation, i is 1, 2, …, m, since step six assumes that there are m columns of unsupervised learning training data. Since the Gaussian mixture model is a generative model, P (X) can be set(new2)) Treated as new data X(new2)Is the probability generated by the gaussian mixture model. When the calculated probability of some new data is low, these data points can be taken as outliers, i.e. anomalous transactions.
In one embodiment: in the seventh step, after the prediction results of supervised learning and unsupervised learning are obtained, a rule model, a linear model or an integrated model can be established to synthesize the prediction results to obtain the final output result.
To achieve the above object, the present invention further provides an anti-fraud computer apparatus based on supervised learning and unsupervised learning, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any of the preceding claims 1 to 2 when executing the computer program.
To achieve the above object, the present invention also provides an anti-fraud computer device storage medium based on supervised and unsupervised learning, having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the preceding claims 1 to 2.
After the technical scheme is adopted, on one hand, the anti-fraud model is established by using a method combining supervised learning and unsupervised learning, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of an integrated model, the predicted result is superior to the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the known fraud mode and the unknown fraud mode can be detected by combining the two learning modes;
on the other hand, in the fourth step, a specific calculation formula of an expectation-maximization (EM) algorithm is derived, and as can be seen from the fourth step, the final formula is easily implemented by using a programming language, and after the algorithm is implemented, the algorithm can be improved according to specific application scenarios, for example, each iteration of the third step and the fourth step requires a full amount of data to be input for calculation, but when the data amount is very large, the time required for one iteration is very long, and at this time, the existing EM algorithm can be improved, and only one batch (batch) of data is input for each iteration, so that the calculation speed can be greatly improved under the condition of ensuring the calculation accuracy.
Drawings
FIG. 1 is a flow chart showing the specific steps of an anti-fraud method based on supervised learning and unsupervised learning according to the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides an anti-fraud method based on supervised learning and unsupervised learning, which includes the following specific steps:
the method comprises the following steps: preprocessing data; the input data of the invention is numerical data, non-numerical data needs to be converted into numerical data, and then all the input data can be standardized, for example, the data can be standardized by using a standard scaler of scinit-lean;
step two: data conversion; in general, each feature of input data has a certain correlation relationship, the input data needs to be converted by using a Principal Component Analysis (PCA) method, each column of the converted data has no correlation relationship, and the converted data is still marked as X;
step three: creating a Gaussian mixture model; the invention aims to establish a corresponding Gaussian mixture model for each column of the converted data, wherein the Gaussian mixture model is recorded as:
Figure GDA0002288788020000121
wherein xiFor the ith row of the data,
Figure GDA0002288788020000131
for the parameters of the model, the Gaussian mixture model contains k components, each of which isIs an independent gaussian univariate distribution, because each gaussian mixture model is only for one column in the data set X, its composition only requires a gaussian univariate distribution, i.e.:
Figure GDA0002288788020000132
furthermore, it is possible to provide a liquid crystal display device,
Figure GDA0002288788020000133
each variable ωjIs the weight of the jth component in the mixture model, now given to each data point xiAdding an implicit variable zi,ziHas a value range of {1, 2.., k }, i.e., z }iData point x is denoted 2iIs generated by the 2 nd component, ziDenoted by j is the data point xiIs the jth component generated, then for any θ there is:
Figure GDA0002288788020000134
according to the formula (2) and the Bayesian formula, the following can be obtained:
Figure GDA0002288788020000135
the above formula represents the data point xiIs the probability of the generation of the jth component, defines γij=p(zi=j|xi,θ);
Step four: an expectation maximization algorithm; after the model is determined, parameters of the model need to be estimated, and in this step, a maximum likelihood parameter estimation method is used, and generally, a log-likelihood function is used:
Figure GDA0002288788020000136
according to the total probability formula:
Figure GDA0002288788020000141
equation (3) can be written as:
Figure GDA0002288788020000142
the logarithm of the sum contained in the formula (4), solving the maximum likelihood estimation of the formula (4) does not have an analytic solution, and the maximum likelihood estimation can only be solved through an iterative method, wherein the maximum value of the objective function L (theta) is calculated by adopting an Expectation-Maximization (EM) algorithm, the EM algorithm is an iterative algorithm and mainly comprises an E step (expection) and an M step (Maximization), and the maximum value solving problem can be changed into the following maximum value solving problem through derivation (4):
θ(m+1)=argmaxθE[logP(x,z|θ)|x,θ(m)] (5)
wherein, theta(m)The parameter estimation value obtained after the mth iteration is shown, and the parameter value theta obtained after the (m + 1) th iteration is obtained through the calculation of the formula (5)(m+1)Defining a Q function:
Q(θ|θ(m))=E[logP(x,z|θ)|x,θ(m)]
then the ith data point xiThe Q function of (A) is:
Figure GDA0002288788020000143
Figure GDA0002288788020000151
the above derivation uses gammaijAnd equation (2), so the Q function for all data points is:
Figure GDA0002288788020000152
after deriving the formula of the Q function, the main steps of the EM algorithm can be listed as follows:
the method comprises the following steps: initializing, when iteration does not start yet, setting the value of the parameter m to 0, wherein the parameter m can be regarded as the number of iterations, and simultaneously, model parameters need to be set
Figure GDA0002288788020000153
Of where theta is an initial value(0)Namely, the model parameter value after the 0 th iteration is represented;
secondly, the step of: e, according to the parameter value theta of the current model(m)The component j is calculated for the data xiResponsivity of (2)
Figure GDA0002288788020000154
Figure GDA0002288788020000155
The above formula has been given in step three;
③: m, according to the definition of the Q function, in each iteration, it is necessary to find the value of the parameter θ so as to maximize the value of the Q function, that is:
Figure GDA0002288788020000156
that is, in M steps, it is necessary to find the function Q (θ |)(m)) For maximum value of parameter theta, in order to obtain optimum mujIs required to measure mujThe partial derivative is calculated and made zero, i.e.:
Figure GDA0002288788020000161
substituting equation (6) into the above equation and solving for it yields:
Figure GDA0002288788020000162
the same optimum σjThe following formula needs to be calculated:
Figure GDA0002288788020000163
substituting and solving equation (6) yields:
Figure GDA0002288788020000164
calculating optimal omegajThe formula of (1) is:
Figure GDA0002288788020000165
after M steps are finished, new parameter values after iteration are obtained
Figure GDA0002288788020000166
Fourthly, the method comprises the following steps: and (4) circulating, repeating the step (E) and the step (M), namely alternately calculating the step (E) and the step (M), and paying attention to that the parameter M is automatically increased by 1 after the step (M) is finished, namely M is equal to M +1, and the circulation is finished when the iteration converges to a preset threshold value;
after the execution of the first step to the fourth step is finished, all parameters of the Gaussian mixture model are calculated, and the training of the Gaussian mixture model is also finished;
step five: supervised learning is available; the historical data which needs to be labeled in supervised learning is assumed to be a matrix X, each row in the matrix X is assumed to be labeled, the normal transaction is recorded as 0, the abnormal transaction is recorded as 1, the labeled information is stored in one column of the matrix X, and the matrix X is split into two matrices X according to the labeled information0And X1,X0Contains all normal orders, and X1Including all exception orders, assuming that matrix X has n columns, then X0And X1Also each have n columns, nowHandle X0And X1Are respectively divided into n single-column matrixes which are respectively marked as X01,X02,…,X0nAnd X11,X12,…,X1nThus, the matrix X is split into 2n single-column matrices, which now need to be applied to each single-column matrix X01,X02,…,X0n,X11,X12,…,X1nCreating an independent Gaussian mixture model and training each model by using an EM algorithm, wherein 2n models are required to be created, and the pseudo code is as follows:
algorithm 1. supervised learning gaussian mixture model,
Figure GDA0002288788020000171
in algorithm 1, rows 2-3 are two cycles, which are cycled for 2n times, each cycle creates a gaussian mixture model in row 4 (corresponding to step three), and trains the gaussian mixture model in row 5 (corresponding to step four), and the trained data is a single-column matrix XijTherefore, the algorithm 1 creates and trains 2n Gaussian mixture models, and circulates the third step and the fourth step for 2n times, and the trained models are stored in array models;
step six: unsupervised learning; assuming that there is some history data that has not been labeled, denoted X ', because the data has not been labeled, there is no way to distinguish between good and bad samples, assuming that matrix X' has m columns, X 'is now broken down into m single-column matrices X'1,X′2,...,X′mSimilarly to step five, a gaussian mixture model is created for each single-column matrix, so that in step six there are a total of m models, and the pseudo-code is as follows:
algorithm 2 unsupervised learning of a gaussian mixture model,
Figure GDA0002288788020000181
algorithm 2 is similar to algorithm 1, algorithm 2 has a cycle, which is cycled m times, each cycle executes step three and step four, a gaussian mixture model is created and trained, so that there are m models, and the trained models are placed in the group of models 2;
step seven: predicting; first, consider the prediction with supervised learning when new transaction data X(new)When generated, the new data category needs to be predicted, that is, the posterior probability P (y | X) needs to be calculated when given the data(new)) Where y ═ {0, 1} is a category, y ═ 0 is a good sample, y ═ 1 is a bad sample, according to bayes' theorem:
Figure GDA0002288788020000191
therefore, to calculate the posterior probability, the prior probability P (y), the conditional probability P (X) are calculated(new)Y), and P (X)(new)) The prior probability P (y) can be given by expert experience, or calculated according to the proportion of good and bad samples, when calculating the conditional probability, naive Bayes can be used, in step five, the data has n columns, and here, new data X is assumed(new)Consistent with the training data, there are also n columns, noted
Figure GDA0002288788020000192
Because in step two, the data has been transformed, and the columns of the transformed data are independent, a naive bayes formula can be applied:
Figure GDA0002288788020000193
equation (7) becomes:
Figure GDA0002288788020000196
(8)
wherein
Figure GDA0002288788020000194
The total 2n probabilities can be calculated by the 2n models trained in the fifth step, and
Figure GDA0002288788020000195
the probability can be calculated by a total probability formula, the posterior probability calculated by the formula (8) is a predicted value of supervised learning, and the data in the unsupervised learning is not marked with good or bad samples, so the formula (8) can not be used for calculation, and only the naive Bayes formula can be used for calculation:
Figure GDA0002288788020000201
note that in the above equation, i is 1, 2, …, m, since step six assumes that there are m columns in the unsupervised learning training data, and since the gaussian mixture model is a generative model, P (X) can be used(new2)) Treated as new data X(new2)Is the probability generated by the gaussian mixture model, and when the probability obtained by the calculation of some new data is very low, the data points can be taken as outliers, namely abnormal transactions.
In this embodiment: in the seventh step, after the prediction results of supervised learning and unsupervised learning are obtained, a rule model, a linear model or an integrated model can be established to synthesize the prediction results to obtain the final output result.
To achieve the above object, the present invention further provides an anti-fraud computer apparatus based on supervised learning and unsupervised learning, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any of the preceding claims 1 to 2 when executing the computer program.
To achieve the above object, the present invention also provides an anti-fraud computer device storage medium based on supervised and unsupervised learning, having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the preceding claims 1 to 2.
Further, the programming language used in the present invention is Python, the principal component analysis in step two may use the composition.pca class in scimit-spare package, the gaussian mixture model in step three may use the mixture.gaussian mixture class in scimit-spare package, an important parameter in this class is n _ components, i.e. how many components the mixture model is composed of, the parameter n _ components is a hyper-parameter, which needs to be manually adjusted or searched using a hyper-parameter, the value of the parameter n _ components will be over-fitted if it is set too high, and the calculation time will be longer.
Further, the fourth step is an expectation maximization algorithm, random values can be used for initializing the model parameters of the first step, but the EM algorithm is sensitive to setting of initial values, the initial values of the parameters are preferably calculated by using a k-means algorithm, the k-means algorithm can use cluster, KMeans, in scipit-left packets, the second step and the third step can be realized according to formulas listed in the fourth step, a numpy packet can be used for optimizing the calculation speed, and the third step and the fourth step can also be realized by using algorithms built in mix, Gaussian mixture classes, but the EM algorithm is difficult to improve by using the existing method in scipit-left.
Furthermore, the supervised learning in the fifth step and the unsupervised learning in the sixth step are both loops of the third step and the fourth step.
Further, in step seven, the conditional probability of the data in each gaussian mixture model can be calculated by using score _ samples method in the mixture.
After the technical scheme is adopted, on one hand, the anti-fraud model is established by using a method combining supervised learning and unsupervised learning, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of an integrated model, the predicted result is superior to the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the known fraud mode and the unknown fraud mode can be detected by combining the two learning modes; on the other hand, in the fourth step, a specific calculation formula of an expectation-maximization (EM) algorithm is derived, and as can be seen from the fourth step, the final formula is easily implemented by using a programming language, and after the algorithm is implemented, the algorithm can be improved according to specific application scenarios, for example, each iteration of the third step and the fourth step requires a full amount of data to be input for calculation, but when the data amount is very large, the time required for one iteration is very long, and at this time, the existing EM algorithm can be improved, and only one batch (batch) of data is input for each iteration, so that the calculation speed can be greatly improved under the condition of ensuring the calculation accuracy.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (3)

1. An anti-fraud method based on supervised learning and unsupervised learning is characterized by comprising the following specific steps:
the method comprises the following steps: preprocessing data; the input data is numerical data, the non-numerical data needs to be converted into numerical data, all the input data are standardized, and the data are standardized by using a standard scaler of scinit-leran;
step two: data conversion; each characteristic of the input data has a certain correlation relationship, the input data needs to be converted by using a Principal Component Analysis (PCA) method, each column of the converted data has no correlation relationship, and the converted data is still marked as X;
step three: creating a Gaussian mixture model; the method aims to establish a corresponding Gaussian mixture model for each column of the converted data, and the Gaussian mixture model is recorded as:
Figure FDA0003555393330000011
wherein x isiFor the ith row of the data,
Figure FDA0003555393330000012
for the parameters of the model, the gaussian mixture model contains k components, each component is an independent gaussian univariate distribution, and because each gaussian mixture model is only for one column in the data set X, its components only need gaussian univariate distributions, that is:
Figure FDA0003555393330000013
furthermore, it is possible to provide a liquid crystal display device,
Figure FDA0003555393330000014
each variable ωjIs the weight of the jth component in the mixture model, now given to each data point xiAdding an implicit variable zi,ziHas a value range of {1, 2.., k }, i.e., z }iData point x is denoted 2iIs generated by the 2 nd component, ziDenoted by j is the data point xiIs the jth component generated, then for any θ, there is:
Figure FDA0003555393330000021
according to the formula (2) and the Bayesian formula, the following can be obtained:
Figure FDA0003555393330000022
the above formula represents the data point xiIs the probability of the generation of the jth component, defines γij=p(zi=j|xi,θ);
Step four: an expectation maximization algorithm; after the model is determined, the parameters of the model need to be estimated, and in this step, a maximum likelihood parameter estimation method is used, and a log likelihood function is used:
Figure FDA0003555393330000023
according to the total probability formula:
Figure FDA0003555393330000024
equation (3) can be written as:
Figure FDA0003555393330000025
the logarithm of the sum contained in the formula (4), solving the maximum likelihood estimation of the formula (4) does not have an analytic solution, and the maximum likelihood estimation can only be solved through an iterative method, wherein the maximum value of the objective function L (theta) is calculated by adopting an Expectation-Maximization (EM) algorithm, the EM algorithm is an iterative algorithm and mainly comprises an E step (expection) and an M step (Maximization), and the maximum value solving problem can be changed into the following maximum value solving problem through derivation (4):
θ(m+1)=arg maxθE[logP(x,z|θ)|x,θ(m)] (5)
wherein, theta(m)The parameter estimation value obtained after the mth iteration is shown, and the parameter value theta obtained after the (m + 1) th iteration is obtained through the calculation of the formula (5)(m+1)Defining a Q function:
Q(θ|θ(m))=E[logP(x,z|θ)|x,θ(m)]
then the ith data point xiThe Q function of (A) is:
Figure FDA0003555393330000031
the above derivation uses gammaijAnd equation (2), so the Q function for all data points is:
Figure FDA0003555393330000032
after the formula of the Q function is derived, the main steps of the EM algorithm are listed as follows:
the method comprises the following steps: initializing, when iteration is not started, setting the value of a parameter m to be 0, regarding the parameter m as the number of iterations, and simultaneously setting model parameters
Figure FDA0003555393330000041
Of where theta is an initial value(0)Namely, the model parameter value after the 0 th iteration is represented;
secondly, the step of: e, according to the parameter value theta of the current model(m)It is necessary to calculate the component j for the data xiResponsivity of (2)
Figure FDA0003555393330000042
Figure FDA0003555393330000043
i=1,2,...,n;j=1,2,...,k
The above formula has been given in step three;
③: m, according to the definition of the Q function, in each iteration, it is necessary to find the value of the parameter θ so as to maximize the value of the Q function, that is:
θ(m+1)=arg maxθQ(θ|θ(m))
that is, in M steps, it is necessary to find the function Q (θ | θ!)(m)) For maximum value of parameter theta, in order to obtain optimum mujIs required to measure mujThe partial derivative is calculated and made zero, i.e.:
Figure FDA0003555393330000044
substituting equation (6) into the above equation and solving for it yields:
Figure FDA0003555393330000045
j=1,2,...,k
the same optimum σjThe following formula needs to be calculated:
Figure FDA0003555393330000051
substituting and solving equation (6) yields:
Figure FDA0003555393330000052
j=1,2,...,k
calculating optimal omegajThe formula of (1) is:
Figure FDA0003555393330000053
j=1,2,...,k
after M steps are finished, new parameter values after iteration are obtained
Figure FDA0003555393330000054
Fourthly, the method comprises the following steps: and (4) circulating, repeating the step (E) and the step (M), namely alternately calculating the step (E) and the step (M), and paying attention to that the parameter M is automatically increased by 1 after the step (M) is finished, namely M is equal to M +1, and the circulation is finished when the iteration converges to a preset threshold value;
after the execution of the first step to the fourth step is finished, all parameters of the Gaussian mixture model are calculated, and the training of the Gaussian mixture model is also finished;
step five: supervised learning is available; supervised learning requires annotated historical data, hypothetical historyThe data is a matrix X, and assuming that each row in the matrix X has been labeled, the normal transaction is labeled 0, the abnormal transaction is labeled 1, the labeled information is stored in one column of X, and the matrix X is split into two matrices X according to the labeled information0And X1,X0Contains all normal orders, and X1Including all exception orders, assuming that matrix X has n columns, then X0And X1Also each have n columns, now X0And X1Are respectively divided into n single-column matrixes which are respectively marked as X01,X02,…,X0nAnd X11,X12,…,X1nThus, the matrix X is split into 2n single-column matrices, which now need to be applied to each single-column matrix X01,X02,…,X0n,X11,X12,…,X1nCreating an independent Gaussian mixture model and training each model by using an EM algorithm, wherein 2n models are required to be created, and the pseudo code is as follows:
algorithm 1: the method has the advantages of having supervised learning of the Gaussian mixture model,
Figure FDA0003555393330000061
in algorithm 1, rows 2-3 are two cycles, which are cycled for 2n times, each cycle creates a gaussian mixture model in row 4 (corresponding to step three), and trains the gaussian mixture model in row 5 (corresponding to step four), and the trained data is a single-column matrix XijTherefore, the algorithm 1 creates and trains 2n Gaussian mixture models, and circulates the third step and the fourth step for 2n times, and the trained models are stored in array models;
step six: unsupervised learning; assuming that there is some history data that has not been labeled, denoted X ', because the data has not been labeled, there is no way to distinguish between good and bad samples, assuming that matrix X' has m columns, X 'is now broken down into m single-column matrices X'1,X′2,...,X′mFor each single column, similar to step fiveThe matrix creates a gaussian mixture model, so that in step six there are a total of m models, the pseudo code is as follows:
and 2, algorithm: the method is characterized in that a Gaussian mixture model is learned without supervision,
Figure FDA0003555393330000071
algorithm 2 is similar to algorithm 1, algorithm 2 has a cycle, which is cycled m times, each cycle executes step three and step four, a gaussian mixture model is created and trained, so that there are m models, and the trained models are placed in the group of models 2;
step seven: predicting; first, looking at the prediction of supervised learning, as new transaction data X(new)When generated, the new data needs to be predicted for its class, that is, when given data, the posterior probability P (y | X) needs to be calculated(new)) Where y ═ {0, 1} is a category, y ═ 0 is a good sample, y ═ 1 is a bad sample, according to bayes' theorem:
Figure FDA0003555393330000072
therefore, to calculate the posterior probability, the prior probability P (y), the conditional probability P (X) are calculated(new)Y), and P (X)(new)) The prior probability P (y) can be given by expert experience, or calculated according to the proportion of good and bad samples, when calculating the conditional probability, the naive Bayes is used, in step five, the data has n columns, and the new data X is assumed here(new)Consistent with the training data, there are also n columns, noted
Figure FDA0003555393330000081
Because in step two, the data has been transformed, and the columns of the transformed data are independent, a naive bayes formula is applied:
Figure FDA0003555393330000082
equation (7) becomes:
Figure FDA0003555393330000083
wherein
Figure FDA0003555393330000084
N, a total of 2n probabilities are calculated from the 2n models trained in step five, and
Figure FDA0003555393330000085
the probability is calculated by a total probability formula, the posterior probability calculated by the formula (8) is a predicted value of supervised learning, and the data in the unsupervised learning is not marked with good or bad samples, so the formula (8) can not be used for calculation, and only the naive Bayes formula can be used for calculation:
Figure FDA0003555393330000086
note that in the above equation, i is 1, 2, …, m, since step six assumes that there are m columns in the unsupervised learning training data, and since the gaussian mixture model is a generative model, P (X) is used(new2)) Treated as new data X(new2)The probability is generated by a Gaussian mixture model, and when the probability obtained by calculating some new data is very low, the data points are used as outliers, namely abnormal transactions;
and step seven, after the prediction results of supervised learning and unsupervised learning are obtained, establishing a rule model, a linear model or an integrated model to synthesize the prediction results to obtain the final output result.
2. An anti-fraud computer device based on supervised and unsupervised learning, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that: the processor, when executing the computer program, performs the steps of the method of claim 1.
3. An anti-fraud computer device storage medium based on supervised and unsupervised learning, having a computer program stored thereon, characterized by: which when executed by a processor performs the steps of the method of claim 1 above.
CN201910977563.9A 2019-10-15 2019-10-15 Anti-fraud method based on supervised learning and unsupervised learning Active CN110717601B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910977563.9A CN110717601B (en) 2019-10-15 2019-10-15 Anti-fraud method based on supervised learning and unsupervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910977563.9A CN110717601B (en) 2019-10-15 2019-10-15 Anti-fraud method based on supervised learning and unsupervised learning

Publications (2)

Publication Number Publication Date
CN110717601A CN110717601A (en) 2020-01-21
CN110717601B true CN110717601B (en) 2022-05-03

Family

ID=69211695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910977563.9A Active CN110717601B (en) 2019-10-15 2019-10-15 Anti-fraud method based on supervised learning and unsupervised learning

Country Status (1)

Country Link
CN (1) CN110717601B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435997A (en) * 2021-06-08 2021-09-24 成都熵焓科技有限公司 Gaussian mixture model bank transaction data simulation generation algorithm based on deep learning
CN113538049A (en) * 2021-07-14 2021-10-22 北京明略软件系统有限公司 Abnormal flow identification system

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745233A (en) * 2014-01-23 2014-04-23 西安电子科技大学 Hyper-spectral image classifying method based on spatial information transfer
CN104280771A (en) * 2014-10-27 2015-01-14 中国石油集团川庆钻探工程有限公司地球物理勘探公司 Three-dimensional seismic data waveform semi-supervised clustering method based on EM algorithm
CN104680015A (en) * 2015-03-02 2015-06-03 华南理工大学 Online soft measurement method for sewage treatment based on quick relevance vector machine
CN104881687A (en) * 2015-06-02 2015-09-02 四川理工学院 Magnetic resonance image classification method based on semi-supervised Gaussian mixed model
CN105160290A (en) * 2015-07-03 2015-12-16 东南大学 Mobile boundary sampling behavior identification method based on improved dense locus
CN106778854A (en) * 2016-12-07 2017-05-31 西安电子科技大学 Activity recognition method based on track and convolutional neural networks feature extraction
CN108734479A (en) * 2018-04-12 2018-11-02 阿里巴巴集团控股有限公司 Data processing method, device, equipment and the server of Insurance Fraud identification
CN108804784A (en) * 2018-05-25 2018-11-13 江南大学 A kind of instant learning soft-measuring modeling method based on Bayes's gauss hybrid models
CN109165550A (en) * 2018-07-13 2019-01-08 首都师范大学 A kind of multi-modal operation track fast partition method based on unsupervised deep learning
CN109784676A (en) * 2018-12-25 2019-05-21 杨鑫 The study and application method, device and computer readable storage medium of data analysis
CN110071913A (en) * 2019-03-26 2019-07-30 同济大学 A kind of time series method for detecting abnormality based on unsupervised learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10999247B2 (en) * 2017-10-24 2021-05-04 Nec Corporation Density estimation network for unsupervised anomaly detection
US20190108447A1 (en) * 2017-11-30 2019-04-11 Intel Corporation Multifunction perceptrons in machine learning environments

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745233A (en) * 2014-01-23 2014-04-23 西安电子科技大学 Hyper-spectral image classifying method based on spatial information transfer
CN104280771A (en) * 2014-10-27 2015-01-14 中国石油集团川庆钻探工程有限公司地球物理勘探公司 Three-dimensional seismic data waveform semi-supervised clustering method based on EM algorithm
CN104680015A (en) * 2015-03-02 2015-06-03 华南理工大学 Online soft measurement method for sewage treatment based on quick relevance vector machine
CN104881687A (en) * 2015-06-02 2015-09-02 四川理工学院 Magnetic resonance image classification method based on semi-supervised Gaussian mixed model
CN105160290A (en) * 2015-07-03 2015-12-16 东南大学 Mobile boundary sampling behavior identification method based on improved dense locus
CN106778854A (en) * 2016-12-07 2017-05-31 西安电子科技大学 Activity recognition method based on track and convolutional neural networks feature extraction
CN108734479A (en) * 2018-04-12 2018-11-02 阿里巴巴集团控股有限公司 Data processing method, device, equipment and the server of Insurance Fraud identification
CN108804784A (en) * 2018-05-25 2018-11-13 江南大学 A kind of instant learning soft-measuring modeling method based on Bayes's gauss hybrid models
CN109165550A (en) * 2018-07-13 2019-01-08 首都师范大学 A kind of multi-modal operation track fast partition method based on unsupervised deep learning
CN109784676A (en) * 2018-12-25 2019-05-21 杨鑫 The study and application method, device and computer readable storage medium of data analysis
CN110071913A (en) * 2019-03-26 2019-07-30 同济大学 A kind of time series method for detecting abnormality based on unsupervised learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
医疗保险欺诈检测的研究与应用;郭涛;《中国优秀硕士论文全文数据库》;20170215;第1-66页 *

Also Published As

Publication number Publication date
CN110717601A (en) 2020-01-21

Similar Documents

Publication Publication Date Title
Hernández et al. Variational encoding of complex dynamics
Tamar et al. Scaling up robust MDPs using function approximation
Zeng et al. SMO-based pruning methods for sparse least squares support vector machines
US9704105B2 (en) Transductive lasso for high-dimensional data regression problems
CN110766044B (en) Neural network training method based on Gaussian process prior guidance
Mariet et al. Foundations of sequence-to-sequence modeling for time series
US10769551B2 (en) Training data set determination
US11004026B2 (en) Method and apparatus for determining risk management decision-making critical values
US20060129395A1 (en) Gradient learning for probabilistic ARMA time-series models
CN110717601B (en) Anti-fraud method based on supervised learning and unsupervised learning
Andrews Rank-based estimation for GARCH processes
Tanha et al. Disagreement-based co-training
Rad et al. GP-RVM: Genetic programing-based symbolic regression using relevance vector machine
Schumacher et al. An introduction to delay-coupled reservoir computing
Lee et al. Trajectory of mini-batch momentum: Batch size saturation and convergence in high dimensions
Xie et al. Reinforcement learning for soft sensor design through autonomous cross-domain data selection
Wang et al. A novel approach to generate artificial outliers for support vector data description
Kopczyk Proxy modeling in life insurance companies with the use of machine learning algorithms
Galili et al. Splitting matters: how monotone transformation of predictor variables may improve the predictions of decision tree models
JP6233432B2 (en) Method and apparatus for selecting mixed model
CN110135592B (en) Classification effect determining method and device, intelligent terminal and storage medium
US7418430B2 (en) Dynamic standardization for scoring linear regressions in decision trees
Madary et al. A bayesian framework for large-scale identification of nonlinear hybrid systems
Kamm et al. A novel approach to rating transition modelling via Machine Learning and SDEs on Lie groups
Grohmann Reliable Resource Demand Estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant