CN110717601A - Anti-fraud method based on supervised learning and unsupervised learning - Google Patents

Anti-fraud method based on supervised learning and unsupervised learning Download PDF

Info

Publication number
CN110717601A
CN110717601A CN201910977563.9A CN201910977563A CN110717601A CN 110717601 A CN110717601 A CN 110717601A CN 201910977563 A CN201910977563 A CN 201910977563A CN 110717601 A CN110717601 A CN 110717601A
Authority
CN
China
Prior art keywords
data
model
formula
gaussian mixture
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910977563.9A
Other languages
Chinese (zh)
Other versions
CN110717601B (en
Inventor
施铭铮
刘占辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Pencil Head Information Technology Co Ltd
Original Assignee
Xiamen Pencil Head Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Pencil Head Information Technology Co Ltd filed Critical Xiamen Pencil Head Information Technology Co Ltd
Priority to CN201910977563.9A priority Critical patent/CN110717601B/en
Publication of CN110717601A publication Critical patent/CN110717601A/en
Application granted granted Critical
Publication of CN110717601B publication Critical patent/CN110717601B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses an anti-fraud method based on supervised learning and unsupervised learning, which comprises the following specific steps: the anti-fraud method based on the supervised learning and the unsupervised learning uses a method combining the supervised learning and the unsupervised learning to establish an anti-fraud model, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of integrating the model, the prediction result is better than the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the combination of the two learning modes can detect the known fraud mode and the unknown fraud mode.

Description

Anti-fraud method based on supervised learning and unsupervised learning
Technical Field
The invention discloses an anti-fraud method based on supervised learning and unsupervised learning, and belongs to the technical field of anti-fraud.
Background
In a large number of orders or transactions in the financial field, normal data and a small part of abnormal data with fraud are existed, how to detect the transaction with fraud from a large number of instant financial transactions and process the transaction in time is an important work, and the machine learning method is already applied in a large number and anti-fraud models, the machine learning algorithm can be roughly classified into a supervised learning method and an unsupervised learning method, and the supervised learning method and the unsupervised learning method have respective advantages and disadvantages in anti-fraud application.
The supervised learning algorithm depends on historical data, and the historical data needs to be labeled, that is, each historical transaction needs to be labeled as a normal transaction or an abnormal (fraudulent) transaction, the supervised learning algorithm learns the labeled historical data and stores the learning result in a model, and when a new transaction is generated, the trained supervised learning model can predict the new transaction and calculate the probability or possibility that the new transaction belongs to the abnormal transaction. The supervised learning method has the problems that the supervised learning learns good samples and bad samples in historical data to find rules and patterns in the good samples and the bad samples, and then classifies newly generated transactions based on the patterns, namely, the supervised learning can only match the patterns existing in the historical data, and when a new fraud pattern occurs, the supervised learning cannot detect the new fraud pattern because the supervised learning does not learn the patterns from the historical data in advance.
The weakness of supervised learning can be compensated by unsupervised learning, which does not need to label the historical data but directly analyzes the distribution of the historical data, and an assumption of unsupervised learning is that when an outlier occurs in the distribution of the transaction data, the outlier will belong to a fraudulent transaction. Unsupervised learning will be useful for discovering unknown patterns of fraud because unsupervised learning does not rely on annotated historical data.
So to say, in anti-fraud applications, the methods of supervised learning and unsupervised learning are complementary, supervised learning being used to discover existing fraud patterns and unsupervised learning being used to detect new fraud patterns;
therefore, the invention provides an anti-fraud method based on supervised learning and unsupervised learning.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an anti-fraud method based on supervised learning and unsupervised learning, wherein an anti-fraud model is established by using a method combining the supervised learning and the unsupervised learning, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of an integrated model, the predicted result is better than the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the known fraud mode and the unknown fraud mode can be detected by combining the two learning modes.
In order to achieve the purpose, the invention provides the following technical scheme: an anti-fraud method based on supervised learning and unsupervised learning comprises the following specific steps:
the method comprises the following steps: preprocessing data; the input data is numerical data, non-numerical data needs to be converted into numerical data, and then all the input data can be standardized, for example, the data can be standardized by using a standard scaler of scinit-lean;
step two: data conversion; in general, each feature of input data has a certain correlation relationship, the input data needs to be converted by using a Principal Component Analysis (PCA) method, each column of the converted data has no correlation relationship, and the converted data is still marked as X;
step three: creating a Gaussian mixture model; the invention aims to establish a corresponding Gaussian mixture model for each column of the converted data, wherein the Gaussian mixture model is recorded as:
Figure RE-GDA0002288788020000031
wherein xiFor the ith row of the data,
Figure RE-GDA0002288788020000032
for the parameters of the model, the gaussian mixture model contains k components, each component is an independent gaussian univariate distribution, and because each gaussian mixture model is only for one column in the data set X, its components only need gaussian univariate distributions, that is:
furthermore, it is possible to provide a liquid crystal display device,
Figure RE-GDA0002288788020000034
each variable ωjIs the weight of the jth component in the mixture model, now given to each data point xiAdding an implicit variable zi,ziHas a value range of {1, 2.., k }, i.e., z }iData point x is denoted 2iIs generated by the 2 nd component, ziDenoted by j is the data point xiIs the jth component generated, then for any θ there is:
Figure RE-GDA0002288788020000041
according to the formula (2) and the Bayesian formula, the following can be obtained:
the above formula represents the data point xiIs the probability of the generation of the jth component, defines γij=p(zi=j|xi,θ);
Step four: an expectation maximization algorithm; after the model is determined, parameters of the model need to be estimated, and in this step, a maximum likelihood parameter estimation method is used, and a log likelihood function is generally used:
Figure RE-GDA0002288788020000043
according to the total probability formula:
Figure RE-GDA0002288788020000044
equation (3) can be written as:
Figure RE-GDA0002288788020000045
the logarithm of the sum contained in the formula (4), solving the maximum likelihood estimation of the formula (4) does not have an analytic solution, and the maximum likelihood estimation can only be solved through an iterative method, wherein the maximum value of the objective function L (theta) is calculated by adopting an Expectation-Maximization (EM) algorithm, the EM algorithm is an iterative algorithm and mainly comprises an E step (expection) and an M step (Maximization), and the maximum value solving problem can be changed into the following maximum value solving problem through derivation (4):
θ(m+1)=argmaxθE[logP(x,z|θ)|x,θ(m)](5)
wherein, theta(m)The parameter estimation value obtained after the mth iteration is shown, and the parameter value theta obtained after the (m + 1) th iteration is obtained through the calculation of the formula (5)(m+1)Defining a Q function:
Q(θ|θ(m))=E[logP(x,z|θ)|x,θ(m)]
then the ith data point xiThe Q function of (A) is:
Figure RE-GDA0002288788020000051
the above derivation uses gammaijAnd equation (2), so the Q function for all data points is:
Figure RE-GDA0002288788020000052
after deriving the formula of the Q function, the main steps of the EM algorithm can be listed as follows:
① initialization, setting the value of parameter m to 0 when iteration is not started, wherein the parameter m can be regarded as the number of iterations, and the model parameter needs to be set
Figure RE-GDA0002288788020000061
Of where theta is an initial value(0)Namely, the model parameter value after the 0 th iteration is represented;
② step E, according to the parameter value theta of the current model(m)Need forComputing component j versus data xiResponsivity of (2)
Figure RE-GDA0002288788020000062
The above formula has been given in step three;
step ③, M, according to the definition of the Q function, in each iteration, it is necessary to find the value of the parameter θ so as to maximize the value of the Q function, that is:
Figure RE-GDA0002288788020000064
that is, in M steps, it is necessary to find the function Q (θ | θ!)(m)) For maximum value of parameter theta, in order to obtain optimum mujIs required to measure mujThe partial derivative is calculated and made zero, i.e.:
Figure RE-GDA0002288788020000065
substituting equation (6) into the above equation and solving for it yields:
Figure RE-GDA0002288788020000066
the same optimum σjThe following formula needs to be calculated:
Figure RE-GDA0002288788020000067
substituting and solving equation (6) yields:
Figure RE-GDA0002288788020000071
calculating optimal omegajThe formula of (1) is:
Figure RE-GDA0002288788020000072
after M steps are finished, new parameter values after iteration are obtained
Figure RE-GDA0002288788020000073
④, repeating ② and ③, namely alternately calculating the steps E and M, and noting that the parameter M is increased by 1 after the step M is finished, namely M is M +1, and the loop is finished when the iteration converges to a preset threshold value;
when the ① to ④ are executed, all parameters of the Gaussian mixture model are calculated, and the training of one Gaussian mixture model is also finished;
step five: supervised learning is available; the historical data which needs to be labeled in supervised learning is assumed to be a matrix X, each row in the matrix X is assumed to be labeled, the normal transaction is recorded as 0, the abnormal transaction is recorded as 1, the labeled information is stored in one column of the matrix X, and the matrix X is split into two matrices X according to the labeled information0And X1,X0Contains all normal transactions, and X1Containing all anomalous transactions, assuming that matrix X has n columns, then X0And X1Also each have n columns, now X0And X1Are respectively divided into n single-column matrixes which are respectively marked as X01,X02,…,X0nAnd X11, X12,…,X1nThus, the matrix X is split into 2n single-column matrices, which now need to be applied to each single-column matrix X01,X02,…,X0n,X11,X12,…,X1nCreating an independent Gaussian mixture model and training each model by using an EM algorithm, wherein 2n models are required to be created, and the pseudo code is as follows:
algorithm 1. supervised learning gaussian mixture model,
Figure RE-GDA0002288788020000081
in algorithm 1, rows 2-3 are two cycles, which are cycled for 2n times, each cycle creates a gaussian mixture model in row 4 (corresponding to step three), and trains the gaussian mixture model in row 5 (corresponding to step four), and the trained data is a single-column matrix XijTherefore, the algorithm 1 creates and trains 2n Gaussian mixture models, and circulates the third step and the fourth step for 2n times, and the trained models are stored in array models;
step six: unsupervised learning; assuming that there is some history data that has not been labeled, denoted X ', because the data has not been labeled, there is no way to distinguish between good and bad samples, assuming that matrix X' has m columns, X 'is now broken down into m single-column matrices X'1,X′2,...,X′mSimilarly to step five, a gaussian mixture model is created for each single-column matrix, so that in step six there are a total of m models, and the pseudo-code is as follows:
algorithm 2 unsupervised learning of a gaussian mixture model,
Figure RE-GDA0002288788020000091
algorithm 2 is similar to algorithm 1, algorithm 2 has a cycle, which is cycled m times, each cycle executes step three and step four, a gaussian mixture model is created and trained, so that there are m models, and the trained models are placed in the group of models 2;
step seven: predicting; first, looking at the prediction of supervised learning, as new transaction data X(new)When generated, the new data needs to be predicted for its class, that is, when given data, the posterior probability P (y | X) needs to be calculated(new)) Where y ═ {0, 1} is a category, y ═ 0 is a good sample, y ═ 1 is a bad sample, according to bayes' theorem:
Figure RE-GDA0002288788020000092
so that the a posteriori probability is calculated,the prior probability P (y) and the conditional probability P (X) are calculated(new)Y), and P (X)(new)) The prior probability p (y) can be given by expert experience or calculated as a proportion of good or bad samples. Naive bayes can be used when calculating the conditional probability. In step five, the data has n columns, where new data X is assumed(new)Consistent with the training data, there are also n columns, noted
Figure RE-GDA0002288788020000093
Because in step two, the data has been transformed, and the columns of the transformed data are independent, a naive bayes formula can be applied:
Figure RE-GDA0002288788020000101
equation (7) becomes:
Figure RE-GDA0002288788020000102
wherein
Figure RE-GDA0002288788020000103
The total 2n probabilities can be calculated by the 2n models trained in the fifth step, and
Figure RE-GDA0002288788020000104
the probability can be calculated by a total probability formula, and the posterior probability calculated by the formula (8) is the predicted value of supervised learning. As the data in unsupervised learning is not marked with good or bad samples, the data can not be calculated by using the formula (8), and only the naive Bayes formula can be used for calculation:
Figure RE-GDA0002288788020000105
note that in the above equation, i is 1, 2, …, m, since step six assumes that there are m columns of unsupervised learning training data. Since the Gaussian mixture model is a generative model, P (X) can be set(new2)) Treated as new data X(new2)Is the probability generated by the gaussian mixture model. When the calculated probability of some new data is low, these data points can be taken as outliers, i.e. anomalous transactions.
In one embodiment: in the seventh step, after the prediction results of supervised learning and unsupervised learning are obtained, a rule model, a linear model or an integrated model can be established to synthesize the prediction results to obtain the final output result.
To achieve the above object, the present invention further provides an anti-fraud computer apparatus based on supervised learning and unsupervised learning, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any of the preceding claims 1 to 2 when executing the computer program.
To achieve the above object, the present invention also provides an anti-fraud computer device storage medium based on supervised and unsupervised learning, having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the preceding claims 1 to 2.
After the technical scheme is adopted, on one hand, the anti-fraud model is established by using a method combining supervised learning and unsupervised learning, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of an integrated model, the predicted result is superior to the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the known fraud mode and the unknown fraud mode can be detected by combining the two learning modes;
on the other hand, in the fourth step, a specific calculation formula of an expectation-maximization (EM) algorithm is derived, and as can be seen from the fourth step, the final formula is easily implemented by using a programming language, and after the algorithm is implemented, the algorithm can be improved according to specific application scenarios, for example, each iteration of the third step and the fourth step requires a full amount of data to be input for calculation, but when the data amount is very large, the time required for one iteration is very long, and at this time, the existing EM algorithm can be improved, and only one batch (batch) of data is input for each iteration, so that the calculation speed can be greatly improved under the condition of ensuring the calculation accuracy.
Drawings
FIG. 1 is a flow chart showing the specific steps of an anti-fraud method based on supervised learning and unsupervised learning according to the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides an anti-fraud method based on supervised learning and unsupervised learning, which includes the following specific steps:
the method comprises the following steps: preprocessing data; the input data is numerical data, non-numerical data needs to be converted into numerical data, and then all the input data can be standardized, for example, the data can be standardized by using a standard scaler of scinit-lean;
step two: data conversion; in general, each feature of input data has a certain correlation relationship, the input data needs to be converted by using a Principal Component Analysis (PCA) method, each column of the converted data has no correlation relationship, and the converted data is still marked as X;
step three: creating a Gaussian mixture model; the invention aims to establish a corresponding Gaussian mixture model for each column of the converted data, wherein the Gaussian mixture model is recorded as:
Figure RE-GDA0002288788020000121
wherein xiIs the ith row of data,
Figure RE-GDA0002288788020000131
For the parameters of the model, the gaussian mixture model contains k components, each component is an independent gaussian univariate distribution, and because each gaussian mixture model is only for one column in the data set X, its components only need gaussian univariate distributions, that is:
Figure RE-GDA0002288788020000132
furthermore, it is possible to provide a liquid crystal display device,each variable ωjIs the weight of the jth component in the mixture model, now given to each data point xiAdding an implicit variable zi,ziHas a value range of {1, 2.., k }, i.e., z }iData point x is denoted 2iIs generated by the 2 nd component, ziDenoted by j is the data point xiIs the jth component generated, then for any θ there is:
Figure RE-GDA0002288788020000134
according to the formula (2) and the Bayesian formula, the following can be obtained:
the above formula represents the data point xiIs the probability of the generation of the jth component, defines γij=p(zi=j|xi,θ);
Step four: an expectation maximization algorithm; after the model is determined, parameters of the model need to be estimated, and in this step, a maximum likelihood parameter estimation method is used, and a log likelihood function is generally used:
Figure RE-GDA0002288788020000136
according to the total probability formula:
Figure RE-GDA0002288788020000141
equation (3) can be written as:
Figure RE-GDA0002288788020000142
the logarithm of the sum contained in the formula (4), solving the maximum likelihood estimation of the formula (4) does not have an analytic solution, and the maximum likelihood estimation can only be solved through an iterative method, wherein the maximum value of the objective function L (theta) is calculated by adopting an Expectation-Maximization (EM) algorithm, the EM algorithm is an iterative algorithm and mainly comprises an E step (expection) and an M step (Maximization), and the maximum value solving problem can be changed into the following maximum value solving problem through derivation (4):
θ(m+1)=argmaxθE[logP(x,z|θ)|x,θ(m)](5)
wherein, theta(m)The parameter estimation value obtained after the mth iteration is shown, and the parameter value theta obtained after the (m + 1) th iteration is obtained through the calculation of the formula (5)(m+1)Defining a Q function:
Q(θ|θ(m))=E[logP(x,z|θ)|x,θ(m)]
then the ith data point xiThe Q function of (A) is:
Figure RE-GDA0002288788020000143
Figure RE-GDA0002288788020000151
the above derivation uses gammaijAnd equation (2), so the Q function for all data points is:
Figure RE-GDA0002288788020000152
after deriving the formula of the Q function, the main steps of the EM algorithm can be listed as follows:
① initialization, setting the value of parameter m to 0 when iteration is not started, wherein the parameter m can be regarded as the number of iterations, and the model parameter needs to be set
Figure RE-GDA0002288788020000153
Of where theta is an initial value(0)Namely, the model parameter value after the 0 th iteration is represented;
② step E, according to the parameter value theta of the current model(m)It is necessary to calculate the component j for the data xiResponsivity of (2)
Figure RE-GDA0002288788020000154
The above formula has been given in step three;
step ③, M, according to the definition of the Q function, in each iteration, it is necessary to find the value of the parameter θ so as to maximize the value of the Q function, that is:
Figure RE-GDA0002288788020000156
that is, in M steps, it is necessary to find the function Q (θ | θ!)(m)) For maximum value of parameter theta, in order to obtain optimum mujIs required to measure mujThe partial derivative is calculated and made zero, i.e.:
Figure RE-GDA0002288788020000161
substituting equation (6) into the above equation and solving for it yields:
Figure RE-GDA0002288788020000162
the same optimum σjThe following formula needs to be calculated:
Figure RE-GDA0002288788020000163
substituting and solving equation (6) yields:
Figure RE-GDA0002288788020000164
calculating optimal omegajThe formula of (1) is:
Figure RE-GDA0002288788020000165
after M steps are finished, new parameter values after iteration are obtained
Figure RE-GDA0002288788020000166
④, repeating ② and ③, namely alternately calculating the steps E and M, and noting that the parameter M is increased by 1 after the step M is finished, namely M is M +1, and the loop is finished when the iteration converges to a preset threshold value;
when the ① to ④ are executed, all parameters of the Gaussian mixture model are calculated, and the training of one Gaussian mixture model is also finished;
step five: supervised learning is available; the historical data which needs to be labeled in supervised learning is assumed to be a matrix X, each row in the matrix X is assumed to be labeled, the normal transaction is recorded as 0, the abnormal transaction is recorded as 1, the labeled information is stored in one column of the matrix X, and the matrix X is split into two matrices X according to the labeled information0And X1,X0Contains all normal orders, and X1Including all exception orders, assuming that matrix X has n columns, then X0And X1Also each have n columns, now X0And X1Are respectively divided into n single-column matrixes which are respectively marked as X01,X02,…,X0nAnd X11, X12,…,X1nThus, the matrix X is split into 2n single-column matrices, which now need to be applied to each single-column matrix X01,X02,…,X0n,X11,X12,…,X1nCreating an independent Gaussian mixture model and training each model by using an EM algorithm, wherein 2n models are required to be created, and the pseudo code is as follows:
algorithm 1. supervised learning gaussian mixture model,
Figure RE-GDA0002288788020000171
in algorithm 1, rows 2-3 are two cycles, which are cycled for 2n times, each cycle creates a gaussian mixture model in row 4 (corresponding to step three), and trains the gaussian mixture model in row 5 (corresponding to step four), and the trained data is a single-column matrix XijTherefore, the algorithm 1 creates and trains 2n Gaussian mixture models, and circulates the third step and the fourth step for 2n times, and the trained models are stored in array models;
step six: unsupervised learning; assuming that there is some history data that has not been labeled, denoted X ', because the data has not been labeled, there is no way to distinguish between good and bad samples, assuming that matrix X' has m columns, X 'is now broken down into m single-column matrices X'1,X′2,...,X′mSimilarly to step five, a gaussian mixture model is created for each single-column matrix, so that in step six there are a total of m models, and the pseudo-code is as follows:
algorithm 2 unsupervised learning of a gaussian mixture model,
Figure RE-GDA0002288788020000181
algorithm 2 is similar to algorithm 1, algorithm 2 has a cycle, which is cycled m times, each cycle executes step three and step four, a gaussian mixture model is created and trained, so that there are m models, and the trained models are placed in the group of models 2;
step seven: predicting; first, looking at the prediction of supervised learning, as new transaction data X(new)When generated, the new data needs to be predicted for its class, that is, when given data, the posterior probability P (y | X) needs to be calculated(new)) Where y ═ {0, 1} is a category, y ═ 0 is a good sample, y ═ 1 is a bad sample, according to bayes' theorem:
Figure RE-GDA0002288788020000191
therefore, to calculate the posterior probability, the prior probability P (y), the conditional probability P (X) are calculated(new)Y), and P (X)(new)) The prior probability P (y) can be given by expert experience, or calculated according to the proportion of good and bad samples, when calculating the conditional probability, naive Bayes can be used, in step five, the data has n columns, and here, new data X is assumed(new)Consistent with the training data, there are also n columns, notedBecause in step two, the data has been transformed, and the columns of the transformed data are independent, a naive bayes formula can be applied:
Figure RE-GDA0002288788020000193
equation (7) becomes:
Figure RE-GDA0002288788020000196
(8)
wherein
Figure RE-GDA0002288788020000194
The total 2n probabilities can be calculated by the 2n models trained in the fifth step, and
Figure RE-GDA0002288788020000195
the probability can be calculated by a total probability formula, the posterior probability calculated by the formula (8) is a predicted value of supervised learning, and the data in the unsupervised learning is not marked with good or bad samples, so the formula (8) can not be used for calculation, and only the naive Bayes formula can be used for calculation:
Figure RE-GDA0002288788020000201
note that in the above equation, i is 1, 2, …, m, since step six assumes that there are m columns in the unsupervised learning training data, and since the gaussian mixture model is a generative model, P (X) can be used(new2)) Treated as new data X(new2)Is the probability generated by the gaussian mixture model, and when the probability obtained by the calculation of some new data is very low, the data points can be taken as outliers, namely abnormal transactions.
In this embodiment: in the seventh step, after the prediction results of supervised learning and unsupervised learning are obtained, a rule model, a linear model or an integrated model can be established to synthesize the prediction results to obtain the final output result.
To achieve the above object, the present invention further provides an anti-fraud computer apparatus based on supervised learning and unsupervised learning, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any of the preceding claims 1 to 2 when executing the computer program.
To achieve the above object, the present invention also provides an anti-fraud computer device storage medium based on supervised and unsupervised learning, having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of any of the preceding claims 1 to 2.
Further, the programming language used in the present invention is Python, the principal component analysis in step two may use the composition.pca class in scimit-spare package, the gaussian mixture model in step three may use the mixture.gaussian mixture class in scimit-spare package, an important parameter in this class is n _ components, i.e. how many components the mixture model is composed of, the parameter n _ components is a hyper-parameter, which needs to be manually adjusted or searched using a hyper-parameter, the value of the parameter n _ components will be over-fitted if it is set too high, and the calculation time will be longer.
Further, the fourth step is expectation maximization algorithm, model parameter initialization of ① may use random values, but EM algorithm is sensitive to setting of initial values, it is preferable to use k-means algorithm to calculate initial values of parameters, k-means algorithm may use cluster in scipit-left package, kmans class, ② and ③ may be implemented according to the formula listed in the fourth step, numpy package may be used to optimize calculation speed, of course, ② and ③ may also be implemented by algorithm built in mix.
Further, the supervised learning of the step five and the unsupervised learning of the step six are both loops of the step three and the step four.
Further, in step seven, the conditional probability of the data in each gaussian mixture model can be calculated by using score _ samples method in the mixture.
After the technical scheme is adopted, on one hand, the anti-fraud model is established by using a method combining supervised learning and unsupervised learning, the supervised learning model and the unsupervised learning model are contained in a large model, the large model achieves the effect of an integrated model, the predicted result is superior to the effect when the supervised learning or the unsupervised learning is considered independently, the supervised learning and the unsupervised learning are complementary, and the known fraud mode and the unknown fraud mode can be detected by combining the two learning modes; on the other hand, in the fourth step, a specific calculation formula of an expectation-maximization (EM) algorithm is derived, and as can be seen from the fourth step, the final formula is easily implemented by using a programming language, and after the algorithm is implemented, the algorithm can be improved according to specific application scenarios, for example, each iteration of the third step and the fourth step requires a full amount of data to be input for calculation, but when the data amount is very large, the time required for one iteration is very long, and at this time, the existing EM algorithm can be improved, and only one batch (batch) of data is input for each iteration, so that the calculation speed can be greatly improved under the condition of ensuring the calculation accuracy.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (7)

1. An anti-fraud method based on supervised learning and unsupervised learning is characterized by comprising the following specific steps:
the method comprises the following steps: preprocessing data; the input data is numerical data, non-numerical data needs to be converted into numerical data, and then all the input data can be standardized, for example, the data can be standardized by using a standard scaler of scinit-lean;
step two: data conversion; in general, each feature of input data has a certain correlation relationship, the input data needs to be converted by using a Principal Component Analysis (PCA) method, each column of the converted data has no correlation relationship, and the converted data is still marked as X;
step three: creating a Gaussian mixture model; the invention aims to establish a corresponding Gaussian mixture model for each column of the converted data, wherein the Gaussian mixture model is recorded as:
Figure RE-FDA0002288788010000011
wherein xiFor the ith row of the data,for the parameters of the model, the gaussian mixture model contains k components, each component is an independent gaussian univariate distribution, and because each gaussian mixture model is only for one column in the data set X, its components only need gaussian univariate distributions, that is:
Figure RE-FDA0002288788010000021
furthermore, it is possible to provide a liquid crystal display device,
Figure RE-FDA0002288788010000022
each variable ωjIs the weight of the jth component in the mixture model, now given to each data point xiAdding an implicit variable zi,ziHas a value range of {1, 2.., k }, i.e., z }iData point x is denoted 2iIs generated by the 2 nd component, ziDenoted by j is the data point xiIs the jth component generated, then for any θ there is:
Figure RE-FDA0002288788010000023
according to the formula (2) and the Bayesian formula, the following can be obtained:
Figure RE-FDA0002288788010000024
the above formula represents the data point xiIs the probability of the generation of the jth component, defines γij=p(zi=j|xi,θ);
Step four: an expectation maximization algorithm; after the model is determined, parameters of the model need to be estimated, and in this step, a maximum likelihood parameter estimation method is used, and a log likelihood function is generally used:
according to the total probability formula:
Figure RE-FDA0002288788010000026
equation (3) can be written as:
Figure RE-FDA0002288788010000031
the logarithm of the sum contained in the formula (4), solving the maximum likelihood estimation of the formula (4) does not have an analytic solution, and the maximum likelihood estimation can only be solved through an iterative method, wherein the maximum value of the objective function L (theta) is calculated by adopting an Expectation-Maximization (EM) algorithm, the EM algorithm is an iterative algorithm and mainly comprises an E step (expection) and an M step (Maximization), and the maximum value solving problem can be changed into the following maximum value solving problem through derivation (4):
θ(m+1)=argmaxθE[logP(x,z|θ)x,θ(m)](5)
wherein, theta(m)The parameter estimation value obtained after the mth iteration is shown, and the parameter value theta obtained after the (m + 1) th iteration is obtained through the calculation of the formula (5)(m+1)Defining a Q function:
Q(θ|θ(m))=E[logP(x,z|θ)|x,θ(m)]
then the ith data point xiThe Q function of (A) is:
Figure RE-FDA0002288788010000032
the above derivation uses gammaijAnd equation (2), so the Q function for all data points is:
Figure RE-FDA0002288788010000041
after deriving the formula of the Q function, the main steps of the EM algorithm can be listed as follows:
① initialization, setting the value of parameter m to 0 when iteration is not started, wherein the parameter m can be regarded as the number of iterations, and the model parameter needs to be setOf where theta is an initial value(0)Namely, the model parameter value after the 0 th iteration is represented;
② step E, according to the parameter value theta of the current model(m)It is necessary to calculate the component j for the data xiResponsivity of (2)
Figure RE-FDA0002288788010000044
The above formula has been given in step three;
step ③, M, according to the definition of the Q function, in each iteration, it is necessary to find the value of the parameter θ so as to maximize the value of the Q function, that is:
Figure RE-FDA0002288788010000045
that is, in M steps, it is necessary to find the function Q (θ | θ!)(m)) For maximum value of parameter theta, in order to obtain optimum mujIs required to measure mujThe partial derivative is calculated and made zero, i.e.:
Figure RE-FDA0002288788010000046
substituting equation (6) into the above equation and solving for it yields:
Figure RE-FDA0002288788010000051
the same optimum σjThe following formula needs to be calculated:
Figure RE-FDA0002288788010000052
substituting and solving equation (6) yields:
Figure RE-FDA0002288788010000053
calculating optimal omegajThe formula of (1) is:
after M steps are finished, new parameter values after iteration are obtained
Figure RE-FDA0002288788010000055
④, repeating ② and ③, namely alternately calculating the steps E and M, and noting that the parameter M is increased by 1 after the step M is finished, namely M is M +1, and the loop is finished when the iteration converges to a preset threshold value;
when the ① to ④ are executed, all parameters of the Gaussian mixture model are calculated, and the training of one Gaussian mixture model is also finished;
step five: supervised learning is available; the historical data which needs to be labeled in supervised learning is assumed to be a matrix X, each row in the matrix X is assumed to be labeled, the normal transaction is recorded as 0, the abnormal transaction is recorded as 1, the labeled information is stored in one column of the matrix X, and the matrix X is split into two matrices X according to the labeled information0And X1,X0Contains all normal transactions, and X1Containing all anomalous transactions, assuming that matrix X has n columns, then X0And X1Also each have n columns, now X0And X1Are respectively divided into n single-column matrixes which are respectively marked as X01,X02,…,X0nAnd X11,X12,…,X1nThus, the matrix X is split into 2n single-column matrices, which now need to be applied to each single-column matrix X01,X02,…,X0n,X11,X12,…,X1nCreating an independent Gaussian mixture model and training each model by using an EM algorithm, wherein 2n models are required to be created, and the pseudo code is as follows:
algorithm 1. supervised learning gaussian mixture model,
in algorithm 1, rows 2-3 are two cycles, which are cycled for 2n times, each cycle creates a gaussian mixture model in row 4 (corresponding to step three), and trains the gaussian mixture model in row 5 (corresponding to step four), and the trained data is a single-column matrix XijTherefore, the algorithm 1 creates and trains 2n Gaussian mixture models, and circulates the third step and the fourth step for 2n times, and the trained models are stored in array models;
step six: unsupervised learning; assuming that there is some history data that has not been labeled, denoted X ', because the data has not been labeled, there is no way to distinguish between good and bad samples, assuming that matrix X' has m columns, X 'is now broken down into m single-column matrices X'1,X′2,...,X′mSimilarly to step five, a gaussian mixture model is created for each single-column matrix, so that in step six there are a total of m models, and the pseudo-code is as follows:
algorithm 2 unsupervised learning of a gaussian mixture model,
algorithm 2 is similar to algorithm 1, algorithm 2 has a cycle, which is cycled m times, each cycle executes step three and step four, a gaussian mixture model is created and trained, so that there are m models, and the trained models are placed in the group of models 2;
step seven: predicting; first, looking at the prediction of supervised learning, as new transaction data X(new)When generated, the new data needs to be predicted for its class, that is, when given data, the posterior probability P (y | X) needs to be calculated(new)) Where y ═ {0, 1} is a category, y ═ 0 is a good sample, y ═ 1 is a bad sample, according to bayes' theorem:
Figure RE-FDA0002288788010000072
therefore, to calculate the posterior probability, the prior probability P (y), the conditional probability P (X) are calculated(new)Y), and P (X)(new)). The prior probability P (y) can be given by expert experience, or calculated according to the proportion of good and bad samples, and when calculating the conditional probability, naive Bayes can be used. In step five, the data has n columns, where new data X is assumed(new)Consistent with the training data, there are also n columns, notedBecause in step two, the data has been transformed, and the columns of the transformed data are independent, a naive bayes formula can be applied:
equation (7) becomes:
Figure RE-FDA0002288788010000083
wherein
Figure RE-FDA0002288788010000084
A total of 2n probabilities can be trained by the five-step 2n modelsIs calculated to obtain
Figure RE-FDA0002288788010000085
The probability can be calculated by a total probability formula, the posterior probability calculated by the formula (8) is a predicted value of supervised learning, and the data in the unsupervised learning is not marked with good or bad samples, so the formula (8) can not be used for calculation, and only the naive Bayes formula can be used for calculation:
Figure RE-FDA0002288788010000086
note that in the above equation, i is 1, 2, …, m, since step six assumes that there are m columns of unsupervised learning training data. Since the Gaussian mixture model is a generative model, P (X) can be set(new2)) Treated as new data X(new2)Is the probability generated by the gaussian mixture model, and when the probability obtained by the calculation of some new data is very low, the data points can be taken as outliers, namely abnormal transactions.
2. The prior probability P (y) can be given by expert experience, or calculated according to the proportion of good and bad samples, and when calculating the conditional probability, naive Bayes can be used.
3. In step five, the data has n columns, where new data is assumed
Figure 840283DEST_PATH_IMAGE045
Consistent with the training data, there are also n columns, noted
Figure 586522DEST_PATH_IMAGE050
Figure 144543DEST_PATH_IMAGE051
,…,
Figure 142586DEST_PATH_IMAGE052
Because in step two, the data has already been convertedThe columns of the converted data are independent, so a naive bayes formula can be applied:
equation (7) becomes:
(8)
whereinThe total 2n probabilities can be calculated by the 2n models trained in step five, and
Figure 115910DEST_PATH_IMAGE056
the probability can be calculated by a total probability formula, the posterior probability calculated by the formula (8) is a predicted value of supervised learning, and the data in the unsupervised learning is not marked with good or bad samples, so the formula (8) can not be used for calculation, and only the naive Bayes formula can be used for calculation:
Figure 930282DEST_PATH_IMAGE057
note that i =1, 2, …, m in the above equation, because step six assumes that there are m columns of unsupervised learning training data.
4. Because the Gaussian mixture model is a generative model, the model can be generated
Figure 283903DEST_PATH_IMAGE058
Treated as new dataIs the probability generated by the Gaussian mixture model, and when the probability obtained by calculating some new data is very low, the data can be processedThe points serve as outliers, i.e., anomalous transactions.
5. An anti-fraud method based on supervised and unsupervised learning according to claim 1, characterized in that: in the seventh step, after the prediction results of supervised learning and unsupervised learning are obtained, a rule model, a linear model or an integrated model can be established to synthesize the prediction results to obtain the final output result.
6. An anti-fraud computer device based on supervised and unsupervised learning, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that: the processor, when executing the computer program, realizes the steps of the method of any of the preceding claims 1 to 2.
7. An anti-fraud computer device storage medium based on supervised and unsupervised learning, having a computer program stored thereon, characterized by: the program when executed by a processor implements the steps of the method of any of claims 1 to 2.
CN201910977563.9A 2019-10-15 2019-10-15 Anti-fraud method based on supervised learning and unsupervised learning Active CN110717601B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910977563.9A CN110717601B (en) 2019-10-15 2019-10-15 Anti-fraud method based on supervised learning and unsupervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910977563.9A CN110717601B (en) 2019-10-15 2019-10-15 Anti-fraud method based on supervised learning and unsupervised learning

Publications (2)

Publication Number Publication Date
CN110717601A true CN110717601A (en) 2020-01-21
CN110717601B CN110717601B (en) 2022-05-03

Family

ID=69211695

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910977563.9A Active CN110717601B (en) 2019-10-15 2019-10-15 Anti-fraud method based on supervised learning and unsupervised learning

Country Status (1)

Country Link
CN (1) CN110717601B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435997A (en) * 2021-06-08 2021-09-24 成都熵焓科技有限公司 Gaussian mixture model bank transaction data simulation generation algorithm based on deep learning
CN113538049A (en) * 2021-07-14 2021-10-22 北京明略软件系统有限公司 Abnormal flow identification system

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745233A (en) * 2014-01-23 2014-04-23 西安电子科技大学 Hyper-spectral image classifying method based on spatial information transfer
CN104280771A (en) * 2014-10-27 2015-01-14 中国石油集团川庆钻探工程有限公司地球物理勘探公司 Three-dimensional seismic data waveform semi-supervised clustering method based on EM algorithm
CN104680015A (en) * 2015-03-02 2015-06-03 华南理工大学 Online soft measurement method for sewage treatment based on quick relevance vector machine
CN104881687A (en) * 2015-06-02 2015-09-02 四川理工学院 Magnetic resonance image classification method based on semi-supervised Gaussian mixed model
CN105160290A (en) * 2015-07-03 2015-12-16 东南大学 Mobile boundary sampling behavior identification method based on improved dense locus
CN106778854A (en) * 2016-12-07 2017-05-31 西安电子科技大学 Activity recognition method based on track and convolutional neural networks feature extraction
CN108734479A (en) * 2018-04-12 2018-11-02 阿里巴巴集团控股有限公司 Data processing method, device, equipment and the server of Insurance Fraud identification
CN108804784A (en) * 2018-05-25 2018-11-13 江南大学 A kind of instant learning soft-measuring modeling method based on Bayes's gauss hybrid models
CN109165550A (en) * 2018-07-13 2019-01-08 首都师范大学 A kind of multi-modal operation track fast partition method based on unsupervised deep learning
US20190108447A1 (en) * 2017-11-30 2019-04-11 Intel Corporation Multifunction perceptrons in machine learning environments
US20190124045A1 (en) * 2017-10-24 2019-04-25 Nec Laboratories America, Inc. Density estimation network for unsupervised anomaly detection
CN109784676A (en) * 2018-12-25 2019-05-21 杨鑫 The study and application method, device and computer readable storage medium of data analysis
CN110071913A (en) * 2019-03-26 2019-07-30 同济大学 A kind of time series method for detecting abnormality based on unsupervised learning

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745233A (en) * 2014-01-23 2014-04-23 西安电子科技大学 Hyper-spectral image classifying method based on spatial information transfer
CN104280771A (en) * 2014-10-27 2015-01-14 中国石油集团川庆钻探工程有限公司地球物理勘探公司 Three-dimensional seismic data waveform semi-supervised clustering method based on EM algorithm
CN104680015A (en) * 2015-03-02 2015-06-03 华南理工大学 Online soft measurement method for sewage treatment based on quick relevance vector machine
CN104881687A (en) * 2015-06-02 2015-09-02 四川理工学院 Magnetic resonance image classification method based on semi-supervised Gaussian mixed model
CN105160290A (en) * 2015-07-03 2015-12-16 东南大学 Mobile boundary sampling behavior identification method based on improved dense locus
CN106778854A (en) * 2016-12-07 2017-05-31 西安电子科技大学 Activity recognition method based on track and convolutional neural networks feature extraction
US20190124045A1 (en) * 2017-10-24 2019-04-25 Nec Laboratories America, Inc. Density estimation network for unsupervised anomaly detection
US20190108447A1 (en) * 2017-11-30 2019-04-11 Intel Corporation Multifunction perceptrons in machine learning environments
CN108734479A (en) * 2018-04-12 2018-11-02 阿里巴巴集团控股有限公司 Data processing method, device, equipment and the server of Insurance Fraud identification
CN108804784A (en) * 2018-05-25 2018-11-13 江南大学 A kind of instant learning soft-measuring modeling method based on Bayes's gauss hybrid models
CN109165550A (en) * 2018-07-13 2019-01-08 首都师范大学 A kind of multi-modal operation track fast partition method based on unsupervised deep learning
CN109784676A (en) * 2018-12-25 2019-05-21 杨鑫 The study and application method, device and computer readable storage medium of data analysis
CN110071913A (en) * 2019-03-26 2019-07-30 同济大学 A kind of time series method for detecting abnormality based on unsupervised learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭涛: "医疗保险欺诈检测的研究与应用", 《中国优秀硕士论文全文数据库》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435997A (en) * 2021-06-08 2021-09-24 成都熵焓科技有限公司 Gaussian mixture model bank transaction data simulation generation algorithm based on deep learning
CN113538049A (en) * 2021-07-14 2021-10-22 北京明略软件系统有限公司 Abnormal flow identification system

Also Published As

Publication number Publication date
CN110717601B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
Zeng et al. SMO-based pruning methods for sparse least squares support vector machines
Tamar et al. Scaling up robust MDPs using function approximation
Tamar et al. Learning the variance of the reward-to-go
US9367818B2 (en) Transductive Lasso for high-dimensional data regression problems
Doppa et al. HC-Search: A learning framework for search-based structured prediction
US10769551B2 (en) Training data set determination
US7421380B2 (en) Gradient learning for probabilistic ARMA time-series models
US11004026B2 (en) Method and apparatus for determining risk management decision-making critical values
Ayodeji et al. Causal augmented ConvNet: A temporal memory dilated convolution model for long-sequence time series prediction
CN110717601B (en) Anti-fraud method based on supervised learning and unsupervised learning
Andrews Rank-based estimation for GARCH processes
CN112734106A (en) Method and device for predicting energy load
Tanha et al. Disagreement-based co-training
Schumacher et al. An introduction to delay-coupled reservoir computing
Zang et al. Semi-supervised and long-tailed object detection with cascadematch
Lee et al. Trajectory of mini-batch momentum: Batch size saturation and convergence in high dimensions
Hong et al. Deep active learning with augmentation-based consistency estimation
Galili et al. Splitting matters: how monotone transformation of predictor variables may improve the predictions of decision tree models
US11556760B2 (en) Learning-based data processing system and model update method
Utkin et al. SurvBeX: An explanation method of the machine learning survival models based on the Beran estimator
Dubiel-Teleszynski et al. Sequential learning and economic benefits from dynamic term structure models
Brandimarte et al. Approximate Dynamic Programming and Reinforcement Learning for Continuous States
Verhofen Markov chain monte carlo methods in financial econometrics
Kamm et al. A novel approach to rating transition modelling via Machine Learning and SDEs on Lie groups
Bukowski et al. SuperNet--An efficient method of neural networks ensembling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant