CN109993538A

CN109993538A - Identity theft detection method based on probability graph model

Info

Publication number: CN109993538A
Application number: CN201910148549.8A
Authority: CN
Inventors: 王成; 胡腾
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2019-07-09

Abstract

The present invention provides a kind of identity theft detection method based on probability graph model, comprising steps of S1: collecting and obtains and pre-process network payment transaction data, obtains a network payment transaction feature set；S2: it is established using the network payment transaction feature set and obtains a probability graph model；S3: inputting the parameter of a training set and the training probability graph model, while the conditional probability parameter of the probability graph model is obtained using Bayes' theorem；S4: predicting a forecast set of input using the conditional probability parameter and the Bayes' theorem, obtains a prediction result.A kind of identity theft detection method based on probability graph model of the invention, based on probability graph model, by synthesizing behavior and model attributes realization network payment fraud detection to user, dynamic on-line tuning can be carried out to detection model, improve the robustness of the accuracy and model that intercept fraudulent trading.

Description

Identity theft detection method based on probability graph model

Technical field

The present invention relates to the anti-fraud detection fields of internet banking network payment, more particularly to one kind to be based on probability artwork The identity theft detection method of type.

Background technique

Mobile Internet is a handle double-edged sword, is consequently also brought while bringing convenience to people's lives various hidden Suffer from, for example, online trading payment platform can allow people stays indoors in addition can be carried out doing shopping anywhere or anytime and prop up It pays, but this convenience and fast some illegal attackers is also allowed to have an opportunity to take advantage of, attacker is by stealing the account letter of user Breath, steals the individual privacy information of user, or even the user itself that disguises oneself as is traded or transferred accounts to complete to cheat.Therefore it is The effective individual interest safety for ensureing user and company, needs to establish effective network payment fraud detection system System.

Some network payment fraud models based on machine learning even deep learning are currently existed, wherein absolutely mostly Several learning models is the discrimination model based on expectation maximization, cheats model for the network on line is counter, uses deep learning Although equal models can be in effect better than other methods, deep learning model as the anti-method cheated of network payment One typical black-box model, result do not have it is explanatory, do not have enough convincingnesses.

Summary of the invention

In view of the deficiency of the prior art, the present invention provides a kind of identity theft detection side based on probability graph model Method realizes network payment fraud detection by synthesizing behavior modeling to user based on probability graph model, can be to detection model Dynamic on-line tuning is carried out, the robustness of the accuracy and model that intercept fraudulent trading is improved.

To achieve the goals above, the present invention provides a kind of identity theft detection method based on probability graph model, including Step:

S1: collecting and obtain and pre-process network payment transaction data, obtains a network payment transaction feature set；

S2: it is established using the network payment transaction feature set and obtains a probability graph model；

S3: the parameter of one training set of input and the training probability graph model, while using described in Bayes' theorem acquisition The conditional probability parameter of probability graph model；

S4: predicting a forecast set of input using the conditional probability parameter and the Bayes' theorem, obtains One prediction result.

Preferably, the S1 step further comprises step:

S11: data scrubbing step, by the network payment transaction data fill in missing values, smooth noise and Identification solves that data are inconsistent to realize that the clear of error correcting and repeated data is removed in the formattings of data, abnormal data It removes；

S12: the unified storage of the network payment transaction data of multiple data sources is formed a number by data integration step According to library；

S13: the network payment transaction data standardization processing in the database is formed into the network payment and is handed over Easy characteristic set.

Preferably, the S2 step further comprises step:

S21: the network payment transaction feature set θ, one candidate feature set θ ' of input, a set of relationship R, mark are obtained Sign attribute Y and threshold value λ；Wherein, θ ' ∈ Φ, R ∈ Φ, Φ indicate empty set.

S22: the feature X for obtaining the network payment transaction feature set θ is calculated according to formula (1)_iWith tag attributes Y's Mutual information I:

Wherein, X_iIndicate ith feature；I is the natural number more than or equal to 1；Y indicates tag attributes；X indicates X_iValue； The value of y expression Y；The joint probability of p (x, y) expression x and y；P (x) is the marginal probability of x；P (y) is the marginal probability of y；I (X_i；Y X) is indicated_iMutual information between Y；

S22: judge I (X_i；Y) whether it is more than or equal to preset threshold value λ；Such as it is to continue with subsequent step；

S23: the candidate feature set θ ' is updated according to formula (2):

θ ' :=θ '+X_i(2)；

The network payment transaction feature set θ is updated according to formula (3):

θ :=θ-X_i(3)；

S24: according to obtaining dependence r, r:X_i→Y；

S25: the set of relationship R is updated according to formula (4)；

S26: judge whether the feature quantity in presently described candidate feature set θ ' is more than or equal to 2；It is such as to continue with subsequent Step, otherwise return step S23；

S27: the mutual information between feature two-by-two is calculated in presently described candidate feature set θ ' according to formula (5):

Wherein, X_iIndicate the i-th feature in θ ', X_jIndicate the jth feature in θ ', i, j are greater than the natural number equal to 1；x Indicate X_iValue；X ' expression X_jValue；The joint probability of p (x, x ') expression x and x '；P (x) is the marginal probability of x；p(x′) For the marginal probability of x '；I(X_i；X_j) indicate X_iWith X_jBetween mutual information；The set of relationship R is updated by formula (4)；

S28: current θ ' is assigned to θ, and empties set θ '；By between formula (5) set of computations θ two-by-two feature Mutual information, if I (X_i；X_j) >=λ determines the dependence r between feature two-by-two, is passing through formula then according to priori knowledge (4) the set of relationship R is updated；

S29: repeating step S28 until θ is the I (X of empty or all features_i；X_j)≤λ, at this time according to presently described set of relations It closes R and obtains the probability graph model.

Preferably, the S3 step further comprises step:

S31: one training set of input, the training set includes characteristic attribute and tag attributes；

S32: it is calculated according to formula (6) and obtains the conditional probability parameter:

Wherein, A_iIndicate the i-th father node of the probability graph model；B indicates A_iChild node；p_train(A_i| B) indicate A_i Conditional probability parameter between B；p(A_i) indicate A_iMarginal probability；p(B|A_i) expression condition be A_iIt is the probability that B occurs；A_j Indicate jth father node；p(A_j) indicate A_jMarginal probability；P(B|A_j) expression condition be A_jWhen B occur probability；

S33: whether judgment formula (6) restrains, and is such as to continue with subsequent step, otherwise return step S31.

Preferably, the S4 step further comprises step:

S41: one test set of input, the test set includes characteristic attribute Y '；

S42: calculating according to a formula (7) and obtain a posterior probability, exports the prediction result according to the posterior probability；

Wherein, p (Y ' | X₁,…,X_n) expression condition be X₁,…,X_nWhen Y ' generation probability；P(X₁,…,X_n| Y ') it indicates X when condition is Y '₁,…,X_n.Joint probability；The marginal probability of P (Y ') expression Y '；P(X₁,…,X_n) indicate X₁,…,X_nConnection Close probability.

Preferably, it is further comprised the steps of: after the S4 step

S5: the prediction result is verified.

Preferably, the S5 step further comprises step:

S51: according to the prediction result count obtain formula (7) model one by positive class determine be positive class quantity TP, One by negative class determine to be positive the quantity FP of class, one negative class is determined into the number of class of being negative by positive class determine the to be negative quantity FN and one of class Measure TN；

S52: it is calculated according to a formula (8) and obtains an accurate rate precision:

It is calculated according to a formula (9) and obtains a recall rate recall:

Acquisition one, which is calculated, according to a formula (10) bothers rate disturb:

S53: it according to the accurate rate, the recall rate and described bother rate and evaluates the prediction result.

The present invention due to use above technical scheme, make it have it is following the utility model has the advantages that

Often there is based on Bayesian probability graph model when giving a forecast to data very strong interpretation and say Take power；Probability graph model carrys out training pattern using training set, obtains conditional probability parameter, when giving a forecast to test set, utilizes Priori knowledge and the condition of test set obtain conditional probability and finally derive that posterior probability, result have very strong convincingness； And probability graph model is capable of handling the situation there are hidden variable.The interpretable of model is improved based on probability graph model Property, to detection fraudulent trading, intercepts fraudulent trading and the fund security of user and enterprise is protected to have better guarantee.

Detailed description of the invention

Fig. 1 is the overview flow chart of the identity theft detection method based on probability graph model of the embodiment of the present invention；

Fig. 2 is that the bank data that is directed to of the embodiment of the present invention models to obtain probability graph model；

Fig. 3 is the part detailed process signal of the identity theft detection method based on probability graph model of the embodiment of the present invention Figure.

Specific embodiment

Below according to attached FIG. 1 to FIG. 3, presently preferred embodiments of the present invention is provided, and is described in detail, is enabled more preferable geographical Solve function of the invention, feature.

Please refer to FIG. 1 to FIG. 3, a kind of identity theft detection method based on probability graph model of the embodiment of the present invention, packet Include step:

S1: collecting and obtain and pre-process network payment transaction data, obtains a network payment transaction feature set.

Wherein, S1 step further comprises step:

S11: data scrubbing step fills in missing values, smooth noise and identification by carrying out to network payment transaction data Solve inconsistent formatting, the removing error correcting of abnormal data and the removing of repeated data to realize data of data；

S12: the unified storage of the network payment transaction data of multiple data sources is formed a database by data integration step；

S13: the network payment transaction data standardization processing in database is formed into network payment transaction feature set.

Although current internet finance has produced many transaction data abundant, based in the real world Data are generally all incomplete inconsistent dirty datas, can not directly participate in the calculating of model, it is therefore necessary to original Data are pre-processed.(1) data scrubbing: by filling in missing values, smooth noise data identifies or solves inconsistent clear up Data.Mainly reach target below: the formatting standard (such as time) of data, the removing of abnormal data, error correcting, The removing of repeated data；(2) data integration: the data in multiple data sources are mainly combined and are uniformly deposited by data integration Storage, establishes data warehouse；(3) data convert: by smoothly assembling, Data generalization, the modes such as standardization convert the data into Practise the form that model needs.

Such as: type is as shown in table 1 after the original field and pretreatment of data.

Type list after the original field of table 1 and pretreatment

Field name	Data type	Field description	Type after pretreatment
				Transaction_Time	Character string	The incident time is handed over, second grade is accurate to	Integer
Check	Character string	The sign test mode of transaction	Integer
				Transaction_Type	Character string	The type of transaction	Integer
Transaction_Amount	Floating type	Transaction amount, unit RMB	Integer
				Merchant_Code	Character string	The merchant number of transaction	Integer
IP	Character string	Transaction whether common IP	Integer
				Sign	Character string	The label of transaction	Integer

Available original field is largely character string type as can be seen from Table 1, and as probability graph model itself The variable of discrete type can only then be processed, therefore pre-processing not only includes data scrubbing and data integration, and is become in data During changing, continuous type floating number is also converted into the computable discrete variable of probability graph model.

S2: it is established using network payment transaction feature set and obtains a probability graph model.

By the dependence and independence between analysis feature, a complete probability graph is constructed.Constructing probability graph is then The Joint Distribution between data characteristics is constructed, and dependence and independence are two main characters of distribution.Independence property It is extremely important when answering inquiry, it can be used to fundamentally reduce the calculating cost of deduction.

In the present embodiment, the algorithm environment of this step is based on: Python and Numpy system.

Wherein, S2 step further comprises step:

S21: obtaining network payment transaction feature set θ, inputs a candidate feature set θ ', a set of relationship R, label category Property Y and threshold value λ；Wherein, θ ' ∈ Φ, R ∈ Φ, Φ indicate empty set.

S22: the feature X for obtaining network payment transaction feature set θ is calculated according to formula (1)_iWith the mutual trust of tag attributes Y Breath amount I:

S23: candidate feature set θ ' is updated according to formula (2):

θ ' :=θ '+X_i(2)；

Network payment transaction feature set θ is updated according to formula (3):

θ :=θ-X_i(3)；

S24: according to obtaining dependence r, r:X_i→Y；

S25: set of relationship R is updated according to formula (4)；

S26: judge whether the feature quantity in current candidate characteristic set θ ' is more than or equal to 2；It is such as to continue with subsequent step, Otherwise return step S23；

S27: the mutual information between feature two-by-two is calculated in current candidate characteristic set θ ' according to formula (5):

Wherein, X_iIndicate the i-th feature in θ ', X_jIndicate the jth feature in θ ', i, j are greater than the natural number equal to 1；x Indicate X_iValue；X ' expression X_jValue；The joint probability of p (x, x ') expression x and x '；P (x) is the marginal probability of x；p(x′) For the marginal probability of x '；I(X_i；X_j) indicate X_iWith X_jBetween mutual information；Set of relationship R is updated by formula (4)；

S28: current θ ' is assigned to θ, and empties set θ '；By between formula (5) set of computations θ two-by-two feature Mutual information, if I (X_i；X_j) >=λ determines the dependence r between feature two-by-two, is passing through formula then according to priori knowledge (4) set of relationship R is updated；

S29: repeating step S28 until θ is the I (X of empty or all features_i；X_j)≤λ, at this time according to current relation set R Obtain probability graph model.

S3: the parameter of one training set of input and training probability graph model, while probability artwork is obtained using Bayes' theorem The conditional probability parameter of type.

The main function of this step is the parameter in training pattern.The essence of probability graph model training is exactly to pass through statistics instruction Practice the marginal probability of each of collection feature, and in this, as condition, by calculating the joint probability of feature, i.e. posterior probability As condition, go to infer the conditional probability in probability graph, the i.e. parameter of model using Bayes' theorem.

In the present embodiment, the algorithm environment of this step is based on: Python, Pgmpy probability graph model and Pandas number According to analysis tool.

Wherein, S3 step further comprises step:

S31: one training set of input, training set includes characteristic attribute and tag attributes；

S32: it is calculated according to formula (6) and obtains conditional probability parameter:

Wherein, A_iIndicate the i-th father node of probability graph model；B indicates A_iChild node；p_train(A_i| B) indicate A_iWith B it Between conditional probability parameter；p(A_i) indicate A_iMarginal probability；p(B|A_i) expression condition be A_iWhen B occur probability；A_jIt indicates Jth father node；p(A_j) indicate A_jMarginal probability；P(B|A_j) expression condition be A_jWhen B occur probability；

S4: predicting a forecast set of input using conditional probability parameter and Bayes' theorem, obtains a prediction knot Fruit.

The main function of this step is judged to unknown record, that is, is directed to a real-time transaction record, model A prediction result is provided, that is, judges that the transaction is arm's length dealing either fraudulent trading.And the process predicted mainly is also With Bayes' theorem, i.e., using the feature in transaction record as condition, with the conditional probability in model, with Bayes' theorem It goes to infer the posterior probability that this records.

In the present embodiment, the algorithm environment of this step is based on: Python, Pgmpy probability graph model, Pandas data Analysis tool and Numpy system.

Wherein, S4 step further comprises step:

S41: one test set of input, test set includes characteristic attribute Y '；

S42: it is done using Bayesian network and infers to be exactly in the conditional probability obtained using training process and test set Condition derive posterior probability；It is calculated according to a formula (7) and obtains posterior probability, prediction result is exported according to posterior probability；

Wherein, p (Y ' | X₁,…,X_n) expression condition be X₁,…,X_nWhen Y ' generation probability；P(X₁,…,X_n| Y ') it indicates X when condition is Y '₁,…,X_nJoint probability；The marginal probability of P (Y ') expression Y '；P(X₁,…,X_n) indicate X₁,…,X_nConnection Close probability.

S5: prediction result is verified.

Wherein, S5 step further comprises step:

S51: according to prediction result count obtain formula (7) model one by positive class determine be positive class quantity TP, one will Negative class determine to be positive the quantity FP of class, one positive class determine the to be negative quantity FN and one of class is determined that negative class be negative the quantity of class TN；

It is calculated according to a formula (9) and obtains a recall rate recall:

S53: according to accurate rate, recall rate and rate is bothered come evaluation and foreca result.

For example, being obtained by carrying out detection proof on true internet Bank Danamon transaction data collection in the rate of bothering (disturb) less than 1%, 0.5%, 0.1% and 0.05% the recall rate (interception rate, True Positive Rate) when, and Thus the performance of this method is evaluated, the method for the present embodiment herein means to put on and calculate is better than previous research on the time, And there is good robustness.

The probability graph model in Fig. 2 is please referred to, in actual use, the method for the present embodiment features disappearing for different user Take the joint ensemble between mode and different characteristic, users different first is when bank handles bank card, the work of the card A kind of purposes (as specially used the card as speculation in stocks or wage card) of fixation, therefore the bank of different purposes can be presented when using possible Card may show different sign test modes, if the bank card of some user is used to carry out particular transaction (as speculated in shares), Then relatively fixed normality (such as with the opening quotation of stock market and close disk time correlation) can be presented in the exchange hour of the card；And it should The transaction amount of card can show relatively high correlation (related to the price of stock)；The trade company to trade simultaneously with the card Side can also show relatively high correlation (such as certain specific companies)；It whether is that common IP also embodies during then trading The stationary distribution that user is formed when trading out is related.The behavior point of different user is constituted without the consumption habit of user Cloth, if once appearance and the unmatched behavior pattern of transaction before, has very maximum probability that can be judged as fraudulent trading. Here it is interpretation logic, the method for the present embodiment compared to traditional deep learning model black box, by combine with The relevant knowledge of banking is directed to similar user in conjunction with hypothesis, and building is used to portray the probability artwork of user behavior distribution Type, and the model has extraordinary interpretation logic.

In addition, being prediction model using probability graph model, the situation there are hidden variable can be preferably handled, this is base The a priori assumption of a routine can be provided by professional knowledge in probability graph model, i.e., when model itself has non-observational variable When, then using Bayesian Estimation can provide a kind of reasonable estimation by state-space model so that method have it is more preferable Robustness.

A kind of identity theft detection method based on probability graph model of the embodiment of the present invention, based on Bayesian general Rate graph model often has very strong interpretation and convincingness when giving a forecast to data；Probability graph model uses training set Training pattern obtains conditional probability parameter, when giving a forecast to test set, obtains item using the condition of prior probability and test set Part probability finally derives that posterior probability, result have very strong convincingness；And probability graph model is capable of handling that there are hidden The situation of variable, and these are that the existing method based on discrimination model can not accomplish；Therefore the embodiment of the present invention The identity theft detection method based on probability graph model based on probability graph model has not available for existing discrimination model Advantage.Deficiency which overcome tradition based on deep learning as fraud detection method improves the interpretation of model, right Detection fraudulent trading intercepts fraudulent trading and the fund security of user and enterprise is protected to have better guarantee.

The present invention has been described in detail with reference to the accompanying drawings, those skilled in the art can be according to upper It states and bright many variations example is made to the present invention.Thus, certain details in embodiment should not constitute limitation of the invention, this Invention will be using the range that the appended claims define as protection scope of the present invention.

Claims

1. a kind of identity theft detection method based on probability graph model, comprising steps of

S3: the parameter of one training set of input and the training probability graph model, while the probability is obtained using Bayes' theorem The conditional probability parameter of graph model；

S4: predicting a forecast set of input using the conditional probability parameter and the Bayes' theorem, and it is pre- to obtain one Survey result.

2. the identity theft detection method according to claim 1 based on probability graph model, which is characterized in that the S1 step Suddenly further comprise step:

S11: data scrubbing step fills in missing values, smooth noise and identification by carrying out to the network payment transaction data Solve inconsistent formatting, the removing error correcting of abnormal data and the removing of repeated data to realize data of data；

S13: it is special that the network payment transaction data standardization processing in the database is formed into the network payment transaction Collection is closed.

3. the identity theft detection method according to claim 2 based on probability graph model, which is characterized in that the S2 step Suddenly further comprise step:

S21: obtaining the network payment transaction feature set θ, inputs a candidate feature set θ ', a set of relationship R, label category Property Y and threshold value λ；Wherein, θ ' ∈ Φ, R ∈ Φ, Φ indicate empty set.

S22: the feature X for obtaining the network payment transaction feature set θ is calculated according to formula (1)_iWith the mutual trust of tag attributes Y Breath amount I:

Wherein, X_iIndicate ith feature；I is the natural number more than or equal to 1；Y indicates tag attributes；X indicates X_iValue；Y table Show the value of Y；The joint probability of p (x, y) expression x and y；P (x) is the marginal probability of x；P (y) is the marginal probability of y；I(X_i； Y X) is indicated_iMutual information between Y；

S23: the candidate feature set θ ' is updated according to formula (2):

θ ' :=θ '+X_i(2)；

θ :=θ-X_i(3)；

S24: according to obtaining dependence r, r:X_i→Y；

S25: the set of relationship R is updated according to formula (4)；

Wherein, X_iIndicate the i-th feature in θ ', X_jIndicate the jth feature in θ ', i, j are greater than the natural number equal to 1；X is indicated X_iValue；X ' expression X_jValue；The joint probability of p (x, x ') expression x and x '；P (x) is the marginal probability of x；P (x ') is x ' Marginal probability；I(X_i；X_j) indicate X_iWith X_jBetween mutual information；The set of relationship R is updated by formula (4)；

S28: current θ ' is assigned to θ, and empties set θ '；Pass through the mutual trust between formula (5) set of computations θ two-by-two feature Breath amount, if I (X_i；X_j) >=λ determines the dependence r between feature two-by-two then according to priori knowledge, is passing through formula (4) Update the set of relationship R；

S29: repeating step S28 until θ is the I (X of empty or all features_i；X_j)≤λ, at this time according to presently described set of relationship R Obtain the probability graph model.

4. the identity theft detection method according to claim 3 based on probability graph model, which is characterized in that the S3 step Suddenly further comprise step:

Wherein, A_iIndicate the i-th father node of the probability graph model；B indicates A_iChild node；p_train(A_i| B) indicate A_iWith B it Between conditional probability parameter；p(A_i) indicate A_iMarginal probability；p(B|A_i) expression condition be A_iWhen B occur probability；A_jIt indicates Jth father node；p(A_j) indicate A_jMarginal probability；P(B|A_j) expression condition be A_jWhen B occur probability.

5. the identity theft detection method according to claim 4 based on probability graph model, which is characterized in that the S4 step Suddenly further comprise step:

Wherein, p (Y ' | X₁,…,X_n) expression condition be X₁,…,X_nWhen Y ' generation probability；P(X₁,…,X_n| Y ') indicate condition X when for Y '₁,…,X_nJoint probability；The marginal probability ... of P (Y ') expression Y '；P(X₁,…,X_n) indicate X₁,…,X_nJoint Probability.

6. the identity theft detection method according to claim 5 based on probability graph model, which is characterized in that the S4 step It is further comprised the steps of: after rapid

S5: the prediction result is verified.

7. the identity theft detection method according to claim 6 based on probability graph model, which is characterized in that the S5 step Suddenly further comprise step:

S51: according to the prediction result count obtain formula (7) model one by positive class determine be positive class quantity TP, one will Negative class determine to be positive the quantity FP of class, one positive class determine the to be negative quantity FN and one of class is determined that negative class be negative the quantity of class TN；

It is calculated according to a formula (9) and obtains a recall rate recall: