CN105095238B

CN105095238B - For detecting the decision tree generation method of fraudulent trading

Info

Publication number: CN105095238B
Application number: CN201410182321.8A
Authority: CN
Inventors: 赵金涛; 邱雪涛; 杨鸿超; 王骏
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2014-05-04
Filing date: 2014-05-04
Publication date: 2019-01-18
Anticipated expiration: 2034-05-04
Also published as: CN105095238A

Abstract

The present invention relates to a kind of for detecting the decision tree generation method of fraudulent trading, includes the following steps: to be sampled to form sample data set history fraudulent trading record and history arm's length dealing record；Each element is concentrated to extract the characteristic values of multiple attributes respectively for sample data, attribute includes at least the attribute of the relationship between the current transaction and upper transaction that indicate same transaction card number；Decision tree is constructed with training sample data；Training sample data are divided based on characteristic value, divide decision tree gradually to form decision tree detection model；Carry out test decision tree detection model with test sample data.It considers the correlation between the identical forward and backward transaction of card number, to be easier to detect the fraudulent trading with correlation, and effectively prevent erroneous detection fraudulent trading too much and missing inspection fraudulent trading.

Description

For detecting the decision tree generation method of fraudulent trading

Technical field

The present invention relates to Research on transaction security in electronic fields, more specifically to a kind of for detecting the decision of fraudulent trading Set generation method and a kind of method based on decision tree detection fraudulent trading.

Background technique

Currently, with the universal of bank card with traded by network prevailing, fraudulent trading is also increasingly multiple, if It is improper to take precautions against, and can bring loss, or even impact to financial security.Accurately detection fraudulent trading becomes transaction One of most important R&D direction of security technology area.

In the detection of traditional fraudulent trading, under experimental conditions, can often it be obtained using Decision Tree Algorithm Good effect, it is good to fraudulent trading detection accuracy, but once apply it under working condition, it is handed over true production Easy data are tested, then can detect excessive fraudulent trading, precision substantially reduces.To find out its cause, mainly there is following two points:

One, when training pattern, the ratio of 1:1 is often taken in fraudulent trading and arm's length dealing, and under working condition, fraud is handed over Incident probability only has ten thousand/and it is several, therefore test the data of environment truly restore creation data, this test Trained model can only be effective to test data under environment, invalid to creation data；If fraudulent trading and arm's length dealing Ratio using the ratio under production environment, then since the ratio of fraudulent trading and arm's length dealing is excessively greatly different, fraudulent trading Behavioural characteristic is covered by arm's length dealing completely, and the detection accuracy of model is then too low.

Two, decision-tree model is only capable of carrying out judgement classification to a transaction in isolation.And time transaction before and after same card number Between usually have certain correlation.For example, there is continuous bankcard consumption until card brush is quick-fried in a short time in a card number Behavior, then the event may be then pseudo- card fraud.Since each transaction in these transaction may meet normal friendship These transaction then may be all classified as arm's length dealing by easy condition, decision-tree model, therefore in this case, and decision tree classification is calculated Rule can fail.

Therefore, this field research staff expectation obtains a kind of for detecting the decision tree generation method of fraudulent trading, with it The decision tree of generation can overcome when fraudulent trading detection drawbacks described above, more accurate and reliable.

Summary of the invention

The purpose of the present invention is to provide a kind of for detecting the decision tree generation method of fraudulent trading.

To achieve the above object, it is as follows to provide a kind of technical solution by the present invention:

It is a kind of for detecting the decision tree generation method of fraudulent trading, include the following steps: to remember a), to history fraudulent trading Record and history arm's length dealing record are sampled to form sample data set；B), each element point is concentrated for sample data The characteristic value of multiple attributes is indescribably taken, attribute includes at least between the current transaction and upper transaction that indicate same transaction card number Relationship attribute；C), decision tree is constructed using the first part of sample data set as training sample data；D), it is based on feature Value divides training sample data, divides decision tree gradually to form decision tree detection model；E), with sample data set Second part carrys out test decision tree detection model as test sample data.

Preferably, step d) specifically comprises the following steps: d1), training sample data are set as to decision tree work as frontal lobe Node；D2), it is directed to each attribute, current leaf node is divided respectively with multiple mutually different characteristic threshold values, and seeks Corresponding Gini coefficient after dividing every time；Wherein, characteristic threshold value is in the range of corresponding to the characteristic value of element of the attribute Any value；D3), to divide the current leaf node with division corresponding to minimum Gini coefficient, to be formed under decision tree One layer of leaf node；D4), next layer of leaf node is set as current leaf node, repeats step d2), d3), stop until meeting decision tree Only splitting condition.

Preferably, sample data concentrates the element from history fraudulent trading record to remember with from history arm's length dealing Ratio of number of the ratio of number of the element of record much smaller than 1 and much larger than history fraudulent trading record and arm's length dealing record.

Preferably, step b) is specifically included: each element concentrated for sample data extracts transaction bank card card number respectively The characteristic value of attribute, exchange hour attribute and trade company's code attribute of trading；Transaction bank card is pressed to the element that sample data is concentrated The characteristic value of card number attribute, exchange hour attribute and trade company's code attribute of trading is ranked up；It is concentrated for sample data each The attribute for the relationship that element addition corresponds between the current transaction and upper transaction of same transaction card number simultaneously extracts the attribute Characteristic value, wherein relationship include it is same transaction card number current transaction and upper transaction transaction whether having the same trade company Exchange hour between code and/or the current transaction and upper transaction is poor；It is right for each element that sample data is concentrated The characteristic value for belonging to continuous variable of the element carries out sliding-model control；The element concentrated to sample data is arranged at random Sequence.

The present invention also provides a kind of methods based on decision tree detection fraudulent trading, include the following steps: based on above Decision tree generation method forms decision tree detection model；It is traded with decision tree detection model to production and carries out fraud detection.

It is provided by the present invention for detecting the decision tree generation method of fraudulent trading, overcome in the prior art using determining The drawbacks of detecting in isolation to single transaction during the detection fraudulent trading of plan tree-model, the forward and backward pen of identical card number is handed over For the attribute of relationship between easily as one in multiple attributes of sample data concentration element, this data processing method can be with The correlation between the identical forward and backward transaction of card number is considered during training pattern, to be easier there will be correlation Fraudulent trading detection；In addition, present invention improves over the ratios of fraudulent trading sample during model training and arm's length dealing sample Example choose, so that the recall rate of model and detection accuracy is fallen on reasonable section, effectively prevent erroneous detection fraudulent trading too much and Missing inspection fraudulent trading.

Detailed description of the invention

Fig. 1 shows the flow chart of the decision tree generation method for detecting fraudulent trading of one embodiment of the invention offer.

Step S13 in the decision tree generation method for detecting fraudulent trading provided Fig. 2 shows one embodiment of the invention Execution flow chart.

Specific embodiment

As shown in Figure 1, one embodiment of the invention provides a kind of for detecting the decision tree generation method of fraudulent trading, packet Include following steps:

Step S10, history fraudulent trading record and history arm's length dealing record are sampled to form sample data Collection.

Specifically, all fraudulent trading data in history fraudulent trading table are extracted, card number is pressed from history arm's length dealing table Arm's length dealing data are extracted, fraudulent trading, arm's length dealing can be respectively labeled as fraud, normal.If only historical trading table and History fraudulent trading table (does not separate history arm's length dealing table), then due to including fraudulent trading data in historical trading table, is taking out When taking arm's length dealing data, need will be comprising taking out in historical trading table again after the record in fraudulent trading table is rejected It takes.Sample data set is formed after extracting proper data.

Due to that may include mass data in historical trading table, only need to therefrom sample out partial data in training decision tree ?.Sample mode can be random sampling.

Further, sample data is concentrated from the element of history fraudulent trading record and from history arm's length dealing Ratio of number of the ratio of number of the element of record much smaller than 1 and much larger than history fraudulent trading record and arm's length dealing record.

The ratio can make while not covering fraudulent trading feature, be closer in creation data fraudulent trading with just The actual proportions often traded.

For example, being drafted from the element of history fraudulent trading record and from history just during data from the sample survey It, can be with reference to the actual proportions of fraudulent trading and arm's length dealing in creation data when the ratio of number of the element of normal transaction record (actual proportions are, for example, 1:10000), and on its basis reduce the 1-2 order of magnitude, such as can value be 1:100.

Step S11, each element is concentrated to extract the characteristic values of multiple attributes respectively for sample data.

Wherein, attribute includes at least the category of the relationship between the current transaction and upper transaction that indicate same transaction card number Property.

Specifically, element includes multiple primitive attributes, such as: trade date, exchange hour, card property etc..It is same to consider Correlation before and after one card number between time transaction is that element adds context property according to above-mentioned primitive attribute, for example, selected member Plain attribute is as shown in table 1 below:

Table 1

Wherein, primitive attribute is directly acquired from fraudulent trading table and historical trading table, does not need to be calculated；On Hereafter attribute needs to need to carry out certain calculating or judgement from obtaining in the data of transaction on same card number.Exchange hour Poor attribute (time_diff) indicates the time difference on same card number between transaction and current transaction, if same trade company's attribute (is_same_mchnt) indicate that whether transaction and current transaction are in the generation of same trade company on same card number.

The decision tree generation method for being used to detect fraudulent trading that an embodiment provides according to the present invention, for each element point The specific execution process for the step of indescribably taking the characteristic value of multiple attributes is as follows:

The each element concentrated for sample data extracts transaction bank card card number attribute, exchange hour attribute and transaction respectively The characteristic value of trade company's code attribute；

Transaction bank card card number attribute, exchange hour attribute and transaction trade company's code category are pressed to the element that sample data is concentrated The characteristic value of property is ranked up；

Correspond to the current transaction and upper transaction of same transaction card number for each element addition that sample data is concentrated Between relationship attribute and extract the characteristic value of the attribute, wherein relationship include same transaction card number current transaction with it is upper Whether having the same exchange hour between transaction transaction trade company's code and/or the current transaction and upper transaction be poor；

For each element that sample data is concentrated, discretization is carried out to the characteristic value for belonging to continuous variable of the element Processing；

The element that sample data is concentrated is carried out randomly ordered.

For example, carrying out the corresponding relationship that sliding-model control is referred to the following table 2 to the characteristic value of exchange hour difference attribute:

Value before exchange hour difference discretization	Value after exchange hour difference discretization
		Time_diff≤1 minute	1
1 minute < time_diff≤10 minute	2
		10 minutes < time_diff≤30 minute	3
30 minutes < time_diff≤1 hour	4
		1 hour < time_diff≤6 hour	5
6 hours < time_diff≤1 day	6
		1 day < time_diff≤7 day	7
7 days < time_diff≤1 month	8
		1 month < time_diff≤6 month	9
Time_diff > 6 month	10

Table 2

Step S12, decision tree is constructed using the first part of sample data set as training sample data.

Specifically, such as the data of extraction sample data concentration 2/3rds are as training sample data, as most early years Node constructs decision tree.Sample data set is gradually divided into multiple sons and gradually dividing decision tree in the next steps Each elemental recognition in sample data set is finally arm's length dealing or fraudulent trading by collection.

Step S13, training sample data are divided based on characteristic value, divides decision tree gradually to form decision tree inspection Survey model.

Fig. 2 shows the specific execution processes of step S13 comprising as follows step by step:

Step S130, training sample data are set as to the current leaf node of decision tree.

Specifically, which is the initialization step to form decision tree detection model.

Step S131, it is directed to each attribute, current leaf node is drawn respectively with multiple mutually different characteristic threshold values Point, and seek Gini coefficient corresponding after dividing every time.

Wherein, characteristic threshold value any value in the range of corresponding to the characteristic value of element of the attribute.More specifically, special Levy any value in the range of the characteristic value of element of the threshold value included by current leaf node.

For example, it is assumed that the element that current leaf node includes is set E:{ e₁,e₂,e₃,…e_n, element has altogether in upper table 1 The 18 attribute { X shown₁,X₂,X₃,…X₁₈, for any attribute Xi(1≤i≤18), if each element { e₁,e₂,e₃,…e_n} Attribute Xi characteristic value formed set C:{ c_i1,c_i2,c_i3,…,c_im, then with the either element C in set C_ir(1≤r≤m) Current leaf node is divided, the feature of attribute Xi as characteristic threshold value (i.e. characteristic threshold value can arbitrarily be chosen in set C) Value is less than or equal to C_irCurrent leaf node element formed subset T(Xi≤C_ir), the characteristic value of attribute Xi is greater than C_irWork as frontal lobe The element of node forms subset T(Xi > C_ir), that is, set E is divided into subset T(Xi≤C_ir) and subset T(Xi > C_ir).

Gini coefficient is sought to the secondary division, its calculation formula is:

Wherein, c is the characteristic threshold value for any attribute Xi, and Xi is the ith attribute that sample data concentrates element, Gini (T_Xi=c) it is corresponding Gini coefficient after threshold value divides current leaf node characterized by c, T (Xi≤c) is after dividing Subset composed by element of the characteristic value of attribute Xi less than or equal to c, T (Xi > c) are that the characteristic value of attribute Xi after dividing is greater than c Element composed by subset, Num (T (Xi≤c)) be subset T (Xi≤c) in element quantity, Num (T (X > c)) be subset T The quantity of element in (Xi > c), Gini (T (Xi≤c)) are the Gini coefficient of subset T (Xi≤c), and Gini (T (Xi > c)) is subset The Gini coefficient of T (Xi > c).

Element set E:{ e₁,e₂,e₃,…e_nSubset T Gini coefficient by following equation group calculate:

,

Wherein, p_normal(T) probability of the element recorded from history arm's length dealing in subset T, p are indicated_fraud(T) Indicate probability of the element recorded from history fraudulent trading in subset T.

Specifically, in this step by step S131, for any attribute Xi, with set C:{ c_i1,c_i2,c_i3,…,c_imIn Each element divides current leaf node respectively as characteristic threshold value, acquires corresponding Gini coefficient Gini after dividing every time (T_Xi=c) and record.Another attribute Xj of reselection (1≤j≤18 and j ≠ i) carries out same operation, until to all 18 A attribute all completes same operation.

Step S132, to divide the current leaf node with division corresponding to minimum Gini coefficient, to form decision tree Next layer of leaf node.

Theoretical, corresponding Gini coefficient after more above-mentioned multiple division, choosing are divided according to the decision tree based on Gini coefficient Take the smallest best divisional mode for being divided into decision tree of Gini coefficient.Work as frontal lobe with the best divisional mode division decision tree Node forms the next layer of leaf node (i.e. two subsets) of decision tree.

Step S133, judge whether that meeting decision tree stops splitting condition, if satisfied, then the process of step S13 terminates, it is no Then, the following steps S134 is executed.

Specifically, it includes either one or two of following condition that decision tree, which stops splitting condition:

Condition 1: subset T (Xi≤c), subset T (Xi > c) are normal from history fraudulent trading record or history in the same manner Transaction record；

Condition 2: corresponding Gini coefficient is more than or equal to Gini coefficient corresponding after preceding primary division after current division；

Condition 3: number of elements included by subset T (Xi≤c) or subset T (Xi > c) is less than number of elements threshold value.

Step S134, next layer of leaf node is set as current leaf node, repeats step S131, S132.

Specifically, in the case where not meeting above-mentioned stopping splitting condition, step S134 persistently carries out decision tree Division.

Step S14, carry out test decision tree detection model using the second part of sample data set as test sample data.

Specifically, can be concentrated using sample data the data of remaining one third tested as test sample data by The decision tree detection model formed according to above-described embodiment.

For detecting the decision tree generation method of fraudulent trading provided by the above embodiment of the present invention, existing skill is overcome The drawbacks of in art using being detected in isolation to single transaction during decision-tree model detection fraudulent trading, by identical card number The attribute of relationship between forward and backward transaction concentrates one in multiple attributes of element as sample data, at this data Reason mode can consider the correlation between the identical forward and backward transaction of card number during training pattern, so that being easier will Fraudulent trading detection with correlation.

Another embodiment of the present invention provides a kind of method based on decision tree detection fraudulent trading, this method includes following step It is rapid:

A), the decision tree generation method provided based on above embodiments forms decision tree detection model；

B), traded with decision tree detection model to production and carry out fraud detection.

Further, further include following steps after step B): it is artificial true to obtain testing result progress to fraud detection Recognize, and history fraudulent trading record is added in the record for the generation transaction for being confirmed to be fraudulent trading.Updated history fraud Transaction record can be used to regenerate decision tree detection model.It will be appreciated by those skilled in the art that in the above manner every one The section time regenerates decision tree detection model, it is ensured that detection model can identify feature and the rule of newest fraudulent trading Rule.

Further improved embodiment according to this embodiment, when decision tree generation method in implementation steps A), Following steps can also be performed after step S14:

If recall rate is greater than first threshold, history fraudulent trading record and history arm's length dealing record are taken out again Sample, to correspondingly improve the ratio for deriving from the element of history fraudulent trading record and concentrating in sample data；If recall rate is small In second threshold, then samples to history fraudulent trading record and history arm's length dealing record, derived from accordingly decreasing again The ratio that the element of history fraudulent trading record is concentrated in sample data；

Continue to execute decision tree generation method corresponding step S11, S12, S13 and S14；

Wherein, recall rate is to be detected in number of elements and test sample data for fraudulent trading in test sample data From the ratio between the number of elements of history fraudulent trading record, first threshold is greater than second threshold.Recall rate is high, indicates detection mould Type erroneous detection situation is more；Recall rate is low, indicates that detection model missing inspection situation is more；Recall rate will all indicate to detect without falling into zone of reasonableness Precision is undesirable.

Therefore, invention also improves the ratios of fraudulent trading sample and arm's length dealing sample during model training to select It takes, the recall rate of model and detection accuracy is made to fall on reasonable section, effectively prevent erroneous detection fraudulent trading and missing inspection too much Fraudulent trading.

Above description is not lain in and is limited the scope of the invention only in the preferred embodiment of the present invention.Ability Field technique personnel can make various modifications design, without departing from thought of the invention and subsidiary claim.

Claims

1. it is a kind of for detecting the decision tree generation method of fraudulent trading, include the following steps:

A), history fraudulent trading record and history arm's length dealing record are sampled to form sample data set；

B), each element is concentrated to extract the characteristic values of multiple attributes respectively for the sample data, the attribute includes at least It indicates the attribute of the relationship between the current transaction and upper transaction of same transaction card number, transaction bank card card number attribute, hand over Easy time attribute and transaction trade company's code attribute；

C), decision tree is constructed using the first part of the sample data set as training sample data；

D), the training sample data are divided based on the characteristic value, divides the decision tree gradually to form decision Set detection model；And

E), the decision tree detection model is tested as test sample data using the second part of the sample data set；

Wherein, the step b) is specifically included:

For the sample data concentrate each element extract respectively the transaction bank card card number attribute, exchange hour attribute and The characteristic value for trade company's code attribute of trading；

In the transaction bank card card number attribute, exchange hour attribute and transaction trade company's generation, are pressed to the element that the sample data is concentrated The characteristic value of code attribute is ranked up；

Correspond to the current transaction and upper transaction of same transaction card number for each element addition that the sample data is concentrated Between relationship attribute and extract the characteristic value of the attribute, wherein the relationship includes the current transaction of same transaction card number With the exchange hour between upper transaction transaction trade company's code whether having the same and/or the current transaction and upper transaction Difference；

For each element that the sample data is concentrated, discretization is carried out to the characteristic value for belonging to continuous variable of the element Processing；

The element that the sample data is concentrated is carried out randomly ordered.

2. decision tree generation method according to claim 1, which is characterized in that the step d) specifically includes following step It is rapid:

D1), the training sample data are set as to the current leaf node of the decision tree；

D2), it is directed to each attribute, the current leaf node is drawn respectively with multiple mutually different characteristic threshold values Point, and seek Gini coefficient corresponding after dividing every time；Wherein, the characteristic threshold value is in the element for corresponding to the attribute Characteristic value in the range of any value；

D3), to divide the current leaf node with division corresponding to the minimum Gini coefficient, to form the decision tree Next layer of leaf node；

D4), the next layer of leaf node is set as the current leaf node, repeating said steps d2), d3), until meet decision Tree stops splitting condition.

3. decision tree generation method according to claim 2, which is characterized in that the calculation formula of the Gini coefficient are as follows:

Wherein, c is the characteristic threshold value for any attribute, and Xi is the ith attribute that the sample data concentrates element, Gini(T_Xi=c) be take c as Gini coefficient corresponding after the characteristic threshold value divides the current leaf node, T (Xi ≤ c) it is subset composed by element of the characteristic value less than or equal to c of attribute Xi after dividing, T (Xi > c) is attribute Xi after dividing Subset composed by element of the characteristic value greater than c, Num (T (Xi≤c)) are the quantity of element in subset T (Xi≤c), Num (T (X > c)) be subset T (Xi > c) in element quantity, Gini (T (Xi≤c)) be subset T (Xi≤c) Gini coefficient, Gini (T (Xi > c)) be subset T (Xi > c) Gini coefficient.

4. decision tree generation method according to claim 3, which is characterized in that the decision tree stops splitting condition and includes Either one or two of following condition:

The subset T (Xi≤c), subset T (Xi > c) are deriving from history fraudulent trading record or the history just in the same manner Normal transaction record；

Corresponding Gini coefficient is more than or equal to Gini coefficient corresponding after preceding primary division after current division；

Number of elements included by the subset T (Xi≤c) or subset T (Xi > c) is less than number of elements threshold value.

5. decision tree generation method according to claim 1, which is characterized in that the sample data is concentrated from described The element of history fraudulent trading record is remote much smaller than 1 with the ratio of number of the element from history arm's length dealing record Greater than the ratio of number of history fraudulent trading record and arm's length dealing record.

6. a kind of method based on decision tree detection fraudulent trading, includes the following steps:

Based on decision tree generation method described in any one of claims 1 to 5, the decision tree detection model is formed；

Fraud detection is carried out to transaction is generated with the decision tree detection model.

7. the method for detection fraudulent trading according to claim 6, which is characterized in that it further includes following steps:

Testing result is obtained to the fraud detection and carries out manual confirmation, and the generation for being confirmed to be fraudulent trading is traded Record history fraudulent trading record is added.

8. it is according to claim 7 detection fraudulent trading method, which is characterized in that in the decision tree generation method Further include following steps after the step e):

If recall rate is greater than first threshold, history fraudulent trading record and the history arm's length dealing are recorded again Sampling, to correspondingly improve the ratio for deriving from the element of history fraudulent trading record and concentrating in the sample data；Such as Recall rate described in fruit is less than second threshold, then takes out again to history fraudulent trading record and history arm's length dealing record Sample, to accordingly decrease the ratio for deriving from the element of history fraudulent trading record and concentrating in the sample data；

Continue to execute the step b), c), d) and e) of the decision tree generation method；

Wherein, the recall rate is the number of elements and the test specimens being detected in the test sample data as fraudulent trading From the ratio between the number of elements of history fraudulent trading record in notebook data, the first threshold is greater than second threshold Value.