CN111047428B

CN111047428B - Bank high-risk fraud customer identification method based on small amount of fraud samples

Info

Publication number: CN111047428B
Application number: CN201911235911.1A
Authority: CN
Inventors: 杨颖一
Original assignee: Shenzhen Suoxinda Data Technology Co ltd
Current assignee: Shenzhen Suoxinda Data Technology Co ltd
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2023-08-08
Anticipated expiration: 2039-12-05
Also published as: CN111047428A

Abstract

Bank high-risk fraud customer identification method based on small amount of fraud samples and related methodThe technical field of customer data processing in a bank management system relates to the technical field of customer data processing in the bank management system, and solves the technical defect that the existing learning model has low efficiency for large-scale data identification, and comprises the following steps: s1: extracting bank customer data D; s2, carrying out data preprocessing and data cleaning on the original data; s3: from D _p Randomly extracting s% of samples as spy samples, and putting into D _u Generates new data setsAndthe method comprises the steps of carrying out a first treatment on the surface of the S4: by means ofAndtraining a logistic regression model as two categories; s5: training a random forest model; s6: constructing a tree by using the training set data; s7: the verification set data is used to find the optimal tree F from the 9 trees generated in step S6. S8: and repeating the steps S5 to S7 until n optimal trees are obtained. The efficiency of identifying high risk customers is improved.

Description

Bank high-risk fraud customer identification method based on small amount of fraud samples

Technical Field

The invention relates to the technical field of customer data processing in a bank management system, in particular to an improvement aspect of a bank high-risk fraud customer information identification method.

Background

Machine learning is an important financial technology innovation means, and is tried to be applied to the fields of risk prevention, anti-fraud and the like in domestic and foreign financial institutions and financial technology enterprises in recent years. Logistic regression, tree models and the like are often used for banking institutions to mine deep business scene features for large-scale data sets and further build supervised and unsupervised learning models and the like so as to improve fraud recognition capability. The supervised model can reduce the labor cost and achieve a stable effect, but has high requirements on a data set (accurate and complete labels), and the unsupervised model needs to introduce subsequent data analysis and costs more labor cost. The bank fraud risk presents the characteristics of more concealment and specialty, and develops more crime manipulation and expression forms. The sample of fraudulent clients is now very representative, whereas the remaining unlabeled fraudulent client data does not necessarily represent a certain non-fraudulent behaviour, i.e. not labeled as a mix of fraudulent and non-fraudulent clients in the sample of fraudulent, if each piece of data is artificially labeled too wastefully. Traditional fraud detection, such as methods that rely on expert rules, blacklist libraries, etc., have failed to accommodate new fraud challenges.

Disclosure of Invention

In summary, the invention aims to solve the technical defects that the existing learning model has low efficiency on large-scale data identification and is easy to cause undiscovered fraudulent clients, and provides a bank high-risk fraudulent client identification method based on a small amount of fraudulent samples.

In order to solve the technical problems, the invention adopts the following technical scheme:

the bank high-risk fraud customer identification method based on a small amount of fraud samples is characterized by comprising the following steps:

s1: extracting bank customer data D, D= { D _p ,D _u }, wherein D _p Representing clients marked "fraudulent", D _u Representing a customer group not marked "fraudulent"; d (D) _pi ＝<A _i ,y _i >，D _ui ＝<A _i >Wherein A is _i Is a characteristic variable of a client, y _i Is of a corresponding category; y is _i = +1 represents "fraud", y _i -1 represents "non-fraud"); a is a matrix formed by characteristic variables of all clients;

s2, carrying out data preprocessing and data cleaning on the original data;

s3: from D _p Randomly extracting s% of samples as spy samples, and putting into D _u Generates new data setsAnd

s4: by means ofAnd->Training logistic regression model as two categories and using the logistic regression model pair +.>The data in the sample is scored, namely the probability value of the sample as the positive example is taken as D _u Sample composition reliable negative sample set D with medium fraction lower than set threshold t _n Reliable negative sample set sample corresponds label y _i ＝-1；

S5: by D _p And D _n The corresponding categories are y _i = +1 and y _i -1; training a random forest model, posl is a sample S= (D) _p ∪D _n ) The ratio of the positive samples is calculated, bootstrap is used for extracting samples from S to be used as training set, and finally aboutIn training set, the remaining +.>Making a verification set;

s6: the Posl is respectively set to be 0.1-0.9, the step length is 0.1, and a training set data is used for constructing a tree T corresponding to each Posl _j ，j＝1...9

S7: the verification set data is used to find the optimal tree F from the 9 trees generated in step S6.

S8: repeating the steps S5 to S7 until n optimal trees are obtained, integrating to obtain a random forest containing n trees, and carrying out input prediction on bank customer data by utilizing the random forest obtained by training to predict the category y _i The client of = +1 is identified as a high risk fraudulent client.

The technical scheme for further limiting the invention comprises the following steps:

the data preprocessing and data cleaning described in step S2 includes: checking the data quality, removing repeated data and abnormal data, filling the missing value of the interpretation variable A, normalizing, and converting the category variable into a numerical variable.

The threshold t in step S4 is preferably 15%.

Wherein, in step S6, a tree T is constructed _j The steps of (a) are as follows:

s61, randomly putting back samples from the attribute set A to form a new attribute space A';

s62, for each attribute a in attribute space A _j Calculating information gain, wherein |P| and |U| respectively represent the number of positive samples and the number of unlabeled samples in the training set, and |P| _node I and U _node The I respectively represents the number of positive samples and the number of unlabeled samples in the node data, and the information gain calculation method is as follows:

p _-1 ＝1-p ₁

s63, selecting an attribute with the maximum information gain as a segmentation attribute and extending sub-nodes from a segmentation point;

s64, repeating the steps S61 to S63 for each child node until the tree cannot be split and grows completely.

The step of finding the optimal tree in step S7 is as follows:

s71, T is _j J=1..9 acts on the test set, respectively, and the positive sample number |p in the test set is calculated _v Number of unlabeled samples |U _v |, false negative |fu _v I, false positive number|FP _v |；

S72, calculating an evaluation index:

and S73, the tree corresponding to the minimum evaluation index is the optimal tree.

The beneficial effects of the invention are as follows: the invention is based on a small amount of fraudulent client samples, and the reliable negative sample set is found for modeling by introducing the spy sample, so that the invention has the function of purifying the unlabeled data set, has higher precision compared with the direct modeling, and also solves the problem that the supervised model needs to mark the fraudulent client from the unlabeled data pool and consumes a large amount of time and manpower resources. On the other hand, by traversing the uncertain positive sample proportion Posl, the optimal tree is integrated to form a random forest, and the method has the advantages of high speed, high efficiency and higher precision of a random forest parallel algorithm; the high-risk fraud customer identification technology combining semi-supervised learning and random forest is applied to reduce the cost of manually marking samples and improve the efficiency of identifying high-risk customers.

Drawings

FIG. 1 is a flowchart illustrating steps of an identification method according to the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the attached drawings and specific embodiments of the invention. Referring to fig. 1, the bank high risk fraud customer identification method based on a small amount of fraud samples of the present invention includes the steps of:

s1: extracting bank customer data D, D= { D _p ,D _u }, wherein D _p Representing clients marked "fraudulent", D _u Representing a customer group not marked "fraudulent"; d (D) _pi ＝<A _i ,y _i >，D _ui ＝<A _i >Wherein A is _i Is a characteristic variable of a client, y _i Is of a corresponding category; y is _i = +1 represents "fraud", y _i -1 represents "non-fraud"); a is a matrix of feature variables for all customers.

S2, carrying out data preprocessing and data cleaning on the original data; the data preprocessing and data cleaning comprises the following steps: checking the data quality, removing repeated data and abnormal data, filling the missing value of the interpretation variable A, normalizing, and converting the category variable into a numerical variable.

s4: by means ofAnd->Training logistic regression model as two categories and using the logistic regression model pair +.>The data in the sample is scored, namely the probability value of the sample as the positive example is taken as D _u Sample composition reliable negative sample set D with medium fraction lower than set threshold t _n Reliable negative sample set sample corresponds label y _i -1; the threshold t is preferably 15%.

S5: by D _p And D _n The corresponding categories are y _i = +1 and y _i -1; training a random forest model, posl is a sample S= (D) _p ∪D _n ) The ratio of the positive samples is calculated, bootstrap is used for extracting samples from S to be used as training set, and finally aboutIn training set, the remaining +.>And (5) making a verification set.

S6: the Posl is respectively set to be 0.1-0.9, the step length is 0.1, and a training set data is used for constructing a tree T corresponding to each Posl _j J=1..9, a tree T is constructed _j The steps of (a) are as follows:

p _-1 ＝1-p ₁

s64, repeating the steps S61 to S63 for each child node until the tree cannot be split and grows completely. S7: the verification set data is used to find the optimal tree F from the 9 trees generated in step S6. The step of finding out the optimal tree is as follows:

s71, T is _j J=1..9 acts on the test set, respectively, and the positive sample number |p in the test set is calculated _v Number of unlabeled samples |U _v |, false negative |fu _v I, and false positive number |fp _v |；

S72, calculating an evaluation index:

In practical application, the bank data size is large, the fraud client accounts for a small proportion, the tag acquisition efficiency is low, and the undiscovered fraud client possibly exists, so that the invention introduces a semi-supervised learning method to match with a random forest optimization framework, finds a reliable negative sample set to form a random forest integrated by an optimal tree, achieves higher identification precision of the high-risk fraud client under the condition that only a small amount of marked fraud client data exists, avoids the problem that the accuracy of a supervised model is limited due to the unclassified data, and is also beneficial to the targeted inspection of banking staff; the cost of manually marking the sample is reduced, and the efficiency of identifying high-risk clients is improved.

Claims

1. The bank high-risk fraud customer identification method based on a small amount of fraud samples is characterized by comprising the following steps:

s1: extracting bank customer data D, D= { D _p ,D _u }, wherein D _p Representing clients marked "fraudulent", D _u Representing a customer group not marked "fraudulent"; d (D) _pi ＝<A _i ,y _i >，D _ui ＝<A _i >Wherein A is _i Is a characteristic variable of a client, y _i Is of a corresponding category; y is _i = +1 represents "fraud", y _i -1 represents "non-fraud"; a is the characteristic variable of all clientsA matrix of components;

s2, carrying out data preprocessing and data cleaning on the original data;

s3: from D _p Randomly extracting s% of samples as spy samples, and putting into D _u Generates new data setsAnd->

s6: the Posl is respectively set to 0.1-0.9, and the step length is0.1, corresponding to each Posl, constructing a tree T by using training set data _j J=1, 2, 9; constructing a tree T _j The steps of (a) are as follows:

p _-1 ＝1-p ₁

s64, repeating the steps S61 to S63 for each child node until the tree cannot be split and grows completely;

s7: finding out an optimal tree F from the 9 trees generated in the step S6 by using the verification set data; the step of finding out the optimal tree is as follows:

s71, T is _j J=1, 2, &..9 acts on the test set, respectively, and the positive sample number |p in the test set is calculated _v Number of unlabeled samples |U _v |, false negative |fu _v I, and false positive number |fp _v |；

S72, calculating an evaluation index:

s73, the tree corresponding to the minimum evaluation index is an optimal tree;

2. A bank high risk fraud customer identification method based on a small number of fraud samples according to claim 1, characterized by: the data preprocessing and data cleaning described in step S2 includes: checking the data quality, removing repeated data and abnormal data, filling the missing value of the interpretation variable A, normalizing, and converting the category variable into a numerical variable.

3. A bank high risk fraud customer identification method based on a small number of fraud samples according to claim 1, characterized by: the threshold t in step S4 is 15%.