CN111047428B - Bank high-risk fraud customer identification method based on small amount of fraud samples - Google Patents
Bank high-risk fraud customer identification method based on small amount of fraud samples Download PDFInfo
- Publication number
- CN111047428B CN111047428B CN201911235911.1A CN201911235911A CN111047428B CN 111047428 B CN111047428 B CN 111047428B CN 201911235911 A CN201911235911 A CN 201911235911A CN 111047428 B CN111047428 B CN 111047428B
- Authority
- CN
- China
- Prior art keywords
- data
- samples
- fraud
- tree
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/02—Banking, e.g. interest calculation or account maintenance
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Business, Economics & Management (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Engineering & Computer Science (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Bank high-risk fraud customer identification method based on small amount of fraud samples and related methodThe technical field of customer data processing in a bank management system relates to the technical field of customer data processing in the bank management system, and solves the technical defect that the existing learning model has low efficiency for large-scale data identification, and comprises the following steps: s1: extracting bank customer data D; s2, carrying out data preprocessing and data cleaning on the original data; s3: from D p Randomly extracting s% of samples as spy samples, and putting into D u Generates new data setsAndthe method comprises the steps of carrying out a first treatment on the surface of the S4: by means ofAndtraining a logistic regression model as two categories; s5: training a random forest model; s6: constructing a tree by using the training set data; s7: the verification set data is used to find the optimal tree F from the 9 trees generated in step S6. S8: and repeating the steps S5 to S7 until n optimal trees are obtained. The efficiency of identifying high risk customers is improved.
Description
Technical Field
The invention relates to the technical field of customer data processing in a bank management system, in particular to an improvement aspect of a bank high-risk fraud customer information identification method.
Background
Machine learning is an important financial technology innovation means, and is tried to be applied to the fields of risk prevention, anti-fraud and the like in domestic and foreign financial institutions and financial technology enterprises in recent years. Logistic regression, tree models and the like are often used for banking institutions to mine deep business scene features for large-scale data sets and further build supervised and unsupervised learning models and the like so as to improve fraud recognition capability. The supervised model can reduce the labor cost and achieve a stable effect, but has high requirements on a data set (accurate and complete labels), and the unsupervised model needs to introduce subsequent data analysis and costs more labor cost. The bank fraud risk presents the characteristics of more concealment and specialty, and develops more crime manipulation and expression forms. The sample of fraudulent clients is now very representative, whereas the remaining unlabeled fraudulent client data does not necessarily represent a certain non-fraudulent behaviour, i.e. not labeled as a mix of fraudulent and non-fraudulent clients in the sample of fraudulent, if each piece of data is artificially labeled too wastefully. Traditional fraud detection, such as methods that rely on expert rules, blacklist libraries, etc., have failed to accommodate new fraud challenges.
Disclosure of Invention
In summary, the invention aims to solve the technical defects that the existing learning model has low efficiency on large-scale data identification and is easy to cause undiscovered fraudulent clients, and provides a bank high-risk fraudulent client identification method based on a small amount of fraudulent samples.
In order to solve the technical problems, the invention adopts the following technical scheme:
the bank high-risk fraud customer identification method based on a small amount of fraud samples is characterized by comprising the following steps:
s1: extracting bank customer data D, D= { D p ,D u }, wherein D p Representing clients marked "fraudulent", D u Representing a customer group not marked "fraudulent"; d (D) pi =<A i ,y i >,D ui =<A i >Wherein A is i Is a characteristic variable of a client, y i Is of a corresponding category; y is i = +1 represents "fraud", y i -1 represents "non-fraud"); a is a matrix formed by characteristic variables of all clients;
s2, carrying out data preprocessing and data cleaning on the original data;
s3: from D p Randomly extracting s% of samples as spy samples, and putting into D u Generates new data setsAnd
s4: by means ofAnd->Training logistic regression model as two categories and using the logistic regression model pair +.>The data in the sample is scored, namely the probability value of the sample as the positive example is taken as D u Sample composition reliable negative sample set D with medium fraction lower than set threshold t n Reliable negative sample set sample corresponds label y i =-1;
S5: by D p And D n The corresponding categories are y i = +1 and y i -1; training a random forest model, posl is a sample S= (D) p ∪D n ) The ratio of the positive samples is calculated, bootstrap is used for extracting samples from S to be used as training set, and finally aboutIn training set, the remaining +.>Making a verification set;
s6: the Posl is respectively set to be 0.1-0.9, the step length is 0.1, and a training set data is used for constructing a tree T corresponding to each Posl j ,j=1...9
S7: the verification set data is used to find the optimal tree F from the 9 trees generated in step S6.
S8: repeating the steps S5 to S7 until n optimal trees are obtained, integrating to obtain a random forest containing n trees, and carrying out input prediction on bank customer data by utilizing the random forest obtained by training to predict the category y i The client of = +1 is identified as a high risk fraudulent client.
The technical scheme for further limiting the invention comprises the following steps:
the data preprocessing and data cleaning described in step S2 includes: checking the data quality, removing repeated data and abnormal data, filling the missing value of the interpretation variable A, normalizing, and converting the category variable into a numerical variable.
The threshold t in step S4 is preferably 15%.
Wherein, in step S6, a tree T is constructed j The steps of (a) are as follows:
s61, randomly putting back samples from the attribute set A to form a new attribute space A';
s62, for each attribute a in attribute space A j Calculating information gain, wherein |P| and |U| respectively represent the number of positive samples and the number of unlabeled samples in the training set, and |P| node I and U node The I respectively represents the number of positive samples and the number of unlabeled samples in the node data, and the information gain calculation method is as follows:
p -1 =1-p 1
s63, selecting an attribute with the maximum information gain as a segmentation attribute and extending sub-nodes from a segmentation point;
s64, repeating the steps S61 to S63 for each child node until the tree cannot be split and grows completely.
The step of finding the optimal tree in step S7 is as follows:
s71, T is j J=1..9 acts on the test set, respectively, and the positive sample number |p in the test set is calculated v Number of unlabeled samples |U v |, false negative |fu v I, false positive number|FP v |;
S72, calculating an evaluation index:
and S73, the tree corresponding to the minimum evaluation index is the optimal tree.
The beneficial effects of the invention are as follows: the invention is based on a small amount of fraudulent client samples, and the reliable negative sample set is found for modeling by introducing the spy sample, so that the invention has the function of purifying the unlabeled data set, has higher precision compared with the direct modeling, and also solves the problem that the supervised model needs to mark the fraudulent client from the unlabeled data pool and consumes a large amount of time and manpower resources. On the other hand, by traversing the uncertain positive sample proportion Posl, the optimal tree is integrated to form a random forest, and the method has the advantages of high speed, high efficiency and higher precision of a random forest parallel algorithm; the high-risk fraud customer identification technology combining semi-supervised learning and random forest is applied to reduce the cost of manually marking samples and improve the efficiency of identifying high-risk customers.
Drawings
FIG. 1 is a flowchart illustrating steps of an identification method according to the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the attached drawings and specific embodiments of the invention. Referring to fig. 1, the bank high risk fraud customer identification method based on a small amount of fraud samples of the present invention includes the steps of:
s1: extracting bank customer data D, D= { D p ,D u }, wherein D p Representing clients marked "fraudulent", D u Representing a customer group not marked "fraudulent"; d (D) pi =<A i ,y i >,D ui =<A i >Wherein A is i Is a characteristic variable of a client, y i Is of a corresponding category; y is i = +1 represents "fraud", y i -1 represents "non-fraud"); a is a matrix of feature variables for all customers.
S2, carrying out data preprocessing and data cleaning on the original data; the data preprocessing and data cleaning comprises the following steps: checking the data quality, removing repeated data and abnormal data, filling the missing value of the interpretation variable A, normalizing, and converting the category variable into a numerical variable.
S3: from D p Randomly extracting s% of samples as spy samples, and putting into D u Generates new data setsAnd
s4: by means ofAnd->Training logistic regression model as two categories and using the logistic regression model pair +.>The data in the sample is scored, namely the probability value of the sample as the positive example is taken as D u Sample composition reliable negative sample set D with medium fraction lower than set threshold t n Reliable negative sample set sample corresponds label y i -1; the threshold t is preferably 15%.
S5: by D p And D n The corresponding categories are y i = +1 and y i -1; training a random forest model, posl is a sample S= (D) p ∪D n ) The ratio of the positive samples is calculated, bootstrap is used for extracting samples from S to be used as training set, and finally aboutIn training set, the remaining +.>And (5) making a verification set.
S6: the Posl is respectively set to be 0.1-0.9, the step length is 0.1, and a training set data is used for constructing a tree T corresponding to each Posl j J=1..9, a tree T is constructed j The steps of (a) are as follows:
s61, randomly putting back samples from the attribute set A to form a new attribute space A';
s62, for each attribute a in attribute space A j Calculating information gain, wherein |P| and |U| respectively represent the number of positive samples and the number of unlabeled samples in the training set, and |P| node I and U node The I respectively represents the number of positive samples and the number of unlabeled samples in the node data, and the information gain calculation method is as follows:
p -1 =1-p 1
s63, selecting an attribute with the maximum information gain as a segmentation attribute and extending sub-nodes from a segmentation point;
s64, repeating the steps S61 to S63 for each child node until the tree cannot be split and grows completely. S7: the verification set data is used to find the optimal tree F from the 9 trees generated in step S6. The step of finding out the optimal tree is as follows:
s71, T is j J=1..9 acts on the test set, respectively, and the positive sample number |p in the test set is calculated v Number of unlabeled samples |U v |, false negative |fu v I, and false positive number |fp v |;
S72, calculating an evaluation index:
and S73, the tree corresponding to the minimum evaluation index is the optimal tree.
S8: repeating the steps S5 to S7 until n optimal trees are obtained, integrating to obtain a random forest containing n trees, and carrying out input prediction on bank customer data by utilizing the random forest obtained by training to predict the category y i The client of = +1 is identified as a high risk fraudulent client.
In practical application, the bank data size is large, the fraud client accounts for a small proportion, the tag acquisition efficiency is low, and the undiscovered fraud client possibly exists, so that the invention introduces a semi-supervised learning method to match with a random forest optimization framework, finds a reliable negative sample set to form a random forest integrated by an optimal tree, achieves higher identification precision of the high-risk fraud client under the condition that only a small amount of marked fraud client data exists, avoids the problem that the accuracy of a supervised model is limited due to the unclassified data, and is also beneficial to the targeted inspection of banking staff; the cost of manually marking the sample is reduced, and the efficiency of identifying high-risk clients is improved.
Claims (3)
1. The bank high-risk fraud customer identification method based on a small amount of fraud samples is characterized by comprising the following steps:
s1: extracting bank customer data D, D= { D p ,D u }, wherein D p Representing clients marked "fraudulent", D u Representing a customer group not marked "fraudulent"; d (D) pi =<A i ,y i >,D ui =<A i >Wherein A is i Is a characteristic variable of a client, y i Is of a corresponding category; y is i = +1 represents "fraud", y i -1 represents "non-fraud"; a is the characteristic variable of all clientsA matrix of components;
s2, carrying out data preprocessing and data cleaning on the original data;
s3: from D p Randomly extracting s% of samples as spy samples, and putting into D u Generates new data setsAnd->
S4: by means ofAnd->Training logistic regression model as two categories and using the logistic regression model pair +.>The data in the sample is scored, namely the probability value of the sample as the positive example is taken as D u Sample composition reliable negative sample set D with medium fraction lower than set threshold t n Reliable negative sample set sample corresponds label y i =-1;
S5: by D p And D n The corresponding categories are y i = +1 and y i -1; training a random forest model, posl is a sample S= (D) p ∪D n ) The ratio of the positive samples is calculated, bootstrap is used for extracting samples from S to be used as training set, and finally aboutIn training set, the remaining +.>Making a verification set;
s6: the Posl is respectively set to 0.1-0.9, and the step length is0.1, corresponding to each Posl, constructing a tree T by using training set data j J=1, 2, 9; constructing a tree T j The steps of (a) are as follows:
s61, randomly putting back samples from the attribute set A to form a new attribute space A';
s62, for each attribute a in attribute space A j Calculating information gain, wherein |P| and |U| respectively represent the number of positive samples and the number of unlabeled samples in the training set, and |P| node I and U node The I respectively represents the number of positive samples and the number of unlabeled samples in the node data, and the information gain calculation method is as follows:
p -1 =1-p 1
s63, selecting an attribute with the maximum information gain as a segmentation attribute and extending sub-nodes from a segmentation point;
s64, repeating the steps S61 to S63 for each child node until the tree cannot be split and grows completely;
s7: finding out an optimal tree F from the 9 trees generated in the step S6 by using the verification set data; the step of finding out the optimal tree is as follows:
s71, T is j J=1, 2, &..9 acts on the test set, respectively, and the positive sample number |p in the test set is calculated v Number of unlabeled samples |U v |, false negative |fu v I, and false positive number |fp v |;
S72, calculating an evaluation index:
s73, the tree corresponding to the minimum evaluation index is an optimal tree;
s8: repeating the steps S5 to S7 until n optimal trees are obtained, integrating to obtain a random forest containing n trees, and carrying out input prediction on bank customer data by utilizing the random forest obtained by training to predict the category y i The client of = +1 is identified as a high risk fraudulent client.
2. A bank high risk fraud customer identification method based on a small number of fraud samples according to claim 1, characterized by: the data preprocessing and data cleaning described in step S2 includes: checking the data quality, removing repeated data and abnormal data, filling the missing value of the interpretation variable A, normalizing, and converting the category variable into a numerical variable.
3. A bank high risk fraud customer identification method based on a small number of fraud samples according to claim 1, characterized by: the threshold t in step S4 is 15%.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911235911.1A CN111047428B (en) | 2019-12-05 | 2019-12-05 | Bank high-risk fraud customer identification method based on small amount of fraud samples |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911235911.1A CN111047428B (en) | 2019-12-05 | 2019-12-05 | Bank high-risk fraud customer identification method based on small amount of fraud samples |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111047428A CN111047428A (en) | 2020-04-21 |
CN111047428B true CN111047428B (en) | 2023-08-08 |
Family
ID=70234914
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911235911.1A Active CN111047428B (en) | 2019-12-05 | 2019-12-05 | Bank high-risk fraud customer identification method based on small amount of fraud samples |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111047428B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001788B (en) * | 2020-08-21 | 2024-02-09 | 东北大学 | Credit card illegal fraud identification method based on RF-DBSCAN algorithm |
CN113569919A (en) * | 2021-07-06 | 2021-10-29 | 上海淇玥信息技术有限公司 | User tag processing method and device and electronic equipment |
CN115018656B (en) * | 2022-08-08 | 2023-01-10 | 太平金融科技服务(上海)有限公司深圳分公司 | Risk identification method, and training method, device and equipment of risk identification model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107785058A (en) * | 2017-07-24 | 2018-03-09 | 平安科技(深圳)有限公司 | Anti- fraud recognition methods, storage medium and the server for carrying safety brain |
CN109472610A (en) * | 2018-11-09 | 2019-03-15 | 福建省农村信用社联合社 | A kind of bank transaction is counter to cheat method and system, equipment and storage medium |
CN109492026A (en) * | 2018-11-02 | 2019-03-19 | 国家计算机网络与信息安全管理中心 | A kind of Telecoms Fraud classification and Detection method based on improved active learning techniques |
CN110334737A (en) * | 2019-06-04 | 2019-10-15 | 阿里巴巴集团控股有限公司 | A kind of method and system of the customer risk index screening based on random forest |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180033006A1 (en) * | 2016-07-27 | 2018-02-01 | Intuit Inc. | Method and system for identifying and addressing potential fictitious business entity-based fraud |
-
2019
- 2019-12-05 CN CN201911235911.1A patent/CN111047428B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107785058A (en) * | 2017-07-24 | 2018-03-09 | 平安科技(深圳)有限公司 | Anti- fraud recognition methods, storage medium and the server for carrying safety brain |
CN109492026A (en) * | 2018-11-02 | 2019-03-19 | 国家计算机网络与信息安全管理中心 | A kind of Telecoms Fraud classification and Detection method based on improved active learning techniques |
CN109472610A (en) * | 2018-11-09 | 2019-03-15 | 福建省农村信用社联合社 | A kind of bank transaction is counter to cheat method and system, equipment and storage medium |
CN110334737A (en) * | 2019-06-04 | 2019-10-15 | 阿里巴巴集团控股有限公司 | A kind of method and system of the customer risk index screening based on random forest |
Also Published As
Publication number | Publication date |
---|---|
CN111047428A (en) | 2020-04-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110223168B (en) | Label propagation anti-fraud detection method and system based on enterprise relationship map | |
CN109918511B (en) | BFS and LPA based knowledge graph anti-fraud feature extraction method | |
CN111047428B (en) | Bank high-risk fraud customer identification method based on small amount of fraud samples | |
CN111882446B (en) | Abnormal account detection method based on graph convolution network | |
CN110852856B (en) | Invoice false invoice identification method based on dynamic network representation | |
CN112053221A (en) | Knowledge graph-based internet financial group fraud detection method | |
CN111754345B (en) | Bit currency address classification method based on improved random forest | |
CN111695597B (en) | Credit fraud group identification method and system based on improved isolated forest algorithm | |
CN111461216B (en) | Case risk identification method based on machine learning | |
CN111798312A (en) | Financial transaction system abnormity identification method based on isolated forest algorithm | |
CN112053222A (en) | Knowledge graph-based internet financial group fraud detection method | |
CN113837859B (en) | Image construction method for small and micro enterprises | |
CN104850868A (en) | Customer segmentation method based on k-means and neural network cluster | |
CN113902534A (en) | Interactive risk group identification method based on stock community relation map | |
CN111191720B (en) | Service scene identification method and device and electronic equipment | |
CN115794803A (en) | Engineering audit problem monitoring method and system based on big data AI technology | |
CN112966728A (en) | Transaction monitoring method and device | |
CN115713399B (en) | User credit evaluation system combined with third-party data source | |
CN117112782A (en) | Method for extracting bid announcement information | |
CN116342255A (en) | Internet consumption credit anti-fraud risk identification method and system | |
CN110705638A (en) | Credit rating prediction classification method using deep network learning fuzzy information feature technology | |
CN115618926A (en) | Important factor extraction method and device for taxpayer enterprise classification | |
CN113378571A (en) | Entity data relation extraction method of text data | |
CN108520042B (en) | System and method for realizing suspect case-involved role calibration and role evaluation in detection work | |
CN116032665B (en) | Network group discovery method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |