CN111047428B - Bank high-risk fraud customer identification method based on small amount of fraud samples - Google Patents

Bank high-risk fraud customer identification method based on small amount of fraud samples Download PDF

Info

Publication number
CN111047428B
CN111047428B CN201911235911.1A CN201911235911A CN111047428B CN 111047428 B CN111047428 B CN 111047428B CN 201911235911 A CN201911235911 A CN 201911235911A CN 111047428 B CN111047428 B CN 111047428B
Authority
CN
China
Prior art keywords
data
samples
fraud
tree
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911235911.1A
Other languages
Chinese (zh)
Other versions
CN111047428A (en
Inventor
杨颖一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Suoxinda Data Technology Co ltd
Original Assignee
Shenzhen Suoxinda Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Suoxinda Data Technology Co ltd filed Critical Shenzhen Suoxinda Data Technology Co ltd
Priority to CN201911235911.1A priority Critical patent/CN111047428B/en
Publication of CN111047428A publication Critical patent/CN111047428A/en
Application granted granted Critical
Publication of CN111047428B publication Critical patent/CN111047428B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Bank high-risk fraud customer identification method based on small amount of fraud samples and related methodThe technical field of customer data processing in a bank management system relates to the technical field of customer data processing in the bank management system, and solves the technical defect that the existing learning model has low efficiency for large-scale data identification, and comprises the following steps: s1: extracting bank customer data D; s2, carrying out data preprocessing and data cleaning on the original data; s3: from D p Randomly extracting s% of samples as spy samples, and putting into D u Generates new data setsAndthe method comprises the steps of carrying out a first treatment on the surface of the S4: by means ofAndtraining a logistic regression model as two categories; s5: training a random forest model; s6: constructing a tree by using the training set data; s7: the verification set data is used to find the optimal tree F from the 9 trees generated in step S6. S8: and repeating the steps S5 to S7 until n optimal trees are obtained. The efficiency of identifying high risk customers is improved.

Description

Bank high-risk fraud customer identification method based on small amount of fraud samples
Technical Field
The invention relates to the technical field of customer data processing in a bank management system, in particular to an improvement aspect of a bank high-risk fraud customer information identification method.
Background
Machine learning is an important financial technology innovation means, and is tried to be applied to the fields of risk prevention, anti-fraud and the like in domestic and foreign financial institutions and financial technology enterprises in recent years. Logistic regression, tree models and the like are often used for banking institutions to mine deep business scene features for large-scale data sets and further build supervised and unsupervised learning models and the like so as to improve fraud recognition capability. The supervised model can reduce the labor cost and achieve a stable effect, but has high requirements on a data set (accurate and complete labels), and the unsupervised model needs to introduce subsequent data analysis and costs more labor cost. The bank fraud risk presents the characteristics of more concealment and specialty, and develops more crime manipulation and expression forms. The sample of fraudulent clients is now very representative, whereas the remaining unlabeled fraudulent client data does not necessarily represent a certain non-fraudulent behaviour, i.e. not labeled as a mix of fraudulent and non-fraudulent clients in the sample of fraudulent, if each piece of data is artificially labeled too wastefully. Traditional fraud detection, such as methods that rely on expert rules, blacklist libraries, etc., have failed to accommodate new fraud challenges.
Disclosure of Invention
In summary, the invention aims to solve the technical defects that the existing learning model has low efficiency on large-scale data identification and is easy to cause undiscovered fraudulent clients, and provides a bank high-risk fraudulent client identification method based on a small amount of fraudulent samples.
In order to solve the technical problems, the invention adopts the following technical scheme:
the bank high-risk fraud customer identification method based on a small amount of fraud samples is characterized by comprising the following steps:
s1: extracting bank customer data D, D= { D p ,D u }, wherein D p Representing clients marked "fraudulent", D u Representing a customer group not marked "fraudulent"; d (D) pi =<A i ,y i >,D ui =<A i >Wherein A is i Is a characteristic variable of a client, y i Is of a corresponding category; y is i = +1 represents "fraud", y i -1 represents "non-fraud"); a is a matrix formed by characteristic variables of all clients;
s2, carrying out data preprocessing and data cleaning on the original data;
s3: from D p Randomly extracting s% of samples as spy samples, and putting into D u Generates new data setsAnd
s4: by means ofAnd->Training logistic regression model as two categories and using the logistic regression model pair +.>The data in the sample is scored, namely the probability value of the sample as the positive example is taken as D u Sample composition reliable negative sample set D with medium fraction lower than set threshold t n Reliable negative sample set sample corresponds label y i =-1;
S5: by D p And D n The corresponding categories are y i = +1 and y i -1; training a random forest model, posl is a sample S= (D) p ∪D n ) The ratio of the positive samples is calculated, bootstrap is used for extracting samples from S to be used as training set, and finally aboutIn training set, the remaining +.>Making a verification set;
s6: the Posl is respectively set to be 0.1-0.9, the step length is 0.1, and a training set data is used for constructing a tree T corresponding to each Posl j ,j=1...9
S7: the verification set data is used to find the optimal tree F from the 9 trees generated in step S6.
S8: repeating the steps S5 to S7 until n optimal trees are obtained, integrating to obtain a random forest containing n trees, and carrying out input prediction on bank customer data by utilizing the random forest obtained by training to predict the category y i The client of = +1 is identified as a high risk fraudulent client.
The technical scheme for further limiting the invention comprises the following steps:
the data preprocessing and data cleaning described in step S2 includes: checking the data quality, removing repeated data and abnormal data, filling the missing value of the interpretation variable A, normalizing, and converting the category variable into a numerical variable.
The threshold t in step S4 is preferably 15%.
Wherein, in step S6, a tree T is constructed j The steps of (a) are as follows:
s61, randomly putting back samples from the attribute set A to form a new attribute space A';
s62, for each attribute a in attribute space A j Calculating information gain, wherein |P| and |U| respectively represent the number of positive samples and the number of unlabeled samples in the training set, and |P| node I and U node The I respectively represents the number of positive samples and the number of unlabeled samples in the node data, and the information gain calculation method is as follows:
p -1 =1-p 1
s63, selecting an attribute with the maximum information gain as a segmentation attribute and extending sub-nodes from a segmentation point;
s64, repeating the steps S61 to S63 for each child node until the tree cannot be split and grows completely.
The step of finding the optimal tree in step S7 is as follows:
s71, T is j J=1..9 acts on the test set, respectively, and the positive sample number |p in the test set is calculated v Number of unlabeled samples |U v |, false negative |fu v I, false positive number|FP v |;
S72, calculating an evaluation index:
and S73, the tree corresponding to the minimum evaluation index is the optimal tree.
The beneficial effects of the invention are as follows: the invention is based on a small amount of fraudulent client samples, and the reliable negative sample set is found for modeling by introducing the spy sample, so that the invention has the function of purifying the unlabeled data set, has higher precision compared with the direct modeling, and also solves the problem that the supervised model needs to mark the fraudulent client from the unlabeled data pool and consumes a large amount of time and manpower resources. On the other hand, by traversing the uncertain positive sample proportion Posl, the optimal tree is integrated to form a random forest, and the method has the advantages of high speed, high efficiency and higher precision of a random forest parallel algorithm; the high-risk fraud customer identification technology combining semi-supervised learning and random forest is applied to reduce the cost of manually marking samples and improve the efficiency of identifying high-risk customers.
Drawings
FIG. 1 is a flowchart illustrating steps of an identification method according to the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the attached drawings and specific embodiments of the invention. Referring to fig. 1, the bank high risk fraud customer identification method based on a small amount of fraud samples of the present invention includes the steps of:
s1: extracting bank customer data D, D= { D p ,D u }, wherein D p Representing clients marked "fraudulent", D u Representing a customer group not marked "fraudulent"; d (D) pi =<A i ,y i >,D ui =<A i >Wherein A is i Is a characteristic variable of a client, y i Is of a corresponding category; y is i = +1 represents "fraud", y i -1 represents "non-fraud"); a is a matrix of feature variables for all customers.
S2, carrying out data preprocessing and data cleaning on the original data; the data preprocessing and data cleaning comprises the following steps: checking the data quality, removing repeated data and abnormal data, filling the missing value of the interpretation variable A, normalizing, and converting the category variable into a numerical variable.
S3: from D p Randomly extracting s% of samples as spy samples, and putting into D u Generates new data setsAnd
s4: by means ofAnd->Training logistic regression model as two categories and using the logistic regression model pair +.>The data in the sample is scored, namely the probability value of the sample as the positive example is taken as D u Sample composition reliable negative sample set D with medium fraction lower than set threshold t n Reliable negative sample set sample corresponds label y i -1; the threshold t is preferably 15%.
S5: by D p And D n The corresponding categories are y i = +1 and y i -1; training a random forest model, posl is a sample S= (D) p ∪D n ) The ratio of the positive samples is calculated, bootstrap is used for extracting samples from S to be used as training set, and finally aboutIn training set, the remaining +.>And (5) making a verification set.
S6: the Posl is respectively set to be 0.1-0.9, the step length is 0.1, and a training set data is used for constructing a tree T corresponding to each Posl j J=1..9, a tree T is constructed j The steps of (a) are as follows:
s61, randomly putting back samples from the attribute set A to form a new attribute space A';
s62, for each attribute a in attribute space A j Calculating information gain, wherein |P| and |U| respectively represent the number of positive samples and the number of unlabeled samples in the training set, and |P| node I and U node The I respectively represents the number of positive samples and the number of unlabeled samples in the node data, and the information gain calculation method is as follows:
p -1 =1-p 1
s63, selecting an attribute with the maximum information gain as a segmentation attribute and extending sub-nodes from a segmentation point;
s64, repeating the steps S61 to S63 for each child node until the tree cannot be split and grows completely. S7: the verification set data is used to find the optimal tree F from the 9 trees generated in step S6. The step of finding out the optimal tree is as follows:
s71, T is j J=1..9 acts on the test set, respectively, and the positive sample number |p in the test set is calculated v Number of unlabeled samples |U v |, false negative |fu v I, and false positive number |fp v |;
S72, calculating an evaluation index:
and S73, the tree corresponding to the minimum evaluation index is the optimal tree.
S8: repeating the steps S5 to S7 until n optimal trees are obtained, integrating to obtain a random forest containing n trees, and carrying out input prediction on bank customer data by utilizing the random forest obtained by training to predict the category y i The client of = +1 is identified as a high risk fraudulent client.
In practical application, the bank data size is large, the fraud client accounts for a small proportion, the tag acquisition efficiency is low, and the undiscovered fraud client possibly exists, so that the invention introduces a semi-supervised learning method to match with a random forest optimization framework, finds a reliable negative sample set to form a random forest integrated by an optimal tree, achieves higher identification precision of the high-risk fraud client under the condition that only a small amount of marked fraud client data exists, avoids the problem that the accuracy of a supervised model is limited due to the unclassified data, and is also beneficial to the targeted inspection of banking staff; the cost of manually marking the sample is reduced, and the efficiency of identifying high-risk clients is improved.

Claims (3)

1. The bank high-risk fraud customer identification method based on a small amount of fraud samples is characterized by comprising the following steps:
s1: extracting bank customer data D, D= { D p ,D u }, wherein D p Representing clients marked "fraudulent", D u Representing a customer group not marked "fraudulent"; d (D) pi =<A i ,y i >,D ui =<A i >Wherein A is i Is a characteristic variable of a client, y i Is of a corresponding category; y is i = +1 represents "fraud", y i -1 represents "non-fraud"; a is the characteristic variable of all clientsA matrix of components;
s2, carrying out data preprocessing and data cleaning on the original data;
s3: from D p Randomly extracting s% of samples as spy samples, and putting into D u Generates new data setsAnd->
S4: by means ofAnd->Training logistic regression model as two categories and using the logistic regression model pair +.>The data in the sample is scored, namely the probability value of the sample as the positive example is taken as D u Sample composition reliable negative sample set D with medium fraction lower than set threshold t n Reliable negative sample set sample corresponds label y i =-1;
S5: by D p And D n The corresponding categories are y i = +1 and y i -1; training a random forest model, posl is a sample S= (D) p ∪D n ) The ratio of the positive samples is calculated, bootstrap is used for extracting samples from S to be used as training set, and finally aboutIn training set, the remaining +.>Making a verification set;
s6: the Posl is respectively set to 0.1-0.9, and the step length is0.1, corresponding to each Posl, constructing a tree T by using training set data j J=1, 2, 9; constructing a tree T j The steps of (a) are as follows:
s61, randomly putting back samples from the attribute set A to form a new attribute space A';
s62, for each attribute a in attribute space A j Calculating information gain, wherein |P| and |U| respectively represent the number of positive samples and the number of unlabeled samples in the training set, and |P| node I and U node The I respectively represents the number of positive samples and the number of unlabeled samples in the node data, and the information gain calculation method is as follows:
p -1 =1-p 1
s63, selecting an attribute with the maximum information gain as a segmentation attribute and extending sub-nodes from a segmentation point;
s64, repeating the steps S61 to S63 for each child node until the tree cannot be split and grows completely;
s7: finding out an optimal tree F from the 9 trees generated in the step S6 by using the verification set data; the step of finding out the optimal tree is as follows:
s71, T is j J=1, 2, &..9 acts on the test set, respectively, and the positive sample number |p in the test set is calculated v Number of unlabeled samples |U v |, false negative |fu v I, and false positive number |fp v |;
S72, calculating an evaluation index:
s73, the tree corresponding to the minimum evaluation index is an optimal tree;
s8: repeating the steps S5 to S7 until n optimal trees are obtained, integrating to obtain a random forest containing n trees, and carrying out input prediction on bank customer data by utilizing the random forest obtained by training to predict the category y i The client of = +1 is identified as a high risk fraudulent client.
2. A bank high risk fraud customer identification method based on a small number of fraud samples according to claim 1, characterized by: the data preprocessing and data cleaning described in step S2 includes: checking the data quality, removing repeated data and abnormal data, filling the missing value of the interpretation variable A, normalizing, and converting the category variable into a numerical variable.
3. A bank high risk fraud customer identification method based on a small number of fraud samples according to claim 1, characterized by: the threshold t in step S4 is 15%.
CN201911235911.1A 2019-12-05 2019-12-05 Bank high-risk fraud customer identification method based on small amount of fraud samples Active CN111047428B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911235911.1A CN111047428B (en) 2019-12-05 2019-12-05 Bank high-risk fraud customer identification method based on small amount of fraud samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911235911.1A CN111047428B (en) 2019-12-05 2019-12-05 Bank high-risk fraud customer identification method based on small amount of fraud samples

Publications (2)

Publication Number Publication Date
CN111047428A CN111047428A (en) 2020-04-21
CN111047428B true CN111047428B (en) 2023-08-08

Family

ID=70234914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911235911.1A Active CN111047428B (en) 2019-12-05 2019-12-05 Bank high-risk fraud customer identification method based on small amount of fraud samples

Country Status (1)

Country Link
CN (1) CN111047428B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001788B (en) * 2020-08-21 2024-02-09 东北大学 Credit card illegal fraud identification method based on RF-DBSCAN algorithm
CN113569919A (en) * 2021-07-06 2021-10-29 上海淇玥信息技术有限公司 User tag processing method and device and electronic equipment
CN115018656B (en) * 2022-08-08 2023-01-10 太平金融科技服务(上海)有限公司深圳分公司 Risk identification method, and training method, device and equipment of risk identification model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107785058A (en) * 2017-07-24 2018-03-09 平安科技(深圳)有限公司 Anti- fraud recognition methods, storage medium and the server for carrying safety brain
CN109472610A (en) * 2018-11-09 2019-03-15 福建省农村信用社联合社 A kind of bank transaction is counter to cheat method and system, equipment and storage medium
CN109492026A (en) * 2018-11-02 2019-03-19 国家计算机网络与信息安全管理中心 A kind of Telecoms Fraud classification and Detection method based on improved active learning techniques
CN110334737A (en) * 2019-06-04 2019-10-15 阿里巴巴集团控股有限公司 A kind of method and system of the customer risk index screening based on random forest

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180033006A1 (en) * 2016-07-27 2018-02-01 Intuit Inc. Method and system for identifying and addressing potential fictitious business entity-based fraud

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107785058A (en) * 2017-07-24 2018-03-09 平安科技(深圳)有限公司 Anti- fraud recognition methods, storage medium and the server for carrying safety brain
CN109492026A (en) * 2018-11-02 2019-03-19 国家计算机网络与信息安全管理中心 A kind of Telecoms Fraud classification and Detection method based on improved active learning techniques
CN109472610A (en) * 2018-11-09 2019-03-15 福建省农村信用社联合社 A kind of bank transaction is counter to cheat method and system, equipment and storage medium
CN110334737A (en) * 2019-06-04 2019-10-15 阿里巴巴集团控股有限公司 A kind of method and system of the customer risk index screening based on random forest

Also Published As

Publication number Publication date
CN111047428A (en) 2020-04-21

Similar Documents

Publication Publication Date Title
CN110223168B (en) Label propagation anti-fraud detection method and system based on enterprise relationship map
CN109918511B (en) BFS and LPA based knowledge graph anti-fraud feature extraction method
CN111047428B (en) Bank high-risk fraud customer identification method based on small amount of fraud samples
CN111882446B (en) Abnormal account detection method based on graph convolution network
CN110852856B (en) Invoice false invoice identification method based on dynamic network representation
CN112053221A (en) Knowledge graph-based internet financial group fraud detection method
CN111754345B (en) Bit currency address classification method based on improved random forest
CN111695597B (en) Credit fraud group identification method and system based on improved isolated forest algorithm
CN111461216B (en) Case risk identification method based on machine learning
CN111798312A (en) Financial transaction system abnormity identification method based on isolated forest algorithm
CN112053222A (en) Knowledge graph-based internet financial group fraud detection method
CN113837859B (en) Image construction method for small and micro enterprises
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
CN113902534A (en) Interactive risk group identification method based on stock community relation map
CN111191720B (en) Service scene identification method and device and electronic equipment
CN115794803A (en) Engineering audit problem monitoring method and system based on big data AI technology
CN112966728A (en) Transaction monitoring method and device
CN115713399B (en) User credit evaluation system combined with third-party data source
CN117112782A (en) Method for extracting bid announcement information
CN116342255A (en) Internet consumption credit anti-fraud risk identification method and system
CN110705638A (en) Credit rating prediction classification method using deep network learning fuzzy information feature technology
CN115618926A (en) Important factor extraction method and device for taxpayer enterprise classification
CN113378571A (en) Entity data relation extraction method of text data
CN108520042B (en) System and method for realizing suspect case-involved role calibration and role evaluation in detection work
CN116032665B (en) Network group discovery method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant