CN112581265A - Internet financial client application fraud detection method based on AdaBoost - Google Patents

Internet financial client application fraud detection method based on AdaBoost Download PDF

Info

Publication number
CN112581265A
CN112581265A CN202011536761.0A CN202011536761A CN112581265A CN 112581265 A CN112581265 A CN 112581265A CN 202011536761 A CN202011536761 A CN 202011536761A CN 112581265 A CN112581265 A CN 112581265A
Authority
CN
China
Prior art keywords
data
adaboost
sample
training
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011536761.0A
Other languages
Chinese (zh)
Inventor
江远强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baiweijinke Shanghai Information Technology Co ltd
Original Assignee
Baiweijinke Shanghai Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baiweijinke Shanghai Information Technology Co ltd filed Critical Baiweijinke Shanghai Information Technology Co ltd
Priority to CN202011536761.0A priority Critical patent/CN112581265A/en
Publication of CN112581265A publication Critical patent/CN112581265A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Abstract

The application discloses an Internet financial client application fraud detection method based on AdaBoost, which comprises the following steps: selecting a point concentration cheating client as a modeling sample, and collecting credit data; after the acquired credit data is subjected to deletion completion, abnormal value processing and normalization processing, dividing K-fold cross validation data into a training set and a test set; building a plurality of weak classifiers in a training set, and performing weighted synthesis on an application fraud classifier of Adaboost; optimizing Adaboost weak classifier weight by adopting a particle swarm algorithm, and training again on the training set; inputting a test set sample to the trained AdaBoost for detection, comparing with an actual label, and evaluating and comparing with a logistic regression and a support vector machine by using model precision evaluation indexes; and deploying the fraud detection model for optimizing AdaBoost to the application platform to realize the detection of the real-time fraud of the client. According to the method, the weight of the Adaboost weak classifier is optimized through the particle swarm, and the corrected strong classifier is generated to realize application fraud detection of Internet financial clients.

Description

Internet financial client application fraud detection method based on AdaBoost
Technical Field
The invention relates to the technical field of wind control in the Internet financial industry, in particular to an Internet financial client application fraud detection method based on AdaBoost.
Background
With the development of internet finance, network lending develops rapidly. In a network environment, strong relevant credit features such as financial information of a borrower are difficult to obtain, a large number of weak relevant credit features need to be obtained from different network platforms, the weak relevant credit features detect whether a certain client applies for fraud or not by establishing a fraud behavior model through machine learning, such as logistic regression and support vector machine algorithm, wherein an independent variable is personal basic information when a sample client account is registered and applied, operation behavior buried point data is obtained from monitoring software, and a dependent variable is fraud application probability. Research shows that the fraud rate of the network loan service is significantly higher than that of the traditional offline loan service, while a fraudulent user has higher false score cost and has great influence on the classification result of a base classifier and an integrated model, so that the demand for applying for a fraud detection method for an internet financial client based on AdaBoost is increasing day by day.
The common logistic regression and support vector machine algorithms are single classifiers, the precision is low, and compared with a simple classification algorithm model, the integrated model has better performance. The adaptive boosting (adaptive boosting) algorithm is combined into a strong classifier through a plurality of weak classifiers, has the advantages of simplicity, effectiveness, capability of constructing sub-classifiers by using different methods and the like, obviously improves the prediction performance of the learning algorithm, and is widely applied to various fields. How to apply Adaboost to financial customer fraud detection becomes an important research direction at present, so that an Internet financial customer application fraud detection method based on Adaboost is provided for the above problems.
Disclosure of Invention
The invention aims to provide an Internet financial client application fraud detection method based on AdaBoost, so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
an Internet financial client application fraud detection method based on AdaBoost comprises the following six steps:
s1, collecting data, selecting a certain proportion and quantity of normal applications and fraudulent customers as modeling samples from an internet financial platform, collecting personal basic information when a customer account of a sample is registered and applied, obtaining operation behavior buried point data from monitoring software as credit data, and using normal applications or fraudulent manifestations corresponding to the sample as label data;
s2, preprocessing data, namely after performing deletion completion, abnormal value processing and normalization processing on the collected credit data, dividing K-fold cross validation data into a training set and a test set;
s3, weighting and synthesizing an application fraud classifier of Adaboost in a training set by constructing a plurality of weak classifiers;
s4, optimizing the Adaboost weak classifier weight by adopting a particle swarm algorithm, and training again on the training set;
s5, inputting a test set sample to the trained AdaBoost for detection, comparing with an actual label, and comparing with a logistic regression and a support vector machine evaluation according to a model precision evaluation index;
s6, deploying the fraud detection model of the optimized AdaBoost to an application platform, acquiring data of a real-time application client, importing the data serving as a sample to be detected into a prediction model, outputting whether fraud is caused or not, realizing real-time examination and approval of the application client, inputting the performance data into the model for training regularly, and realizing online updating of the model.
Preferably, in S1, a certain proportion and quantity of normal repayment and overdue customers are selected as modeling samples from the back end of the internet financial platform according to the post-loan performance, personal basic information when the sample customer account registration application is acquired, and operation behavior buried point data is acquired from monitoring software. The personal application information of the user comprises: the mobile phone number, the academic calendar, the marital status, the working unit, the address, the contact information, the personal basic information, the credit transaction information, the public information and the special record data which are acquired by the credit investigation report; the data of the buried point comprises equipment behavior data and log data which are collected when the point is buried, wherein the equipment behavior data comprises: the number of times, the number of clicks, the click frequency, the total input time and the average time, the mobile phone number data, the GPS position, the MAC address, the IP address data, the geographic information application frequency, the IP application frequency, the equipment electric quantity ratio and the average acceleration of the gyroscope of logging on the platform, and the log data comprises: login times within 7 days, time from the first click to the application of credit, the maximum number of sessions within one day, behavior statistics of one week before the application of credit and the like. In addition, under the compliance requirement, the method is not limited to obtaining the universe multi-dimensional big data including mobile internet behavior data, behavior data in the loan APP, credit history and operator data.
Preferably, in S2, Adaboost is sensitive to abnormal samples, which may get higher weight in iteration, and affect the prediction accuracy of the final strong learner. Normalizing the data after eliminating abnormal points and reducing noise by using a normalization formula on the sample data collected in the step S1, converting all data into [0,1], reducing the difference between the data and averaging the data, wherein the normalization formula is as follows:
Figure BDA0002853286610000031
wherein, XnormIs normalized data; xmin、XmaxRespectively representing the minimum and maximum values in the data set; and X is original data.
And dividing the normalized data set into a training set and a test set by adopting K-fold, firstly, disordering the data set, uniformly dividing the data set into K disjoint subsets, and randomly dividing the training set and the test set for cross validation.
Preferably, in S3, the Adaptive boosting algorithm (Adaptive boosting) is to repeatedly search the sample feature space, obtain the weights of the samples, continuously adjust the weights of the training samples in the iterative process, increase (decrease) the weights of the samples with low (high) prediction accuracy, and combine the samples by using a weighted majority voting method to form a strong predictor, that is, increase (decrease) the weight of the weak predictor with a smaller (large) prediction error rate, so that it plays a larger (small) role in voting, thereby significantly improving the prediction performance of the learning algorithm.
Selecting m groups of training samples T { (x) from the sample space1,y1),(x2,y2),…,(xm,ym) In which y isi{ -1, +1} denotes positive and negative samples, if xiNo manifestation of fraud, (x)i,yi)=(xi-1) and vice versa (x)i,yi)=(xi, +1), Weak pointsAnd f (x) a classifier algorithm, K, the iteration number of the weak classifier and H (X) a training result output classifier. The method comprises the following specific steps:
s31, initializing sample weight:
the weight distribution of the training samples is initialized,
D1=(w11,w12,…,w1i);w1i1/m, i is 1,2, …, m represents the number of samples,
wherein D1Representing the weight, D, of each sample of the first iterationtRepresents the weight distribution, w, of the training data before the start of the t-th iterationtiRepresenting the weight of the ith sample at the tth iteration.
S32 iterative training
Carrying out iteration T as 1,2, … and T, carrying out iteration training on the data sample to obtain a weak classifier ht(x)。
Training weak classifier h using the sample distribution of the t-th roundt(x) At the current distribution DtThen, calling a weak classifier to obtain a classification rule of the t-th round:
ht(x):X→{-1,+1}。
s33 weight normalization
Weight w for ith sample at the t-th iterationtiNormalization treatment:
Figure BDA0002853286610000041
wherein, wtiIs a sample (x)i,yi) Weights in the t-th iteration.
S34, calculating the fitness value of each weak classifier
Training weak classifier h using the sample distribution of the t-th roundt(x) Calculate ht(x) E classification error rate oftThe sum of all the misclassified sample weights represents the training error rate and the misjudgment rate e of the t-th weak classifiertThe expression is as follows:
Figure BDA0002853286610000042
wherein, wtiIs a sample (x)i,yi) Weight in the t-th iteration, I (h)t(xi)≠yi) As the rate of misjudgment etCalculated are misclassified samples; y isiRepresenting the true tag value of the ith sample.
S35, calculating ht(x) Coefficient a oft
Weight a of each weak classifiertThe importance of the weak classifier can be measured, and the calculation expression is as follows:
Figure BDA0002853286610000043
wherein e istRepresenting the training error rate of the t-th weak classifier.
S36, updating the sample weight and increasing the weight of the error sample
Updating the weights D of a training data sett+1The weight of the data set is updated according to the last weight, and the updating expression is as follows:
Dt+1,i=(wt+1,1,wt+1,2,…,wt+1,m)
Figure BDA0002853286610000051
wherein ZtIs a normalization factor such that the weight distribution DtBecomes a probability distribution, the expression is as follows:
Figure BDA0002853286610000052
Figure BDA0002853286610000053
wherein, wtiWeights for the ith sample in the t-th iteration; a istIs the weight, h, of each weak classifiert(xi) Is a weak classifier.
S37 construction of Linear combination of basic classifiers
After the iteration is completed, combining the weak classifiers f (x):
Figure BDA0002853286610000054
then, a mathematical sign function is added to obtain a final strong classifier H (x):
Figure BDA0002853286610000055
wherein, sign mathematical sign function expression is as follows:
Figure BDA0002853286610000056
experimental results show that AdaBoost has higher detection rate, low generalization error rate, no need of parameter adjustment and difficult occurrence of overfitting phenomenon.
Preferably, in S4, for the problem that once the weak classification coefficients are determined in each iteration process, the weak classifier cannot be changed in the later stage, and the redundant or useless weak classifier has a large weight, the particle swarm optimization is used to optimize the weight of the Adaboost weak classifier, so that the weak classifier with high accuracy obtains the large weight, and the useless or redundant weak classifier obtains the small weight, thereby further improving the accuracy and readability of Adaboost.
S41, initializing particle swarm parameters and coding Adaboost
Encoding weak classifier weights needing optimization by Adaboost as a position vector of each particle, wherein initial position parameters of each particle are random numbers between [0 and 1], and generating a speed matrix and a position matrix of the particle, wherein the individual number of the particle is determined by a specific data scale and a training set data scale and is generally between 20 and 40;
s42 setting fitness function
Error rate e of AdaboosttAs a fitness function for each particle of the population, the expression is as follows:
Figure BDA0002853286610000061
wherein m represents the number of samples, etIndicating the error rate, i.e., fitness value, of the ith particle, m is the number of weak classifiers, wtiT-th weight value, y, representing the ith exampleiRepresenting the true class of the ith sample.
S43, position and speed update
Calculating the particle fitness value in each iteration according to the fitness function fit, comparing the particle fitness values, and determining the individual extreme value and the global optimal extreme value of each particle to update the optimal position and speed of each particle;
assuming that M particles form a particle swarm in a D-dimensional search space, in each iteration process, the particles update the speed V of the particles through an individual extremum and a global extremumidAnd position XidThe update formula is as follows:
Figure BDA0002853286610000062
Figure BDA0002853286610000063
wherein w is an inertial weight (w balances the local search capability and the global search capability of the PSO, generally 0.5); c. C1、c2An acceleration factor (usually taken to be 2, typically between 0 and 4); r is1、r2Is distributed in [0,1]]A random number in between; d is 1,2, …, D is the data dimension; i is 1,2, …, M is the number of particles; k is an iterationThe number of times;
Figure BDA0002853286610000064
representing individual extrema of the current particle;
Figure BDA0002853286610000065
representing the global extremum of the current particle.
S44, iterative optimization
Analyzing the result after each iteration, and using the current fitness value fit (i) obtained by the iteration and the individual extreme value of the individual current particle
Figure BDA0002853286610000071
Make a comparison if
Figure BDA0002853286610000072
Update the individual extremum of the current particle with fit (i)
Figure BDA0002853286610000073
If it is not
Figure BDA0002853286610000074
Update the global extremum of the current particle with fit (i)
Figure BDA0002853286610000075
Simultaneous velocity VidAnd position XidUpdates the velocity and position of the current particle.
S45, finding the global optimal solution
According to the fact that the adaptive value fitness is smaller than the set value or the maximum iteration number k is reachedmaxEvaluating all individuals in each generation of particle swarm, continuously iterating until reaching the maximum iteration times, and finding out the minimum value fitness of individual fitnessminCorrespondingly obtaining the individual optimal solution gbestAnd global optimal solution p of the particle swarmbest
S46, finding the global optimal solution
Global optimal solution p of particle swarmbestTo obtainThe optimal individual decoding is given to the weak classifier weight of Adaboost to obtain an Adaboost detection model, and a training set is input into learning training.
Preferably, in S5, the training samples are compared according to the actual and predicted results to obtain a confusion matrix, and the values of the following indexes, i.e., true Positive rate tpr (true Positive rate), false Positive rate fpr (false Positive rate), auc (area Under curve) and KS (Kolmogorov-Smirnov), can be calculated as follows:
Figure BDA0002853286610000076
Figure BDA0002853286610000077
KS=max(TPR-FPR)
wherein, True Positive (TP) means that the model correctly predicts the Positive class sample as the Positive class; true Negative (TN) refers to the model correctly predicting Negative class samples as Negative classes; false Positive example (FP) refers to a model that incorrectly predicts negative class samples as Positive classes; false Negative (FN) refers to a model that correctly predicts Negative class samples as Negative classes. In this application, the fraud sample is used as a positive type, and the normal sample is used as a negative type.
The TPR is taken as a vertical axis, the FPR is taken as a horizontal axis for plotting to obtain an ROC (receiver operating characteristic Curve), an AUC value (Area Under the ROC Curve) obtained by the ROC Curve is taken as an evaluation standard for measuring the accuracy of the model, and the effect of the model is better when the AUC value is closer to 1.
The KS value is the maximum value of the difference between the TPR and the FPR, the optimal distinguishing effect of the model can be reflected, the threshold value taken at the moment is generally used as the optimal threshold value for defining good and bad users, and generally KS is larger than 0.2, so that the model has better prediction accuracy.
Preferably, in S6, the fraud detection model that optimizes AdaBoost is deployed to the application platform, data of the real-time application client is acquired and imported as a sample to be tested into the prediction model to output whether fraud occurs, so as to achieve real-time approval of the application client, and periodically input performance data into the model for training, thereby achieving online update of the model.
Compared with the prior art, the invention has the beneficial effects that:
1. in the invention, the Adaboost algorithm is combined into a strong classifier through a plurality of weak classifiers, so that the Adaboost algorithm has the advantages of simplicity, effectiveness, capability of constructing sub-classifiers by using different methods and the like, and can remarkably improve the prediction performance of the learning algorithm.
2. In the invention, the particle swarm algorithm has less parameter adjustment, starts from random solution, finds global optimum by local optimum through iteration, and is suitable for solving the complex problems of nonlinearity, incoherence and multi-peak value.
3. In the invention, the particle swarm optimization is adopted to optimize the weight of the Adaboost weak classifier, so that the weak classifier with high precision obtains larger weight, and the useless or redundant weak classifier obtains smaller weight, thereby further improving the accuracy and readability of Adaboost.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
Referring to fig. 1, the present invention provides a technical solution:
an Internet financial client application fraud detection method based on AdaBoost comprises the following six steps:
s1, collecting data, selecting a certain proportion and quantity of normal applications and fraudulent customers as modeling samples from an internet financial platform, collecting personal basic information when a customer account of a sample is registered and applied, obtaining operation behavior buried point data from monitoring software as credit data, and using normal applications or fraudulent manifestations corresponding to the sample as label data;
s2, preprocessing data, namely after performing deletion completion, abnormal value processing and normalization processing on the collected credit data, dividing K-fold cross validation data into a training set and a test set;
s3, weighting and synthesizing an application fraud classifier of Adaboost in a training set by constructing a plurality of weak classifiers;
s4, optimizing the Adaboost weak classifier weight by adopting a particle swarm algorithm, and training again on the training set;
s5, inputting a test set sample to the trained AdaBoost for detection, comparing with an actual label, and comparing with a logistic regression and a support vector machine evaluation according to a model precision evaluation index;
s6, deploying the fraud detection model of the optimized AdaBoost to an application platform, acquiring data of a real-time application client, importing the data serving as a sample to be detected into a prediction model, outputting whether fraud is caused or not, realizing real-time examination and approval of the application client, inputting the performance data into the model for training regularly, and realizing online updating of the model.
In S1, normal repayment and overdue clients in certain proportion and quantity are selected as modeling samples from the back end of the Internet financial platform according to the post-loan performance, personal basic information when the account of the sample client is registered and applied is collected, and operation behavior buried point data is obtained from monitoring software. The personal application information of the user comprises: the mobile phone number, the academic calendar, the marital status, the working unit, the address, the contact information, the personal basic information, the credit transaction information, the public information and the special record data which are acquired by the credit investigation report; the data of the buried point comprises equipment behavior data and log data which are collected when the point is buried, wherein the equipment behavior data comprises: the number of times, the number of clicks, the click frequency, the total input time and the average time, the mobile phone number data, the GPS position, the MAC address, the IP address data, the geographic information application frequency, the IP application frequency, the equipment electric quantity ratio and the average acceleration of the gyroscope of logging on the platform, and the log data comprises: login times within 7 days, time from the first click to the application of credit, the maximum number of sessions within one day, behavior statistics of one week before the application of credit and the like. In addition, under the compliance requirement, the method is not limited to obtaining the universe multi-dimensional big data including mobile internet behavior data, behavior data in the loan APP, credit history and operator data, and the arrangement is favorable for comprehensively counting user information so as to be convenient for subsequently predicting the credit risk of the user.
In S2, Adaboost is sensitive to abnormal samples, which may get higher weight in iteration, and affect the prediction accuracy of the final strong learner. Normalizing the data after eliminating abnormal points and reducing noise by using a normalization formula on the sample data collected in the step S1, converting all data into [0,1], reducing the difference between the data and averaging the data, wherein the normalization formula is as follows:
Figure BDA0002853286610000101
wherein, XnormIs normalized data; xmin、XmaxRespectively representing the minimum and maximum values in the data set; and X is original data.
And dividing the normalized data set into a training set and a test set by adopting K-fold, firstly, disordering the data set, uniformly dividing the data set into K disjoint subsets, and randomly dividing the training set and the test set for cross validation, wherein the setting is convenient for calculation.
In S3, an Adaptive boosting algorithm (Adaptive boosting, Adaptive) obtains sample weights by repeatedly searching a sample feature space, continuously adjusts the weights of training samples in an iterative process, increases (decreases) the weights of samples with low (high) prediction accuracy, and combines a weighted majority voting method to form a strong predictor, i.e., increases (decreases) the weights of weak predictors with smaller (larger) prediction error rates, so that it plays a larger (smaller) role in voting, and significantly improves the prediction performance of the learning algorithm.
Selecting m groups of training samples T { (x) from the sample space1,y1),(x2,y2),…,(xm,ym) In which y isi{ -1, +1} denotes positive and negative samples, if xiNo manifestation of fraud, (x)i,yi)=(xi-1) and vice versa (x)i,yi)=(xi+1), a weak classifier algorithm f (x), the iteration number K of the weak classifier, and a training result output classifier H (X). The method comprises the following specific steps:
s31, initializing sample weight:
the weight distribution of the training samples is initialized,
D1=(w11,w12,…,w1i);w1i1/m, i is 1,2, …, m represents the number of samples,
wherein D1Representing the weight, D, of each sample of the first iterationtRepresents the weight distribution, w, of the training data before the start of the t-th iterationtiRepresenting the weight of the ith sample at the tth iteration.
S32 iterative training
Carrying out iteration T as 1,2, … and T, carrying out iteration training on the data sample to obtain a weak classifier ht(x)。
Training weak classifier h using the sample distribution of the t-th roundt(x) At the current distribution DtThen, calling a weak classifier to obtain a classification rule of the t-th round:
ht(x):X→{-1,+1}
s33 weight normalization
Weight w for ith sample at the t-th iterationtiNormalization treatment:
Figure BDA0002853286610000111
wherein, wtiIs a sample (x)i,yi) Weights in the t-th iteration.
S34, calculating the fitness value of each weak classifier
Training weak classifier h using the sample distribution of the t-th roundt(x) Calculate ht(x) E classification error rate oftThe sum of all the misclassified sample weights represents the training error rate and the misjudgment rate e of the t-th weak classifiertThe expression is as follows:
Figure BDA0002853286610000112
wherein, wtiIs a sample (x)i,yi) Weight in the t-th iteration, I (h)t(xi)≠yi) As the rate of misjudgment etCalculated are misclassified samples; y isiA true tag value representing the ith sample;
s35, calculating ht(x) Coefficient a oft
Weight a of each weak classifiertThe importance of the weak classifier can be measured, and the calculation expression is as follows:
Figure BDA0002853286610000113
wherein e istRepresenting the training error rate of the t-th weak classifier.
S36, updating the sample weight and increasing the weight of the error sample
Updating the weights D of a training data sett+1The weight of the data set is updated according to the last weight, and the updating expression is as follows:
Dt+1,i=(wt+1,1,wt+1,2,…,wt+1,m)
Figure BDA0002853286610000121
wherein ZtIs a normalization factor such that the weight distribution DtBecomes a probability distribution, the expression is as follows:
Figure BDA0002853286610000122
Figure BDA0002853286610000123
wherein, wtiWeights for the ith sample in the t-th iteration; a istIs the weight, h, of each weak classifiert(xi) Is a weak classificationA device.
S37 construction of Linear combination of basic classifiers
After the iteration is completed, combining the weak classifiers f (x):
Figure BDA0002853286610000124
then, a mathematical sign function is added to obtain a final strong classifier H (x):
Figure BDA0002853286610000125
wherein, sign mathematical sign function expression is as follows:
Figure BDA0002853286610000126
experimental results show that AdaBoost has higher detection rate, low generalization error rate, no need of parameter adjustment and difficult occurrence of overfitting phenomenon.
In S4, aiming at the problem that once the weak classification coefficients are determined in each iteration process, the weak classification coefficients cannot be changed in the later stage, and redundant or useless weak classifiers have a large weight, the particle swarm optimization is used to optimize the weight of the Adaboost weak classifier, so that the weak classifier with high accuracy obtains a large weight, and the useless or redundant weak classifier obtains a small weight, thereby further improving the accuracy and readability of Adaboost.
S41, initializing particle swarm parameters and coding Adaboost
Encoding weak classifier weights needing optimization by Adaboost as a position vector of each particle, wherein initial position parameters of each particle are random numbers between [0 and 1], and generating a speed matrix and a position matrix of the particle, wherein the individual number of the particle is determined by a specific data scale and a training set data scale and is generally between 20 and 40;
s42 setting fitness function
Error rate e of AdaboosttAs a fitness function for each particle of the population, the expression is as follows:
Figure BDA0002853286610000131
wherein m represents the number of samples, etIndicating the error rate, i.e., fitness value, of the ith particle, m is the number of weak classifiers, wtiT-th weight value, y, representing the ith exampleiRepresenting the true class of the ith sample.
S43, position and speed update
Calculating the particle fitness value in each iteration according to the fitness function fit, comparing the particle fitness values, and determining the individual extreme value and the global optimal extreme value of each particle to update the optimal position and speed of each particle;
assuming that M particles form a particle swarm in a D-dimensional search space, in each iteration process, the particles update the speed V of the particles through an individual extremum and a global extremumidAnd position XidThe update formula is as follows:
Figure BDA0002853286610000132
Figure BDA0002853286610000133
wherein w is an inertial weight (w balances the local search capability and the global search capability of the PSO, generally 0.5); c. C1、c2An acceleration factor (usually taken to be 2, typically between 0 and 4); r is1、r2Is distributed in [0,1]]D is 1,2, …, D is the data dimension; i is 1,2, …, M is the number of particles; k is the number of iterations;
Figure BDA0002853286610000134
representing individual extrema of the current particle;
Figure BDA0002853286610000135
representing the global extremum of the current particle.
S44, iterative optimization
Analyzing the result after each iteration, and using the current fitness value fit (i) obtained by the iteration and the individual extreme value of the individual current particle
Figure BDA0002853286610000141
Make a comparison if
Figure BDA0002853286610000142
Update the individual extremum of the current particle with fit (i)
Figure BDA0002853286610000143
If it is not
Figure BDA0002853286610000144
Update the global extremum of the current particle with fit (i)
Figure BDA0002853286610000145
Simultaneous velocity VidAnd position XidUpdates the velocity and position of the current particle.
S45, finding the global optimal solution
According to the fact that the adaptive value fitness is smaller than the set value or the maximum iteration number k is reachedmaxEvaluating all individuals in each generation of particle swarm, continuously iterating until reaching the maximum iteration times, and finding out the minimum value fitness of individual fitnessmin,Correspondingly obtaining the individual optimal solution gbestAnd global optimal solution p of the particle swarmbest
S46, finding the global optimal solution
Global optimal solution p of particle swarmbestThe obtained optimal individual decoding is given to the weak classifier weight of Adaboost to obtain an Adaboost detection model, and a training set is input into learning training, so that the arrangement is favorable for enhancing the prediction precision.
In S5, the training samples are compared with the actual and predicted results to obtain a confusion matrix, and the values of the following indexes, i.e., true Positive rate tpr (true Positive rate), false Positive rate fpr (false Positive rate), auc (area Under curve) and KS (Kolmogorov-Smirnov), can be calculated as follows:
Figure BDA0002853286610000146
Figure BDA0002853286610000147
KS=max(TPR-FPR)
wherein, True Positive (TP) means that the model correctly predicts the Positive class sample as the Positive class; true Negative (TN) refers to the model correctly predicting Negative class samples as Negative classes; false Positive example (FP) refers to a model that incorrectly predicts negative class samples as Positive classes; false Negative (FN) refers to a model that correctly predicts Negative class samples as Negative classes. In this application, the fraud sample is used as a positive type, and the normal sample is used as a negative type.
The TPR is taken as a vertical axis, the FPR is taken as a horizontal axis for plotting to obtain an ROC (receiver operating characteristic Curve), an AUC value (Area Under the ROC Curve) obtained by the ROC Curve is taken as an evaluation standard for measuring the accuracy of the model, and the effect of the model is better when the AUC value is closer to 1.
The KS value is the maximum value of the difference between the TPR and the FPR, the optimal distinguishing effect of the model can be reflected, the threshold value taken at the moment is generally used as the optimal threshold value for defining good and bad users, and generally KS is larger than 0.2, so that the model has better prediction accuracy.
This arrangement is advantageous for straight pipe judgment model accuracy.
In S6, deploying the fraud detection model of the optimized AdaBoost to an application platform, acquiring data of a real-time application client, importing the data serving as a sample to be detected into a prediction model, outputting whether fraud is caused or not, realizing real-time examination and approval of the application client, inputting the performance data into the model for training at regular intervals, and realizing online updating of the model.
The Internet financial client application fraud detection system based on AdaBoost comprises the following modules:
the data acquisition module is used for acquiring a modeling sample which comprises personal application information, operation behavior buried point data and fraud performance as evaluation results;
the data processing module is used for carrying out deletion completion, abnormal value processing and normalization processing on the collected credit data;
the model building module is used for training and building a plurality of weak classifiers and weighting and synthesizing an application fraud classifier of Adaboost;
the parameter optimization module is used for optimizing the Adaboost weak classifier weight by adopting a particle swarm algorithm, and training and optimizing again;
a fraud detection module: the fraud detection model for AdaBoost detects fraud for the application client in real time.
The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts of the present invention. The foregoing is only a preferred embodiment of the present invention, and it should be noted that there are objectively infinite specific structures due to the limited character expressions, and it will be apparent to those skilled in the art that a plurality of modifications, decorations or changes may be made without departing from the principle of the present invention, and the technical features described above may be combined in a suitable manner; such modifications, variations, combinations, or adaptations of the invention using its spirit and scope, as defined by the claims, may be directed to other uses and embodiments.

Claims (7)

1. An Internet financial client application fraud detection method based on AdaBoost is characterized by comprising the following six steps:
s1, collecting data, selecting a certain proportion and quantity of normal applications and fraudulent customers as modeling samples from an internet financial platform, collecting personal basic information when a customer account of a sample is registered and applied, obtaining operation behavior buried point data from monitoring software as credit data, and using normal applications or fraudulent manifestations corresponding to the sample as label data;
s2, preprocessing data, namely after performing deletion completion, abnormal value processing and normalization processing on the collected credit data, dividing K-fold cross validation data into a training set and a test set;
s3, weighting and synthesizing an application fraud classifier of Adaboost in a training set by constructing a plurality of weak classifiers;
s4, optimizing the Adaboost weak classifier weight by adopting a particle swarm algorithm, and training again on the training set;
s5, inputting a test set sample to the trained AdaBoost for detection, comparing with an actual label, and comparing with a logistic regression and a support vector machine evaluation according to a model precision evaluation index;
s6, deploying the fraud detection model of the optimized AdaBoost to an application platform, acquiring data of a real-time application client, importing the data serving as a sample to be detected into a prediction model, outputting whether fraud is caused or not, realizing real-time examination and approval of the application client, inputting the performance data into the model for training regularly, and realizing online updating of the model.
2. The method for detecting the application fraud of the internet financial client based on AdaBoost of claim 1, characterized in that in S1, a certain proportion and quantity of normal repayment and overdue clients are selected as modeling samples according to the post-loan expression from the back end of the internet financial platform, personal basic information of the sample client account during registration application is collected, and operation behavior buried point data is obtained from monitoring software. The personal application information of the user comprises: the mobile phone number, the academic calendar, the marital status, the working unit, the address, the contact information, the personal basic information, the credit transaction information, the public information and the special record data which are acquired by the credit investigation report; the data of the buried point comprises equipment behavior data and log data which are collected when the point is buried, wherein the equipment behavior data comprises: the number of times, the number of clicks, the click frequency, the total input time and the average time, the mobile phone number data, the GPS position, the MAC address, the IP address data, the geographic information application frequency, the IP application frequency, the equipment electric quantity ratio and the average acceleration of the gyroscope of logging on the platform, and the log data comprises: login times within 7 days, time from the first click to the application of credit, the maximum number of sessions within one day, behavior statistics of one week before the application of credit and the like. In addition, under the compliance requirement, the method is not limited to obtaining the universe multi-dimensional big data including mobile internet behavior data, behavior data in the loan APP, credit history and operator data.
3. The method for detecting Internet financial client application fraud based on AdaBoost of claim 1, wherein in S2, Adaboost is sensitive to abnormal samples, and the abnormal samples may obtain higher weight in iteration, which affects the prediction accuracy of the final strong learner. Normalizing the data after eliminating abnormal points and reducing noise by using a normalization formula for the sample data collected in the step S1, converting all data into [0,1], reducing the difference value among the data, and enabling the data distribution to be smoother, wherein the normalization formula is as follows:
Figure FDA0002853286600000021
wherein, XnormIs normalized data; xmin、XmaxRespectively representing the minimum and maximum values in the data set; and X is original data.
And dividing the normalized data set into a training set and a test set by adopting K-fold, firstly, disordering the data set, uniformly dividing the data set into K disjoint subsets, and randomly dividing the training set and the test set for cross validation.
4. The method for detecting internet financial client application fraud based on AdaBoost of claim 1, wherein in S3, an Adaptive boosting algorithm (AdaBoost) is used to obtain sample weights by repeatedly searching sample feature space, continuously adjust the weights of training samples in an iterative process, increase (decrease) the weights of samples with low (high) prediction accuracy, and combine a weighted majority voting method to form a strong predictor, i.e., increase (decrease) the weights of weak predictors with smaller (large) prediction error rates, so that it plays a larger (smaller) role in voting, thereby significantly improving the prediction performance of the learning algorithm.
Selecting m groups of training samples T { (x) from the sample space1,y1),(x2,y2),…,(xm,ym) In which y isi{ -1, +1} denotes positive and negative samples, if xiNo manifestation of fraud, (x)i,yi)=(xi-1) and vice versa (x)i,yi)=(xi+1), a weak classifier algorithm f (x), the iteration number K of the weak classifier, and a training result output classifier H (X). The method comprises the following specific steps:
s31, initializing sample weight:
the weight distribution of the training samples is initialized,
D1=(w11,w12,…,w1i);w1i1/m, i is 1,2, …, m represents the number of samples,
wherein D is1Representing the weight, D, of each sample of the first iterationtRepresents the weight distribution, w, of the training data before the start of the t-th iterationtiRepresenting the weight of the ith sample at the tth iteration.
S32 iterative training
Carrying out iteration T as 1,2, … and T, carrying out iteration training on the data sample to obtain a weak classifier ht(x) At the current distribution DtThen, calling a weak classifier to obtain a classification rule of the t-th round:
ht(x):X→{-1,+1}。
s33 weight normalization
Weight w for ith sample at the t-th iterationtiNormalization treatment:
Figure FDA0002853286600000031
wherein, wtiIs a sample (x)i,yi) Weights in the t-th iteration.
S34, calculating the fitness value of each weak classifier
Training weak classifier h using the sample distribution of the t-th roundt(x) Calculate ht(x) E classification error rate oft,etThe sum of all the misclassified sample weights represents the training error rate of the t-th weak classifier, and the expression is as follows:
Figure FDA0002853286600000032
wherein, wtiIs a sample (x)i,yi) Weight in the t-th iteration, I (h)t(xi)≠yi) To classify the error rate etCalculated are misclassified samples; y isiThe true tag value for the ith sample;
s35, calculating ht(x) Coefficient a oft
Weight a of each weak classifiertThe importance of the weak classifier can be measured, and the calculation expression is as follows:
Figure FDA0002853286600000041
where et represents the training error rate of the t-th weak classifier.
S36, updating the sample weight and increasing the weight of the error sample
Updating the weights D of a training data sett+1The weight of the data set is updated according to the last weight, and the updating expression is as follows:
Dt+1,i=(wt+1,1,wt+1,2,…,wt+1,m)
Figure FDA0002853286600000042
wherein Z istIs a normalization factor such that the weight distribution DtBecomes a probability distribution, the expression is as follows:
Figure FDA0002853286600000043
Figure FDA0002853286600000044
wherein, wtiWeights for the ith sample in the t-th iteration; a istIs the weight, h, of each weak classifiert(xi) Is a weak classifier.
S37 construction of Linear combination of basic classifiers
After the iteration is completed, combining the weak classifiers f (x):
Figure FDA0002853286600000045
then, a mathematical sign function is added to obtain a final strong classifier H (x):
Figure FDA0002853286600000046
wherein, sign mathematical sign function expression is as follows:
Figure FDA0002853286600000047
experimental results show that AdaBoost has higher detection rate, low generalization error rate, no need of parameter adjustment and difficult occurrence of overfitting phenomenon.
5. The method for detecting the fraud application of the Internet financial client based on AdaBoost according to claim 1, wherein in S4, aiming at the problem that once the weak classification coefficients are determined in each iteration process, the weak classification coefficients cannot be changed in the later period, and redundant or useless weak classifiers have larger weights inevitably, the weight of the Adaboost weak classifier is optimized by adopting a particle swarm algorithm, so that the weak classifier with high accuracy obtains larger weights, and the useless or redundant weak classifier obtains smaller weights, thereby further improving the accuracy and readability of the Adaboost.
S41, initializing particle swarm parameters and coding Adaboost
Encoding weak classifier weights needing optimization by Adaboost as a position vector of each particle, wherein initial position parameters of each particle are random numbers between [0 and 1], and generating a speed matrix and a position matrix of the particle, wherein the individual number of the particle is determined by a specific data scale and a training set data scale and is generally between 20 and 40;
s42 setting fitness function
Error rate e of AdaboosttAs a fitness function for each particle of the population, the expression is as follows:
Figure FDA0002853286600000051
wherein m represents the number of samples, etIndicating the error rate, i.e., fitness value, of the ith particle, m is the number of weak classifiers, wtiT-th weight value, y, representing the ith exampleiRepresenting the true class of the ith sample.
S43, position and speed update
Calculating the particle fitness value in each iteration according to the fitness function fit, comparing the particle fitness values, and determining the individual extreme value and the global optimal extreme value of each particle to update the optimal position and speed of each particle;
assuming that M particles constitute a population of particles in a D-dimensional search space, during each iteration the particles pass through the sum of individual extremaGlobal extremum updating its own velocity VidAnd position XidThe update formula is as follows:
Figure FDA0002853286600000052
Figure FDA0002853286600000061
wherein w is an inertial weight (w balances the local search capability and the global search capability of the PSO, generally 0.5); c. C1、c2An acceleration factor (usually taken to be 2, typically between 0 and 4); r is1、r2Is distributed in [0,1]]A random number in between; d is 1,2, …, D is the data dimension; i is 1,2, …, M is the number of particles; k is the number of iterations;
Figure FDA0002853286600000062
representing individual extrema of the current particle;
Figure FDA0002853286600000063
representing the global extremum of the current particle.
S44, iterative optimization
Analyzing the result after each iteration, and using the current fitness value fit (i) obtained by the iteration and the individual extreme value of the individual current particle
Figure FDA0002853286600000064
Make a comparison if
Figure FDA0002853286600000065
Update the individual extremum of the current particle with fit (i)
Figure FDA0002853286600000066
If it is not
Figure FDA0002853286600000067
Update the global extremum of the current particle with fit (i)
Figure FDA0002853286600000068
Simultaneous velocity VidAnd position XidUpdates the velocity and position of the current particle.
S45, finding the global optimal solution
According to the fact that the adaptive value fitness is smaller than the set value or the maximum iteration number k is reachedmaxEvaluating all individuals in each generation of particle swarm, continuously iterating until reaching the maximum iteration times, and finding out the minimum value fitness of individual fitnessminCorrespondingly obtaining the individual optimal solution gbestAnd global optimal solution p of the particle swarmbest
S46, finding the global optimal solution
Global optimal solution p of particle swarmbestAnd the obtained optimal individual decoding is given to the weak classifier weight of Adaboost to obtain an Adaboost detection model, and the training set is input into learning training.
6. The method for detecting internet financial client application fraud based on AdaBoost of claim 1, wherein in S5, the training samples are compared according to actual and predicted results to obtain a confusion matrix, and the values of the following indexes, namely, true Positive rate tpr (true Positive rate), false Positive rate fpr (false Positive rate), auc (area Under customer) and KS (Kolmogorov-Smirnov), can be calculated as follows:
Figure FDA0002853286600000069
Figure FDA0002853286600000071
KS=max(TPR-FPR)
wherein, True Positive (TP) means that the model correctly predicts the Positive class sample as the Positive class; true Negative (TN) refers to the model correctly predicting Negative class samples as Negative classes; false Positive example (FP) refers to a model that incorrectly predicts negative class samples as Positive classes; false Negative (FN) refers to a model that correctly predicts Negative class samples as Negative classes. In this application, the fraud sample is used as a positive type, and the normal sample is used as a negative type.
The TPR is taken as a vertical axis, the FPR is taken as a horizontal axis for plotting to obtain an ROC (receiver operating characteristic Curve), an AUC value (Area Under the ROC Curve) obtained by the ROC Curve is taken as an evaluation standard for measuring the accuracy of the model, and the effect of the model is better when the AUC value is closer to 1.
The KS value is the maximum value of the difference between the TPR and the FPR, the optimal distinguishing effect of the model can be reflected, the threshold value taken at the moment is generally used as the optimal threshold value for defining good and bad users, and generally KS is larger than 0.2, so that the model has better prediction accuracy.
7. The method for detecting the application fraud of the Internet financial client based on AdaBoost according to claim 1, wherein in S6, a fraud detection model for optimizing AdaBoost is deployed to an application platform, data of a real-time application client is obtained and is imported into a prediction model as a sample to be detected to output whether fraud exists or not, real-time approval of the application client is realized, and presented data is periodically input into model training to realize online updating of the model.
CN202011536761.0A 2020-12-23 2020-12-23 Internet financial client application fraud detection method based on AdaBoost Pending CN112581265A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011536761.0A CN112581265A (en) 2020-12-23 2020-12-23 Internet financial client application fraud detection method based on AdaBoost

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011536761.0A CN112581265A (en) 2020-12-23 2020-12-23 Internet financial client application fraud detection method based on AdaBoost

Publications (1)

Publication Number Publication Date
CN112581265A true CN112581265A (en) 2021-03-30

Family

ID=75139392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011536761.0A Pending CN112581265A (en) 2020-12-23 2020-12-23 Internet financial client application fraud detection method based on AdaBoost

Country Status (1)

Country Link
CN (1) CN112581265A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436010A (en) * 2021-08-10 2021-09-24 四川新网银行股份有限公司 System and method for identifying public deposit payment loan fraud in real time
CN115577287A (en) * 2022-09-30 2023-01-06 湖南工程学院 Data processing method, apparatus and computer-readable storage medium
CN115618238A (en) * 2022-12-14 2023-01-17 湖南工商大学 Credit card fraud detection method based on parameter offset correction integrated learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309953A (en) * 2013-05-24 2013-09-18 合肥工业大学 Method for labeling and searching for diversified pictures based on integration of multiple RBFNN classifiers
CN108229581A (en) * 2018-01-31 2018-06-29 西安工程大学 Based on the Diagnosis Method of Transformer Faults for improving more classification AdaBoost
CN109447158A (en) * 2018-10-31 2019-03-08 中国石油大学(华东) A kind of Adaboost Favorable Reservoir development area prediction technique based on unbalanced data
CN112053223A (en) * 2020-08-14 2020-12-08 百维金科(上海)信息科技有限公司 Internet financial fraud behavior detection method based on GA-SVM algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309953A (en) * 2013-05-24 2013-09-18 合肥工业大学 Method for labeling and searching for diversified pictures based on integration of multiple RBFNN classifiers
CN108229581A (en) * 2018-01-31 2018-06-29 西安工程大学 Based on the Diagnosis Method of Transformer Faults for improving more classification AdaBoost
CN109447158A (en) * 2018-10-31 2019-03-08 中国石油大学(华东) A kind of Adaboost Favorable Reservoir development area prediction technique based on unbalanced data
CN112053223A (en) * 2020-08-14 2020-12-08 百维金科(上海)信息科技有限公司 Internet financial fraud behavior detection method based on GA-SVM algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KEWEN LI等: ""Improved PSO_AdaBoost Ensemble Algorithm for Imbalanced Data", 《SENSORS》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436010A (en) * 2021-08-10 2021-09-24 四川新网银行股份有限公司 System and method for identifying public deposit payment loan fraud in real time
CN115577287A (en) * 2022-09-30 2023-01-06 湖南工程学院 Data processing method, apparatus and computer-readable storage medium
CN115577287B (en) * 2022-09-30 2023-05-30 湖南工程学院 Data processing method, apparatus and computer readable storage medium
CN115618238A (en) * 2022-12-14 2023-01-17 湖南工商大学 Credit card fraud detection method based on parameter offset correction integrated learning

Similar Documents

Publication Publication Date Title
US11257041B2 (en) Detecting disability and ensuring fairness in automated scoring of video interviews
CN112581265A (en) Internet financial client application fraud detection method based on AdaBoost
CN112581263A (en) Credit evaluation method for optimizing generalized regression neural network based on wolf algorithm
CN112037012A (en) Internet financial credit evaluation method based on PSO-BP neural network
CN111932269B (en) Equipment information processing method and device
CN109389494B (en) Loan fraud detection model training method, loan fraud detection method and device
CN111444951B (en) Sample recognition model generation method, device, computer equipment and storage medium
CN112215702A (en) Credit risk assessment method, mobile terminal and computer storage medium
CN113240155A (en) Method and device for predicting carbon emission and terminal
CN109840413B (en) Phishing website detection method and device
US20220188644A1 (en) Latent-space misalignment measure of responsible ai for machine learning models
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN111833175A (en) Internet financial platform application fraud behavior detection method based on KNN algorithm
CN115376518B (en) Voiceprint recognition method, system, equipment and medium for real-time noise big data
CN115412301A (en) Network security prediction analysis method and system
CN113239638A (en) Overdue risk prediction method for optimizing multi-core support vector machine based on dragonfly algorithm
CN112581264A (en) Grasshopper algorithm-based credit risk prediction method for optimizing MLP neural network
CN112733995A (en) Method for training neural network, behavior detection method and behavior detection device
CN112464281A (en) Network information analysis method based on privacy grouping and emotion recognition
Smith et al. Making generative classifiers robust to selection bias
CN114912027A (en) Learning scheme recommendation method and system based on learning outcome prediction
CN114049205A (en) Abnormal transaction identification method and device, computer equipment and storage medium
CN111833173A (en) LSTM-based third-party platform payment fraud online detection method
IMBALANCE Ensemble Adaboost In Classification And Regression Trees To Overcome Class Imbalance In Credit Status Of Bank Customers
CN112348656A (en) BA-WNN-based personal loan credit scoring method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210330