CN112364901A

CN112364901A - LGB algorithm-based fraud call identification method

Info

Publication number: CN112364901A
Application number: CN202011185958.4A
Authority: CN
Inventors: 张飞; 周红敏; 周荣; 程钢; 卜小冲; 肖书华; 董伟
Original assignee: Shanghai Xinfang Software Co ltd; Shanghai Cintel Intelligent System Co ltd
Current assignee: Shanghai Xinfang Software Co ltd; Shanghai Cintel Intelligent System Co ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-02-12

Abstract

The invention discloses a method for identifying fraudulent calls based on an LGB algorithm, which comprises the following steps: acquiring a data set of an original call; sampling a data set of an original call by adopting an SMOTE algorithm, and classifying the data set into a training set and a testing set; extracting the characteristic behavior of the call ticket, and initializing the model parameters; training the model by adopting a ten-fold cross-validation method, validating by using a test set, and calculating the precision rate, the recall rate and the F1 score of the model; obtaining an optimal LGB model by adopting grid search, and serializing the model by adopting a pickle module; deserializing the model by adopting a pick module, and constructing an API by using a flash framework; and when the call record is measured, calling an API (application programming interface) interface, inputting data into an LGB (local database) prediction model, and returning a result after the model is predicted. The invention provides a method for identifying fraud calls based on an LGB algorithm, which can realize automatic classification and prediction of fraud calls, can also obviously improve the case solving efficiency of public security personnel and reduce the economic loss of enterprises and individuals.

Description

LGB algorithm-based fraud call identification method

Technical Field

The invention relates to the technical field of network communication, in particular to a method for identifying fraudulent calls based on an LGB algorithm.

Background

The harassing call is eight-flower, wins from the first various types, and the current impersonation public inspection method, financing, shopping service, air ticket change and the like cheat the patterns to be renewed year by year, and show the evolution trend from 'screening' to 'precision'. Telephone fraud causes huge economic loss to people, disturbs normal work and living order of people, greatly harms the integrity of society, and becomes social public hazard seriously infringing the vital interests of people.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a method for identifying fraudulent calls based on an LGB algorithm, which solves the problems that the existing telephone fraud causes huge economic loss of people, disturbs the normal work and life order of people and greatly jeopardizes the integrity of the society.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

a method for identifying fraudulent calls based on LGB algorithm is designed, and comprises the following steps:

step S1, acquiring a data set of an original call, and manually studying and judging to determine the distribution proportion of positive and negative data samples;

step S2, sampling the data set of the original call by adopting an SMOTE algorithm to form a final data set, and classifying the data set into a training set and a testing set;

step S3, extracting the characteristic behavior of the call ticket, and initializing the model parameter;

step S4, training the model by adopting a ten-fold cross-validation method, using the test set for validation, and calculating the accuracy rate, the recall rate and the F1 score of the model;

step S5, obtaining an optimal LGB model by grid search, serializing the model by a pickle module, and storing the model to a server;

step S6, deserializing the model by adopting a pickle module, constructing an API by using a flash framework, and deploying the model on line in an interface mode;

and step S7, when the call record is detected to arrive, calling an API (application programming interface) interface, inputting data into an LGB (local database) prediction model, and after the model is predicted, returning the result.

Further, in step S1, the data set is a two-month call record, the data dimension is 43 dimensions, and the input feature value of the LGB is obtained through data cleaning, variable derivation, and feature screening.

Further, in step S1, the data in the original call is encrypted.

Further, in the step S2, the SMOTE sampling is to analyze a minority of crank call samples, and artificially synthesize a new sample according to the minority of crank call samples and add the new sample to the data set; each few sample a randomly selects a sample b from the nearest neighbor, and then randomly selects a point on the connecting line of the sample a and the sample b as a newly synthesized few sample c, wherein the specific algorithm steps comprise:

step S21, for each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class sample c using the euclidean distance d as a standard to obtain k neighbors, wherein the calculation formula of the euclidean distance d is as follows:

step S22, for each sample x of the minority class, randomly selecting a number of samples from k neighbors, assuming the selected neighbors are x_n；

Step S23, for each randomly selected neighbor x_nCarrying out random linear interpolation, and respectively constructing new samples with the original samples;

step S24, putting a new sample into the original data to generate a new training set;

after SMOTE sampling is finished, a final new sample set is formed, namely training samples and testing samples are divided.

Further, in step S3, the index of the feature behavior of the call ticket is extracted as the information gain, and the larger the information gain of a certain feature is, the better the selectivity of the feature is, and the calculation formula is as follows:

g(D，A)＝H(D)-H(D|A)；

where H (D) is the empirical entropy and H (D | a) is the empirical conditional entropy of the selected feature a, the calculation formulas are respectively as follows:

Further, in step S4, the model parameters include configuration file parameters and core algorithm operating parameters.

Further, in the step S4, the calculation formulas of the precision rate, the recall rate and the F1 score are as follows:

precision ratio TP/(TP + FP);

recall TP/(TP + FN);

f1 score 2 precision recall/(precision + recall);

wherein, TP represents the number of positive samples and positive prediction results, FP represents the number of negative samples and positive prediction results, TN represents the number of negative samples and negative prediction results, FN represents the number of positive samples and negative prediction results.

Further, in step S5, the LGB model is constructed by using two parameters, which are the number t of leaf nodes of the decision tree and the number m of input features to be considered when each node of the decision tree is split, and the specific construction steps include:

step S51, data is pre-sorted, firstly, all the characteristics are pre-sorted according to the data;

step S52, sampling, setting N as the number of training samples, then starting to build tree, wherein the number of input samples is N, and the N training samples are randomly extracted from the training set;

step S53, discretizing data, namely discretizing continuous data and determining the required buckets of each feature;

step S54, selecting characteristics and dividing points, wherein the number of input characteristics of a training sample is M (M is 23), and M is far smaller than M, calculating dividing nodes of data according to information gain, when splitting is carried out on the optimal dividing points, selecting M input characteristics from the M input characteristics, and then selecting a node with the largest information gain from the M input characteristics for splitting, wherein M cannot be changed in the process of constructing a decision tree;

step S55, the data are split according to the method, one leaf with the maximum column division gain is found from all the current leaves each time, then the splitting is carried out, and the steps are circulated in such a way, and pruning is not needed;

wherein, an optimal decision tree is generated in the process of model training, and fraud categories are output for newly input test samples through the optimal decision tree.

The invention has the beneficial effects that: the LGB algorithm-based fraud call identification method combines the existing fraud call bills and the artificial label library to establish the call bill data into an algorithm model to realize automatic classification prediction, and the method can obviously improve the case solving efficiency of public security personnel and reduce the economic loss of enterprises and individuals.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a method for identifying fraudulent calls based on LGB algorithm according to an embodiment of the present invention;

FIG. 2 is a first optimized tree structure diagram of a LGB algorithm-based fraud phone recognition method in a specific application according to an embodiment of the present invention;

fig. 3 is a second optimized tree structure diagram of a LGB algorithm-based fraud phone recognition method in a specific application according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

As shown in fig. 1, a method for identifying fraudulent calls based on LGB algorithm according to an embodiment of the present invention includes the following steps:

In this embodiment, in step S1, the data set is a two-month call record, the data dimension is 43 dimensions, and the input feature value of the LGB is obtained through data cleaning, variable derivation, and feature screening.

In this embodiment, in step S1, the data in the original call is encrypted.

In this embodiment, in step S2, the SMOTE sampling is to analyze a minority of crank call samples, and artificially synthesize a new sample according to the minority of crank call samples and add the new sample to the data set; each few sample a randomly selects a sample b from the nearest neighbor, and then randomly selects a point on the connecting line of the sample a and the sample b as a newly synthesized few sample c, wherein the specific algorithm steps comprise:

In this embodiment, in step S3, the index of the feature behavior of the call ticket is extracted as the information gain, and the larger the information gain of a certain feature is, the better the selectivity of the feature is, and the calculation formula is as follows:

g(D，A)＝H(D)-H(D|A)；

In this embodiment, in step S4, the model parameters include configuration file parameters and core algorithm operating parameters.

In this embodiment, in the step S4, the calculation formulas of the precision rate, the recall rate and the F1 score are as follows:

precision ratio TP/(TP + FP);

recall TP/(TP + FN);

f1 score 2 precision recall/(precision + recall);

In this embodiment, in step S5, two parameters are required for constructing the LGB model, which are the number t of leaf nodes of the decision tree and the number m of input features that need to be considered when each node of the decision tree is split, and the specific construction steps include:

In order to facilitate further understanding of the technical scheme, the implementation principle is illustrated:

as shown in fig. 1, the LGB algorithm-based fraud phone identification method includes five parts, namely data set, SMOTE sampling, LGB algorithm model, model evaluation and model deployment.

The data set A is a two-month call record, the data is encrypted, the original data dimension is 43 dimensions, the input characteristics of the LGB obtained after data cleaning, variable derivation and characteristic screening are x1, x2, x3 and x4 … … x23, and the variables are described in detail as follows: x1 is the calling number; x2 is the total duration of the call; x3 is average duration of call; x4 is the maximum number of calls in a day hour; x5 is the average number of calls in small and medium hours in a day; x6 is the average number of calls in a day hour; x7 is the number of standard deviation calls in small and medium hours in a day; x8 is the maximum talk time in the middle hour of the day; x9 is the minimum talk length in the middle hour of the day; x10 is the average call duration in hours of the day; x11 is the standard deviation call duration in small and medium hours in a day; x12 is for dialing different numbers; x13 is dialing different number of areas; x14 is the average call duration for different numbers; x15 is the minimum talk length per day; x16 is maximum talk time per day; x17 is standard deviation of day-to-day call duration; x18 is number of call failures; x19 is the earliest hour to make a call; x20 is the hour of the latest call; x21 is the hour span; x22 is call frequency; x23 is how many hours per day a call is made.

SMOTE sampling is to analyze a few types of crank call samples A and artificially synthesize new samples according to the few types of crank call samples A to be added into a data set. And randomly selecting a sample b from the nearest neighbor of each few sample a, and then randomly selecting a point on the connecting line of a and b as a newly synthesized few class sample.

The LGB algorithm comprises four parts of parameter setting, information gain, algorithm LGB tree construction and model training.

Parameter settings configuration, core parameter settings and parameter interpretation in the model in the LGB algorithm-based fraud phone identification method are as follows: (1) config: and configuring the path of the file, and defaulting to be an empty character string. (2) task: given the task to be performed, the default value is train, indicating that it is a training task. (3) application indicates the type of the problem, and the default value is regression, which indicates the regression task. (4) boosting, namely representing a base learning model algorithm, wherein the default value is gbdt, and the default value is a gradient boosting decision tree algorithm. (5) And data, wherein the file name of the file in which the data is positioned is given. And (5) default character strings. The Lightgbm algorithm will use it to model. (6) And valid, giving the file name of the file where the verification set is located, and default is an empty character string. (7) num _ iterations, the number of iterations of the algorithm is given, with a default value of 100. (8) learning _ rate gives the learning rate, with a default value of 1. (9) num _ leaves, which gives the coconut tree of a tree, the default value is 31. (10) And (4) tree _ leaner, wherein the number of the parallel learning processes is given, the default value is serial, and a single machine is represented. (11) max _ depth: representing the maximum depth of the tree model, with a default value of-1. If less than 0, no restriction is indicated. (12) min _ data _ in _ leaf: indicating the minimum number of samples contained on a leaf node. The default value is 20. (13) min _ sum _ sessionin _ leaf: representing the sum of the minimum hessian on a leaf node. (i.e., the minimum of the sum of leaf node sample weights) defaults to 1 e-3. (14) feature _ fraction: the value range is [0.0,1.0], and the default value is 0. If less than 1.0, lightgbm will randomly select a partial feature in each iteration. As represented by 0.8: 80% of the features were selected for training before each tree was trained. (15) feature _ fraction _ seed: the random number seed representing feature _ fraction is 2 by default. (16) bagging _ fraction: the value range is [0.0,1.0], and the default value is 0. If less than 1.0, lightgbm will randomly select a partial sample to train (non-resample) in each iteration. As represented by 0.8: 80% of the samples (non-resampled) were selected for training before each tree training. (17) bagging _ freq: indicating that the bagging is performed every bagging _ freq times. If the parameter is 0, it indicates disabling bagging. (18) bagging _ seed: a random number seed representing bagging, defaults to 3. (19) early _ stopping _ round: by default, 0, if the metric of one validation set does not rise in the early stopping round loop, training is stopped. If 0, it means not to start early stop. (20) lambda _ l 1: indicating an L1 regularization coefficient, 0 by default. (21) lambda _ l 2: indicating an L2 regularization coefficient, 0 by default. (22) min _ split _ gain: represents the minimum gain to perform the slicing, and is 0 by default. (23) drop _ rate: the range is [0.0,1.0], which represents the ratio of dropout, default is 1, and this parameter is used only in dart. (24) skip _ drop: the range is [0.0,1.0], which indicates the probability of skipping dropout, and the default is 5, and this parameter is used only in dart. (25) max _ drop: represents the maximum number of deleted trees in one iteration, and is 50 by default. If 0 or less, no restriction is indicated. This parameter is used only in dart. (26) unifonm _ drop: indicating whether a uniform delete tree is desired, False is the default value. This parameter is used only in dart. (27) xgboost _ dart _ mode: indicating whether the xgboost dart mode is used, and default value is False. This parameter is used only in dart. (28) drop _ seed: random number seed, which represents dropout, has a default value of 4. This parameter is used only in dart. (29) top _ rate: the value range is [0.0,1.0], which represents the retention ratio of the large gradient data in goss, and the default value is 2. This parameter is used only in goss. (30) top _ k: for use in voting parallelism. Default to 20, setting it to a larger value may yield more accurate results, but may reduce training speed.

The information gain is an index used for selecting the features in the tree model, and the larger the information gain of a certain feature is, the better the selectivity of the feature is.

The LGB algorithm only needs two parameters in the process of constructing the tree, the number t of leaf nodes of the decision tree and the number m of input features which need to be considered when each node of the decision tree is split.

The model training is to combine the meaning of the initial value of the parameter, set different values for each parameter, adopt ten-fold cross validation, grid search, continuously fit data, train the model, and output a stable training result model.

Model evaluation the model was evaluated in the present invention using precision, recall, F1 scores.

And (4) deploying the model, after comprehensively evaluating the model, using a pickle module to realize serialization of the model, and storing the model in a server. An API is constructed by adopting a flash, a pickle module is used for realizing model deserialization, and a model meeting the service requirement is deployed on line in an API interface mode, so that the real-time prediction and interception of fraud short lines are realized.

And (3) model prediction, calling an API (application programming interface) interface when a call ticket (encryption) arrives, inputting data into an LGB (local global library) model to realize prediction of a crank call, returning 1 by the interface if the crank call is the crank call, and returning 0 by the interface if the crank call is not the crank call.

As shown in fig. 2-3, such a LGB algorithm-based fraud phone recognition method is widely used in life. The LGB algorithm is an algorithm for making a decision by using a tree structure, and an optimal tree is finally established after feature processing is carried out on sample data according to known requirements, and a leaf node identifier of the tree makes a final decision. New data can be judged based on the tree. The LGB is a lightweight XGboost algorithm, and has high training speed and low memory occupancy rate. And randomly selecting training data, constructing a classifier, and finally, combining the learned models to increase the overall classification effect. The application of LGB in the technical solution in various fields is described below by several classical cases of classification prediction.

As shown in fig. 2, the fraud call prediction unit is a unit ticket data of XX in Gansu, the extracted main features include a calling number, a called number, whether the number is a virtual business number segment, regional dispersion, active days, call callback rate and the like, and whether the numbers are fraud calls is predicted through the data. Each inner nodule represents an attribute condition decision and the leaf nodules represent whether or not they are fraud numbers. When the decision tree selects the features, the feature with the largest information gain value is selected as the nodule splitting condition, the information gain values of other features are calculated according to the information gain value, an optimized tree is formed, and finally, the output leaf node indicates whether the leaf node is a fraud number or not.

As shown in fig. 3, the fraud call prediction unit is the basic characteristics extracted from the CDR ticket data of unit XX in Jiangsu, for example: the called dispersion, whether the number is an overseas number, whether the number is a virtual business number, the home location of the calling party, the roaming location of the calling party, the home location of the called number and the calling frequency, and whether the number is a fraud call is predicted according to the characteristic training models. Each inner nodule represents a conditional determination of an attribute and the leaf nodes represent whether or not a fraudulent phone call is made. When the decision tree selects the features, firstly, the information gain value of each feature is calculated, the information gain values of each feature are sorted in a descending order, the feature with the largest information gain value is selected as a root node, the information gains of other nodes are calculated, the feature with the largest information gain is selected to be split for the second time, and the splitting is performed for multiple times in the same way to form an optimized classification prediction tree, and finally whether the number is a fraud phone is given by the optimized tree of the LGB model.

In summary, the innovation points of the invention are as follows: first, no fraud calls currently identify LGB related algorithm patents; secondly, the adoption of an LGB machine learning algorithm can accurately identify fraud calls, and effectively solve the problems of misjudgment, missed judgment and the like in the case judging process of the public security; thirdly, the SMOTE sampling algorithm is used in the data sampling process in the patent, so that the positive and negative samples of the model are relatively balanced, and the error of the model is effectively reduced. The LGB algorithm is a framework for realizing the GBDT algorithm, supports high-efficiency parallel training, can automatically realize feature selection, improves the training speed more quickly, reduces the memory consumption and ensures that the model has better accuracy.

The invention adopts a fraud telephone identification method based on LGB algorithm, the method is based on CDR ticket data (encryption), and establishes a fraud telephone identification model based on LGB algorithm; for unbalanced data samples, oversampling is carried out by adopting an SMOTE algorithm, and data distribution is balanced; designing data input and output variables of the LGB; LGB parameters are designed based on grid search, the LGB accuracy is improved, and the training efficiency is improved; and identifying whether the calling number is a fraud phone or a normal phone by calculating identification effect evaluation values, such as precision rate, recall rate and F1 score, of the LGB method-based fraud phone identification model. The fraud call identification method provided by the invention has the characteristics of accuracy and rapidness.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for identifying fraudulent calls based on LGB algorithm is characterized by comprising the following steps:

2. The LGB algorithm-based fraud phone identification method according to claim 1, wherein in the step S1, the data set is a two-month call record, the data dimension is 43 dimensions, and the input feature value of the LGB is obtained after data cleaning, variable derivation and feature screening.

3. The LGB algorithm-based fraud phone identification method according to claim 2, wherein in said step S1, the data in the original call is processed by encryption.

4. The LGB algorithm-based fraud call identification method according to claim 1, wherein in the step S2, SMOTE sampling is to analyze a minority of crank call samples and artificially synthesize a new sample according to the minority of crank call samples to be added into the data set; each few sample a randomly selects a sample b from the nearest neighbor, and then randomly selects a point on the connecting line of the sample a and the sample b as a newly synthesized few sample c, wherein the specific algorithm steps comprise:

5. The LGB algorithm-based fraud phone recognition method of claim 1, wherein in the step S3, the index of the characteristic behavior of the call ticket is extracted as the information gain, and the larger the information gain of a certain characteristic is, the better the selectivity of this characteristic is, and the calculation formula is as follows:

g(D，A)＝H(D)-H(D|A)；

6. The LGB algorithm-based fraud phone identification method according to claim 1, wherein in said step S4, the model parameters include profile parameters, core algorithm operation parameters.

7. The LGB algorithm-based fraud phone identification method according to claim 6, wherein in said step S4, the calculation formulas of precision rate, recall rate and F1 score are respectively as follows:

precision ratio TP/(TP + FP);

recall TP/(TP + FN);

f1 score 2 precision recall/(precision + recall);

8. The LGB algorithm-based fraud phone recognition method of claim 1, wherein in said step S5, the construction of LGB model requires two parameters, i.e. the number of leaf nodes t of decision tree, the number of input features m to be considered when each node of decision tree is split, and the specific construction steps include: