CN112364901A - LGB algorithm-based fraud call identification method - Google Patents

LGB algorithm-based fraud call identification method Download PDF

Info

Publication number
CN112364901A
CN112364901A CN202011185958.4A CN202011185958A CN112364901A CN 112364901 A CN112364901 A CN 112364901A CN 202011185958 A CN202011185958 A CN 202011185958A CN 112364901 A CN112364901 A CN 112364901A
Authority
CN
China
Prior art keywords
model
samples
data
lgb
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011185958.4A
Other languages
Chinese (zh)
Inventor
张飞
周红敏
周荣
程钢
卜小冲
肖书华
董伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xinfang Software Co ltd
Shanghai Cintel Intelligent System Co ltd
Original Assignee
Shanghai Xinfang Software Co ltd
Shanghai Cintel Intelligent System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xinfang Software Co ltd, Shanghai Cintel Intelligent System Co ltd filed Critical Shanghai Xinfang Software Co ltd
Priority to CN202011185958.4A priority Critical patent/CN112364901A/en
Publication of CN112364901A publication Critical patent/CN112364901A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2218Call detail recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2281Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls

Abstract

The invention discloses a method for identifying fraudulent calls based on an LGB algorithm, which comprises the following steps: acquiring a data set of an original call; sampling a data set of an original call by adopting an SMOTE algorithm, and classifying the data set into a training set and a testing set; extracting the characteristic behavior of the call ticket, and initializing the model parameters; training the model by adopting a ten-fold cross-validation method, validating by using a test set, and calculating the precision rate, the recall rate and the F1 score of the model; obtaining an optimal LGB model by adopting grid search, and serializing the model by adopting a pickle module; deserializing the model by adopting a pick module, and constructing an API by using a flash framework; and when the call record is measured, calling an API (application programming interface) interface, inputting data into an LGB (local database) prediction model, and returning a result after the model is predicted. The invention provides a method for identifying fraud calls based on an LGB algorithm, which can realize automatic classification and prediction of fraud calls, can also obviously improve the case solving efficiency of public security personnel and reduce the economic loss of enterprises and individuals.

Description

LGB algorithm-based fraud call identification method
Technical Field
The invention relates to the technical field of network communication, in particular to a method for identifying fraudulent calls based on an LGB algorithm.
Background
The harassing call is eight-flower, wins from the first various types, and the current impersonation public inspection method, financing, shopping service, air ticket change and the like cheat the patterns to be renewed year by year, and show the evolution trend from 'screening' to 'precision'. Telephone fraud causes huge economic loss to people, disturbs normal work and living order of people, greatly harms the integrity of society, and becomes social public hazard seriously infringing the vital interests of people.
Disclosure of Invention
Aiming at the problems in the related art, the invention provides a method for identifying fraudulent calls based on an LGB algorithm, which solves the problems that the existing telephone fraud causes huge economic loss of people, disturbs the normal work and life order of people and greatly jeopardizes the integrity of the society.
In order to achieve the technical purpose, the technical scheme of the invention is as follows:
a method for identifying fraudulent calls based on LGB algorithm is designed, and comprises the following steps:
step S1, acquiring a data set of an original call, and manually studying and judging to determine the distribution proportion of positive and negative data samples;
step S2, sampling the data set of the original call by adopting an SMOTE algorithm to form a final data set, and classifying the data set into a training set and a testing set;
step S3, extracting the characteristic behavior of the call ticket, and initializing the model parameter;
step S4, training the model by adopting a ten-fold cross-validation method, using the test set for validation, and calculating the accuracy rate, the recall rate and the F1 score of the model;
step S5, obtaining an optimal LGB model by grid search, serializing the model by a pickle module, and storing the model to a server;
step S6, deserializing the model by adopting a pickle module, constructing an API by using a flash framework, and deploying the model on line in an interface mode;
and step S7, when the call record is detected to arrive, calling an API (application programming interface) interface, inputting data into an LGB (local database) prediction model, and after the model is predicted, returning the result.
Further, in step S1, the data set is a two-month call record, the data dimension is 43 dimensions, and the input feature value of the LGB is obtained through data cleaning, variable derivation, and feature screening.
Further, in step S1, the data in the original call is encrypted.
Further, in the step S2, the SMOTE sampling is to analyze a minority of crank call samples, and artificially synthesize a new sample according to the minority of crank call samples and add the new sample to the data set; each few sample a randomly selects a sample b from the nearest neighbor, and then randomly selects a point on the connecting line of the sample a and the sample b as a newly synthesized few sample c, wherein the specific algorithm steps comprise:
step S21, for each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class sample c using the euclidean distance d as a standard to obtain k neighbors, wherein the calculation formula of the euclidean distance d is as follows:
Figure BDA0002751432920000021
step S22, for each sample x of the minority class, randomly selecting a number of samples from k neighbors, assuming the selected neighbors are xn
Step S23, for each randomly selected neighbor xnCarrying out random linear interpolation, and respectively constructing new samples with the original samples;
step S24, putting a new sample into the original data to generate a new training set;
after SMOTE sampling is finished, a final new sample set is formed, namely training samples and testing samples are divided.
Further, in step S3, the index of the feature behavior of the call ticket is extracted as the information gain, and the larger the information gain of a certain feature is, the better the selectivity of the feature is, and the calculation formula is as follows:
g(D,A)=H(D)-H(D|A);
where H (D) is the empirical entropy and H (D | a) is the empirical conditional entropy of the selected feature a, the calculation formulas are respectively as follows:
Figure BDA0002751432920000031
Figure BDA0002751432920000032
wherein, the training data set D, | D | is the sample capacity, namely the number of the samples (the number of elements in D), and is provided with K classes CkTo express, | Ck | is CiNumber of samples, | CkThe sum of | is | D |, k is 1, 21,D2.....Dn,|DiThe number of samples with Di is |, the sum of Di | is | D |, i is 1, 2, …, and D is recordediIn (C)kSample set D ofikI.e. intersection, | DikL is DikThe number of samples.
Further, in step S4, the model parameters include configuration file parameters and core algorithm operating parameters.
Further, in the step S4, the calculation formulas of the precision rate, the recall rate and the F1 score are as follows:
precision ratio TP/(TP + FP);
recall TP/(TP + FN);
f1 score 2 precision recall/(precision + recall);
wherein, TP represents the number of positive samples and positive prediction results, FP represents the number of negative samples and positive prediction results, TN represents the number of negative samples and negative prediction results, FN represents the number of positive samples and negative prediction results.
Further, in step S5, the LGB model is constructed by using two parameters, which are the number t of leaf nodes of the decision tree and the number m of input features to be considered when each node of the decision tree is split, and the specific construction steps include:
step S51, data is pre-sorted, firstly, all the characteristics are pre-sorted according to the data;
step S52, sampling, setting N as the number of training samples, then starting to build tree, wherein the number of input samples is N, and the N training samples are randomly extracted from the training set;
step S53, discretizing data, namely discretizing continuous data and determining the required buckets of each feature;
step S54, selecting characteristics and dividing points, wherein the number of input characteristics of a training sample is M (M is 23), and M is far smaller than M, calculating dividing nodes of data according to information gain, when splitting is carried out on the optimal dividing points, selecting M input characteristics from the M input characteristics, and then selecting a node with the largest information gain from the M input characteristics for splitting, wherein M cannot be changed in the process of constructing a decision tree;
step S55, the data are split according to the method, one leaf with the maximum column division gain is found from all the current leaves each time, then the splitting is carried out, and the steps are circulated in such a way, and pruning is not needed;
wherein, an optimal decision tree is generated in the process of model training, and fraud categories are output for newly input test samples through the optimal decision tree.
The invention has the beneficial effects that: the LGB algorithm-based fraud call identification method combines the existing fraud call bills and the artificial label library to establish the call bill data into an algorithm model to realize automatic classification prediction, and the method can obviously improve the case solving efficiency of public security personnel and reduce the economic loss of enterprises and individuals.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a method for identifying fraudulent calls based on LGB algorithm according to an embodiment of the present invention;
FIG. 2 is a first optimized tree structure diagram of a LGB algorithm-based fraud phone recognition method in a specific application according to an embodiment of the present invention;
fig. 3 is a second optimized tree structure diagram of a LGB algorithm-based fraud phone recognition method in a specific application according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
As shown in fig. 1, a method for identifying fraudulent calls based on LGB algorithm according to an embodiment of the present invention includes the following steps:
step S1, acquiring a data set of an original call, and manually studying and judging to determine the distribution proportion of positive and negative data samples;
step S2, sampling the data set of the original call by adopting an SMOTE algorithm to form a final data set, and classifying the data set into a training set and a testing set;
step S3, extracting the characteristic behavior of the call ticket, and initializing the model parameter;
step S4, training the model by adopting a ten-fold cross-validation method, using the test set for validation, and calculating the accuracy rate, the recall rate and the F1 score of the model;
step S5, obtaining an optimal LGB model by grid search, serializing the model by a pickle module, and storing the model to a server;
step S6, deserializing the model by adopting a pickle module, constructing an API by using a flash framework, and deploying the model on line in an interface mode;
and step S7, when the call record is detected to arrive, calling an API (application programming interface) interface, inputting data into an LGB (local database) prediction model, and after the model is predicted, returning the result.
In this embodiment, in step S1, the data set is a two-month call record, the data dimension is 43 dimensions, and the input feature value of the LGB is obtained through data cleaning, variable derivation, and feature screening.
In this embodiment, in step S1, the data in the original call is encrypted.
In this embodiment, in step S2, the SMOTE sampling is to analyze a minority of crank call samples, and artificially synthesize a new sample according to the minority of crank call samples and add the new sample to the data set; each few sample a randomly selects a sample b from the nearest neighbor, and then randomly selects a point on the connecting line of the sample a and the sample b as a newly synthesized few sample c, wherein the specific algorithm steps comprise:
step S21, for each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class sample c using the euclidean distance d as a standard to obtain k neighbors, wherein the calculation formula of the euclidean distance d is as follows:
Figure BDA0002751432920000061
step S22, for each sample x of the minority class, randomly selecting a number of samples from k neighbors, assuming the selected neighbors are xn
Step S23, for each randomly selected neighbor xnCarrying out random linear interpolation, and respectively constructing new samples with the original samples;
step S24, putting a new sample into the original data to generate a new training set;
after SMOTE sampling is finished, a final new sample set is formed, namely training samples and testing samples are divided.
In this embodiment, in step S3, the index of the feature behavior of the call ticket is extracted as the information gain, and the larger the information gain of a certain feature is, the better the selectivity of the feature is, and the calculation formula is as follows:
g(D,A)=H(D)-H(D|A);
where H (D) is the empirical entropy and H (D | a) is the empirical conditional entropy of the selected feature a, the calculation formulas are respectively as follows:
Figure BDA0002751432920000071
Figure BDA0002751432920000072
wherein, the training data set D, | D | is the sample capacity, namely the number of the samples (the number of elements in D), and is provided with K classes CkTo express, | Ck | is CiNumber of samples, | CkThe sum of | is | D |, k is 1, 21,D2.....Dn,|DiThe number of samples with Di is |, the sum of Di | is | D |, i is 1, 2, …, and D is recordediIn (C)kSample set D ofikI.e. intersection, | DikL is DikThe number of samples.
In this embodiment, in step S4, the model parameters include configuration file parameters and core algorithm operating parameters.
In this embodiment, in the step S4, the calculation formulas of the precision rate, the recall rate and the F1 score are as follows:
precision ratio TP/(TP + FP);
recall TP/(TP + FN);
f1 score 2 precision recall/(precision + recall);
wherein, TP represents the number of positive samples and positive prediction results, FP represents the number of negative samples and positive prediction results, TN represents the number of negative samples and negative prediction results, FN represents the number of positive samples and negative prediction results.
In this embodiment, in step S5, two parameters are required for constructing the LGB model, which are the number t of leaf nodes of the decision tree and the number m of input features that need to be considered when each node of the decision tree is split, and the specific construction steps include:
step S51, data is pre-sorted, firstly, all the characteristics are pre-sorted according to the data;
step S52, sampling, setting N as the number of training samples, then starting to build tree, wherein the number of input samples is N, and the N training samples are randomly extracted from the training set;
step S53, discretizing data, namely discretizing continuous data and determining the required buckets of each feature;
step S54, selecting characteristics and dividing points, wherein the number of input characteristics of a training sample is M (M is 23), and M is far smaller than M, calculating dividing nodes of data according to information gain, when splitting is carried out on the optimal dividing points, selecting M input characteristics from the M input characteristics, and then selecting a node with the largest information gain from the M input characteristics for splitting, wherein M cannot be changed in the process of constructing a decision tree;
step S55, the data are split according to the method, one leaf with the maximum column division gain is found from all the current leaves each time, then the splitting is carried out, and the steps are circulated in such a way, and pruning is not needed;
wherein, an optimal decision tree is generated in the process of model training, and fraud categories are output for newly input test samples through the optimal decision tree.
In order to facilitate further understanding of the technical scheme, the implementation principle is illustrated:
as shown in fig. 1, the LGB algorithm-based fraud phone identification method includes five parts, namely data set, SMOTE sampling, LGB algorithm model, model evaluation and model deployment.
The data set A is a two-month call record, the data is encrypted, the original data dimension is 43 dimensions, the input characteristics of the LGB obtained after data cleaning, variable derivation and characteristic screening are x1, x2, x3 and x4 … … x23, and the variables are described in detail as follows: x1 is the calling number; x2 is the total duration of the call; x3 is average duration of call; x4 is the maximum number of calls in a day hour; x5 is the average number of calls in small and medium hours in a day; x6 is the average number of calls in a day hour; x7 is the number of standard deviation calls in small and medium hours in a day; x8 is the maximum talk time in the middle hour of the day; x9 is the minimum talk length in the middle hour of the day; x10 is the average call duration in hours of the day; x11 is the standard deviation call duration in small and medium hours in a day; x12 is for dialing different numbers; x13 is dialing different number of areas; x14 is the average call duration for different numbers; x15 is the minimum talk length per day; x16 is maximum talk time per day; x17 is standard deviation of day-to-day call duration; x18 is number of call failures; x19 is the earliest hour to make a call; x20 is the hour of the latest call; x21 is the hour span; x22 is call frequency; x23 is how many hours per day a call is made.
SMOTE sampling is to analyze a few types of crank call samples A and artificially synthesize new samples according to the few types of crank call samples A to be added into a data set. And randomly selecting a sample b from the nearest neighbor of each few sample a, and then randomly selecting a point on the connecting line of a and b as a newly synthesized few class sample.
The LGB algorithm comprises four parts of parameter setting, information gain, algorithm LGB tree construction and model training.
Parameter settings configuration, core parameter settings and parameter interpretation in the model in the LGB algorithm-based fraud phone identification method are as follows: (1) config: and configuring the path of the file, and defaulting to be an empty character string. (2) task: given the task to be performed, the default value is train, indicating that it is a training task. (3) application indicates the type of the problem, and the default value is regression, which indicates the regression task. (4) boosting, namely representing a base learning model algorithm, wherein the default value is gbdt, and the default value is a gradient boosting decision tree algorithm. (5) And data, wherein the file name of the file in which the data is positioned is given. And (5) default character strings. The Lightgbm algorithm will use it to model. (6) And valid, giving the file name of the file where the verification set is located, and default is an empty character string. (7) num _ iterations, the number of iterations of the algorithm is given, with a default value of 100. (8) learning _ rate gives the learning rate, with a default value of 1. (9) num _ leaves, which gives the coconut tree of a tree, the default value is 31. (10) And (4) tree _ leaner, wherein the number of the parallel learning processes is given, the default value is serial, and a single machine is represented. (11) max _ depth: representing the maximum depth of the tree model, with a default value of-1. If less than 0, no restriction is indicated. (12) min _ data _ in _ leaf: indicating the minimum number of samples contained on a leaf node. The default value is 20. (13) min _ sum _ sessionin _ leaf: representing the sum of the minimum hessian on a leaf node. (i.e., the minimum of the sum of leaf node sample weights) defaults to 1 e-3. (14) feature _ fraction: the value range is [0.0,1.0], and the default value is 0. If less than 1.0, lightgbm will randomly select a partial feature in each iteration. As represented by 0.8: 80% of the features were selected for training before each tree was trained. (15) feature _ fraction _ seed: the random number seed representing feature _ fraction is 2 by default. (16) bagging _ fraction: the value range is [0.0,1.0], and the default value is 0. If less than 1.0, lightgbm will randomly select a partial sample to train (non-resample) in each iteration. As represented by 0.8: 80% of the samples (non-resampled) were selected for training before each tree training. (17) bagging _ freq: indicating that the bagging is performed every bagging _ freq times. If the parameter is 0, it indicates disabling bagging. (18) bagging _ seed: a random number seed representing bagging, defaults to 3. (19) early _ stopping _ round: by default, 0, if the metric of one validation set does not rise in the early stopping round loop, training is stopped. If 0, it means not to start early stop. (20) lambda _ l 1: indicating an L1 regularization coefficient, 0 by default. (21) lambda _ l 2: indicating an L2 regularization coefficient, 0 by default. (22) min _ split _ gain: represents the minimum gain to perform the slicing, and is 0 by default. (23) drop _ rate: the range is [0.0,1.0], which represents the ratio of dropout, default is 1, and this parameter is used only in dart. (24) skip _ drop: the range is [0.0,1.0], which indicates the probability of skipping dropout, and the default is 5, and this parameter is used only in dart. (25) max _ drop: represents the maximum number of deleted trees in one iteration, and is 50 by default. If 0 or less, no restriction is indicated. This parameter is used only in dart. (26) unifonm _ drop: indicating whether a uniform delete tree is desired, False is the default value. This parameter is used only in dart. (27) xgboost _ dart _ mode: indicating whether the xgboost dart mode is used, and default value is False. This parameter is used only in dart. (28) drop _ seed: random number seed, which represents dropout, has a default value of 4. This parameter is used only in dart. (29) top _ rate: the value range is [0.0,1.0], which represents the retention ratio of the large gradient data in goss, and the default value is 2. This parameter is used only in goss. (30) top _ k: for use in voting parallelism. Default to 20, setting it to a larger value may yield more accurate results, but may reduce training speed.
The information gain is an index used for selecting the features in the tree model, and the larger the information gain of a certain feature is, the better the selectivity of the feature is.
The LGB algorithm only needs two parameters in the process of constructing the tree, the number t of leaf nodes of the decision tree and the number m of input features which need to be considered when each node of the decision tree is split.
The model training is to combine the meaning of the initial value of the parameter, set different values for each parameter, adopt ten-fold cross validation, grid search, continuously fit data, train the model, and output a stable training result model.
Model evaluation the model was evaluated in the present invention using precision, recall, F1 scores.
And (4) deploying the model, after comprehensively evaluating the model, using a pickle module to realize serialization of the model, and storing the model in a server. An API is constructed by adopting a flash, a pickle module is used for realizing model deserialization, and a model meeting the service requirement is deployed on line in an API interface mode, so that the real-time prediction and interception of fraud short lines are realized.
And (3) model prediction, calling an API (application programming interface) interface when a call ticket (encryption) arrives, inputting data into an LGB (local global library) model to realize prediction of a crank call, returning 1 by the interface if the crank call is the crank call, and returning 0 by the interface if the crank call is not the crank call.
As shown in fig. 2-3, such a LGB algorithm-based fraud phone recognition method is widely used in life. The LGB algorithm is an algorithm for making a decision by using a tree structure, and an optimal tree is finally established after feature processing is carried out on sample data according to known requirements, and a leaf node identifier of the tree makes a final decision. New data can be judged based on the tree. The LGB is a lightweight XGboost algorithm, and has high training speed and low memory occupancy rate. And randomly selecting training data, constructing a classifier, and finally, combining the learned models to increase the overall classification effect. The application of LGB in the technical solution in various fields is described below by several classical cases of classification prediction.
As shown in fig. 2, the fraud call prediction unit is a unit ticket data of XX in Gansu, the extracted main features include a calling number, a called number, whether the number is a virtual business number segment, regional dispersion, active days, call callback rate and the like, and whether the numbers are fraud calls is predicted through the data. Each inner nodule represents an attribute condition decision and the leaf nodules represent whether or not they are fraud numbers. When the decision tree selects the features, the feature with the largest information gain value is selected as the nodule splitting condition, the information gain values of other features are calculated according to the information gain value, an optimized tree is formed, and finally, the output leaf node indicates whether the leaf node is a fraud number or not.
As shown in fig. 3, the fraud call prediction unit is the basic characteristics extracted from the CDR ticket data of unit XX in Jiangsu, for example: the called dispersion, whether the number is an overseas number, whether the number is a virtual business number, the home location of the calling party, the roaming location of the calling party, the home location of the called number and the calling frequency, and whether the number is a fraud call is predicted according to the characteristic training models. Each inner nodule represents a conditional determination of an attribute and the leaf nodes represent whether or not a fraudulent phone call is made. When the decision tree selects the features, firstly, the information gain value of each feature is calculated, the information gain values of each feature are sorted in a descending order, the feature with the largest information gain value is selected as a root node, the information gains of other nodes are calculated, the feature with the largest information gain is selected to be split for the second time, and the splitting is performed for multiple times in the same way to form an optimized classification prediction tree, and finally whether the number is a fraud phone is given by the optimized tree of the LGB model.
In summary, the innovation points of the invention are as follows: first, no fraud calls currently identify LGB related algorithm patents; secondly, the adoption of an LGB machine learning algorithm can accurately identify fraud calls, and effectively solve the problems of misjudgment, missed judgment and the like in the case judging process of the public security; thirdly, the SMOTE sampling algorithm is used in the data sampling process in the patent, so that the positive and negative samples of the model are relatively balanced, and the error of the model is effectively reduced. The LGB algorithm is a framework for realizing the GBDT algorithm, supports high-efficiency parallel training, can automatically realize feature selection, improves the training speed more quickly, reduces the memory consumption and ensures that the model has better accuracy.
The invention adopts a fraud telephone identification method based on LGB algorithm, the method is based on CDR ticket data (encryption), and establishes a fraud telephone identification model based on LGB algorithm; for unbalanced data samples, oversampling is carried out by adopting an SMOTE algorithm, and data distribution is balanced; designing data input and output variables of the LGB; LGB parameters are designed based on grid search, the LGB accuracy is improved, and the training efficiency is improved; and identifying whether the calling number is a fraud phone or a normal phone by calculating identification effect evaluation values, such as precision rate, recall rate and F1 score, of the LGB method-based fraud phone identification model. The fraud call identification method provided by the invention has the characteristics of accuracy and rapidness.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (8)

1. A method for identifying fraudulent calls based on LGB algorithm is characterized by comprising the following steps:
step S1, acquiring a data set of an original call, and manually studying and judging to determine the distribution proportion of positive and negative data samples;
step S2, sampling the data set of the original call by adopting an SMOTE algorithm to form a final data set, and classifying the data set into a training set and a testing set;
step S3, extracting the characteristic behavior of the call ticket, and initializing the model parameter;
step S4, training the model by adopting a ten-fold cross-validation method, using the test set for validation, and calculating the accuracy rate, the recall rate and the F1 score of the model;
step S5, obtaining an optimal LGB model by grid search, serializing the model by a pickle module, and storing the model to a server;
step S6, deserializing the model by adopting a pickle module, constructing an API by using a flash framework, and deploying the model on line in an interface mode;
and step S7, when the call record is detected to arrive, calling an API (application programming interface) interface, inputting data into an LGB (local database) prediction model, and after the model is predicted, returning the result.
2. The LGB algorithm-based fraud phone identification method according to claim 1, wherein in the step S1, the data set is a two-month call record, the data dimension is 43 dimensions, and the input feature value of the LGB is obtained after data cleaning, variable derivation and feature screening.
3. The LGB algorithm-based fraud phone identification method according to claim 2, wherein in said step S1, the data in the original call is processed by encryption.
4. The LGB algorithm-based fraud call identification method according to claim 1, wherein in the step S2, SMOTE sampling is to analyze a minority of crank call samples and artificially synthesize a new sample according to the minority of crank call samples to be added into the data set; each few sample a randomly selects a sample b from the nearest neighbor, and then randomly selects a point on the connecting line of the sample a and the sample b as a newly synthesized few sample c, wherein the specific algorithm steps comprise:
step S21, for each sample x in the minority class, calculating the distance from the sample x to all samples in the minority class sample c using the euclidean distance d as a standard to obtain k neighbors, wherein the calculation formula of the euclidean distance d is as follows:
Figure FDA0002751432910000021
step S22, for each sample x of the minority class, randomly selecting a number of samples from k neighbors, assuming the selected neighbors are xn
Step S23, for each randomly selected neighbor xnCarrying out random linear interpolation, and respectively constructing new samples with the original samples;
step S24, putting a new sample into the original data to generate a new training set;
after SMOTE sampling is finished, a final new sample set is formed, namely training samples and testing samples are divided.
5. The LGB algorithm-based fraud phone recognition method of claim 1, wherein in the step S3, the index of the characteristic behavior of the call ticket is extracted as the information gain, and the larger the information gain of a certain characteristic is, the better the selectivity of this characteristic is, and the calculation formula is as follows:
g(D,A)=H(D)-H(D|A);
where H (D) is the empirical entropy and H (D | a) is the empirical conditional entropy of the selected feature a, the calculation formulas are respectively as follows:
Figure FDA0002751432910000022
Figure FDA0002751432910000023
wherein, the training data set D, | D | is the sample capacity, namely the number of the samples (the number of elements in D), and is provided with K classes CkTo express, | Ck | is CiNumber of samples, | CkThe sum of | is | D |, k is 1, 21,D2.....Dn,|DiThe number of samples with Di is |, the sum of Di | is | D |, i is 1, 2, …, and D is recordediIn (C)kSample set D ofikI.e. intersection, | DikL is DikThe number of samples.
6. The LGB algorithm-based fraud phone identification method according to claim 1, wherein in said step S4, the model parameters include profile parameters, core algorithm operation parameters.
7. The LGB algorithm-based fraud phone identification method according to claim 6, wherein in said step S4, the calculation formulas of precision rate, recall rate and F1 score are respectively as follows:
precision ratio TP/(TP + FP);
recall TP/(TP + FN);
f1 score 2 precision recall/(precision + recall);
wherein, TP represents the number of positive samples and positive prediction results, FP represents the number of negative samples and positive prediction results, TN represents the number of negative samples and negative prediction results, FN represents the number of positive samples and negative prediction results.
8. The LGB algorithm-based fraud phone recognition method of claim 1, wherein in said step S5, the construction of LGB model requires two parameters, i.e. the number of leaf nodes t of decision tree, the number of input features m to be considered when each node of decision tree is split, and the specific construction steps include:
step S51, data is pre-sorted, firstly, all the characteristics are pre-sorted according to the data;
step S52, sampling, setting N as the number of training samples, then starting to build tree, wherein the number of input samples is N, and the N training samples are randomly extracted from the training set;
step S53, discretizing data, namely discretizing continuous data and determining the required buckets of each feature;
step S54, selecting characteristics and dividing points, wherein the number of input characteristics of a training sample is M (M is 23), and M is far smaller than M, calculating dividing nodes of data according to information gain, when splitting is carried out on the optimal dividing points, selecting M input characteristics from the M input characteristics, and then selecting a node with the largest information gain from the M input characteristics for splitting, wherein M cannot be changed in the process of constructing a decision tree;
step S55, the data are split according to the method, one leaf with the maximum column division gain is found from all the current leaves each time, then the splitting is carried out, and the steps are circulated in such a way, and pruning is not needed;
wherein, an optimal decision tree is generated in the process of model training, and fraud categories are output for newly input test samples through the optimal decision tree.
CN202011185958.4A 2020-10-30 2020-10-30 LGB algorithm-based fraud call identification method Pending CN112364901A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011185958.4A CN112364901A (en) 2020-10-30 2020-10-30 LGB algorithm-based fraud call identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011185958.4A CN112364901A (en) 2020-10-30 2020-10-30 LGB algorithm-based fraud call identification method

Publications (1)

Publication Number Publication Date
CN112364901A true CN112364901A (en) 2021-02-12

Family

ID=74514203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011185958.4A Pending CN112364901A (en) 2020-10-30 2020-10-30 LGB algorithm-based fraud call identification method

Country Status (1)

Country Link
CN (1) CN112364901A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961712A (en) * 2021-09-08 2022-01-21 武汉众智数字技术有限公司 Knowledge graph-based fraud telephone analysis method
CN114006982A (en) * 2021-11-02 2022-02-01 号百信息服务有限公司 Harassment number identification method based on classification gradient lifting algorithm

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108924333A (en) * 2018-06-12 2018-11-30 阿里巴巴集团控股有限公司 Fraudulent call recognition methods, device and system
CN109657977A (en) * 2018-12-19 2019-04-19 重庆誉存大数据科技有限公司 A kind of Risk Identification Method and system
CN110147430A (en) * 2019-04-25 2019-08-20 上海欣方智能系统有限公司 Harassing call recognition methods and system based on random forests algorithm
CN111311401A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Financial default probability prediction model based on LightGBM
CN111414717A (en) * 2020-03-02 2020-07-14 浙江大学 XGboost-L ightGBM-based unit power prediction method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108924333A (en) * 2018-06-12 2018-11-30 阿里巴巴集团控股有限公司 Fraudulent call recognition methods, device and system
CN109657977A (en) * 2018-12-19 2019-04-19 重庆誉存大数据科技有限公司 A kind of Risk Identification Method and system
CN110147430A (en) * 2019-04-25 2019-08-20 上海欣方智能系统有限公司 Harassing call recognition methods and system based on random forests algorithm
CN111414717A (en) * 2020-03-02 2020-07-14 浙江大学 XGboost-L ightGBM-based unit power prediction method
CN111311401A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Financial default probability prediction model based on LightGBM

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961712A (en) * 2021-09-08 2022-01-21 武汉众智数字技术有限公司 Knowledge graph-based fraud telephone analysis method
CN113961712B (en) * 2021-09-08 2024-04-26 武汉众智数字技术有限公司 Knowledge-graph-based fraud telephone analysis method
CN114006982A (en) * 2021-11-02 2022-02-01 号百信息服务有限公司 Harassment number identification method based on classification gradient lifting algorithm
CN114006982B (en) * 2021-11-02 2024-04-30 号百信息服务有限公司 Harassment number identification method based on classification gradient lifting algorithm

Similar Documents

Publication Publication Date Title
Bolleyer et al. Origins of party formation and new party success in advanced democracies
CN108898479B (en) Credit evaluation model construction method and device
CN109818961B (en) Network intrusion detection method, device and equipment
CN110837963A (en) Risk control platform construction method based on data, model and strategy
CN110162970A (en) A kind of program processing method, device and relevant device
CN112364901A (en) LGB algorithm-based fraud call identification method
CN113961712B (en) Knowledge-graph-based fraud telephone analysis method
CN111773732A (en) Target game user detection method, device and equipment
CN115577858B (en) Block chain-based carbon emission prediction method and device and electronic equipment
CN112464058A (en) XGboost algorithm-based telecommunication internet fraud identification method
CN110516748A (en) Method for processing business, device, medium and electronic equipment
CN111510368B (en) Family group identification method, device, equipment and computer readable storage medium
CN111709775A (en) House property price evaluation method and device, electronic equipment and storage medium
CN109547393B (en) Malicious number identification method, device, equipment and storage medium
CN114841789A (en) Block chain-based auditing and auditing pricing fault data online editing method and system
CN112835682B (en) Data processing method, device, computer equipment and readable storage medium
US20160019267A1 (en) Using data mining to produce hidden insights from a given set of data
CN113827978A (en) Loss user prediction method and device and computer readable storage medium
KR100736033B1 (en) Apparatus and method for growing automatically up business process
CN110059457B (en) Body-building method and device
CN113163057B (en) Method for constructing dynamic identification interval of fraud telephone
CN114385876B (en) Model search space generation method, device and system
CN114511330B (en) Ether house Pompe fraudster detection method and system based on improved CNN-RF
CN114331665A (en) Training method and device for credit judgment model of predetermined applicant and electronic equipment
CN103390404A (en) Information processing apparatus, information processing method and information processing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination