CN113163057B

CN113163057B - Method for constructing dynamic identification interval of fraud telephone

Info

Publication number: CN113163057B
Application number: CN202110073654.7A
Authority: CN
Inventors: 林绍福; 常晴晴; 刘希亮
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2022-09-30
Anticipated expiration: 2041-01-20
Also published as: CN113163057A

Abstract

The invention discloses a method for constructing a dynamic identification interval of a fraud telephone. The method creatively provides a method for combining the super-parameter optimization and the gradient elevator to construct the fraud phone identification model, optimizes the parameters of the gradient elevator by using the super-parameter optimization algorithm, and improves the model identification effect. The method uses a random forest algorithm to select the data features, and selects the dimension with the feature importance degree larger than 0.8 to construct the fraud telephone feature vector. The invention provides a method for sampling data by using a mixed sampling method combining undersampling and oversampling, which relieves the imbalance of data distribution and is a feasible method through experimental verification. The invention provides a parameterization method based on a probability prediction model, which takes the probability output by a classifier as the confidence coefficient of a sample, and constructs a fraud telephone dynamic identification interval according to the sample confidence coefficient output by the model.

Description

Method for constructing dynamic identification interval of fraud telephone

Technical Field

The invention relates to the field of internet communication and artificial intelligence, in particular to a method for constructing a dynamic identification interval of a fraud telephone, which can be applied to the field of telecommunication anti-fraud.

Background

The fraudulent call seriously disturbs the normal communication order, impairs the free communication of the citizens and interferes with the normal working and life of the masses, which is a serious problem in the current society. How to effectively identify and intercept fraud phones plays an important role in the anti-fraud mechanism of telecommunications, and has attracted extensive attention from academic, industrial, and government subsidized institutions.

In the related art, fraud telephone identification by using a crowdsourcing method is a common method, but the crowdsourcing method has high cost and low efficiency. With the rapid development of the artificial intelligence technology, in the related technology, a fraud telephone identification model is also constructed by using a machine learning method, but most researchers only evaluate the quality of the model by using the accuracy rate output by the model, however, for a typical unbalanced data set such as fraud telephone bill data, the model identification has a large deviation, and the accuracy rate cannot accurately reflect the identification effect of the model. Therefore, the invention provides a fraud telephone dynamic identification interval based on a machine learning algorithm with various evaluation indexes.

Disclosure of Invention

The invention aims to provide a method for constructing a fraud telephone identification dynamic interval, and aims to solve the problem of low fraud telephone identification accuracy in an anti-fraud scene in the telecommunication field, namely a telecommunication operator can use the model to complete fraud telephone identification and take corresponding control measures, so that the user loss is reduced, and the user experience is improved. The method comprises the steps of inputting user ticket log data serving as a model, outputting the confidence coefficient that each piece of user ticket log data is a fraud phone through model analysis and judgment, judging whether a sample is a suspicious fraud phone according to the confidence coefficient and upper and lower bound threshold values of a fraud phone dynamic interval, and providing important reference for an operator to analyze and manage users.

A method for constructing a dynamic identification interval of a fraud phone is characterized by comprising the following steps,

step 1: a method for extracting features of fraud telephone user bill data based on random forest is provided;

step 2: according to the data processed in the step 1, a hybrid sampling method is used for carrying out rebalancing processing on the data, so that the influence on the model caused by unbalanced distribution of the data is reduced;

and 3, step 3: according to the characteristics of the phone bill data of the fraud phone user, a fraud phone identification model is constructed, and the model identification effect is measured by using various evaluation indexes;

and 4, step 4: and 3, judging the probability that the data sample is a fraud phone by using the fraud phone identification model, and constructing a fraud phone dynamic identification interval.

1. The method comprises the steps of calculating the information gain of each dimension characteristic in a data set by utilizing a random forest fraud telephone user ticket data characteristic extraction method, constructing node splitting of each tree according to the information gain, and finally calculating the score of each dimension data. The original fraud telephone user ticket data is used as input, the VIM is used for representing the importance measurement of the variable, and the GI is used for representing the Gini index.

The training data set S with n examples is defined as:

S＝{s _i },i＝1,2,...,n (1)

wherein s is _i Represents any sample point in the sample set, n represents that the sample set contains n sample points, s _i The definition of (A) is shown in formula 2.

s _i ＝(x _i ,y _i ),i＝1,2,...,n (2)

Wherein x is _i ＝{v ₁ ,v ₂ ,...,v _w Is an example, v _j Is represented by x _i Characteristic of the sample, y _i X represents a corresponding x _i The data of the label is divided into normal telephone user call bill data and fraud telephone user call bill data, namely C is 2.

Data dimension used in the invention is data desensitization mobile phone number v ₁ And the called mobile phone number v ₂ Frequency v of conversation ₃ Ratio of successful connections v ₄ Average duration of conversation v ₅ Average duration v of the ring tone ₆ Call type v ₇ Calling time v ₈ Duration of the call v ₉ Ratio v of hung up calls ₁₀ Condition v of the mobile phone ₁₁ Time of conversation v ₁₂ A field. Therefore, in the present invention w is 12.

GI kini index is defined as:

wherein K denotes K classes, p _mk Represents the proportion of the class k in the node m, p _mk' Indicating the proportion of nodes m whose classes are not k.

The VIM variable importance is defined as:

wherein, GI _left And GI _right GI indexes of left and right new branch nodes respectively representing m nodes.

Finally, theThe importance measures of all variables are normalized. Feature v for any fraudulent call _i With importance of VIM _i The normalized calculation formula of the importance in this period is shown in formula (5).

Where Σ VIM represents the sum of feature importance of the 12 features in the present invention. And sorting the data according to the importance scores, selecting the feature vectors of the first 9 feature construction data with scores greater than 0.8, and obtaining a new fraud telephone user ticket data set which can be used for subsequent experiments.

2. According to the fact that the user call ticket data is typical unbalanced data, the invention provides a method for sampling the data by using mixed sampling, and the data processed by the method 1 is used as input. Setting a sampling ratio r according to the unbalanced ratio of the normal telephone samples to the fraud telephone samples, and setting the number of the normal telephone samples as p and the number of the fraud telephone samples as q, then

One of the sample points s is selected _i Calculating s using Euclidean distance _i Obtaining r neighbors of the r minority sample points near the r minority sample points; for each few classes of fraud phone samples s _c Randomly taking several samples from its r nearest neighbor samples

Where r ∈ {1,2, 3.. a },

representing a sample s _c All around except for _c For each selected neighboring sample, other than the sample point

According to s with the original sample _new ＝s _c +rand(0,1)×(s _c '-s _c ) Synthesis of a novel sample s _new Where rand (0,1) is a function generating a random number between 0 and 1, s _c ' denotes each randomly selected neighbor sample. The newly synthesized sample s _new Adding the data into the original data set to form a new sample set; in the invention, 107,935 bars are used as normal telephones, 8,448 bars are used as fraud telephones, and 116,383 bars are used in total, and 107,007 bars are used as normal telephones, 104,059 bars are used as fraud telephones, and 211,066 bars are used in total after the normal telephones are processed by the method.

3. According to the characteristics of fraud telephone user ticket data, the invention innovatively provides a fraud telephone identification model established by combining a gradient-based unilateral sampling and feature binding lifting tree algorithm, meanwhile, a random forest-based hyper-parameter optimization algorithm is used for optimizing the parameters of a gradient lifter, the fraud telephone identification model is established, and the model performance is judged by using various indexes of accuracy, recall rate, F1 value and AUC value.

Wherein True Positive (TP) represents the number of fraud telephones predicted as fraud telephones, false positive is the number of normal telephones predicted as fraud telephones, false negative is the number of fraud telephones predicted as normal telephones, and true negative is the number of normal telephones predicted as normal telephones.

The accuracy (Precision) is a ratio of the samples predicted to be fraudulent calls, which are originally fraudulent calls, and is expressed by a mathematical formula as shown in the following formula (6).

The Recall rate (Recall) is a ratio of the fraud calls predicted from the samples originally identified as fraud calls, and is expressed by a mathematical formula as shown in the following equation (7).

F1 is a new evaluation index F-measure of harmonizing accuracy and recall, abbreviated as F1, and the specific mathematical formula is shown in the following formula (8).

The AUC is the area under the ROC curve, which is a curve made from the results predicted by the algorithm, the ratio of the samples that are originally normal phones to be predicted as fraudulent phones and the ratio that is originally fraudulent phones to be predicted as fraudulent phones, and the specific mathematical formula is shown in the following formula (9). Wherein S _min Indicating the number of fraudulent calls, S _maj Indicating the number of normal calls and,

represents the serial number of the ith sample,

indicating that the fraudulent telephone numbers are added up.

4. The fraud phone identification model in step 3 is characterized in that the model can output the confidence level of each sample, the probability that the fraud phone is a fraud phone can be judged according to the confidence level output by the model, a fraud phone discrimination threshold is set according to the confidence level that the sample is a fraud phone and the sample true tag data result, and a fraud phone dynamic identification interval is constructed. The working flow of the fraud phone dynamic identification interval model is as follows,

step 4.1: preparing the data of 107,007 normal telephones, 104,059 normal telephones and 211,066 normal telephones obtained after 1,2 processing;

and 4.2: randomly dividing the data obtained in the step 4.1 into 10 parts, and taking 8 parts of the data to be used for training the model and 2 parts of the data to be used for testing the model;

step 4.3: continuously optimizing the model by using a random forest-based hyper-parametric optimization algorithm until a plurality of evaluation indexes of the model on a training set and a test set, such as accuracy, recall rate, F1 value and AUC value, are all greater than 0.9;

step 4.4: outputting the confidence coefficient y of the training sample by using the model trained in the steps 4.2 and 4.3;

step 4.5: drawing a sample scatter diagram, analyzing the difference and the sameness between the confidence coefficient of each sample and the true label of the sample, and obtaining a fraud telephone identification dynamic interval of which alpha is more than or equal to 0 and beta is less than or equal to 1, wherein alpha is 0.2, and beta is 0.8. When the model output result is more than or equal to y and less than or equal to alpha, the sample is a normal telephone; when the model outputs a result alpha < y < beta, the sample is a suspicious telephone; when the model output result beta is less than y and less than or equal to 1, the sample is a fraud phone;

step 6: testing and verifying the effect of the model by using the remaining 2 test sets divided in the step 2;

and 7: and (6) ending.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 shows a system block diagram of the present invention;

FIG. 2 is a graph showing a portion of the test results of the present invention;

FIG. 3 illustrates a partial sample distribution statistical plot of the present invention;

FIG. 4 shows a partial sample distribution density histogram of the invention;

Detailed Description

For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

In order to realize the fraud telephone identification system, the technical scheme adopted by the invention is a method for constructing a fraud telephone dynamic identification interval, the overall system result diagram of the invention is shown in figure 1, and the method is divided into five steps in total:

(1) data preprocessing: the invention needs to take the user call ticket log data as input data to process the user call ticket log data, wherein the processing comprises missing value processing, abnormal value processing, uniform specification and repeated value deletion; secondly, in order to reduce the influence of the data dimension on the subsequent model, the data is standardized, and finally, a preprocessed data set is output.

(2) Feature extraction: the method comprises the steps of calculating information gain of each dimension characteristic in a data set by a fraud telephone user ticket data characteristic extraction method based on random forests, constructing node splitting of each tree according to the information gain, finally calculating the score of each dimension data, sorting the data according to the scores, and selecting the first 9 characteristics with the scores larger than 0.8 to construct a characteristic vector of the data.

(3) Unbalanced data rebalancing: using the data processed in the steps (1) and (2) as input, setting sampling ratio r according to the unbalanced ratio of normal telephone and fraud telephone samples, setting the number of normal telephone samples as p and the number of fraud telephone samples as q, then

Where r ∈ {1,2, 3.. a },

According to s with the original sample _new ＝s _c +rand(0,1)×(s _c '-s _c ) Synthesis of a novel sample s _new Where rand (0,1) is a function generating a random number between 0 and 1, s _c ' denotes each randomly selected neighbor sample. The newly synthesized sample s _new Adding the data into the original data set to form a new sample set; in the invention, 107,935 normal telephones, 8,448 fraud telephones and 116,383 total, 107,007 normal telephones, 104,059 fraud telephones and 211,066 total are processed by the method.

(4) Building a fraud telephone identification model: and randomly dividing the data in the last step into 10 parts, randomly taking 8 parts as a training set, training a constructed gradient-based unilateral sampling and feature binding lifting tree algorithm model as input data of the model, outputting the accuracy, the recall rate, the F1 value and the AUC value of sample identification by the model according to the fraud telephone identification model, continuously optimizing the model by using a random forest-based hyper-parametric optimization algorithm, and testing the effect of the model by using the rest 2 parts of data until a plurality of evaluation indexes of the accuracy, the recall rate, the F1 value and the AUC value of the model on the training set and the testing set are all greater than 0.9.

(5) Building a fraud call dynamic identification interval: and (3) dividing the new data set formed in the step 3 into 10 parts by using the fraud telephone recognition model constructed in the previous step, randomly taking 8 parts as a training set, taking 2 parts as a test set, firstly taking the training set as an input, analyzing the model to output a sample confidence coefficient, and constructing a fraud telephone recognition dynamic interval of 0 & lt alpha & lt beta & lt 1 according to the sample confidence coefficient, wherein alpha is 0.2, and beta is 0.8. When the model output result is more than or equal to y and less than or equal to alpha, the sample is a normal telephone; when the model outputs a result alpha < y < beta, the sample is a suspicious telephone; and when the model output result beta is less than or equal to 1, the sample is a fraud phone, then the test set is used as the input of the model, and whether the output sample is a fraud phone or not is compared with the true label of the sample. The experiment uses the fraud telephone bill data disclosed by Liuming and the like to test the method, and partial test results are shown in figure 2. The partial sample distribution statistical graph is shown in fig. 3, the partial sample distribution density statistical graph is shown in fig. 4, and it can be seen from the sample distribution statistical graph that the confidence of the model on normal telephone bill data is mostly below 0.2, the confidence of the model on fraud telephone bill data is mostly above 0.8 when the confidence is before 0.2 to 0.8, and the reasonableness and feasibility of the dynamic identification interval α of 0.2 and β of 0.8 proposed by the present invention are verified through experiments on the data set.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for constructing a dynamic identification interval of a fraud phone is characterized by comprising the following steps,

step 1: performing feature extraction on fraud telephone user bill data based on a random forest;

step 2: according to the data processed in the step 1, the data is rebalanced by using a hybrid sampling method, so that the influence on the model caused by unbalanced distribution of the data is reduced;

and step 3: according to the characteristics of the phone bill data of the fraud phone user, a fraud phone identification model is constructed, and the model identification effect is measured by using various evaluation indexes;

and 4, step 4: according to the step 3, judging the probability that the data sample is a fraud phone by using a fraud phone identification model, and constructing a fraud phone dynamic identification interval;

wherein the step 1 specifically comprises the following steps: calculating the information gain of each dimension characteristic in the data set, constructing node splitting of each tree according to the information gain, and finally calculating the score of each dimension data; using original fraud telephone user bill data as input, using VIM to represent the importance measurement of variables, and using GI to represent the Gini index;

the training data set S with n examples is defined as:

S＝{s _i },i＝1,2,...,n (1)

wherein s is _i Represents any sample point in the sample set, n represents that the sample set contains n sample points, s _i The definition of (A) is shown as formula 2;

s _i ＝(x _i ,y _i ),i＝1,2,...,n (2)

wherein x is _i ＝{v ₁ ,v ₂ ,...,v _w Is an example, v _j Is represented by x _i Characteristic of the sample, y _i X represents a corresponding x _i The data of the label (C) is divided into normal telephone user bill data and fraud telephone user bill data, namely C is 2;

using data dimension as data desensitization mobile phone number v ₁ Called mobile phone number v ₂ Frequency v of conversation ₃ Ratio v of successful connections ₄ Average call duration v ₅ Average duration v of the ring tone ₆ Call type v ₇ Calling time v ₈ Duration of call v ₉ Ratio v of hung up calls ₁₀ Condition v of the mobile phone ₁₁ Time of conversation v ₁₂ A field; i.e., w-12;

GI kini index is defined as:

wherein K represents K classes, p _mk Represents the proportion of the class k in the node m, p _mk' Representing the proportion of the nodes m with the classes not being k;

VIM feature importance is defined as:

wherein, GI _left And GI _right GI indexes of a left branch node and a right branch node of the m node are respectively represented;

finally, all feature importance measures are normalized; feature v for any fraudulent call _i Its characteristic importance is VIM _i The standardized calculation formula of the importance degree in this period is shown as a formula (5);

wherein, Sigma VIM represents the sum of feature importance of 12 features; and sequencing the data according to the feature importance, selecting feature vectors of the first 9 feature construction data with scores greater than 0.8, and obtaining a new fraud telephone user ticket data set which can be used for subsequent experiments.

2. The method according to claim 1, characterized in that the data is rebalanced using a hybrid sampling method, specifically: setting a sampling ratio r according to the unbalanced ratio of the normal telephone samples to the fraud telephone samples, wherein if the number of the normal telephone samples is p and the number of the fraud telephone samples is q, then

One of the sample points s is selected _i Calculating the distance from si to the few class sample points nearby by using the Euclidean distance,obtaining r neighbor thereof; for each few classes of fraud phone samples s _c Randomly taking several samples from its r nearest neighbor samples

Where r ∈ {1,2, 3.. a },

According to s with the original sample _new ＝s _c +rand(0,1)×(s _c '-s _c ) Synthesis of a novel sample s _new Where rand (0,1) is a function generating a random number between 0 and 1, s _c ' represents each randomly selected neighbor sample; the newly synthesized sample s _new And adding the data into the original data set to form a new sample set.

3. The method according to claim 1, wherein step 4 is specifically:

step 4.1: inputting a new sample set;

and 4.2: randomly dividing the data obtained in the step 4.1, wherein one part of the data is used for training the model, and the other part of the data is used for testing the model;

step 4.3: continuously optimizing the model by using a random forest-based hyper-parametric optimization algorithm until a plurality of evaluation indexes of accuracy, Recall rate, F1 value and AUC value of the model on a training set and a testing set are all greater than 0.9, wherein the accuracy Precision refers to the proportion of fraud telephones in samples predicted as fraud telephones, and the Recall rate Recall refers to the proportion of fraud telephones in samples predicted as fraud telephones; f1 is a new evaluation index F-measure of harmonizing accuracy and recall, F1 for short, and AUC is the area under the ROC curve, wherein the ROC curve is a curve made from the results predicted by the algorithm, the ratio of the samples originally being normal telephones predicted as fraudulent telephones and the ratio of the samples originally being fraudulent telephones predicted as fraudulent telephones;

step 4.5: drawing a sample scatter diagram, analyzing the difference and the sameness between the confidence coefficient of each sample and the true label of the sample, and obtaining a fraud telephone identification dynamic interval of which alpha is more than or equal to 0 and beta is less than or equal to 1, wherein alpha is 0.2, and beta is 0.8; when the output result of the model is more than or equal to 0 and less than or equal to y and less than or equal to alpha, the sample is a normal telephone; when the model outputs a result alpha < y < beta, the sample is a suspicious telephone; when the model outputs a result β < y ≦ 1, the sample is a fraudulent call.