CN114006982B

CN114006982B - Harassment number identification method based on classification gradient lifting algorithm

Info

Publication number: CN114006982B
Application number: CN202111288535.XA
Authority: CN
Inventors: 周晓辉; 蒋胜波; 史慧; 顾湘芸; 马钰璐; 李华; 金忻; 陈益辉; 郑珍珍; 顾清
Original assignee: Best Tone Information Service Corp Ltd
Current assignee: Best Tone Information Service Corp Ltd
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2024-04-30
Anticipated expiration: 2041-11-02
Also published as: CN114006982A

Abstract

The invention relates to the field of network communication technology and machine learning algorithm modeling, in particular to a harassment number identification method based on a classification gradient lifting algorithm, which comprises the following steps: firstly, selecting a sample, then performing data cleaning and fusion on the sample to form an original data set containing multidimensional data, and then extracting a characteristic variable set from the original data set. And constructing Catboost models by utilizing the characteristic variable sets, and finally deploying the final models which are trained into a production system for identifying harassment numbers in specific businesses and carrying out targeted treatment on the harassment numbers. According to the invention, the problem of unbalanced data is solved by adopting SMOTE TomeK algorithm, the feature dimension is effectively reduced by adopting XGBoost feature selection method, the bat algorithm is combined with training Catboost model to avoid sinking into local optimal solution, and finally Catboost model is used to effectively improve the prediction precision of harassment numbers.

Description

Harassment number identification method based on classification gradient lifting algorithm

Technical Field

The invention relates to the field of network communication technology and machine learning algorithm modeling, in particular to a harassment number identification method based on a classification gradient lifting algorithm.

Background

With the continuous development of the new generation 5G communication technology, communication users experience the communication technology to bring convenience to life and work, and meanwhile, the harassment conversation behavior and mode are also continuously developed and changed, so that the development trend of harassment pattern change is presented, and harassment targets are accurate. The harassment call not only disturbs the normal life and work order of the communication user, but also greatly damages the social public confidence of the telecom operator, and brings harm to the personal benefit of people. According to big data analysis, the total quantity of the harassment call marks in 2020 exceeds 2.63 hundred million times, and the quantity of the marks is increased by 38.42% compared with 1.9 hundred million times in 2019. In the number source distribution of the harassing call, the fixed phone end accounts for 48.27%, the mobile phone end accounts for 35.58%, and other types account for 16.15%. Wherein, the number at the beginning of 1 belongs to the mobile terminal, the number at the beginning of 0 belongs to the fixed telephone terminal, and the special numbers such as 400/95/96 belong to other types. The harassment calls are statistically analyzed by taking the week time as a period, and the day of week are relatively periods with the highest harassment call amount, and the activity of the day of rest is slightly lower than that of the day of work, but the peak value is not obvious as a whole. In the area distribution aspect of harassing calls, the economically developed areas and provinces belong to key sources and targets of the harassing calls. The harassment call amount in Guangdong province is highest, the harassment call amount accounts for 9.48%, and the harassment call amount in Jiangsu province and Shandong province accounts for 4.87% and 4.33% respectively. Harassing calls are frequent, so that normal life of people is greatly influenced, but the harassing calls are too complex, and the harassing calls are not thoroughly solved at present. Therefore, the technical problem to be solved urgently at present is effectively identified from the signaling call ticket data, in life, the harassment number and the fraud number are characterized by high calling frequency, large calling quantity, short conversation time and the like, and meanwhile, the harassment number has a plurality of categories such as sales call, takeaway call, fraud call and the like, but the prior art does not have a solution related to the multi-category identification of the harassment number.

Disclosure of Invention

The invention aims to provide a harassment number identification method based on a classification gradient lifting algorithm, which mainly solves the problems existing in the prior art, thereby building a good communication network environment, providing a harassment number multi-category identification method with high accuracy and high stability, and actually realizing harassment call remediation work.

In order to achieve the above purpose, the technical scheme adopted by the invention is to provide a harassment number identification method based on a classification gradient lifting algorithm, which is characterized by comprising the following steps: firstly, selecting a sample, then performing data cleaning and fusion on the sample to form an original data set containing multidimensional data, and then extracting a characteristic variable set from the original data set; constructing an identification model by utilizing the characteristic variable set, and finally deploying the final model after training into a production system for identifying harassment numbers in specific businesses and carrying out targeted processing on the harassment numbers;

The samples comprise service telephone samples obtained from a signaling call ticket database and classified telephone samples obtained from a black-and-white list database;

the recognition model is Catboost models, the construction of the recognition model comprises initializing Catboost models, setting model precision thresholds, training Catboost models by utilizing the characteristic variable sets, and outputting the current recognition model as a final model when the training process meets the requirement of meeting the model precision thresholds.

Further, the Catboost model is constructed using a homogeneous integration algorithm.

Further, extracting the feature variable set from the original dataset by adopting SMOTE TomeK algorithm, specifically comprising: firstly, converting the original data set into a model template data set by utilizing comprehensive sampling, and then dividing the model template data set into a data training set and a data testing set; extracting the characteristic variable set from the data training set for model training; and the data test set is matched with the model precision threshold value and is used for judging model training termination conditions and determining the final model.

Further, the model template data set is divided into a data training set and a data testing set by adopting a five-fold cross validation method.

Further, prior to constructing the recognition model using the feature variables, a XGBoost feature selection method is employed to measure the feature importance of each feature variable in the set of feature variables and use the feature importance to select the best feature classification to optimize the set of feature variables by eliminating redundant ones of the feature variables.

Further, in the XGBoost feature selection method, the feature importance includes weight, gain, and coverage.

Further, in training the Catboost model with the set of feature variables, parameters of the Catboost model are optimized using a bat algorithm.

Further, the process of identifying the harassment number in the specific business comprises the steps of firstly identifying a suspected harassment number by the final model, and then comparing the suspected harassment number with the black-and-white list database; and if the suspected harassment number is not matched with the white list data in the black and white list database, identifying the suspected harassment number as a harassment number, otherwise, identifying the suspected harassment number as a normal number.

Further, the service telephone sample is the telephone sample of the last N months in the signaling call ticket database; the classified telephone sample is formed from third party annotation data, customer feedback information and complaint data to form a closed loop for dynamically improving the capacity of the identification model.

Further, after the recognition model meets the requirement of meeting the model precision threshold in the training process, setting a rechecking threshold, and rechecking the recognition precision of the current recognition model by using external data; when the recognition accuracy reaches the rechecking threshold, outputting the recognition model as a final model; and the rechecking threshold value is larger than or equal to the model precision threshold value.

Further, the feature variable set comprises calling number features and called number features; the calling number characteristics comprise the calling number calling frequency, the calling number frequency, the called number calling frequency, the called number frequency, the calling number switching-on frequency, the calling number average ringing duration and the calling number average call duration; the called number features include home location distribution of called numbers, called number dispersion, number segment distribution of called numbers, called number dispersion, called number regional distribution and called number dispersion.

Further, the identified harassment number is applied to an incoming call business card, a security reminding business and an anti-harassment refusing business.

In view of the above technical features, the present invention has the following advantages:

1. And the SMOTE TomeK algorithm is adopted to comprehensively sample the original data, so that the problem of unbalanced data is effectively solved.

2. The Catboost model is selected for identification, the Catboost model adopts an ordered principle mode, a traditional gradient enhancement algorithm is converted into an ordered enhancement algorithm, the generalization capability of the model is improved, meanwhile, the combination of classification characteristic values is constructed by adopting a greedy strategy, and the combination is used as an additional characteristic, so that the model is facilitated to capture higher-order dependency relationships more easily, and the prediction accuracy is further improved.

3. In Catboost model feature variable selection, the feature importance of each is measured by adopting a XGBoost feature selection method, redundant features are deleted, and the best classification features are selected, so that feature dimensions are effectively reduced.

4. In Catboost model parameter optimization, a bat algorithm with a powerful search function is introduced to optimize parameters, so that the capacity of Catboost model for processing parameters is improved, and the prediction accuracy and robustness of the model are enhanced.

Drawings

FIG. 1 is a flow chart of a method for identifying nuisance numbers based on classification gradient lifting algorithm in a preferred embodiment of the present invention for establishing an identification model;

FIG. 2 is a flow chart of a preferred embodiment of a harassment number identification method based on a classification gradient lifting algorithm of the present invention.

Detailed Description

The application is further described below in conjunction with the detailed description. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Referring to fig. 1, the invention discloses a harassment number identification method based on a classification gradient lifting algorithm. As shown in the figure, the construction of the identification model in a preferred embodiment comprises the following steps:

step S101, sample selection.

The sample selection consists of two steps, the first step is to obtain the service phone samples from the signaling ticket database and the classified phone samples from the black and white list database. And secondly, comprehensively sampling the extracted samples by using an SMOTE Tomek algorithm to finish data cleaning and fusion, and forming an original data set containing multidimensional data.

Step S102, inputting characteristic variables.

After extracting the feature variable set from the original data set, measuring and evaluating the feature importance of each feature variable by utilizing XGBoost feature selection method, optimizing the feature variable set, and finally forming a feature variable set for training an identification model, wherein the feature variable set comprises call duration, call frequency, call address, called region dispersity and called number dispersity.

And step S103, building a recognition model.

The recognition model is Catboost model. After Catboost model initialization and construction, model accuracy thresholds are set and the Catboost model is trained with a set of feature variables. And when the model accuracy threshold requirement is met in the training process, outputting the current identification model as a test run model. In the training process, a BAT Algorithm (BAT) is introduced to perform parameter optimization, so that convergence of the training process is quickened.

When Catboost models are built, a homogeneous integration algorithm can be introduced according to the requirements of actual services, so that the stability of Catboost models is better ensured, the recognition effect is optimized, and the recognition accuracy is improved.

Step S104, verification and evaluation of the model.

Before the test run model is actually deployed to the service application, test run is needed to verify and evaluate the final model, so as to ensure the recognition accuracy of the test run model in rechecking verification, which is not worse than the recognition accuracy in training. In the test run, a review threshold range is set: 0.90 or more and 0.95 or less, preferably 0.90 or less. The method is characterized in that manual callback verification and third party data verification are adopted for the harassment numbers identified by the models, when the accuracy is larger than a verification threshold value of 0.90, the test run model accuracy verification passes, the test run model accuracy verification can be used as a final model to be deployed into the service, otherwise, the test run model accuracy verification cannot be deployed into the actual service, and retraining is needed.

The manual callback verification mode is to verify the accuracy of the model by manually sampling and callback the harassment number identified by the model, wherein the sampling rate is selected to be 5% -15%, and preferably 10%. The third party data checking and verification is to compare and verify the harassment number with the third party (Internet) labeling data, so as to verify the accuracy of the model.

Step S105, deployment and application of the model.

After rechecking, the model can be deployed into a production system for identifying harassment numbers in specific businesses. And carrying out targeted processing on the identified harassment numbers by combining specific services.

Referring to fig. 2, the invention discloses a harassment number identification method based on a classification gradient lifting algorithm. As shown, a preferred embodiment thereof comprises the steps of:

Step S201, the classified telephone samples which are already classified are obtained from the black-and-white list database, and the business telephone samples obtained from the signaling telephone database are combined to carry out ETL projects such as data cleaning and conversion. The business telephone sample is a 3-month call record. The classified phone sample is formed from customer feedback information collected from normal traffic and complaint data.

Step S202, obtaining an original data set required by the identification model through data association fusion. The data dimension of the original dataset is 50 dimensions.

And step S203, comprehensively sampling the original data set by adopting SMOTE TomeK algorithm to form a model template data set, and dividing the model template data set into two types of data training sets and data testing sets. Specifically, the method comprises the steps of firstly adopting SMOTE to carry out oversampling, deleting the point in a gluing state or the point with a very close distance by adopting a Tomek algorithm after a sample is expanded, and then according to a five-fold cross validation method, carrying out the steps of: 1, dividing the data quantity in the data training set and the data quantity in the data testing set.

And extracting a characteristic variable set from the data training set for model training. The data test set is matched with a model precision threshold value and is used for judging model training termination conditions and determining a final model.

And step S204, extracting feature variables needed by the model based on the data training set, measuring the feature importance by adopting a XGBoost feature selection method aiming at the feature redundancy problem, and selecting the optimal classification feature after deleting the redundant feature.

In the feature selection technique of XGBoost algorithm, feature importance can be used to make model interpretability. The boost class get_score method in XGBoost algorithm outputs feature importance, wherein importance _type parameter supports three feature importance calculation methods:

importance _type=weight: feature importance uses features as the number of times of dividing attributes in all trees.

Importance _type=gain (gain): feature importance uses the amount of reduction in loss average when features are used as partitioning attributes.

Importance _type=cover (coverage): feature importance uses the coverage of a feature on a sample when it is a partitioning attribute.

And step S205, extracting a feature variable set from the data training set according to the feature importance degree selected by the XGBoost feature selection method as a standard. The characteristic variables are mainly divided into: the calling number features mainly comprise home location distribution of called numbers, called number dispersion, number segment distribution of called numbers, called number dispersion, called number regional distribution, called number dispersion and the like.

Step S206, initializing Catboost training parameters of the model, and setting a model precision threshold range: 0.85 or more and 0.90 or less, preferably 0.90 or less. The model precision threshold used in training needs to be ensured to be smaller than or equal to the rechecking threshold set in rechecking. In this embodiment, the model accuracy threshold is 0.9, and the review threshold is also 0.9.

The identification technology based on Catboost algorithm model adopts the mode of ordered principle to convert the traditional gradient enhancement algorithm into ordered enhancement algorithm, thus improving the generalization capability of the model. The combination of classification characteristic values is constructed by adopting a greedy strategy, and the combination is used as an additional characteristic, so that the model is facilitated to capture higher-order dependency relationships, and the prediction accuracy is further improved.

Step S207, training Catboost a model according to a five-fold cross validation method, training a training model by using data which is 80% of the model template data set, then verifying a training result by using a data test set which is 20% of the model sample data, and calculating the precision rate, recall rate and F1 score value of the model. (refer to FIG. 1 for a detailed flow)

Step S208, judging Catboost whether the model reaches a preset model precision threshold value of 0.90, namely comparing the predicted result of the Catboost model on the data testing set with the known result in the data testing set, and if the accuracy is smaller than 0.90, entering step S209. Otherwise, further checking and verifying, and proceeding to step S210 through the checked model, and proceeding to step S210 otherwise.

In step S209, the number of parameters required to be set by the Catboost model increases the probability of sinking into the locally optimal solution. Therefore, a bat algorithm with a powerful search function is introduced to optimize parameters, so that the capability of Catboost model for processing parameters is improved, and the prediction accuracy and the robustness of the model are enhanced.

The bat algorithm is based on a principle of bionic bat foraging behavior, and a meta-heuristic algorithm for searching a target by utilizing high-frequency pulses sent by each miniature bat and analyzing unique echo information characteristics of the target to position the target.

Step S210, the production environment server deployment is carried out on the model with the full output meeting the preset model precision threshold.

In step S211, in the specific service, the call record to be predicted is input to the Catboost model, and after the model prediction, the prediction result is output.

And S212, performing rejection operation of the objection data on the prediction result by utilizing the black-and-white list database. If the suspected harassment number is not matched with the white list data in the black and white list database, the harassment number is identified, and otherwise, the harassment number is considered to be a normal number.

Step S213, the final harassment number is applied to the scenes of the calling card service, the security reminding service and the anti-harassment refusing service.

Step S214, continuously collecting customer service feedback data and complaint data in service scene application in the service operation process, and updating a black-and-white list database by combining third party (Internet) labeling data for next model training to form a model closed loop.

Third party (internet) tagging data contains nuisance call information actively tagged by the end user with 360 cell phone assistants or Tencel cell phone assistants, collected and consolidated by internet providers.

After the harassment number is obtained, the harassment call repairing work is practically realized by the following means:

1. safety reminding: the telephone received by the user is reminded to be a harassment call at the mobile phone end in real time by using the flash information of the telecom operator. The user himself judges whether answering is still needed.

2. Communication assistant, which is to automatically answer the identified harassment call by a robot secretary with the large network capability of China telecom and push the micro-letter public number after character identification.

3. The heaver is harassment-proof: the identified harassing call can be refused to answer by self according to the large network capacity of China telecom.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes using the descriptions and drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the invention.

Claims

1. A harassment number identification method based on a classification gradient lifting algorithm is characterized by comprising the following steps: firstly, selecting a sample, then performing data cleaning and fusion on the sample to form an original data set containing multidimensional data, and then extracting a characteristic variable set from the original data set; constructing an identification model by utilizing the characteristic variable set, and finally deploying the final model after training into a production system for identifying harassment numbers in specific businesses and carrying out targeted processing on the harassment numbers;

The identification model is Catboost models, the construction of the identification model comprises initializing Catboost models, setting model precision thresholds, training Catboost models by utilizing the characteristic variable sets, and outputting the current identification model as a final model when the training process meets the requirements of the model precision thresholds;

constructing the Catboost model by using a homogeneous integration algorithm;

Extracting the characteristic variable set from the original dataset by adopting SMOTE TomeK algorithm, wherein the characteristic variable set comprises the following specific steps: firstly, converting the original data set into a model template data set by utilizing comprehensive sampling, and then dividing the model template data set into a data training set and a data testing set; extracting the characteristic variable set from the data training set for model training; the data test set is matched with the model precision threshold value and is used for judging model training termination conditions and determining the final model;

dividing the model template data set into a data training set and a data testing set by adopting a five-fold cross validation method;

Before constructing an identification model by using the feature variable set, measuring the feature importance of each feature variable in the feature variable set by adopting a XGBoost feature selection method, and selecting the optimal feature classification by using the feature importance so as to optimize the feature variable set by deleting redundant feature variables;

in the XGBoost feature selection method, feature importance includes weight, gain, and coverage.

2. A harassment number identification method based on a classification gradient lifting algorithm according to claim 1, characterized in that in training the Catboost model with the set of feature variables, the parameters of the Catboost model are optimized using a bat algorithm.

3. The harassment number identification method based on a classification gradient lifting algorithm as claimed in claim 1, wherein the process of identifying harassment numbers in a specific business comprises identifying suspected harassment numbers by the final model first, and then comparing the suspected harassment numbers with the black-and-white list database; and if the suspected harassment number is not matched with the white list data in the black and white list database, identifying the suspected harassment number as a harassment number, otherwise, identifying the suspected harassment number as a normal number.

4. The harassment number identification method based on the classification gradient lifting algorithm as claimed in claim 1, wherein the service call sample is a last N months call sample in the signaling call ticket database; the classified telephone sample is formed from third party annotation data, customer feedback information and complaint data to form a closed loop for dynamically improving the capacity of the identification model.

5. The harassment number identification method based on the classification gradient lifting algorithm as claimed in claim 1, wherein after the identification model meets the requirement of the model precision threshold in the training process, a rechecking threshold is set, and the identification precision of the identification model is rechecked currently by using external data; when the recognition accuracy reaches the rechecking threshold, outputting the recognition model as a final model; and the rechecking threshold value is larger than or equal to the model precision threshold value.

6. The harassment number identification method based on the classification gradient lifting algorithm as claimed in claim 1, wherein the feature variable set comprises calling number features and called number features; the calling number characteristics comprise the calling number calling frequency, the calling number frequency, the called number calling frequency, the called number frequency, the calling number switching-on frequency, the calling number average ringing duration and the calling number average call duration; the called number features include home location distribution of called numbers, called number dispersion, number segment distribution of called numbers, called number dispersion, called number regional distribution and called number dispersion.

7. The harassment number identification method based on the classification gradient promotion algorithm of claim 1, wherein the identified harassment number is applied to an incoming call business card, a security reminding business and an anti-harassment refusal business.