CN118035740A

CN118035740A - Training method and device for fraud phone recognition model, electronic equipment and medium

Info

Publication number: CN118035740A
Application number: CN202410094956.6A
Authority: CN
Inventors: 辜芳琴; 钟钢; 朱银清; 陈绚华; 王俊伟
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-05-14

Abstract

The application provides a training method and device for a fraud phone identification model, electronic equipment and a medium. The method comprises the following steps: processing the signaling traffic characteristics associated with the GOIP fraud number to obtain a time sequence data set; performing relevant feature screening processing on the time sequence data set based on a preset dynamic time warping algorithm to obtain middle sample features screened from the time sequence data set; performing feature weighted screening processing on the intermediate sample features based on a Boosting feature screening algorithm to obtain target sample features screened from the intermediate sample features, and constructing a target training sample set; training the fraud phone recognition model to be trained based on the target training sample set to obtain the fraud phone recognition model. The application can improve the accuracy of the built fraud telephone identification model, thereby improving the accuracy and efficiency of fraud telephone identification, and effectively avoiding unnecessary loss caused by fraud telephone of the user.

Description

Training method and device for fraud phone recognition model, electronic equipment and medium

Technical Field

The present application relates to the field of communications technologies, and in particular, to a training method and apparatus for a fraud phone identification model, an electronic device, and a medium.

Background

In the communication phishing, because GOIP (Gsm Over Internet Protocol) equipment is unattended and can be remotely controlled, and the characteristics of machine-card separation and the like can be realized by combining with a card pool, a fraud molecule gradually changes the GOIP equipment to conduct crime, and the GOIP fraud telephone is very necessary to be timely and rapidly identified.

In the prior art, the static calling feature or the timing feature is usually used for constructing the GOIP fraud phone identification model, and the technical defect is that the GOIP fraud phone identification is lagged and inaccurate, and the GOIP fraud phone cannot be predicted.

Disclosure of Invention

The technical problem to be solved by the embodiment of the application is to provide a training method, a training device, electronic equipment and a training medium for a fraud phone identification model, so that the accuracy of the built fraud phone identification model is effectively improved, the accuracy and the efficiency of fraud phone identification are improved, and unnecessary loss caused by fraud phones on users is effectively avoided.

In a first aspect, an embodiment of the present application provides a method for training a fraud phone identification model, where the method includes:

Processing the signaling traffic characteristics associated with the GOIP fraud number to obtain a time sequence data set;

Performing relevant feature screening processing on the time sequence data set based on a preset dynamic time warping algorithm to obtain middle sample features screened from the time sequence data set;

Performing feature weighted screening processing on the intermediate sample features based on a Boosting feature screening algorithm to obtain target sample features screened from the intermediate sample features, and constructing a target training sample set;

training a fraud phone recognition model to be trained based on the target training sample set to obtain the fraud phone recognition model.

Optionally, the processing the signaling traffic feature associated with the GOIP fraud number to obtain a time-series data set includes:

constructing a GOIP fraud number base according to the fraud numbers on the GOIP equipment;

Extracting signaling traffic characteristics from the GOIP fraud number library to obtain the signaling traffic characteristics associated with the GOIP fraud numbers;

Integrating the signaling traffic characteristics according to a time sequence and a preset duration to generate an initial time sequence data set;

and carrying out secondary data processing on the initial time sequence data set based on a preset weighted moving average algorithm to obtain the time sequence data set.

Optionally, the performing a correlation feature screening process on the time sequence data set based on a preset dynamic time warping algorithm to obtain intermediate sample features screened from the time sequence data set includes:

Carrying out standardization processing on the data in the time sequence data set to obtain standardized data;

And carrying out similarity calculation on the standardized data based on the preset dynamic time warping algorithm, and screening the middle sample characteristics from the time sequence data set according to a similarity calculation result.

Optionally, the calculating the similarity of the standardized data based on the preset dynamic time warping algorithm, and screening the intermediate sample feature from the time sequence data set according to the similarity calculation result includes:

Acquiring a candidate time sequence data set corresponding to the collected non-GOIP fraud numbers;

Performing similarity calculation on the target data in the candidate time sequence data set and the standardized data based on the preset dynamic time warping algorithm to obtain a similarity calculation result; the target data are the data in the candidate time sequence data set, and the time of the data is the same as that of the standardized data;

and screening the intermediate sample features from the time series data set and the candidate time series data set based on the similarity calculation result.

Optionally, the Boosting-based feature screening algorithm performs feature weighted screening processing on the intermediate sample features to obtain target sample features screened from the intermediate sample features, and constructs a target training sample set, including:

Performing feature screening on the intermediate sample features based on the Boosting feature screening algorithm to obtain reference sample features;

And carrying out feature weighting processing on the reference sample features according to weights corresponding to preset feature types, and screening the target sample features from the reference sample features according to weighting results so as to construct the target training sample set.

Optionally, the Boosting-based feature screening algorithm performs feature screening on the intermediate sample feature to obtain a reference sample feature, including:

acquiring an initialized sample weight of the intermediate sample feature and a pre-trained weak classifier;

Processing the intermediate sample features based on the weak classifier to obtain classification prediction results corresponding to the intermediate sample features;

According to the classification prediction result, calculating and obtaining a classification error corresponding to the middle sample characteristic;

updating the initialized sample weight based on the classification error to obtain an updated sample weight corresponding to the middle sample feature;

The updated sample weight is used as the initialized sample weight, the intermediate sample feature is processed based on the weak classifier in a set round of iterative execution, a classification prediction result corresponding to the intermediate sample feature is obtained, and the initialized sample weight is updated based on the classification error, so that the updated sample weight corresponding to the intermediate sample feature is obtained;

determining a feature importance score corresponding to the middle sample feature according to the sample weight of the middle sample feature in each round of processing;

the reference sample feature is screened from the intermediate sample features based on the feature importance score.

Optionally, after the training the fraud phone identification model to be trained based on the target training sample set, obtaining the fraud phone identification model, the method further includes:

Converting the signaling traffic characteristics of the number to be identified within a preset time length from the current time into the traffic characteristics of a time sequence;

Inputting the telephone traffic characteristics of the time sequence to the fraud telephone identification model to obtain a fraud prediction result of the number to be identified;

responding to the fraud prediction result as a prediction result of a fraud number, and acquiring call initiation base station information corresponding to the number to be identified;

and positioning GOIP equipment corresponding to the number to be identified based on the call initiation base station information.

In a second aspect, an embodiment of the present application provides a training apparatus for a fraud phone identification model, the apparatus comprising:

The time sequence data set acquisition module is used for processing the signaling traffic characteristics associated with the GOIP fraud number to obtain a time sequence data set;

The intermediate sample feature acquisition module is used for carrying out relevant feature screening processing on the time sequence data set based on a preset dynamic time warping algorithm to obtain intermediate sample features screened from the time sequence data set;

The target training sample set construction module is used for carrying out feature weighted screening treatment on the intermediate sample features based on a Boosting feature screening algorithm to obtain target sample features screened from the intermediate sample features, and constructing a target training sample set;

and the fraud telephone identification model acquisition module is used for training the fraud telephone identification model to be trained based on the target training sample set to obtain the fraud telephone identification model.

Optionally, the time series data set acquisition module includes:

the number library construction unit is used for constructing a GOIP fraud number library according to the fraud numbers on the GOIP equipment;

The signaling traffic feature acquisition unit is used for extracting signaling traffic features from the GOIP fraud number library to obtain the signaling traffic features associated with the GOIP fraud numbers;

the initial sequence data set generating unit is used for integrating the signaling traffic characteristics according to the time sequence and the preset duration to generate an initial time sequence data set;

And the time sequence data set acquisition unit is used for carrying out secondary data processing on the initial time sequence data set based on a preset weighted moving average algorithm to obtain the time sequence data set.

Optionally, the intermediate sample feature acquisition module includes:

The standardized data acquisition unit is used for carrying out standardized processing on the data in the time sequence data set to obtain standardized data;

And the intermediate sample feature screening unit is used for carrying out similarity calculation on the standardized data based on the preset dynamic time warping algorithm, and screening the intermediate sample features from the time sequence data set according to a similarity calculation result.

Optionally, the intermediate sample feature screening unit includes:

A candidate sequence data set obtaining subunit, configured to obtain a candidate time sequence data set corresponding to the collected non-GOIP fraud number;

The similarity calculation result obtaining subunit is used for carrying out similarity calculation on the target data in the candidate time sequence data set and the standardized data based on the preset dynamic time warping algorithm to obtain a similarity calculation result; the target data are the data in the candidate time sequence data set, and the time of the data is the same as that of the standardized data;

and the intermediate sample feature screening subunit is used for screening the intermediate sample features from the time sequence data set and the candidate time sequence data set based on the similarity calculation result.

Optionally, the target training sample set construction module includes:

The reference sample feature acquisition unit is used for carrying out feature screening on the intermediate sample features based on the Boosting feature screening algorithm to obtain reference sample features;

The target training sample set construction unit is used for carrying out feature weighting processing on the reference sample features according to weights corresponding to preset feature types, and screening the target sample features from the reference sample features according to weighting results so as to construct the target training sample set.

Optionally, the reference sample feature acquiring unit includes:

An initial weight acquisition subunit, configured to acquire an initial sample weight of the intermediate sample feature and a pre-trained weak classifier;

the classification prediction result obtaining subunit is used for processing the intermediate sample characteristics based on the weak classifier to obtain classification prediction results corresponding to the intermediate sample characteristics;

the classification error calculation subunit is used for calculating and obtaining the classification error corresponding to the middle sample characteristic according to the classification prediction result;

An updated sample weight obtaining subunit, configured to update the initialized sample weight based on the classification error, to obtain an updated sample weight corresponding to the intermediate sample feature;

the iterative execution subunit is used for taking the updated sample weight as the initialized sample weight and iteratively executing the classification prediction result acquisition subunit, the classification error calculation subunit and the updated sample weight acquisition subunit of set rounds;

The characteristic importance score determining subunit is used for determining the characteristic importance score corresponding to the intermediate sample feature according to the sample weight of the intermediate sample feature in each round of processing;

and the reference sample feature screening subunit is used for screening the reference sample features from the intermediate sample features based on the feature importance scores.

Optionally, the apparatus further comprises:

The telephone traffic feature conversion module is used for converting the signaling telephone traffic feature of the number to be identified within the preset time length from the current time into the telephone traffic feature of the time sequence;

a fraud prediction result obtaining module, configured to input the traffic characteristics of the time sequence to the fraud telephone identification model, so as to obtain a fraud prediction result of the number to be identified;

the base station information acquisition module is used for responding to the fraud prediction result as a prediction result of a fraud number and acquiring call initiation base station information corresponding to the number to be identified;

And the GOIP equipment positioning module is used for positioning the GOIP equipment corresponding to the number to be identified based on the call initiation base station information.

In a third aspect, an embodiment of the present application provides an electronic device, including:

a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed implements the method of training a fraud telephone identification model as defined in any preceding claim.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the method of training a fraud telephone identification model as described in any of the above.

Compared with the prior art, the embodiment of the application has the following advantages:

In the embodiment of the application, the time sequence data set is obtained by processing the signaling traffic characteristics associated with the GOIP fraud number. And carrying out relevant characteristic screening processing on the time sequence data set based on a preset dynamic time warping algorithm to obtain intermediate sample characteristics screened from the time sequence data set. And performing feature weighted screening treatment on the intermediate sample features by using a Boosting-based feature screening algorithm to obtain target sample features screened from the intermediate sample features, and constructing a target training sample set. Training the fraud phone recognition model to be trained based on the target training sample set to obtain the fraud phone recognition model. According to the embodiment of the application, the signaling telephone traffic characteristic data set of the time sequence is constructed, and the important characteristics are weighted by adopting the Boosting-based characteristic screening algorithm, so that the recognition efficiency is higher, the recognition accuracy for the fraud telephone is higher, and unnecessary loss caused by the fraud telephone of the user can be effectively avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

FIG. 1 is a flowchart showing steps of a training method for a fraud phone identification model according to an embodiment of the present application;

Fig. 2 is a schematic diagram of a classifier training process according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a GOIP fraud phone identification process based on a time sequence model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a training device for a fraud phone identification model according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description.

The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Referring to fig. 1, a step flow chart of a training method of a fraud phone identification model according to an embodiment of the present application is shown, and as shown in fig. 1, the training method of a fraud phone identification model may include: step 101, step 102, step 103 and step 104.

Step 101: and processing the signaling traffic characteristics associated with the GOIP fraud number to obtain a time sequence data set.

The embodiment of the application can be applied to the scene of training the GOIP fraud telephone identification model.

In this embodiment, the signaling traffic characteristics associated with the GOIP fraud number may be obtained while training the GOIP fraud phone identification model. And processing the signaling traffic characteristics associated with the GOIP fraud number to obtain a time series data set. The implementation may be described in detail in connection with the following specific implementations.

In a specific implementation of the present application, the step 101 may include:

substep A1: and constructing a GOIP fraud number base according to the fraud numbers on the GOIP equipment.

In this embodiment, when the fraud phone recognition model is trained, the fraud numbers on the GOIP device may be acquired, and a GOIP fraud number library may be constructed according to the fraud numbers on the GOIP device. In a specific implementation, relevant characteristics of GOIP fraud related case number card network access time, network access channels and call frequency can be collected by utilizing signaling ticket analysis, and a GOIP fraud number library is constructed according to the characteristics, wherein the called attribution distribution duty ratio, the active time, the silent time, the number of active base stations, the position track, the number IMEI change, the terminal information library, the base station cell information library, the age group and the like.

Wherein, the network access duration is as follows: the time elapsed from the first turn-on of the number to the time of use.

Network access channel: the number opens the network channel for the first time.

Frequency of calls: number of callers per unit time.

Called home distribution duty cycle: the number of dialing to different city numbers per unit time/number of number calls per unit time.

Active duration: a total number of hours of >5 callers per hour within 24 hours.

Duration of silence: total number of hours per hour caller times = 0 within 24 hours.

Position track: base station track coincidence degree in calling time.

Number IMEI change: the number accumulates the number of different mobile terminals used.

Terminal information base: the mobile phone information of the mobile phone card is loaded, including price, whether 5G is supported, whether double cards are supported, and the like.

After constructing the GOIP fraud number base from the fraud numbers on the GOIP device, sub-step sub-A2 is performed.

Substep A2: and extracting signaling traffic characteristics from the GOIP fraud number library to obtain the signaling traffic characteristics associated with the GOIP fraud numbers.

After the GOIP fraud number base is constructed according to the fraud numbers on the GOIP equipment, the signaling traffic characteristics of the GOIP fraud number base can be extracted, and the signaling traffic characteristics associated with the GOIP fraud numbers are obtained. Namely, extracting the call details, the position, the user information, the signaling traffic characteristics of the terminal and the like from the GOIP fraud number library.

After extracting signaling traffic characteristics from the GOIP fraud number base, performing sub-step A3.

Substep A3: and integrating the signaling traffic characteristics according to the time sequence and the preset duration to generate an initial time sequence data set.

After the signaling traffic characteristics associated with the GOIP fraud number are extracted from the GOIP fraud number library, the signaling traffic characteristics can be integrated according to the time sequence and the preset duration to generate an initial time sequence data set. Specifically, the obtained hourly signaling traffic related features may be integrated into a time sequence data set in a time sequence arrangement, so as to obtain a call sequence, i.e. an initial time sequence data set, of each number within a period of time.

After integrating the signaling traffic characteristics according to the time sequence and the preset duration to generate an initial time sequence data set, a sub-step A4 is performed.

Substep A4: and carrying out secondary data processing on the initial time sequence data set based on a preset weighted moving average algorithm to obtain the time sequence data set.

After the signaling traffic characteristics are integrated according to the time sequence and the preset duration to generate an initial time sequence data set, secondary data processing can be performed on the initial time sequence data set based on a preset weighted moving average algorithm, so that the time sequence data set is obtained.

Specifically, the formula of the preset weighted moving average algorithm may be shown in the following formula (1):

In the above-mentioned formula (1), For the predicted value of the t+1 stage, w _i is the observation weight of the t-i+1 stage, y _t-i+1 is the observation of the t-i+1 stage, and N is the number of weights. The formula for adjusting the weights can be shown in the following formula (2):

w_i'＝w_i+2ke_i+1y_t-i+1 (2)

In the above formula (2), i=1, 2,3,..n, t=n, n+1,..n, w _i is the i-th weight before adjustment, w _i' is the i-th weight after adjustment, k is the learning constant, and e _i+1 is the prediction error of the i+1th phase.

The preset weighted moving average algorithm repeatedly adjusts weights according to the prediction errors to obtain an adjusted target time sequence data set, so that the errors are reduced to the minimum.

After processing the signaling traffic characteristics associated with the GOIP fraud number resulting in a time series data set, step 102 is performed.

Step 102: and carrying out relevant characteristic screening processing on the time sequence data set based on a preset dynamic time warping algorithm to obtain intermediate sample characteristics screened from the time sequence data set.

The dynamic time warping algorithm (DYNAMIC TIME WARPING, DTW) is to construct the corresponding relation of two sequence elements with different lengths according to the principle of distance nearest, and evaluate the similarity of the two sequences.

After the signaling traffic characteristics associated with the GOIP fraud number are processed to obtain a time sequence data set, the time sequence data set can be subjected to relevant characteristic screening processing based on a preset dynamic time warping algorithm to obtain intermediate sample characteristics screened from the time sequence data set. The implementation may be described in detail in connection with the following specific implementations.

In a specific implementation of the present application, the step 102 may include:

substep B1: and carrying out standardization processing on the data in the time sequence data set to obtain standardized data.

In this embodiment, the DTW may be used to identify the correlation in the time-series data, so as to find the partial data with the highest correlation with the target time-series, select the partial data with the highest correlation with the target time-series as the training sample, ensure that the selected sample is representative, and cover various modes of fraud and normal numbers.

When similarity calculation is performed, data in a time sequence data set can be standardized to obtain standardized data, so that the data is scaled, unit limitation of the data is removed, the data is converted into dimensionless pure numerical values, indexes of different units or orders can be compared and weighted conveniently, and the indexes fall into a data distribution interval with a mean value of 0 and a standard deviation of 1.

After normalizing the data in the time series data set to obtain normalized data, a sub-step B2 is performed.

Substep B2: and carrying out similarity calculation on the standardized data based on the preset dynamic time warping algorithm, and screening the middle sample characteristics from the time sequence data set according to a similarity calculation result.

After the data in the time sequence data set is standardized to obtain standardized data, similarity calculation can be performed on the standardized data based on a preset dynamic time warping algorithm, and intermediate sample characteristics are screened out from the time sequence data set according to a similarity calculation result. The similarity between the two time sequences is calculated by adopting a preset dynamic time warping algorithm, and partial data with the maximum correlation with the target time sequence is rapidly selected as training samples, so that the calculated amount is reduced, and the algorithm operation efficiency is improved to a certain extent. In a specific implementation, a candidate time-series dataset corresponding to the collected non-GOIP fraud numbers may be obtained. And performing similarity calculation on the target data and the standardized data in the candidate time sequence data set based on a preset dynamic time warping algorithm to obtain a similarity calculation result, wherein the target data is the data with the same time as the standardized data in the candidate time sequence data set. Intermediate sample features are screened from the time series data set and the candidate time series data set based on the similarity calculation result. Namely, a target time sequence (i.e. the time sequence which is hoped to search for the similarity in the normal and fraud data) is set according to the time sequence corresponding to the collected normal number and fraud number, for example, the collected normal and fraud numbers are concentrated in the first hour and the fourth hour, and the data set of the target time sequence mainly searches for other numbers with similar calling times and duration at the same time point.

In this embodiment, the preset dynamic time warping algorithm mainly sets the search range by adopting global constraint, and controls the search path inside the warping window, which is to control the search path only inside three rectangles of the parallelogram warping window, so that the points outside the rectangles do not need to be calculated any more, the calculation amount is reduced, and the arithmetic operation efficiency of the algorithm is improved to a certain extent. The algorithm further reduces the search range of the regular path, reduces the calculation amount of the original algorithm to a certain extent, and therefore achieves the improvement of the operation efficiency, and especially when the lengths of the two time sequences are longer, the improvement of the operation efficiency is more obvious.

After performing a correlation feature screening process on the time series data set based on a preset dynamic time warping algorithm, step 103 is performed after obtaining intermediate sample features screened from the time series data set.

Step 103: and performing feature weighted screening processing on the intermediate sample features by a Boosting-based feature screening algorithm to obtain target sample features screened from the intermediate sample features, and constructing a target training sample set.

Boosting is a method used to improve the accuracy of weak classification algorithms by constructing a series of prediction functions and then combining them into a prediction function in some way. Boosting is a method to improve the accuracy of any given learning algorithm.

After the time series data set is subjected to relevant feature screening processing based on a preset dynamic time warping algorithm to obtain intermediate sample features screened from the time series data set, the intermediate sample features can be subjected to feature weighted screening processing based on a Boosting feature screening algorithm to obtain target sample features screened from the intermediate sample features, and a target training sample set is constructed. Specifically, feature screening can be performed based on a Boosting feature screening algorithm, and feature weighting can be performed according to empirically obtained weights, so as to screen out important features. The implementation may be described in detail in connection with the following specific implementations.

In a specific implementation of the present application, the step 103 may include:

Substep C1: and carrying out feature screening on the intermediate sample features based on the Boosting feature screening algorithm to obtain reference sample features.

In this embodiment, the feature screening algorithm based on Boosting may perform feature screening on the intermediate sample feature to obtain the reference sample feature. In this example, a final strong classifier may be formed by iterative training of the original data set, and further, from the feature combinations constituting the strong classifier, a corresponding feature set, that is, a data set containing the reference sample features is output. The implementation may be described in detail in connection with the following specific implementations.

In another specific implementation of the present application, the above sub-step C1 may include:

Substep D1: and acquiring initialized sample weights of the intermediate sample features and a pre-trained weak classifier.

In this embodiment, the initial sample weights of the intermediate sample features and the pre-trained weak classifiers may be preset.

Substep D2: and processing the intermediate sample features based on the weak classifier to obtain a classification prediction result corresponding to the intermediate sample features.

Furthermore, the intermediate sample features can be processed based on the weak classifier, so that a classification prediction result corresponding to the intermediate sample features can be obtained.

Substep D3: and calculating and obtaining the classification error corresponding to the intermediate sample characteristic according to the classification prediction result.

Then, according to the classification prediction result, a classification error corresponding to the middle sample characteristic can be calculated.

Substep D4: and updating the initialized sample weight based on the classification error to obtain the updated sample weight corresponding to the middle sample feature.

After the classification error is calculated, the initial sample weight is updated according to the classification error, so that the updated sample weight corresponding to the middle sample feature is obtained.

Substep D5: and taking the updated sample weight as the initialized sample weight, and iteratively executing the steps of processing the intermediate sample feature based on the weak classifier in a set round to obtain a classification prediction result corresponding to the intermediate sample feature, and updating the initialized sample weight based on the classification error to obtain the updated sample weight corresponding to the intermediate sample feature.

After obtaining the updated sample weight, the updated sample weight may be used as an initialized sample weight, and the above-described sub-steps D2 to D4 of the set round may be iteratively performed.

Substep D6: and determining the feature importance scores corresponding to the intermediate sample features according to the sample weights of the intermediate sample features in each round of processing.

After the training of the set round is performed, a feature importance score corresponding to the intermediate sample feature may be determined according to the sample weight of the intermediate sample feature during each round of processing.

Substep D7: the reference sample feature is screened from the intermediate sample features based on the feature importance score.

And further, the reference sample features can be screened from the intermediate sample features according to the feature importance scores.

For a training sample constructed by the DTW algorithm, a process of feature set screening by adopting a boosting algorithm can be shown in fig. 2, and the specific steps are as follows:

1. Initializing a sample weight w_i=1/n such that each sample initially contributes equally to the model, n being the number of samples;

2. for each round of iteration (e.g., T-round), the following steps are performed:

2.1 training a weak classifier (e.g., decision tree) using current sample weights;

2.2, predicting the whole data set by using the trained weak classifier;

2.3, calculating classification errors, namely the sum of weights of the misclassified samples;

2.4, calculating the weight of the weak classifier, which is usually related to classification errors;

2.5, updating the weight of the sample, increasing the weight of the sample which is classified by mistake, and reducing the weight of the sample which is classified by mistake;

3. Calculating the weight of each feature in each round of training to obtain feature importance scores;

4. the feature importance scores are arranged in a descending order, and the first k features are selected as the final selected features;

5. Returning the selected feature set.

And C2, after the intermediate sample characteristics are subjected to characteristic screening by a Boosting-based characteristic screening algorithm to obtain reference sample characteristics, executing a substep.

Substep C2: and carrying out feature weighting processing on the reference sample features according to weights corresponding to preset feature types, and screening the target sample features from the reference sample features according to weighting results so as to construct the target training sample set.

After the intermediate sample features are subjected to feature screening by the Boosting-based feature screening algorithm to obtain reference sample features, feature weighting processing can be performed on the reference sample features according to weights corresponding to preset feature types, and target sample features are screened from the reference sample features according to weighting results, so that a target training sample set is constructed.

In a specific implementation, after obtaining the reference sample features, the important features may be weighted (w 1, w2, w3, w4, w 5) in combination with expert experience, as shown in table 1 below:

Table 1:

As shown in table 1 above, feature weight W1 may be increased for abnormal call features, feature weight W2 may be increased for multi-IMEI (International Mobile Equipment Identity ) features, and so on.

According to the embodiment of the application, the Boosting-based feature screening algorithm is adopted, and the experience of GOIP equipment research specialists is combined to weight important features, so that the built model is more accurate, the recognition efficiency is higher, and the recognition accuracy for fraud calls is stronger.

After the Boosting-based feature screening algorithm performs feature weighted screening processing on the intermediate sample features to obtain the target sample features screened from the intermediate sample features and construct a target training sample set, step 104 is performed.

Step 104: training a fraud phone recognition model to be trained based on the target training sample set to obtain the fraud phone recognition model.

In this example, the fraud telephone identification model to be trained may be a modified LSTM model.

After the target training sample set is obtained, the fraud phone recognition model to be trained can be trained based on the target training sample set, and the fraud phone recognition model is obtained. Specifically, the target training sample set can be divided into a training set, a testing set and a verification set, and the training set is substituted into the improved LSTM model to train to obtain a target model for prediction, namely a fraud phone identification model.

And adding a self-attention mechanism at the output gate part, adding a BN layer and a Dropout layer at each hidden layer, improving the overfitting degree of the model, and accelerating the convergence rate.

According to the embodiment of the application, the signaling telephone traffic characteristic data set of the time sequence is constructed, and the important characteristics are weighted by adopting the Boosting-based characteristic screening algorithm, so that the recognition efficiency is higher, the recognition accuracy for the fraud telephone is higher, and unnecessary loss caused by the fraud telephone of the user can be effectively avoided.

After training to obtain the fraud phone identification model, the fraud phone identification model can be used for fraud phone identification. The implementation may be described in detail in connection with the following specific implementations.

In a specific implementation of the present application, after the step 104, the method may further include:

Step E1: and converting the signaling traffic characteristics of the number to be identified within a preset time length from the current time into the traffic characteristics of the time sequence.

In this embodiment, after the obtained fraud phone identification model is trained, the signaling traffic characteristics of the number to be identified within a preset duration from the current time can be converted into the traffic characteristics of the time sequence.

And E2, after converting the signaling traffic characteristics of the number to be identified within a preset time length from the current time into the traffic characteristics of the time sequence.

Step E2: and inputting the telephone traffic characteristics of the time sequence to the fraud telephone identification model to obtain a fraud prediction result of the number to be identified.

After the signaling traffic characteristics of the number to be identified within the preset time length from the current time are converted into the traffic characteristics of the time sequence, the traffic characteristics of the time sequence can be input into a fraud phone identification model to obtain a fraud prediction result of the number to be identified. The hour characteristic data of the number to be identified in a period of time is substituted into the target, and whether the fraud risk exists at the next time point is predicted.

Step E3: and responding to the fraud prediction result as a prediction result of the fraud number, and acquiring call initiation base station information corresponding to the number to be identified.

And when the fraud prediction result is the prediction result of the fraud number, acquiring call initiation base station information corresponding to the number to be identified.

Step E4: and positioning GOIP equipment corresponding to the number to be identified based on the call initiation base station information.

Furthermore, the GOIP device corresponding to the number to be identified can be positioned based on the information of the call initiation base station. Specifically, the base station information initiated by the abnormal number call can be matched with the broadband address information, a fixed network broadband account number is obtained, the number related information is synchronously reported to a local public security, the abnormal IP and the port are obtained, and the GOIP equipment is automatically traced to a cell room address and accurately identified by matching with broadband DPI (Dots Per Inch) data.

The process of identifying fraudulent telephone numbers may be described in detail below in connection with FIG. 3.

As shown in fig. 3, the fraud telephone number identification procedure may include:

1. and constructing a GOIP fraud number library and a fraud-related GOIP library.

2. And extracting the characteristics of a call detail list, a position, user information, a terminal and the like.

3. A time series dataset is constructed. The collected signaling traffic related characteristics of each hour are arranged and integrated into a time sequence data set according to time sequence, a call sequence of each number in a period of time is obtained, and a weighted moving average method is added to perform secondary data processing on the data set.

4. And extracting training samples with high similarity by using an improved DTW algorithm, and performing standardization processing. Firstly, data is standardized so as to scale the data, remove unit limitation of the data, convert the data into dimensionless pure numerical values, and facilitate indexes of different units or orders to be compared and weighted so as to fall into a data distribution interval with a mean value of 0 and a standard deviation of 1; and calculating the similarity between the two time sequences by adopting an optimized DTW algorithm, and rapidly selecting partial data with the maximum correlation with the target time sequence as a training sample.

5. Feature screening is performed based on Boosting and important features are weighted by combining expert experience. The method comprises the steps of weighting important features by adopting a Boosting-based feature screening algorithm and combining with the experience of GOIP equipment research specialists so as to screen target training samples.

6. Dividing the training set, the testing set and the verification set into the LSTM model for training. The target training sample is divided into a training set, a testing set and a verification set, so that model training, testing and verification are respectively carried out on the LSTM until the model converges.

7. And predicting suspected GOIP fraud calls by using the model.

8. And reporting the related information of the abnormal number to a local public security, acquiring an abnormal IP and a port, and automatically tracing the GOIP equipment. The base station information initiated by the abnormal number call is matched with the broadband address information, a fixed network broadband account number is obtained, the number related information is synchronously reported to a local public security, the abnormal IP and the port are obtained, and the GOIP equipment is accurately identified by automatically tracing to the cell room address through matching with broadband DPI data.

According to the training method of the fraud telephone identification model, provided by the embodiment of the application, the time sequence data set is obtained by processing the signaling telephone traffic characteristics associated with the GOIP fraud number. And carrying out relevant characteristic screening processing on the time sequence data set based on a preset dynamic time warping algorithm to obtain intermediate sample characteristics screened from the time sequence data set. And performing feature weighted screening treatment on the intermediate sample features by using a Boosting-based feature screening algorithm to obtain target sample features screened from the intermediate sample features, and constructing a target training sample set. Training the fraud phone recognition model to be trained based on the target training sample set to obtain the fraud phone recognition model. According to the embodiment of the application, the signaling telephone traffic characteristic data set of the time sequence is constructed, and the important characteristics are weighted by adopting the Boosting-based characteristic screening algorithm, so that the recognition efficiency is higher, the recognition accuracy for the fraud telephone is higher, and unnecessary loss caused by the fraud telephone of the user can be effectively avoided.

Referring to fig. 4, a schematic structural diagram of a training device for a fraud phone identification model according to an embodiment of the present application is shown, and as shown in fig. 4, a training device 400 for a fraud phone identification model may include the following modules:

A time sequence data set acquisition module 410, configured to process signaling traffic characteristics associated with the GOIP fraud number to obtain a time sequence data set;

The intermediate sample feature obtaining module 420 is configured to perform a relevant feature screening process on the time sequence data set based on a preset dynamic time warping algorithm, so as to obtain intermediate sample features screened from the time sequence data set;

The target training sample set construction module 430 is configured to perform feature weighted screening processing on the intermediate sample features based on a Boosting feature screening algorithm, obtain target sample features screened from the intermediate sample features, and construct a target training sample set;

And a fraud phone identification model obtaining module 440, configured to train a fraud phone identification model to be trained based on the target training sample set, so as to obtain the fraud phone identification model.

Optionally, the time series data set acquisition module includes:

Optionally, the intermediate sample feature acquisition module includes:

Optionally, the intermediate sample feature screening unit includes:

Optionally, the target training sample set construction module includes:

Optionally, the reference sample feature acquiring unit includes:

Optionally, the apparatus further comprises:

The training device of the fraud telephone identification model provided by the embodiment of the application obtains a time sequence data set by processing the signaling telephone traffic characteristics associated with the GOIP fraud number. And carrying out relevant characteristic screening processing on the time sequence data set based on a preset dynamic time warping algorithm to obtain intermediate sample characteristics screened from the time sequence data set. And performing feature weighted screening treatment on the intermediate sample features by using a Boosting-based feature screening algorithm to obtain target sample features screened from the intermediate sample features, and constructing a target training sample set. Training the fraud phone recognition model to be trained based on the target training sample set to obtain the fraud phone recognition model. According to the embodiment of the application, the signaling telephone traffic characteristic data set of the time sequence is constructed, and the important characteristics are weighted by adopting the Boosting-based characteristic screening algorithm, so that the recognition efficiency is higher, the recognition accuracy for the fraud telephone is higher, and unnecessary loss caused by the fraud telephone of the user can be effectively avoided.

The embodiment of the application also provides electronic equipment, which comprises: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the training method of the fraud telephone identification model when being executed by the processor.

Fig. 5 shows a schematic structural diagram of an electronic device 500 according to an embodiment of the invention. As shown in fig. 5, the electronic device 500 includes a Central Processing Unit (CPU) 501 that can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 502 or loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data required for the operation of the electronic device 500 may also be stored. The CPU501, ROM502, and RAM503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in electronic device 500 are connected to I/O interface 505, including: an input unit 506 such as a keyboard, mouse, microphone, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The various processes and treatments described above may be performed by the processing unit 501. For example, the methods of any of the embodiments described above may be implemented as a computer software program tangibly embodied on a computer-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM502 and/or the communication unit 509. When the computer program is loaded into RAM503 and executed by CPU501, one or more actions of the methods described above may be performed.

Additionally, the embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the training method of the fraud telephone identification model.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminals (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description of a method for training a fraud phone recognition model, a training device for a fraud phone recognition model, an electronic device and a computer readable storage medium provided by the present application, the specific examples are applied to illustrate the principles and embodiments of the present application, and the above examples are only used to help understand the method and core ideas of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1.A method of training a fraud telephone identification model, the method comprising:

2. The method of claim 1, wherein processing signaling traffic characteristics associated with the GOIP fraud number to obtain a time series data set comprises:

3. The method according to claim 1, wherein the performing a correlation feature screening process on the time series data set based on a preset dynamic time warping algorithm to obtain intermediate sample features screened from the time series data set includes:

4. A method according to claim 3, wherein said performing similarity calculation on said normalized data based on said predetermined dynamic time warping algorithm and screening said intermediate sample features from said time series data set according to the similarity calculation result comprises:

5. The method of claim 1, wherein the Boosting-based feature screening algorithm performs feature weighted screening on the intermediate sample features to obtain target sample features screened from the intermediate sample features, and constructs a target training sample set, including:

6. The method of claim 5, wherein the Boosting-based feature screening algorithm performs feature screening on the intermediate sample features to obtain reference sample features, including:

7. The method of claim 1, further comprising, after said training a fraud telephone identification model to be trained based on said target training sample set, obtaining said fraud telephone identification model:

8. A training device for fraud telephone identification models, the device comprising:

9. An electronic device, comprising:

A processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the training method of the fraud telephone identification model of any of claims 1 to 7 when the program is executed.

10. A computer readable storage medium, characterized in that instructions in said storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of training the fraud telephone identification model of any of claims 1 to 7.