CN114205462A

CN114205462A - Fraud telephone identification method, device, system and computer storage medium

Info

Publication number: CN114205462A
Application number: CN202111526088.7A
Authority: CN
Inventors: 王晨; 包森成; 余娜; 徐强; 王健; 葛胜利
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-03-18

Abstract

The invention discloses a fraud telephone identification method, device, system and computer storage medium. Wherein, the method comprises the following steps: acquiring a training sample data set and a test sample data set in a current scene; carrying out model training on a plurality of first features obtained by carrying out multi-dimensional feature extraction on a training sample data set through a random forest algorithm to obtain a detection model; inputting a test sample data set into a detection model and performing parameter optimization on the detection model to obtain an updated detection model and a model prediction result; evaluating the model prediction result according to a plurality of evaluation indexes, and judging whether the detection model is feasible or not; when the detection model is feasible, inputting a plurality of second features obtained by extracting the multidimensional features of the number to be predicted into the updated detection model for prediction to obtain the probability P that the number to be predicted is abnormal; and comparing the probability P with a preset threshold value, and judging whether the number to be predicted is abnormal or not according to the comparison result. The method has long timeliness and high accuracy.

Description

Fraud telephone identification method, device, system and computer storage medium

Technical Field

The invention relates to the technical field of network security, in particular to a fraud telephone identification method, device, system and computer storage medium.

Background

In the prior art, the number card management aiming at telecommunication fraud is mainly researched and judged based on a list library and a business rule. The first method of filtering the number card through the black and white list mechanism has effectiveness mainly depending on the effectiveness of the list library, which usually enters the system after the event, and both effectiveness of the judgment and comprehensiveness of the card capturing related to the fraud are obvious. The other is to analyze the service data based on a history blacklist and extract strong service rules of region attributes, frequency attributes and the like, and the research and judgment mode of the service rules fully depends on the expert experience, so that the problems of difficult maintenance, unpredictable interception accuracy and the like exist.

Aiming at the problems of short time efficiency, incompleteness and low accuracy and difficult maintenance in the method for filtering the number card through a blacklist mechanism in the prior art by depending on expert experience, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides a fraud telephone identification method, a fraud telephone identification device, a fraud telephone identification system and a computer storage medium, which are used for solving the problems of short timeliness and incompleteness existing in a method for filtering a number card through a blacklist mechanism in the prior art, low accuracy and difficulty in maintenance existing in research and judgment by depending on expert experience.

To achieve the above object, in one aspect, the present invention provides a fraud phone identification method, including: acquiring a training sample data set and a test sample data set in a current scene; performing multi-dimensional feature extraction on the training sample data set to obtain a plurality of first features; performing model training on the plurality of first characteristics through a random forest algorithm to obtain a detection model; inputting the test sample data set into the detection model and performing parameter optimization on the detection model to obtain an updated detection model and a model prediction result; evaluating the model prediction result according to a plurality of evaluation indexes, and judging whether the detection model is feasible or not according to the evaluation result; when the detection model is feasible, extracting the multidimensional characteristics of the number to be predicted, and inputting the extracted second characteristics into the updated detection model for prediction to obtain the probability P that the number to be predicted is abnormal; and comparing the probability P with a preset threshold value, and judging whether the telephone number to be predicted is abnormal or not according to a comparison result.

Optionally, the evaluating the model prediction result according to the multiple evaluation indexes, and determining whether the detection model is feasible according to the evaluation result includes: when the evaluation value of each evaluation index on the model prediction result is greater than 90 minutes, the detection model is determined to be feasible.

Optionally, the multidimensional features at least include: a call feature, a short message feature, and a traffic feature.

Optionally, the performing multidimensional feature extraction on the training sample data set to obtain a plurality of first features includes: screening the training sample data set, and screening out a training sample data subset with a higher negative sample proportion in the training sample data set; and carrying out multi-dimensional feature extraction on the training sample data subset to obtain the plurality of first features.

Optionally, the scenario at least includes: a silent card revival scene, an abnormal roaming fraud scene, and a new card opening fraud scene.

In another aspect, the present invention provides a fraud telephone identification apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a training sample data set and a test sample data set in a current scene; the training unit is used for carrying out multi-dimensional feature extraction on the training sample data set to obtain a plurality of first features; performing model training on the plurality of first characteristics through a random forest algorithm to obtain a detection model; the updating unit is used for inputting the test sample data set into the detection model and carrying out parameter optimization on the detection model to obtain an updated detection model and a model prediction result; the evaluation unit is used for evaluating the model prediction result according to a plurality of evaluation indexes and judging whether the detection model is feasible according to the evaluation result; the prediction unit is used for extracting multi-dimensional features of the number to be predicted when the detection model is feasible, inputting a plurality of extracted second features into the updated detection model for prediction, and obtaining the probability P that the telephone number to be predicted is abnormal; and the judging unit is used for comparing the probability P with a preset threshold value and judging whether the telephone number to be predicted is abnormal or not according to a comparison result.

Optionally, the evaluation unit includes: and the evaluation subunit is used for judging that the detection model is feasible when the evaluation value of each evaluation index on the model prediction result is greater than 90 minutes.

Optionally, the training unit includes: the screening subunit is used for screening the training sample data set and screening out a training sample data subset with a higher negative sample percentage in the training sample data set; and the extraction subunit is used for performing multi-dimensional feature extraction on the training sample data subset to obtain the plurality of first features.

In another aspect, the invention further provides a fraud telephone identification system, which comprises the fraud telephone identification device.

In another aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the above-mentioned fraud telephone identification method.

The invention has the beneficial effects that:

the invention provides a fraud telephone identification method, which comprises the following steps: acquiring a training sample data set and a test sample data set in a current scene; performing multi-dimensional feature extraction on the training sample data set to obtain a plurality of first features; performing model training on the plurality of first characteristics through a random forest algorithm to obtain a detection model; inputting the test sample data set into the detection model and performing parameter optimization on the detection model to obtain an updated detection model and a model prediction result; evaluating the model prediction result according to a plurality of evaluation indexes, and judging whether the detection model is feasible or not according to the evaluation result; when the detection model is feasible, extracting the multidimensional characteristics of the number to be predicted, and inputting a plurality of extracted second characteristics into the updated detection model for prediction to obtain the probability P that the number to be predicted is abnormal; and comparing the probability P with a preset threshold value, and judging whether the telephone number to be predicted is abnormal or not according to a comparison result.

In the method, the accuracy of detection can be improved by carrying out multi-dimensional feature extraction; the method has the advantages that the training sample data set is subjected to model training through a random forest algorithm to obtain a detection model, numbers to be predicted are input into the detection model in real time for prediction, detection comprehensiveness can be guaranteed, and long timeliness and convenience in subsequent maintenance are guaranteed through the method.

Drawings

FIG. 1 is a flow chart of a fraudulent call identification method provided by an embodiment of the present invention;

FIG. 2 is a flow chart of obtaining a plurality of first features according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a fraudulent telephone identification device provided by the embodiment of the present invention;

fig. 4 is a schematic structural diagram of a training unit according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of them. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Thus, the present invention provides a fraud phone identification method, fig. 1 is a flow chart of a fraud phone identification method provided by an embodiment of the present invention, as shown in fig. 1, the method includes:

s101, acquiring a training sample data set and a test sample data set in a current scene;

in an optional embodiment, the scenario includes at least: a silent card revival scene, an abnormal roaming fraud scene, and a new card opening fraud scene.

The following is illustrated by an asynchronous roaming fraud scenario:

and acquiring a training sample data set and a test sample data set in the scene.

S102, performing multi-dimensional feature extraction on the training sample data set to obtain a plurality of first features; performing model training on the plurality of first characteristics through a random forest algorithm to obtain a detection model;

the performing multi-dimensional feature extraction on the training sample data set to obtain a plurality of first features comprises:

s1021, screening the training sample data set, and screening out a training sample data subset with a higher negative sample proportion in the training sample data set;

for example: the distribution difference of the silent period and the active period of the call and traffic activities between normal users and fraud-related users is large, and the positive sample is far higher than the negative sample. The conversation silence period is between one month and two months, and the proportion of the fraud-related users is 1.98 times that of the normal users. Similarly, the traffic silencing period is between 14 days and 1 month, and the percentage of the fraud-related users is 1.2 times that of the normal users.

The active period is defined in terms of consecutive active days, i.e. the active period is calculated in terms of the number of days between the fraud telephone number entering the active state and the temporary cessation of consecutive activity. The active period for a fraud-related user is significantly less than for a normal user. 94.0% of the fraud-related users have the longest continuous active days not exceeding 30 days; while only 7.25% of normal users have no more than 7 days of continuous activity, and 62.88% of normal users have more than 30 days of continuous activity.

Therefore, when the training sample data set is screened, the training sample data subset is screened by adopting 30 days of silent conversation or 14 days of silent flow or no more than 7 days of continuous activity. All screened samples are positive samples, so that the negative sample ratio is higher, and the sample imbalance is obviously reduced.

And S1022, performing multi-dimensional feature extraction on the training sample data subset to obtain the plurality of first features.

Specifically, the multidimensional characteristics at least include: a call feature, a short message feature, and a traffic feature.

The call feature at least comprises: calling/called frequency ratio; only the called has no calling/more called than calling; a duration of time; base station dispersion of calling/called; incoming/outgoing number dispersion; peak call frequency/fluctuation rate; the communication activity; local/roaming call calling/called frequency; roaming city dispersion; talk period/talk duration preference.

The short message characteristics comprise: the operation frequency of other short messages except the short message receiving and sending is carried out; sending short messages by the ratio of times; local sending short message frequency; the dispersion of the number of the opposite terminal sent by the short message; the dispersion of the numbers of the opposite terminals of all the short message operations; the divergence of the opposite terminal numbers of other short message operations except the short message sending and receiving.

The flow characteristics include: number of active hours of cross-provincial roaming traffic; number of active hours of intra-provincial roaming traffic; the variance of the provinces of the flow; the dispersion of a flow base station; the ratio of the uplink flow; the downlink flow rate is proportional to the activity of the flow rate behavior; dispersion of upstream flow fluctuation; and (4) descending flow fluctuation dispersion.

Performing model training on the obtained first characteristics through a random forest algorithm to obtain a detection model; in the invention, because the screened training sample data subset is reduced compared with the training sample data set, when the multi-dimensional characteristic extraction is carried out on the training sample data subset subsequently, the first characteristic is reduced, thereby reducing the data processing difficulty and accelerating the subsequent model training process.

The random forest is to integrate the results of multiple decision trees, each tree randomly selects a part of the number of first features and a part of the number of first feature attributes to make a decision, and the final result is generated by voting of the multiple decision trees.

S103, inputting the test sample data set into the detection model and performing parameter optimization on the detection model to obtain an updated detection model and a model prediction result;

in an optional implementation manner, the test sample data set is input into the detection model for prediction, and a model prediction result is obtained; and meanwhile, the detection model is subjected to parameter optimization by using methods such as grid search, random search and the like to obtain an updated detection model.

S104, evaluating the model prediction result according to a plurality of evaluation indexes, and judging whether the detection model is feasible or not according to the evaluation result;

specifically, the evaluation index includes at least: precision evaluation index, recall evaluation index, F1-score (harmonic mean of precision and recall) evaluation index. And when the evaluation value of each evaluation index on the model prediction result is greater than 90 minutes, judging that the detection model is feasible.

S105, when the detection model is feasible, extracting multidimensional characteristics of the telephone number to be predicted, inputting a plurality of extracted second characteristics into the updated detection model for prediction, and obtaining the probability P that the telephone number to be predicted is abnormal;

s106, comparing the probability P with a preset threshold value, and judging whether the telephone number to be predicted is abnormal or not according to a comparison result.

In an alternative embodiment, the prediction result of the telephone number to be predicted is explained by using an Eli5 algorithm in the invention. Because the second feature has multiple dimensions and different contribution degrees (namely abnormal expressions) of different second features, the contribution degrees of the second features are arranged in a reverse order, and the second features corresponding to the first contribution degrees are main features influencing the prediction result of the telephone number to be predicted.

Fig. 3 is a schematic structural diagram of a fraud telephone identification apparatus provided in an embodiment of the present invention, as shown in fig. 3, the apparatus includes:

an obtaining unit 201, configured to obtain a training sample data set and a test sample data set in a current scene;

The following is illustrated by an asynchronous roaming fraud scenario:

A training unit 202, configured to perform multidimensional feature extraction on the training sample data set to obtain multiple first features; performing model training on the plurality of first characteristics through a random forest algorithm to obtain a detection model;

in an alternative implementation manner, fig. 4 is a schematic structural diagram of a training unit provided in an embodiment of the present invention, and as shown in fig. 4, the training unit 202 includes:

a screening subunit 2021, configured to screen the training sample data set, and screen out a training sample data subset with a higher negative sample percentage in the training sample data set;

The extracting subunit 2022 is configured to perform multidimensional feature extraction on the training sample data subset to obtain the plurality of first features.

An updating unit 203, configured to input the test sample data set into the detection model and perform parameter optimization on the detection model to obtain an updated detection model and a model prediction result;

The evaluation unit 204 is configured to evaluate the model prediction result according to a plurality of evaluation indexes, and determine whether the detection model is feasible according to the evaluation result;

The prediction unit 205 is configured to, when the detection model is feasible, perform multidimensional feature extraction on the phone number to be predicted, and input a plurality of extracted second features into the updated detection model for prediction, so as to obtain a probability P that the phone number to be predicted is abnormal;

the determining unit 206 is configured to compare the probability P with a preset threshold, and determine whether the phone number to be predicted is abnormal according to a comparison result.

The invention also provides a fraud telephone identification system which comprises the fraud telephone identification device.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described fraud phone recognition method.

The storage medium stores the software, and the storage medium includes but is not limited to: optical disks, floppy disks, hard disks, erasable memory, etc.

The invention has the beneficial effects that:

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may be modified or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A fraud telephone identification method, comprising:

acquiring a training sample data set and a test sample data set in a current scene;

performing multi-dimensional feature extraction on the training sample data set to obtain a plurality of first features; performing model training on the plurality of first characteristics through a random forest algorithm to obtain a detection model;

inputting the test sample data set into the detection model and performing parameter optimization on the detection model to obtain an updated detection model and a model prediction result;

evaluating the model prediction result according to a plurality of evaluation indexes, and judging whether the detection model is feasible or not according to the evaluation result;

when the detection model is feasible, carrying out multi-dimensional feature extraction on the telephone number to be predicted, and inputting a plurality of extracted second features into the updated detection model for prediction to obtain the probability P that the telephone number to be predicted is abnormal;

and comparing the probability P with a preset threshold value, and judging whether the telephone number to be predicted is abnormal or not according to a comparison result.

2. The method of claim 1, wherein the evaluating the model prediction result according to a plurality of evaluation indexes, and the determining whether the detection model is feasible according to the evaluation result comprises:

and when the evaluation value of each evaluation index on the model prediction result is greater than 90 minutes, judging that the detection model is feasible.

3. The method of claim 1, wherein:

the multi-dimensional features include at least: a call feature, a short message feature, and a traffic feature.

4. The method of claim 1, wherein the performing multi-dimensional feature extraction on the training sample data set to obtain a plurality of first features comprises:

screening the training sample data set, and screening out a training sample data subset with a higher negative sample proportion in the training sample data set;

and performing multi-dimensional feature extraction on the training sample data subset to obtain the plurality of first features.

5. The method of claim 1, wherein:

the scene at least comprises: a silent card revival scene, an abnormal roaming fraud scene, and a new card opening fraud scene.

6. A fraud telephone identification apparatus, comprising:

the acquisition unit is used for acquiring a training sample data set and a test sample data set in a current scene;

the training unit is used for carrying out multi-dimensional feature extraction on the training sample data set to obtain a plurality of first features; performing model training on the plurality of first characteristics through a random forest algorithm to obtain a detection model;

the updating unit is used for inputting the test sample data set into the detection model and carrying out parameter optimization on the detection model so as to obtain an updated detection model and a model prediction result;

the evaluation unit is used for evaluating the model prediction result according to a plurality of evaluation indexes and judging whether the detection model is feasible according to the evaluation result;

the prediction unit is used for extracting the multidimensional characteristics of the telephone number to be predicted when the detection model is feasible, inputting a plurality of extracted second characteristics into the updated detection model for prediction, and obtaining the probability P that the telephone number to be predicted is abnormal;

and the judging unit is used for comparing the probability P with a preset threshold value and judging whether the telephone number to be predicted is abnormal or not according to a comparison result.

7. The apparatus of claim 6, wherein the evaluation unit comprises:

and the evaluation subunit is used for judging that the detection model is feasible when the evaluation value of each evaluation index on the model prediction result is greater than 90 minutes.

8. The apparatus of claim 6, wherein the training unit comprises:

a screening subunit, configured to screen the training sample data set, and screen out a training sample data subset with a higher negative sample percentage in the training sample data set;

and the extraction subunit is used for performing multi-dimensional feature extraction on the training sample data subset to obtain the plurality of first features.

9. A fraud telephone identification system, comprising: the fraud telephone identification apparatus of any of claims 6-8.

10. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the fraudulent call identification method as recited in any one of claims 1 to 5.