CN113591924A - Phishing number detection method, system, storage medium and terminal equipment - Google Patents

Phishing number detection method, system, storage medium and terminal equipment Download PDF

Info

Publication number
CN113591924A
CN113591924A CN202110748349.3A CN202110748349A CN113591924A CN 113591924 A CN113591924 A CN 113591924A CN 202110748349 A CN202110748349 A CN 202110748349A CN 113591924 A CN113591924 A CN 113591924A
Authority
CN
China
Prior art keywords
model
matrix
training
feature
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110748349.3A
Other languages
Chinese (zh)
Inventor
杨伟志
衣杨
赵小蕾
张海
曾青青
刘少江
黎丹雨
王玉娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua College of Sun Yat Sen University
Original Assignee
Xinhua College of Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua College of Sun Yat Sen University filed Critical Xinhua College of Sun Yat Sen University
Priority to CN202110748349.3A priority Critical patent/CN113591924A/en
Publication of CN113591924A publication Critical patent/CN113591924A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2281Call monitoring, e.g. for law enforcement purposes; Call tracing; Detection or prevention of malicious calls

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Technology Law (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to a phishing number detection method, which comprises the following steps: firstly, designing corresponding characteristics aiming at information fraud behaviors and extracting the characteristics of a user behavior log so as to construct an original characteristic matrix and carry out data preprocessing; secondly, according to the imbalance degree of the proportion of normal users to risk users in the original characteristic matrix after data preprocessing, a self-adjusting oversampling algorithm is adopted to perform oversampling on a few types of samples, and a training set is reconstructed; then, pre-training, feature importance evaluation and feature screening are carried out through an XGboost model; then performing model training on the reconstructed feature matrix by using XGboost and LightGBM models; and finally, improving the model performance through a Stacking multi-model fusion mode to obtain a two-layer model Logistic, and completing the mobile network risk user identification model. The method and the device can improve the accuracy and robustness of fraud number identification in network communication, and meet the requirements of practical application.

Description

Phishing number detection method, system, storage medium and terminal equipment
Technical Field
The present application relates to the field of machine learning and network security, and in particular, to a method, a system, a storage medium, and a terminal device for detecting phishing numbers.
Background
With the continuous development of information technology and communication technology, network risk behaviors such as information communication fraud and the like are more and more frequent, the technology is more and more advanced and diversified, and the life and work of people are seriously influenced. The identification of fraud numbers by utilizing big data and artificial intelligence is an important direction for improving the technical capability of fighting communication fraud.
Currently, the detection of fraud numbers is usually based on specific constraint constraints or by using a random forest algorithm. However, the rule-based detection method is not favorable for the situation of multiple changes of fraud behaviors, and is not flexible enough and has limited effect based on the manual design of specific rules; the random forest algorithm has low prediction accuracy, so the detection rate of fraud risk users is relatively low.
Disclosure of Invention
In view of the above, it is desirable to provide a phishing number detection method, system, storage medium and terminal device capable of improving accuracy and robustness of fraud number identification in network communication.
The embodiment of the invention provides a phishing number detection method, which comprises the following steps:
performing over-sampling on a few types of samples through a self-adjusting over-sampling algorithm, and fusing an over-sampling matrix obtained by the over-sampling with a feature engineering matrix to obtain a training feature matrix;
respectively training the XGboost model and the LightGBM model through the training feature matrix;
model fusion is carried out on the prediction results of the XGboost model and the LightGBM model in a Stacking mode to obtain a Logistic model;
and identifying the network risk users in the communication process through the Logistic model.
Further, the method for acquiring the feature engineering matrix comprises the following steps:
extracting the characteristics of original data of communication network users in the actual data set, and constructing an original characteristic matrix according to the extracted characteristic data;
and performing characteristic engineering processing on the original characteristic matrix to obtain a characteristic engineering matrix.
Further, the method for extracting the characteristics of the original data of the communication network users in the actual data set and constructing the original characteristic matrix according to the extracted characteristic data comprises the following steps:
extracting the characteristics of a call log, a short message log and a network original log of a communication network user;
and combining the extracted feature information of the call log, the short message log and the network original log with the user basic data to obtain an original feature matrix.
Further, the method for performing feature engineering processing on the original feature matrix to obtain a feature engineering matrix includes:
counting feature dimension information of a normal user group and a fraud user group in the original feature matrix;
acquiring a plurality of characteristics of which the difference degrees of normal user groups and fraudulent user groups are greater than a target value through a visualization tool;
calculating the variance of each feature dimension in the original feature matrix, extracting the features of which the variance is greater than a threshold value according to a variance selection method, and deleting the features of which the group difference degree between normal users and fraudulent users is less than a target value, thereby obtaining a feature engineering matrix.
Further, the method for performing over-sampling of a few types of samples through a self-adjusting over-sampling algorithm and obtaining the training feature matrix after fusing an over-sampling matrix obtained by the over-sampling and a feature engineering matrix comprises the following steps:
calculating K neighbor samples of each minority sample according to the distance measurement; k is the number of the calculated neighbor samples;
calculating a similar coefficient C of each minority sample K neighbor sample according to the similar coefficient model, and screening out minority samples with C > C _ method as sampling minority samples; wherein, C _ method is a coefficient threshold of the same type;
and the sampled minority samples and the adjacent minority samples corresponding to the sampled minority samples obtain an oversampled data set through a random sample generation model, the oversampled data set is combined with a classifier model to carry out sample screening by adopting an embedding method, so that a qualified minority oversampling matrix is obtained, and the oversampling matrix is fused with the characteristic engineering matrix to obtain a training characteristic matrix.
Further, the method for training the XGBoost model and the LightGBM model respectively through the training feature matrix includes:
dividing a data set corresponding to the training feature matrix into a training set and a test set; the training set is training data used for model training, and the test set is test data used for model testing;
and respectively inputting the training set and the test set into the XGboost model and the LightGBM model to perform 5-fold cross validation training, so that the test set covers the whole training set.
Further, during cross-validation training,
setting a hyper-parameter: the loss function is AUC, the evaluation function is fs _ score, the maximum depth parameter of node splitting of the model decision tree is 6, the learning rate parameter is 0.08, the regularization parameter is 2, the maximum iteration frequency is 10000 rounds, and the early stop is 100 rounds;
Figure BDA0003143468570000031
Figure BDA0003143468570000032
Figure BDA0003143468570000033
wherein Precision is Precision, Recall is Recall, TP is true, FP is false, thkK is a constant value.
Another embodiment of the invention provides an phishing number detection system, solving the problem that the existing detection of the phishing numbers is usually based on specific constraint condition constraints or is carried out by adopting a random forest algorithm; however, the rule-based detection method is not favorable for the situation of multiple changes of fraud behaviors, and is not flexible enough and has limited effect based on the manual design of specific rules; the random forest algorithm has low prediction accuracy, so that the detection rate of fraud risk users is relatively low.
The phishing number detection system according to the embodiment of the invention comprises:
the sampling module is used for performing oversampling of a few types of samples through a self-adjusting oversampling algorithm, and fusing an oversampling matrix obtained by oversampling with a feature engineering matrix to obtain a training feature matrix;
the training module is used for respectively training the XGboost model and the LightGBM model through the training feature matrix;
the fusion module is used for performing model fusion on the prediction results of the XGboost model and the LightGBM model in a Stacking mode to obtain a Logistic model;
and the identification module is used for identifying the network risk user in the communication process through the Logistic model.
Another embodiment of the present invention is also directed to a computer readable storage medium including a stored computer program; wherein the computer program, when executed, controls an apparatus on which the computer-readable storage medium is located to perform the phishing number detection method as described above.
Another embodiment of the present invention also proposes a terminal device, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the phishing number detection method as described above when executing the computer program.
According to the phishing number detection method, oversampling of a few types of samples is carried out through a self-adjusting oversampling algorithm, and a training feature matrix is obtained after an oversampling matrix obtained through oversampling is fused with a feature engineering matrix; respectively training the XGboost model and the LightGBM model through the training feature matrix; model fusion is carried out on the prediction results of the XGboost model and the LightGBM model in a Stacking mode to obtain a Logistic model; and identifying the network risk users in the communication process through the Logistic model. Compared with the prior art, the method and the device can improve the accuracy and robustness of fraud number identification in network communication, and meet the actual application requirements.
Drawings
FIG. 1 is a flow chart illustrating a phishing number detection method provided by an embodiment of the present invention;
FIG. 2 is a data flow of the phishing number detection method provided by the embodiment of the present invention;
FIG. 3 is a detailed flowchart of step S11 in FIG. 1;
FIG. 4 is a detailed flowchart of step S12 in FIG. 1;
FIG. 5 is a schematic diagram of model fusion in step S13 in FIG. 1;
FIG. 6 is a block diagram illustrating the structure of a phishing number detection system provided by the embodiment of the present invention;
fig. 7 is a structural diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any inventive step, are within the scope of the present invention.
It should be noted that, the step numbers in the text are only for convenience of explanation of the specific embodiments, and do not serve to limit the execution sequence of the steps. The method provided by the embodiment can be executed by the relevant server, and the server is taken as an example for explanation below.
As shown in fig. 1 to 5, the phishing number detection method provided by the embodiment of the invention comprises steps S11 to S14:
and step S11, oversampling of a few types of samples is carried out through a self-adjusting oversampling algorithm, and a training feature matrix is obtained after an oversampling matrix obtained through oversampling is fused with a feature engineering matrix.
The method for acquiring the characteristic engineering matrix comprises the following steps: extracting the characteristics of original data of communication network users in the actual data set, and constructing an original characteristic matrix according to the extracted characteristic data; and performing characteristic engineering processing on the original characteristic matrix to obtain a characteristic engineering matrix.
Specifically, before extracting the original features of the communication information of the communication network users in the actual data set, the basic attributes, calls, short messages and network behaviors of the communication network users are mainly considered. Wherein, the conversation/short message characteristics include: the monthly call times, the average call duration, the call frequency, the dial-out and answer proportion, the communication time interval distribution, the length of the opposite terminal number, the prefix of the opposite terminal number, the number of call objects and the like of each user; the network characteristics comprise the uplink and downlink flow conditions of the user, the number of accessed websites, the access frequency of special websites, the uplink and downlink flow, the access type and the like.
When the original characteristics of the communication information of the communication network users in the actual data set are extracted, codes are compiled for the call logs, the short messages and the original network information logs of the communication network users, so that the original characteristics are extracted through the function functions such as a statistical function, an aggregation function and a perspective table, and the characteristic information extracted from the three log tables is combined according to the user ID, so that the log data are converted into structured data, namely a numerical characteristic matrix in a form of 'sample-characteristic' required by model training. In addition, the numerical feature matrix is subjected to data cleaning work such as abnormal value processing (deletion) and missing value filling (mean value filling), so that an original feature matrix is obtained.
Further, when a feature engineering matrix is obtained, counting feature dimension information of a normal user group and a fraud user group in the original feature matrix; acquiring a plurality of characteristics of which the difference degrees of normal user groups and fraudulent user groups are greater than a target value through a visualization tool; calculating the variance of each feature dimension in the original feature matrix, extracting the features of which the variance is greater than a threshold value according to a variance selection method, and deleting the features of which the group difference degree between normal users and fraudulent users is less than a target value, thereby obtaining a feature engineering matrix.
It can be understood that the characteristics with larger difference of the characteristics of the user groups of the two categories are further obtained by counting the data conditions of the characteristic dimensions of the normal user group and the fraudulent user group and performing comparative analysis by using visualization tools such as a kernel density graph and a bar graph. And on the basis of the above steps: calculating the variance of each feature dimension in a sample set, and extracting features of which the variances are larger than a threshold value according to a variance selection method; and deleting the features with smaller group difference between the normal users and the fraud users so as to obtain the feature engineering matrix after the feature engineering processing. Effective feature screening is carried out by methods based on feature variance threshold values, normal user feature comparison with fraud user features and the like, so that feature dimensionality is reduced, and model generalization capability is improved.
Referring to fig. 3, the method for performing oversampling on a few types of samples by using a self-adjusting oversampling algorithm and obtaining a training feature matrix by fusing an oversampling matrix obtained by oversampling with a feature engineering matrix includes:
step S111, calculating K neighbor samples of each minority sample according to the distance measurement; and K is the number of the calculated neighbor samples.
Step S112, calculating a similar coefficient C of each minority sample K neighbor sample according to a similar coefficient model, and screening out minority samples of C > C _ method as sampling minority samples; wherein, C _ method is the homogeneous coefficient threshold, and the neighbor minority sample is the minority sample in the K neighbor samples.
And S113, obtaining an oversampling data set by using the sampled minority samples and the corresponding neighbor minority samples through a random sample generation model, carrying out sample screening on the oversampling data set by combining a classifier model and adopting an embedding method to obtain a qualified minority oversampling matrix, and fusing the qualified minority oversampling matrix with the feature engineering matrix to obtain a training feature matrix.
Specifically, in an actual situation, the proportion of the samples of the fraudulent user is far smaller than that of the normal user, the imbalance of the samples can cause the deviation of the contents of model training and learning, and the learned rule is more emphasized on most samples, so that the classification effect of the model is poor. Therefore, a Self-adjusting Oversampling algorithm (SA-SMOTE, Self-adjusting-Synthetic Minrity Oversampling Technique) is designed to perform data balance, that is, under the condition of not changing data distribution, more samples of a few classes are generated, and the overlarge sample proportion difference between the two classes is avoided. The SA-SMOTE algorithm mainly obtains K adjacent samples of each minority sample according to the existing minority sample and an Euclidean distance measurement mode. And screening out the minority samples with the same type coefficient as the target minority sample and larger than a threshold value C from the K neighbor samples, and randomly generating new samples according to the characteristic distribution condition of the original minority samples and the same type samples in the K neighbor samples.
Further, the coefficient of uniformity model is,
Figure BDA0003143468570000081
wherein, target _ class is a few sample classes, class (i) represents the class of sample i, and K is the number of adjacent samples.
Further, the random sample generation model is,
xnew=x+rand(0,1)*(xn-x)
x is the original minority sample feature vector, xnThe rand (0,1) is a decimal between 0 and 1 that generates a random number, which is the eigenvector of the nth neighbor sample of the K neighbor samples of the sample.
Further, sample screens were generated by the "embedding method": and training a classification algorithm to respectively obtain the predicted evaluation index score before adding the over-sampled sample and the algorithm predicted evaluation index score after adding the current over-sampled sample, if the over-sampled sample is added to improve the score of the predicted result evaluation index on the verification set, keeping a few over-sampled samples generated currently, and otherwise, discarding the samples until the proportion of the positive sample and the negative sample accords with the preset proportion. In the process of randomly generating the minority samples by the SA-SMOTE algorithm, the number K of the neighbor samples and the similarity coefficient threshold C _ method of the minority K neighbor samples serve as hyper-parameters, and the optimization can be performed according to experiments. The obtained oversampled data set is added with a few types of samples under the condition of original data distribution, so that the influence of data imbalance on the precision of a training model is relieved to a certain extent.
It can be understood that the data of the fraud user is enhanced by the self-adjusting oversampling balance algorithm, so that the data proportion difference of the fraud user from the normal user is avoided under the condition of not changing the data distribution, and the learning capability of the model is improved. The algorithm can improve the detection performance of the model aiming at the condition that the training data is limited.
And step S12, respectively training the XGboost model and the LightGBM model through the training feature matrix.
Specifically, under the condition of original data distribution, a few types of samples are added to the over-sampling data set, and the added samples and the original sample form a new training data set, so that the influence of data imbalance on the accuracy of a training model is relieved to a certain extent. Feature screening is performed on the full data set. The feature screening uses an embedding method, namely training a feature training matrix formed by training a full data set is placed in an XGboost model and a LightGBM model for training.
Referring to fig. 4, the method for training the XGBoost model and the LightGBM model respectively through the training feature matrix includes:
step S121, dividing a data set corresponding to the training feature matrix into a training set and a test set; the training set is training data used for model training, and the test set is test data used for model testing.
And S122, respectively inputting the training set and the test set into the XGboost model and the LightGBM model to perform 5-fold cross validation training, so that the validation set covers the whole training set.
When 5-fold cross validation training is carried out on the classification task, 4/5 training data of each time are used as a training set, 1/5 is used as a test set, and the test set is switched and repeated for 5 times, so that the test set covers the whole training set.
Unlike the evaluation manner in the conventional machine learning, in the information fraud identification task, in addition to the accuracy rate, much attention is paid to identifying fraud users more with high accuracy rate. Precision is the ratio of the samples correctly predicted as positive class by the classifier (fraudulent user) to all samples predicted as positive class for a given test data set, and is calculated by the formula:
Figure BDA0003143468570000091
wherein, TP is a true positive example (true label is a positive class, the prediction result is a positive class), FP is a false positive example (true label is a negative class, the prediction result is a positive class). The accuracy rate is used to visually represent the classifier's ability to mark the positive case.
Recall recalling is the proportion of samples labeled as positive class, predicted as positive class, for a given test data set, and is calculated by the formula:
Figure BDA0003143468570000092
FN is false negative class (true label is positive class, prediction result is negative class).
Therefore, the XGboost model is correspondingly improved, training logic is modified, an evaluation function fs _ score of the XGboost model is the score of the model when the accuracy rate exceeds a threshold value, and a calculation formula is as follows:
Figure BDA0003143468570000101
where Precision is Precision, Recall is Recall, thkK is a constant value.
Further, in the cross validation training process, setting a hyper-parameter: the loss function is AUC, the evaluation function is fs _ score, the maximum splitting depth parameter max _ depth of the decision tree nodes in the XGboost is 6, the learning rate parameter eta is 0.08, and the regularization parameter L2 is set to be 2.
Further, in the training process, the maximum iteration number is set to 10000 rounds, and early stopping is set to 100 rounds, namely every 100 rounds of new training effect cannot exceed the current optimum score in the verification set fs _ score, the training is stopped, and overfitting is prevented.
Further, the XGboost is a tree-shaped integration model, and after training is completed, feature importance ranking is obtained according to the splitting times of each feature in tree nodes. And (4) performing feature screening according to the sorting condition, thereby reducing feature dimensionality and improving the generalization capability of the model.
It should be further noted that, in order to improve the generalization ability and stability of the detection model, the XGBoost model and the LightGBM model may be subjected to pre-training, feature importance evaluation and feature screening before being subjected to collective training. On the basis of a new training feature matrix obtained by pre-training an XGboost model, feature importance evaluation and feature screening, the XGboost model and the LightGBM model are trained again.
And step S13, carrying out model fusion on the prediction results of the XGboost model and the LightGBM model in a Stacking mode to obtain a Logistic model.
It can be understood that model training is performed based on the XGBoost and the LightGBM model, and a Stacking method is used for fusion, so that the accuracy and generalization capability of the mobile phishing user identification method are improved.
And step S14, identifying the network risk user in the communication process through the Logistic model.
Referring to fig. 5, model fusion is performed on the prediction results of the two models in a Stacking manner to obtain a final prediction model of the mobile phishing user. The Stacking is a two-layer model structure, the XGboost and the LightGBM are used as a first layer, the Logistic model is used as a second layer, the fusion mode is that the XGboost and the LightGBM are respectively trained by a 5-fold cross validation method, the predicted values of the two models to the training set and the test set sample are used as the characteristics of the Logistic model to be trained, and the Logistic carries out final category prediction based on the characteristics.
It can be understood that the XGBoost algorithm and the LightGBM algorithm in machine learning are used for modeling and training the user behavior data of the telecommunication network. The XGboost model and the LightGBM model are forward step-by-step addition integration algorithms based on decision trees and are composed of a plurality of decision trees, and the sub-model behind is adjusted according to the prediction performance of the sub-model in front, so that the overall performance of the model is easier to improve. In the XGboost, a second-order Taylor expansion is used for describing a target loss function, so that the approximation of the optimal point of the model is facilitated, and the prediction of the model is more accurate. In the XGBoost, the predicted values of the leaf nodes of the base model are not simply averaged over the target values of the samples of the leaf nodes, but are calculated by the optimization theory of the target loss function. The weight change in the objective loss function is actually a negative gradient factor multiplied by a coefficient consisting of the second order gradient plus the inverse of the constant. That is, the weight change is optimized along the direction of the negative gradient, and the optimization amplitude is dynamically adjusted according to the gradient change amplitude (the condition that the second-order gradient describes the gradient change), so that the tree model in the XGBoost can obtain the optimal solution more easily, and the phenomenon of 'oscillation' near the optimal point is avoided to a certain extent. And the model prediction is more accurate.
The method and the device provided by the invention are combined with the telecommunication fraud identification task, and aiming at the characteristic of improving the identification recall rate of fraud users under high accuracy rate, XGBoost and LightGBM are optimized, an evaluation function fs _ score is customized, and the identification capability of the fraud users is improved under high accuracy rate.
According to the method, the Stacking model fusion algorithm is adopted to fuse the XGboost model and the LightGBM model, and the fraud number prediction model with high accuracy and good robustness is obtained. Different rules and information can be learned by different models in the data training process, and the final model can integrate the advantages of the internal model through appropriate model fusion, so that the accuracy of prediction is improved, the robustness is improved, and the performance of the final model on new data is more stable.
In a classification task, the imbalance of samples can cause the deviation of the contents of model training and learning, and the learned rule is more emphasized on most samples, so that the classification effect of the model is poor. The invention provides a self-adjusting oversampling balance algorithm under the condition of limited data, and a proper few types of samples are generated under the condition of not changing the original data distribution, so that the model training effect is improved.
According to the phishing number detection method, oversampling of a few types of samples is carried out through a self-adjusting oversampling algorithm, and a training feature matrix is obtained after an oversampling matrix obtained through oversampling is fused with a feature engineering matrix; respectively training the XGboost model and the LightGBM model through the training feature matrix; model fusion is carried out on the prediction results of the XGboost model and the LightGBM model in a Stacking mode to obtain a Logistic model; and identifying the network risk users in the communication process through the Logistic model. Compared with the prior art, the method and the device can improve the accuracy and robustness of fraud number identification in network communication, and meet the actual application requirements.
It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the above-described flowcharts may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or the stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least a portion of the sub-steps or stages of other steps.
As shown in FIG. 6, it is a block diagram of the structure of the phishing number detection system provided by the present invention, said system comprising:
and the sampling module 21 is configured to perform oversampling on a few types of samples through a self-adjusting oversampling algorithm, and fuse an oversampling matrix obtained through oversampling with a feature engineering matrix to obtain a training feature matrix.
The acquisition mode of the characteristic engineering matrix comprises the following steps:
extracting the characteristics of original data of communication network users in the actual data set, and constructing an original characteristic matrix according to the extracted characteristic data;
and performing characteristic engineering processing on the original characteristic matrix to obtain a characteristic engineering matrix.
Specifically, feature extraction is performed on original data of a communication network user in an actual data set, and an original feature matrix is constructed according to the extracted feature data, specifically,
extracting the characteristics of a call log, a short message log and a network original log of a communication network user;
and combining the extracted feature information of the call log, the short message log and the network original log with the user basic data to obtain an original feature matrix.
Further, the original feature matrix is processed by feature engineering to obtain a feature engineering matrix, specifically,
counting feature dimension information of a normal user group and a fraud user group in the original feature matrix;
acquiring a plurality of characteristics of which the difference degrees of normal user groups and fraudulent user groups are greater than a target value through a visualization tool;
calculating the variance of each feature dimension in the original feature matrix, extracting the features of which the variance is greater than a threshold value according to a variance selection method, and deleting the features of which the group difference degree between normal users and fraudulent users is less than a target value, thereby obtaining a feature engineering matrix.
Further, the sampling module 21 is specifically configured to calculate K neighbor samples of each minority class sample according to the distance metric; k is the number of the calculated neighbor samples;
calculating a similar coefficient C of each minority sample K neighbor sample according to the similar coefficient model, and screening out minority samples with C > C _ method as sampling minority samples; wherein, C _ method is a coefficient threshold of the same type;
and forming a sampling matrix by the sampling minority samples, the neighbor minority samples corresponding to the sampling minority samples and the random samples generated by the random sample generation model, and fusing the sampling matrix with the characteristic engineering matrix to obtain a training characteristic matrix.
And the training module 22 is used for respectively training the XGboost model and the LightGBM model through the training feature matrix.
Specifically, a data set corresponding to the training feature matrix is divided into a training set and a test set; the training set is training data used for model training, and the test set is test data used for model testing;
and respectively inputting the training set and the test set into the XGboost model and the LightGBM model to perform 5-fold cross validation training, so that the test set covers the whole training set.
Further, during cross-validation training,
setting a hyper-parameter: the loss function is AUC, the evaluation function is fs _ score, the maximum depth parameter of node splitting of the model decision tree is 6, the learning rate parameter is 0.08, the regularization parameter is 2, the maximum iteration frequency is 10000 rounds, and the early stop is 100 rounds;
Figure BDA0003143468570000141
Figure BDA0003143468570000142
Figure BDA0003143468570000143
wherein Precision is Precision, Recall is Recall, TP is true, FP is false, thkK is a constant value.
And the fusion module 23 is configured to perform model fusion on the prediction results of the XGBoost model and the LightGBM model in a Stacking manner to obtain a Logistic model.
And the identification module 24 is used for identifying the network risk users in the communication process through the Logistic model.
The phishing number detection system provided by the embodiment of the invention performs oversampling on a few types of samples through a self-adjusting oversampling algorithm, and fuses an oversampling matrix obtained by oversampling with a feature engineering matrix to obtain a training feature matrix; respectively training the XGboost model and the LightGBM model through the training feature matrix; model fusion is carried out on the prediction results of the XGboost model and the LightGBM model in a Stacking mode to obtain a Logistic model; and identifying the network risk users in the communication process through the Logistic model. Compared with the prior art, the method and the device can improve the accuracy and robustness of fraud number identification in network communication, and meet the actual application requirements.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program; wherein the computer program, when executed, controls an apparatus on which the computer-readable storage medium is located to perform the phishing number detection method as described above.
The present invention further provides a terminal device, referring to fig. 7, which is a block diagram of a preferred embodiment of the terminal device provided by the present invention, the terminal device includes a processor 10, a memory 20 and a computer program stored in the memory 20 and configured to be executed by the processor 10, and the processor 10, when executing the computer program, implements the phishing number detection method as described above.
Preferably, the computer program can be divided into one or more modules/units (e.g. computer program 1, computer program 2,) which are stored in the memory 20 and executed by the processor 10 to accomplish the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the terminal device.
The Processor 10 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, etc., the general purpose Processor may be a microprocessor, or the Processor 10 may be any conventional Processor, the Processor 10 is a control center of the terminal device, and various interfaces and lines are used to connect various parts of the terminal device.
The memory 20 mainly includes a program storage area that may store an operating system, an application program required for at least one function, and the like, and a data storage area that may store related data and the like. In addition, the memory 20 may be a high speed random access memory, may also be a non-volatile memory, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), and the like, or the memory 20 may also be other volatile solid state memory devices.
It should be noted that the terminal device may include, but is not limited to, a processor and a memory, and those skilled in the art will understand that the structural block diagram of fig. 7 is only an example of the terminal device, and does not constitute a limitation to the terminal device, and may include more or less components than those shown, or combine some components, or different components.
In summary, the phishing number detection method, the phishing number detection system, the storage medium and the terminal device provided by the embodiment of the invention perform oversampling of a few types of samples through a self-adjusting oversampling algorithm, and fuse an oversampling matrix obtained by oversampling with a feature engineering matrix to obtain a training feature matrix; respectively training the XGboost model and the LightGBM model through the training feature matrix; model fusion is carried out on the prediction results of the XGboost model and the LightGBM model in a Stacking mode to obtain a Logistic model; and identifying the network risk users in the communication process through the Logistic model. Compared with the prior art, the method and the device can improve the accuracy and robustness of fraud number identification in network communication, and meet the actual application requirements.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A phishing number detection method, characterized in that said method comprises the steps of:
performing over-sampling on a few types of samples through a self-adjusting over-sampling algorithm, and fusing an over-sampling matrix obtained by the over-sampling with a feature engineering matrix to obtain a training feature matrix;
respectively training the XGboost model and the LightGBM model through the training feature matrix;
model fusion is carried out on the prediction results of the XGboost model and the LightGBM model in a Stacking mode to obtain a Logistic model;
and identifying the network risk users in the communication process through the Logistic model.
2. The phishing number detection method as recited in claim 1, wherein the obtaining method of the feature engineering matrix comprises:
extracting the characteristics of original data of communication network users in the actual data set, and constructing an original characteristic matrix according to the extracted characteristic data;
and performing characteristic engineering processing on the original characteristic matrix to obtain a characteristic engineering matrix.
3. The phishing number detection method as recited in claim 2, wherein the method of performing feature extraction on the original data of the communication network users in the actual data set and constructing an original feature matrix according to the extracted feature data comprises:
extracting the characteristics of a call log, a short message log and a network original log of a communication network user;
and combining the extracted feature information of the call log, the short message log and the network original log with the user basic data to obtain an original feature matrix.
4. The phishing number detection method as recited in claim 2, wherein the method of feature engineering processing the original feature matrix to obtain a feature engineering matrix comprises:
counting feature dimension information of a normal user group and a fraud user group in the original feature matrix;
acquiring a plurality of characteristics of which the difference degrees of normal user groups and fraudulent user groups are greater than a target value through a visualization tool;
calculating the variance of each feature dimension in the original feature matrix, extracting the features of which the variance is greater than a threshold value according to a variance selection method, and deleting the features of which the group difference degree between normal users and fraudulent users is less than a target value, thereby obtaining a feature engineering matrix.
5. The phishing number detection method as recited in claim 4, wherein the method for oversampling a few types of samples by a self-adjusting oversampling algorithm and fusing the oversampled matrix with the feature engineering matrix to obtain the training feature matrix comprises:
calculating K neighbor samples of each minority sample according to the distance measurement; k is the number of the calculated neighbor samples;
calculating a similar coefficient C of each minority sample K neighbor sample according to the similar coefficient model, and screening out minority samples with C > C _ method as sampling minority samples; wherein, C _ method is a coefficient threshold of the same type;
and the sampled minority samples and the adjacent minority samples corresponding to the sampled minority samples obtain an oversampled data set through a random sample generation model, the oversampled data set is combined with a classifier model to carry out sample screening by adopting an embedding method, so that a qualified minority oversampling matrix is obtained, and the oversampling matrix is fused with the characteristic engineering matrix to obtain a training characteristic matrix.
6. The phishing number detection method of claim 1, wherein the method of training the XGboost model and the LightGBM model by the training feature matrix respectively comprises:
dividing a data set corresponding to the training feature matrix into a training set and a test set; the training set is training data used for model training, and the test set is test data used for model testing;
and respectively inputting the training set and the test set into the XGboost model and the LightGBM model to perform 5-fold cross validation training, so that the test set covers the whole training set.
7. The phishing number detection method as recited in claim 6, wherein, during cross-validation training,
setting a hyper-parameter: the loss function is AUC, the evaluation function is fs _ score, the maximum depth parameter of node splitting of the model decision tree is 6, the learning rate parameter is 0.08, the regularization parameter is 2, the maximum iteration frequency is 10000 rounds, and the early stop is 100 rounds;
Figure FDA0003143468560000031
Figure FDA0003143468560000032
Figure FDA0003143468560000033
wherein Precision is Precision, Recall is Recall, TP is true, FP is false, thkK is a constant value.
8. An phishing number detection system, characterized in that said system comprises:
the sampling module is used for performing oversampling of a few types of samples through a self-adjusting oversampling algorithm, and fusing an oversampling matrix obtained by oversampling with a feature engineering matrix to obtain a training feature matrix;
the training module is used for respectively training the XGboost model and the LightGBM model through the training feature matrix;
the fusion module is used for performing model fusion on the prediction results of the XGboost model and the LightGBM model in a Stacking mode to obtain a Logistic model;
and the identification module is used for identifying the network risk user in the communication process through the Logistic model.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program; wherein the computer program, when executed, controls an apparatus on which the computer-readable storage medium is located to perform the phishing number detection method as recited in any one of claims 1-7.
10. A terminal device, comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, when executing the computer program, implementing the phishing number detection method as recited in any one of claims 1-7.
CN202110748349.3A 2021-07-01 2021-07-01 Phishing number detection method, system, storage medium and terminal equipment Pending CN113591924A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110748349.3A CN113591924A (en) 2021-07-01 2021-07-01 Phishing number detection method, system, storage medium and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110748349.3A CN113591924A (en) 2021-07-01 2021-07-01 Phishing number detection method, system, storage medium and terminal equipment

Publications (1)

Publication Number Publication Date
CN113591924A true CN113591924A (en) 2021-11-02

Family

ID=78245983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110748349.3A Pending CN113591924A (en) 2021-07-01 2021-07-01 Phishing number detection method, system, storage medium and terminal equipment

Country Status (1)

Country Link
CN (1) CN113591924A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114035468A (en) * 2021-11-08 2022-02-11 山东理工大学 Predictive monitoring method and system for fan overhaul process based on XGboost algorithm
CN114511330A (en) * 2022-04-18 2022-05-17 山东省计算中心(国家超级计算济南中心) Improved CNN-RF-based Ethernet workshop Pompe deception office detection method and system
CN115174745A (en) * 2022-07-04 2022-10-11 联通(山东)产业互联网有限公司 Telephone number fraud pattern recognition method based on graph network and machine learning
CN115550506A (en) * 2022-09-27 2022-12-30 中国电信股份有限公司 Training of user recognition model, user recognition method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114035468A (en) * 2021-11-08 2022-02-11 山东理工大学 Predictive monitoring method and system for fan overhaul process based on XGboost algorithm
CN114035468B (en) * 2021-11-08 2024-05-28 山东理工大学 Method and system for predictively monitoring overhaul flow of fan based on XGBoost algorithm
CN114511330A (en) * 2022-04-18 2022-05-17 山东省计算中心(国家超级计算济南中心) Improved CNN-RF-based Ethernet workshop Pompe deception office detection method and system
CN115174745A (en) * 2022-07-04 2022-10-11 联通(山东)产业互联网有限公司 Telephone number fraud pattern recognition method based on graph network and machine learning
CN115174745B (en) * 2022-07-04 2023-08-15 联通(山东)产业互联网有限公司 Telephone number fraud pattern recognition method based on graph network and machine learning
CN115550506A (en) * 2022-09-27 2022-12-30 中国电信股份有限公司 Training of user recognition model, user recognition method and device

Similar Documents

Publication Publication Date Title
CN113591924A (en) Phishing number detection method, system, storage medium and terminal equipment
CN113515770A (en) Method and device for determining target business model based on privacy protection
Sethi et al. SMS spam detection and comparison of various machine learning algorithms
Junnarkar et al. E-mail spam classification via machine learning and natural language processing
CN110348528A (en) Method is determined based on the user credit of multidimensional data mining
Alzahrani et al. Comparative study of machine learning algorithms for SMS spam detection
CN110263538A (en) A kind of malicious code detecting method based on system action sequence
CN112153221B (en) Communication behavior identification method based on social network diagram calculation
Janjua et al. Handling insider threat through supervised machine learning techniques
CN116596095B (en) Training method and device of carbon emission prediction model based on machine learning
Dada et al. Random forests machine learning technique for email spam filtering
Saini et al. Machine learning approaches for an automatic email spam detection
Abinaya et al. Spam detection on social media platforms
Aliza et al. A comparative analysis of SMS spam detection employing machine learning methods
CN115022038A (en) Power grid network anomaly detection method, device, equipment and storage medium
Yang et al. Anti-spam filtering using neural networks and Baysian classifiers
CN110347669A (en) Risk prevention method based on streaming big data analysis
Anklesaria et al. A survey on machine learning algorithms for detecting fake instagram accounts
Abu-Nimeh et al. Bayesian additive regression trees-based spam detection for enhanced email privacy
Ying et al. FrauDetector+ An Incremental Graph-Mining Approach for Efficient Fraudulent Phone Call Detection
Singh et al. Enhancing spam detection on SMS performance using several machine learning classification models
CN116663018A (en) Vulnerability detection method and device based on code executable path
Thanh et al. An approach to reduce data dimension in building effective network intrusion detection systems
Wang et al. An efficient intrusion detection model combined bidirectional gated recurrent units with attention mechanism
CN112069392B (en) Method and device for preventing and controlling network-related crime, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination