CN111198947B

CN111198947B - Convolutional neural network fraud short message classification method and system based on naive Bayes optimization

Info

Publication number: CN111198947B
Application number: CN202010008497.7A
Authority: CN
Inventors: 石嘉; 王秀丽; 李盛超
Original assignee: NANJING SINOVATIO TECHNOLOGY CO LTD
Current assignee: NANJING SINOVATIO TECHNOLOGY CO LTD
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2024-02-13
Anticipated expiration: 2040-01-06
Also published as: CN111198947A

Abstract

The invention discloses a convolutional neural network fraud short message classification method and a system based on naive Bayes optimization, which are characterized in that a template library is established, and short messages to be distinguished are matched and classified with short message templates in the template library; carrying out secondary discrimination on the short message with the failed template matching through a textCNN model, and classifying the short message successfully discriminated through the textCNN model to determine the short message as a fraud short message; and reclassifying the short message which fails to be judged by the textCNN model by calculating the Bayesian probability, and determining the short message which is unsuccessfully classified by the Bayesian probability as the non-fraud short message. According to the invention, a scheme of parallel short message fraud classification of naive Bayes and textCNN is realized, the whole model is optimized through statistics generation templates of keywords, and self-learning can be realized through self-increment of a template library, so that the accuracy and recall rate of short message fraud classification are greatly improved.

Description

Convolutional neural network fraud short message classification method and system based on naive Bayes optimization

Technical Field

The invention relates to a method and a system for classifying natural language processing texts, in particular to a method and a system for classifying convolutional neural network fraud short messages based on naive Bayesian optimization.

Background

With the development of the internet, unstructured text data has increased sharply, and at the same time, it has become more difficult for people to manage their own desired data from massive data, so how to effectively organize and manage such information, and quickly, accurately and comprehensively find information required by users from such information is a great challenge currently faced. Because of the large amount of information, if text data is collected and mined only manually, a great deal of manpower and time are consumed, and the text data is difficult to realize. The realization of automatic text classification is particularly important, is a basic function of text information processing, and also becomes a core technology for processing and organizing text data.

For text classification problems, there is no common method to extract features of the text, such as using word2vec or LDA models to convert the text into a feature vector of fixed dimensions, and then training a classifier based on the extracted features. However, along with the TextCNN proposed in paper Convolutional Neural Networks for Sentence Classification of yoon kim, a gate for classifying texts by convolutional neural networks (Convolutional Neural Networks, CNN) is opened, and experimental research shows that the CNN has strong capability of extracting shallow features, good effect in classifying short texts, wide application and high speed, and is generally preferred. Meanwhile, naive bayes are one of ten big algorithms of data, and are widely used for solving classification because they are easy to construct and interpret and have good performance.

However, in the application of actual fraud short messages, the classifying effect of textCNN on normal text short messages is very good, however, many of the fraud short messages contain non-standard text, and at the moment, when word2vec is used for converting text into feature vectors, the feature of the non-standard text cannot be accurately represented, so the classifying effect of textCNN is not ideal, and therefore, the classification of short messages containing non-standard text needs to be optimized.

Disclosure of Invention

The invention aims to: the invention aims to provide a convolutional neural network fraud short message classification method based on naive Bayesian optimization, which can improve the accuracy and recall rate of short message fraud classification; the second purpose of the invention is a convolutional neural network fraud short message classification system based on naive Bayes optimization.

The technical scheme is as follows: the invention discloses a convolutional neural network fraud short message classification method based on naive Bayes optimization, which comprises the following steps:

establishing a template library, carrying out matching classification on the short message to be judged and the short message templates in the template library, and judging the short message with successful template matching as a fraud short message; the construction of the template library comprises the steps of clustering all fraud short messages from manual research and judgment, finding out a representative short message sample from each class to form a short message template, and constructing the template library.

Carrying out secondary discrimination on the short message with the failed template matching through a textCNN model, and carrying out Bayesian probability calculation classification on the short message with the failed discrimination through the textCNN model;

and reclassifying the short messages with failed textCNN model discrimination by calculating Bayesian probability, determining the short messages with successful Bayesian probability classification as fraud short messages and finishing classification, and determining the short messages with failed Bayesian probability classification as non-fraud short messages.

Preferably, the short message successfully judged by the textCNN model is formed into a new short message template, and an amplified template is formed and put into a template library. The template library increases the data size of the templates through textCNN, so that the effect of self-adaptive matching is achieved.

Preferably, the specific process of matching and classifying the short message to be distinguished and the short message templates in the template library is as follows: and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful.

Preferably, the specific process of the textCNN model secondary discrimination is as follows: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel can be regarded as a representation form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message; after the model is trained, when the probability of a certain scene is a certain threshold value, the model is judged to be a fraud short message.

When convolution is carried out through convolution kernels with the sizes of 2,3 and 4 respectively, each type of convolution kernel has two convolution kernels and 6 convolution kernels in total, so that six matrixes are obtained; when the convolution is carried out, the convolution kernel carries out one-dimensional convolution from top to bottom; the size of the convolution kernel can be seen as a representation of n-gram, representing 3-gram,4-gram,5-gram, respectively, and then performing a maximum merge, i.e., recording the maximum number from each feature map; generating univariate feature vectors from all six graphs, and the six features are connected to form the feature vector of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message; after the model is trained, when the probability of a certain scene is a certain threshold value, the model is judged to be a fraud short message.

Preferably, the specific process of reclassifying by calculating the bayesian probability is as follows: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords as the prior probability of each keyword, calculating the posterior probability of each scene according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the fraud messages when the probability of a certain scene is larger than a set threshold value.

The invention also provides a convolutional neural network fraud short message classification system based on naive Bayes optimization, which comprises a construction module, a template matching module, a textCNN model discrimination module and a Bayes probability classification module.

The construction module is used for constructing a template library, and after clustering all fraud short messages from manual research and judgment, each class finds out a representative short message sample to form a short message template, so that the template library is constructed;

the template matching module is used for carrying out matching classification on the short message to be judged and the short message templates in the template library, judging the short message with successful template matching as a fraud short message, and sending the short message with failed template matching to the textCNN model judging module for secondary judgment;

the textCNN model judging module judges the short message with failed template matching for the second time, judges the successfully judged short message as a fraud short message, and sends the short message with failed judgment to the Bayesian probability classifying module for reclassifying; forming a new short message template by using the short message successfully judged by the textCNN model, forming an amplification template and putting the amplification template into a template library;

and the Bayesian probability classification module is used for reclassifying the short messages which are failed to be judged by the textCNN model by calculating the Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which are unsuccessfully classified by the Bayesian probability as non-fraud short messages.

Preferably, the process of matching and classifying the short message to be distinguished and the short message templates in the template library comprises the following steps: and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful.

the discriminating process of the textCNN model discriminating module comprises the following steps: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel can be regarded as a representation form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message; after the model is trained, when the probability of a certain scene is a certain threshold value, the model is judged to be a fraud short message.

When convolution is carried out through convolution kernels with the sizes of 2,3 and 4 respectively, each type of convolution kernel has two, and 6 convolution kernels are obtained, six matrixes are obtained, and when the convolution is carried out, one-dimensional convolution is carried out on the convolution kernels from top to bottom; the size of the convolution kernel can be seen as a representation of n-gram, representing 3-gram,4-gram,5-gram, respectively, and then performing a maximum merge, i.e., recording the maximum number from each feature map; generating univariate feature vectors from all six graphs, and the six features are connected to form the feature vector of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message.

The classification process of the Bayesian probability classification module comprises the following steps: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords as the prior probability of each keyword, calculating the posterior probability of each scene according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the fraud messages when the probability of a certain scene is larger than a set threshold value.

The invention designs and realizes an algorithm of naive Bayes, volume and neural network parallelization in the classification of the fraud messages, and generates a multi-classifier according to the intention of the fraud messages. Experiments show that the accuracy of a single multi-classifier generated by using CNN is particularly high, but the recall rate is relatively low, because the characteristic extraction of non-standard characters is inaccurate, so that the short messages which are not identified by the CNN model can be classified and identified by counting common characters (including standard and non-standard) in the short messages of various types and then establishing a naive Bayesian model to form the multi-classifier, meanwhile, in order to further improve the accuracy and recall rate of the whole model, a template similarity matching model is added at the beginning, and finally, the result of the three models is synthesized to judge the short messages, so that the accuracy and recall rate are greatly improved.

In the template similarity matching, the template matching can be matched to 20% of short messages, the accuracy is more than 90%, and the recall rate can be dynamically increased through the self-increment of a model library; in textCNN, about 60% of short messages can be classified, and the accuracy is more than 92%. In the Bayesian probability calculation, the number of models with certain large types is limited, otherwise, classification is extremely prone to the large types, and at present, the Bayesian probability can classify about 20% of short messages, and the accuracy is above 90%.

The beneficial effects are that: the invention designs and realizes the scheme of parallel short message fraud classification of naive Bayes and textCNN, optimizes the whole model by counting and generating templates of keywords, and simultaneously can realize self-learning by self-increasing a template library, so that the accuracy and recall rate of short message fraud classification are greatly improved.

Drawings

FIG. 1 is a flow chart of a fraud SMS classification method of the present invention;

fig. 2 is a schematic structural diagram of textCNN model.

Detailed Description

The present invention will be described in further detail with reference to examples.

Example 1:

fig. 1 is a flow chart of a fraud message classification method, and the convolutional neural network fraud message classification method based on naive bayes optimization in this embodiment includes the following steps:

(1) Marking the data category, simultaneously carrying out keyword statistics on the short messages of each category, and generating a template library and training data by using keywords with high frequency of each category. The aim is to construct a template library and training data of textCNN.

(2) The text is segmented through jieba, so that the characteristic words in sentences are matched with a template library.

(3) And establishing a stop word dictionary, wherein the stop words mainly comprise some adverbs, adjectives and some connecting words. By maintaining a stop vocabulary, which is essentially a feature extraction process, it is essentially part of the feature selection in order to remove redundant information from the sms.

(4) Carrying out matching classification on the short message to be judged and the short message templates in the template library, and judging that the template is successfully matched as a fraud short message; and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful. Namely: by cosine formulaAnd calculating the average cosine similarity of each word vector in the short message and the word vector of the template library, and judging that the short message is a fraud short message when the average cosine similarity is larger than a set threshold value of 0.5, wherein the text CNN model is adopted for judging whether the short message is identified as the fraud short message continuously.

(5) The short message with the failure template matching is subjected to secondary judgment through a textCNN model, and the short message with the successful judgment through the textCNN model is judged to be a fraud short message;

and performing supervision training on the textCNN model by using the well-trained data, and storing the model and parameters so that the textCNN model judges the short message with the template matching failure. Fig. 2 shows a textCNN model structure schematic diagram.

The characters of the short message are subjected to feature matrix processing, a feature matrix is formed after word embedding, then convolution is carried out through convolution kernels with the sizes of 2,3 and 4 respectively, two convolution kernels of each type and 6 total convolution kernels are obtained, six matrixes are obtained, meanwhile, when the convolution is carried out, the convolution kernels can only carry out one-dimensional convolution from top to bottom and cannot carry out left-right convolution, the reason is that words cannot be separated for convolution, even if data obtained by left-right convolution has no meaning, the application of the feature matrix is different from that of CNN in images; the size of the convolution kernel can be considered as a representation of an n-gram, representing 3-gram,4-gram,5-gram, respectively; maximum merging is then performed, i.e. the maximum number from each feature is recorded; thus, univariate feature vectors are generated from all six graphs, and these six features are connected to form the feature vector of the penultimate layer. The final softmax layer receives the feature vector as input and uses it to classify sentences. After the model is trained, when the probability of a certain scene is greater than a set threshold value of 0.85, judging that the short message is a fraud short message.

(6) And reclassifying the short messages with failed textCNN model discrimination by calculating Bayesian probability, determining the short messages with successful Bayesian probability classification as fraud short messages and finishing classification, and determining the short messages with failed Bayesian probability classification as non-fraud short messages. The method comprises the following specific processes of calculating the naive Bayes probability by utilizing the statistics of the keywords and reclassifying by calculating the Bayes probability: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords in each scene as the prior probability of each keyword, calculating the probability of the short message which fails to judge according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the short message as the fraud message when the probability is larger than a set threshold value.

Namely: using bayesian formulasCounting the keywords of each fraud category, calculating the conditional probability of each keyword in each category, and then calculating the posterior probability of the keywords hit in the judging short message by using a Bayesian formula; when the posterior probability of a certain scene is greater than the set threshold value of 0.9, judging that the short message is a fraud short message of the category.

(7) And (6) warehousing test data to obtain a result.

In the template similarity matching of the embodiment, the template matching can be matched to 20% of short messages, the accuracy is more than 90%, and the recall rate can be dynamically increased through the self-increment of a model library; in textCNN, about 60% of short messages can be classified, and the accuracy is more than 92%. In the Bayesian probability calculation, the number of models with certain large types is limited, otherwise, classification is extremely prone to the large types, and at present, the Bayesian probability can classify about 20% of short messages, and the accuracy is above 90%.

Example 2:

the convolutional neural network fraud short message classification system based on naive Bayesian optimization is characterized by comprising a construction module, a template matching module, a textCNN model discrimination module and a Bayesian probability classification module.

The construction module is used for constructing a template library, and after clustering all fraud short messages from manual research and judgment, each class of the template library finds out a representative short message sample to form a short message template, and the template library is constructed;

and the template matching module is used for carrying out matching classification on the short message to be judged and the short message templates in the template library, judging that the template matching is successful as a fraud short message, and sending the short message with failed template matching to the textCNN model judging module for secondary judgment. The process for matching and classifying the short message to be distinguished and the short message templates in the template library comprises the following steps: and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful.

the textCNN model judging module judges the short message with failed template matching for the second time, judges the successfully judged short message as a fraud short message, and sends the short message with failed judgment to the Bayesian probability classifying module for reclassifying; and forming a new short message template by the short message successfully judged by the textCNN model, and adding the formed amplified template into a template library. The discriminating process of the textCNN model discriminating module comprises the following steps: carrying out feature matrix processing on characters in the short message, carrying out convolution through convolution kernels with the sizes of 2,3 and 4 respectively, wherein each type of convolution kernel has two, and 6 convolution kernels are used to obtain six matrixes, and simultaneously, when carrying out convolution, the convolution kernels can only carry out one-dimensional convolution from top to bottom; the size of the convolution kernel can be seen as a representation of n-gram, representing 3-gram,4-gram,5-gram, respectively, and then performing a maximum merge, i.e., recording the maximum number from each feature map; generating univariate feature vectors from all six graphs, and the six features are connected to form the feature vector of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message.

And the Bayesian probability classification module is used for reclassifying the short messages which are failed to be judged by the textCNN model by calculating the Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which are unsuccessfully classified by the Bayesian probability as non-fraud short messages. The classification process of the Bayesian probability classification module comprises the following steps: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords in each scene as the prior probability of each keyword, calculating the probability of the short message which fails to judge according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the short message as the fraud message when the probability is larger than a set threshold value.

Claims

1. A convolutional neural network fraud short message classification method based on naive Bayesian optimization is characterized by comprising the following steps:

constructing a template library, carrying out matching classification on the short message to be judged and the short message templates in the template library, and judging that the template matching is successful as a fraud short message; the construction of the template library comprises the following steps: after clustering all fraud short messages from manual research and judgment, finding out representative short message samples from each class to form a short message template, and constructing a template library; the specific process of matching and classifying the short message to be distinguished and the short message templates in the template library is as follows: calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful;

the short message with the failure template matching is subjected to secondary judgment through a textCNN model, and the short message with the successful judgment through the textCNN model is judged to be a fraud short message; the text CNN model judges the successful short message to form a new short message template, and forms an amplified template to be added into a template library;

the secondary discrimination of the textCNN model comprises the following specific processes: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel is regarded as the expression form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message;

reclassifying the short messages which fail to be judged by the textCNN model by calculating Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which fail to be classified by the Bayesian probability as non-fraud short messages; the specific process of calculating the Bayesian probability for reclassifying is as follows: counting the number of all fraud short messages based on manual research and judgment in each scene, taking the number of the occurrence of each keyword in each scene as the prior probability of each keyword, calculating the posterior probability of each keyword belonging to the text short message based on the naive Bayesian theory according to the prior probability of the hit keyword, and judging the text short message as the fraud short message when the posterior probability is larger than a set threshold value; wherein the set threshold is 0.85.

2. A system of convolutional neural network fraud short message classification method based on naive Bayesian optimization comprises:

the template matching module is used for carrying out matching classification on the short message to be judged and the short message templates in the template library, judging that the template matching is successful as a fraud short message, and sending the short message with failed template matching to the textCNN model judging module for secondary judgment; the process for matching and classifying the short message to be distinguished and the short message templates in the template library comprises the following steps: calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful;

the textCNN model judging module judges the short message with failed template matching for the second time, judges the successfully judged short message as a fraud short message, and sends the short message with failed judgment to the Bayesian probability classifying module for reclassifying; the text CNN model is judged to be successful, a new short message template is formed by the short message, and an amplification template is formed and added into a template library; the discriminating process of the textCNN model discriminating module comprises the following steps: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel is regarded as the expression form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message;

the Bayesian probability classification module is used for reclassifying the short messages which are failed to be judged by the textCNN model by calculating the Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which are unsuccessfully classified by the Bayesian probability as non-fraud short messages; the classification process of the Bayesian probability classification module comprises the following steps: counting all fraud messages based on manual research and judgment, counting the number of the keywords in each scene, taking the number as the prior probability of each keyword, calculating the probability of the short message which fails to judge the text CNN model based on the naive Bayesian theory according to the prior probability of the hit keyword, and judging the short message as the fraud message when the posterior probability is larger than a set threshold value.