CN111198947B - Convolutional neural network fraud short message classification method and system based on naive Bayes optimization - Google Patents
Convolutional neural network fraud short message classification method and system based on naive Bayes optimization Download PDFInfo
- Publication number
- CN111198947B CN111198947B CN202010008497.7A CN202010008497A CN111198947B CN 111198947 B CN111198947 B CN 111198947B CN 202010008497 A CN202010008497 A CN 202010008497A CN 111198947 B CN111198947 B CN 111198947B
- Authority
- CN
- China
- Prior art keywords
- short message
- fraud
- template
- short
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 24
- 238000005457 optimization Methods 0.000 title claims abstract description 12
- 239000013598 vector Substances 0.000 claims description 28
- 238000011160 research Methods 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 230000003321 amplification Effects 0.000 claims description 3
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a convolutional neural network fraud short message classification method and a system based on naive Bayes optimization, which are characterized in that a template library is established, and short messages to be distinguished are matched and classified with short message templates in the template library; carrying out secondary discrimination on the short message with the failed template matching through a textCNN model, and classifying the short message successfully discriminated through the textCNN model to determine the short message as a fraud short message; and reclassifying the short message which fails to be judged by the textCNN model by calculating the Bayesian probability, and determining the short message which is unsuccessfully classified by the Bayesian probability as the non-fraud short message. According to the invention, a scheme of parallel short message fraud classification of naive Bayes and textCNN is realized, the whole model is optimized through statistics generation templates of keywords, and self-learning can be realized through self-increment of a template library, so that the accuracy and recall rate of short message fraud classification are greatly improved.
Description
Technical Field
The invention relates to a method and a system for classifying natural language processing texts, in particular to a method and a system for classifying convolutional neural network fraud short messages based on naive Bayesian optimization.
Background
With the development of the internet, unstructured text data has increased sharply, and at the same time, it has become more difficult for people to manage their own desired data from massive data, so how to effectively organize and manage such information, and quickly, accurately and comprehensively find information required by users from such information is a great challenge currently faced. Because of the large amount of information, if text data is collected and mined only manually, a great deal of manpower and time are consumed, and the text data is difficult to realize. The realization of automatic text classification is particularly important, is a basic function of text information processing, and also becomes a core technology for processing and organizing text data.
For text classification problems, there is no common method to extract features of the text, such as using word2vec or LDA models to convert the text into a feature vector of fixed dimensions, and then training a classifier based on the extracted features. However, along with the TextCNN proposed in paper Convolutional Neural Networks for Sentence Classification of yoon kim, a gate for classifying texts by convolutional neural networks (Convolutional Neural Networks, CNN) is opened, and experimental research shows that the CNN has strong capability of extracting shallow features, good effect in classifying short texts, wide application and high speed, and is generally preferred. Meanwhile, naive bayes are one of ten big algorithms of data, and are widely used for solving classification because they are easy to construct and interpret and have good performance.
However, in the application of actual fraud short messages, the classifying effect of textCNN on normal text short messages is very good, however, many of the fraud short messages contain non-standard text, and at the moment, when word2vec is used for converting text into feature vectors, the feature of the non-standard text cannot be accurately represented, so the classifying effect of textCNN is not ideal, and therefore, the classification of short messages containing non-standard text needs to be optimized.
Disclosure of Invention
The invention aims to: the invention aims to provide a convolutional neural network fraud short message classification method based on naive Bayesian optimization, which can improve the accuracy and recall rate of short message fraud classification; the second purpose of the invention is a convolutional neural network fraud short message classification system based on naive Bayes optimization.
The technical scheme is as follows: the invention discloses a convolutional neural network fraud short message classification method based on naive Bayes optimization, which comprises the following steps:
establishing a template library, carrying out matching classification on the short message to be judged and the short message templates in the template library, and judging the short message with successful template matching as a fraud short message; the construction of the template library comprises the steps of clustering all fraud short messages from manual research and judgment, finding out a representative short message sample from each class to form a short message template, and constructing the template library.
Carrying out secondary discrimination on the short message with the failed template matching through a textCNN model, and carrying out Bayesian probability calculation classification on the short message with the failed discrimination through the textCNN model;
and reclassifying the short messages with failed textCNN model discrimination by calculating Bayesian probability, determining the short messages with successful Bayesian probability classification as fraud short messages and finishing classification, and determining the short messages with failed Bayesian probability classification as non-fraud short messages.
Preferably, the short message successfully judged by the textCNN model is formed into a new short message template, and an amplified template is formed and put into a template library. The template library increases the data size of the templates through textCNN, so that the effect of self-adaptive matching is achieved.
Preferably, the specific process of matching and classifying the short message to be distinguished and the short message templates in the template library is as follows: and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful.
Preferably, the specific process of the textCNN model secondary discrimination is as follows: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel can be regarded as a representation form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message; after the model is trained, when the probability of a certain scene is a certain threshold value, the model is judged to be a fraud short message.
When convolution is carried out through convolution kernels with the sizes of 2,3 and 4 respectively, each type of convolution kernel has two convolution kernels and 6 convolution kernels in total, so that six matrixes are obtained; when the convolution is carried out, the convolution kernel carries out one-dimensional convolution from top to bottom; the size of the convolution kernel can be seen as a representation of n-gram, representing 3-gram,4-gram,5-gram, respectively, and then performing a maximum merge, i.e., recording the maximum number from each feature map; generating univariate feature vectors from all six graphs, and the six features are connected to form the feature vector of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message; after the model is trained, when the probability of a certain scene is a certain threshold value, the model is judged to be a fraud short message.
Preferably, the specific process of reclassifying by calculating the bayesian probability is as follows: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords as the prior probability of each keyword, calculating the posterior probability of each scene according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the fraud messages when the probability of a certain scene is larger than a set threshold value.
The invention also provides a convolutional neural network fraud short message classification system based on naive Bayes optimization, which comprises a construction module, a template matching module, a textCNN model discrimination module and a Bayes probability classification module.
The construction module is used for constructing a template library, and after clustering all fraud short messages from manual research and judgment, each class finds out a representative short message sample to form a short message template, so that the template library is constructed;
the template matching module is used for carrying out matching classification on the short message to be judged and the short message templates in the template library, judging the short message with successful template matching as a fraud short message, and sending the short message with failed template matching to the textCNN model judging module for secondary judgment;
the textCNN model judging module judges the short message with failed template matching for the second time, judges the successfully judged short message as a fraud short message, and sends the short message with failed judgment to the Bayesian probability classifying module for reclassifying; forming a new short message template by using the short message successfully judged by the textCNN model, forming an amplification template and putting the amplification template into a template library;
and the Bayesian probability classification module is used for reclassifying the short messages which are failed to be judged by the textCNN model by calculating the Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which are unsuccessfully classified by the Bayesian probability as non-fraud short messages.
Preferably, the process of matching and classifying the short message to be distinguished and the short message templates in the template library comprises the following steps: and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful.
the discriminating process of the textCNN model discriminating module comprises the following steps: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel can be regarded as a representation form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message; after the model is trained, when the probability of a certain scene is a certain threshold value, the model is judged to be a fraud short message.
When convolution is carried out through convolution kernels with the sizes of 2,3 and 4 respectively, each type of convolution kernel has two, and 6 convolution kernels are obtained, six matrixes are obtained, and when the convolution is carried out, one-dimensional convolution is carried out on the convolution kernels from top to bottom; the size of the convolution kernel can be seen as a representation of n-gram, representing 3-gram,4-gram,5-gram, respectively, and then performing a maximum merge, i.e., recording the maximum number from each feature map; generating univariate feature vectors from all six graphs, and the six features are connected to form the feature vector of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message.
The classification process of the Bayesian probability classification module comprises the following steps: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords as the prior probability of each keyword, calculating the posterior probability of each scene according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the fraud messages when the probability of a certain scene is larger than a set threshold value.
The invention designs and realizes an algorithm of naive Bayes, volume and neural network parallelization in the classification of the fraud messages, and generates a multi-classifier according to the intention of the fraud messages. Experiments show that the accuracy of a single multi-classifier generated by using CNN is particularly high, but the recall rate is relatively low, because the characteristic extraction of non-standard characters is inaccurate, so that the short messages which are not identified by the CNN model can be classified and identified by counting common characters (including standard and non-standard) in the short messages of various types and then establishing a naive Bayesian model to form the multi-classifier, meanwhile, in order to further improve the accuracy and recall rate of the whole model, a template similarity matching model is added at the beginning, and finally, the result of the three models is synthesized to judge the short messages, so that the accuracy and recall rate are greatly improved.
In the template similarity matching, the template matching can be matched to 20% of short messages, the accuracy is more than 90%, and the recall rate can be dynamically increased through the self-increment of a model library; in textCNN, about 60% of short messages can be classified, and the accuracy is more than 92%. In the Bayesian probability calculation, the number of models with certain large types is limited, otherwise, classification is extremely prone to the large types, and at present, the Bayesian probability can classify about 20% of short messages, and the accuracy is above 90%.
The beneficial effects are that: the invention designs and realizes the scheme of parallel short message fraud classification of naive Bayes and textCNN, optimizes the whole model by counting and generating templates of keywords, and simultaneously can realize self-learning by self-increasing a template library, so that the accuracy and recall rate of short message fraud classification are greatly improved.
Drawings
FIG. 1 is a flow chart of a fraud SMS classification method of the present invention;
fig. 2 is a schematic structural diagram of textCNN model.
Detailed Description
The present invention will be described in further detail with reference to examples.
Example 1:
fig. 1 is a flow chart of a fraud message classification method, and the convolutional neural network fraud message classification method based on naive bayes optimization in this embodiment includes the following steps:
(1) Marking the data category, simultaneously carrying out keyword statistics on the short messages of each category, and generating a template library and training data by using keywords with high frequency of each category. The aim is to construct a template library and training data of textCNN.
(2) The text is segmented through jieba, so that the characteristic words in sentences are matched with a template library.
(3) And establishing a stop word dictionary, wherein the stop words mainly comprise some adverbs, adjectives and some connecting words. By maintaining a stop vocabulary, which is essentially a feature extraction process, it is essentially part of the feature selection in order to remove redundant information from the sms.
(4) Carrying out matching classification on the short message to be judged and the short message templates in the template library, and judging that the template is successfully matched as a fraud short message; and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful. Namely: by cosine formulaAnd calculating the average cosine similarity of each word vector in the short message and the word vector of the template library, and judging that the short message is a fraud short message when the average cosine similarity is larger than a set threshold value of 0.5, wherein the text CNN model is adopted for judging whether the short message is identified as the fraud short message continuously.
(5) The short message with the failure template matching is subjected to secondary judgment through a textCNN model, and the short message with the successful judgment through the textCNN model is judged to be a fraud short message;
and performing supervision training on the textCNN model by using the well-trained data, and storing the model and parameters so that the textCNN model judges the short message with the template matching failure. Fig. 2 shows a textCNN model structure schematic diagram.
The characters of the short message are subjected to feature matrix processing, a feature matrix is formed after word embedding, then convolution is carried out through convolution kernels with the sizes of 2,3 and 4 respectively, two convolution kernels of each type and 6 total convolution kernels are obtained, six matrixes are obtained, meanwhile, when the convolution is carried out, the convolution kernels can only carry out one-dimensional convolution from top to bottom and cannot carry out left-right convolution, the reason is that words cannot be separated for convolution, even if data obtained by left-right convolution has no meaning, the application of the feature matrix is different from that of CNN in images; the size of the convolution kernel can be considered as a representation of an n-gram, representing 3-gram,4-gram,5-gram, respectively; maximum merging is then performed, i.e. the maximum number from each feature is recorded; thus, univariate feature vectors are generated from all six graphs, and these six features are connected to form the feature vector of the penultimate layer. The final softmax layer receives the feature vector as input and uses it to classify sentences. After the model is trained, when the probability of a certain scene is greater than a set threshold value of 0.85, judging that the short message is a fraud short message.
(6) And reclassifying the short messages with failed textCNN model discrimination by calculating Bayesian probability, determining the short messages with successful Bayesian probability classification as fraud short messages and finishing classification, and determining the short messages with failed Bayesian probability classification as non-fraud short messages. The method comprises the following specific processes of calculating the naive Bayes probability by utilizing the statistics of the keywords and reclassifying by calculating the Bayes probability: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords in each scene as the prior probability of each keyword, calculating the probability of the short message which fails to judge according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the short message as the fraud message when the probability is larger than a set threshold value.
Namely: using bayesian formulasCounting the keywords of each fraud category, calculating the conditional probability of each keyword in each category, and then calculating the posterior probability of the keywords hit in the judging short message by using a Bayesian formula; when the posterior probability of a certain scene is greater than the set threshold value of 0.9, judging that the short message is a fraud short message of the category.
(7) And (6) warehousing test data to obtain a result.
In the template similarity matching of the embodiment, the template matching can be matched to 20% of short messages, the accuracy is more than 90%, and the recall rate can be dynamically increased through the self-increment of a model library; in textCNN, about 60% of short messages can be classified, and the accuracy is more than 92%. In the Bayesian probability calculation, the number of models with certain large types is limited, otherwise, classification is extremely prone to the large types, and at present, the Bayesian probability can classify about 20% of short messages, and the accuracy is above 90%.
Example 2:
the convolutional neural network fraud short message classification system based on naive Bayesian optimization is characterized by comprising a construction module, a template matching module, a textCNN model discrimination module and a Bayesian probability classification module.
The construction module is used for constructing a template library, and after clustering all fraud short messages from manual research and judgment, each class of the template library finds out a representative short message sample to form a short message template, and the template library is constructed;
and the template matching module is used for carrying out matching classification on the short message to be judged and the short message templates in the template library, judging that the template matching is successful as a fraud short message, and sending the short message with failed template matching to the textCNN model judging module for secondary judgment. The process for matching and classifying the short message to be distinguished and the short message templates in the template library comprises the following steps: and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful.
the textCNN model judging module judges the short message with failed template matching for the second time, judges the successfully judged short message as a fraud short message, and sends the short message with failed judgment to the Bayesian probability classifying module for reclassifying; and forming a new short message template by the short message successfully judged by the textCNN model, and adding the formed amplified template into a template library. The discriminating process of the textCNN model discriminating module comprises the following steps: carrying out feature matrix processing on characters in the short message, carrying out convolution through convolution kernels with the sizes of 2,3 and 4 respectively, wherein each type of convolution kernel has two, and 6 convolution kernels are used to obtain six matrixes, and simultaneously, when carrying out convolution, the convolution kernels can only carry out one-dimensional convolution from top to bottom; the size of the convolution kernel can be seen as a representation of n-gram, representing 3-gram,4-gram,5-gram, respectively, and then performing a maximum merge, i.e., recording the maximum number from each feature map; generating univariate feature vectors from all six graphs, and the six features are connected to form the feature vector of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message.
And the Bayesian probability classification module is used for reclassifying the short messages which are failed to be judged by the textCNN model by calculating the Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which are unsuccessfully classified by the Bayesian probability as non-fraud short messages. The classification process of the Bayesian probability classification module comprises the following steps: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords in each scene as the prior probability of each keyword, calculating the probability of the short message which fails to judge according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the short message as the fraud message when the probability is larger than a set threshold value.
Claims (2)
1. A convolutional neural network fraud short message classification method based on naive Bayesian optimization is characterized by comprising the following steps:
constructing a template library, carrying out matching classification on the short message to be judged and the short message templates in the template library, and judging that the template matching is successful as a fraud short message; the construction of the template library comprises the following steps: after clustering all fraud short messages from manual research and judgment, finding out representative short message samples from each class to form a short message template, and constructing a template library; the specific process of matching and classifying the short message to be distinguished and the short message templates in the template library is as follows: calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful;
the short message with the failure template matching is subjected to secondary judgment through a textCNN model, and the short message with the successful judgment through the textCNN model is judged to be a fraud short message; the text CNN model judges the successful short message to form a new short message template, and forms an amplified template to be added into a template library;
the secondary discrimination of the textCNN model comprises the following specific processes: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel is regarded as the expression form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message;
reclassifying the short messages which fail to be judged by the textCNN model by calculating Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which fail to be classified by the Bayesian probability as non-fraud short messages; the specific process of calculating the Bayesian probability for reclassifying is as follows: counting the number of all fraud short messages based on manual research and judgment in each scene, taking the number of the occurrence of each keyword in each scene as the prior probability of each keyword, calculating the posterior probability of each keyword belonging to the text short message based on the naive Bayesian theory according to the prior probability of the hit keyword, and judging the text short message as the fraud short message when the posterior probability is larger than a set threshold value; wherein the set threshold is 0.85.
2. A system of convolutional neural network fraud short message classification method based on naive Bayesian optimization comprises:
the construction module is used for constructing a template library, and after clustering all fraud short messages from manual research and judgment, each class of the template library finds out a representative short message sample to form a short message template, and the template library is constructed;
the template matching module is used for carrying out matching classification on the short message to be judged and the short message templates in the template library, judging that the template matching is successful as a fraud short message, and sending the short message with failed template matching to the textCNN model judging module for secondary judgment; the process for matching and classifying the short message to be distinguished and the short message templates in the template library comprises the following steps: calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful;
the textCNN model judging module judges the short message with failed template matching for the second time, judges the successfully judged short message as a fraud short message, and sends the short message with failed judgment to the Bayesian probability classifying module for reclassifying; the text CNN model is judged to be successful, a new short message template is formed by the short message, and an amplification template is formed and added into a template library; the discriminating process of the textCNN model discriminating module comprises the following steps: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel is regarded as the expression form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message;
the Bayesian probability classification module is used for reclassifying the short messages which are failed to be judged by the textCNN model by calculating the Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which are unsuccessfully classified by the Bayesian probability as non-fraud short messages; the classification process of the Bayesian probability classification module comprises the following steps: counting all fraud messages based on manual research and judgment, counting the number of the keywords in each scene, taking the number as the prior probability of each keyword, calculating the probability of the short message which fails to judge the text CNN model based on the naive Bayesian theory according to the prior probability of the hit keyword, and judging the short message as the fraud message when the posterior probability is larger than a set threshold value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010008497.7A CN111198947B (en) | 2020-01-06 | 2020-01-06 | Convolutional neural network fraud short message classification method and system based on naive Bayes optimization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010008497.7A CN111198947B (en) | 2020-01-06 | 2020-01-06 | Convolutional neural network fraud short message classification method and system based on naive Bayes optimization |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111198947A CN111198947A (en) | 2020-05-26 |
CN111198947B true CN111198947B (en) | 2024-02-13 |
Family
ID=70744553
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010008497.7A Active CN111198947B (en) | 2020-01-06 | 2020-01-06 | Convolutional neural network fraud short message classification method and system based on naive Bayes optimization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111198947B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112231537A (en) * | 2020-11-09 | 2021-01-15 | 张印祺 | Intelligent reading system based on deep learning and web crawler |
CN115022464A (en) * | 2022-05-06 | 2022-09-06 | 中国联合网络通信集团有限公司 | Number processing method, system, computing device and storage medium |
CN114928498A (en) * | 2022-06-15 | 2022-08-19 | 中国联合网络通信集团有限公司 | Fraud information identification method and device and computer readable storage medium |
CN114786184B (en) * | 2022-06-21 | 2022-09-16 | 中国信息通信研究院 | Method and device for generating fraud-related short message interception template |
CN116150379B (en) * | 2023-04-04 | 2023-06-30 | 中国信息通信研究院 | Short message text classification method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101257671A (en) * | 2007-07-06 | 2008-09-03 | 浙江大学 | Real-time filtering method of large-scale spam text messages based on content |
CN103634473A (en) * | 2013-12-05 | 2014-03-12 | 南京理工大学连云港研究院 | Naive Bayesian classification based mobile phone spam short message filtering method and system |
CN107835496A (en) * | 2017-11-24 | 2018-03-23 | 北京奇虎科技有限公司 | A kind of recognition methods of refuse messages, device and server |
CN108268461A (en) * | 2016-12-30 | 2018-07-10 | 广东精点数据科技股份有限公司 | A kind of document sorting apparatus based on hybrid classifer |
CN110059189A (en) * | 2019-04-11 | 2019-07-26 | 厦门点触科技股份有限公司 | A kind of categorizing system and method for gaming platform message |
-
2020
- 2020-01-06 CN CN202010008497.7A patent/CN111198947B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101257671A (en) * | 2007-07-06 | 2008-09-03 | 浙江大学 | Real-time filtering method of large-scale spam text messages based on content |
CN103634473A (en) * | 2013-12-05 | 2014-03-12 | 南京理工大学连云港研究院 | Naive Bayesian classification based mobile phone spam short message filtering method and system |
CN108268461A (en) * | 2016-12-30 | 2018-07-10 | 广东精点数据科技股份有限公司 | A kind of document sorting apparatus based on hybrid classifer |
CN107835496A (en) * | 2017-11-24 | 2018-03-23 | 北京奇虎科技有限公司 | A kind of recognition methods of refuse messages, device and server |
CN110059189A (en) * | 2019-04-11 | 2019-07-26 | 厦门点触科技股份有限公司 | A kind of categorizing system and method for gaming platform message |
Also Published As
Publication number | Publication date |
---|---|
CN111198947A (en) | 2020-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111198947B (en) | Convolutional neural network fraud short message classification method and system based on naive Bayes optimization | |
CN110162593B (en) | Search result processing and similarity model training method and device | |
CN108595632B (en) | A Hybrid Neural Network Text Classification Method Fusing Abstract and Main Features | |
CN107193801B (en) | Short text feature optimization and emotion analysis method based on deep belief network | |
US7689531B1 (en) | Automatic charset detection using support vector machines with charset grouping | |
CN105389379A (en) | Rubbish article classification method based on distributed feature representation of text | |
CN110362819B (en) | Text emotion analysis method based on convolutional neural network | |
US20110213736A1 (en) | Method and arrangement for automatic charset detection | |
CN105183715B (en) | A kind of word-based distribution and the comment spam automatic classification method of file characteristics | |
CN109446333A (en) | A kind of method that realizing Chinese Text Categorization and relevant device | |
CN110516098A (en) | An Image Annotation Method Based on Convolutional Neural Network and Binary Coded Features | |
Bouguila | A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity | |
CN115098690B (en) | Multi-data document classification method and system based on cluster analysis | |
CN109063185A (en) | Social networks short text data filter method towards event detection | |
CN107357895B (en) | Text representation processing method based on bag-of-words model | |
CN114579746A (en) | An optimized high-precision text classification method and device | |
CN112434686B (en) | End-to-end misplaced text classification identifier for OCR (optical character) pictures | |
Huang | A CNN model for SMS spam detection | |
CN114372145B (en) | Scheduling method for dynamic allocation of operation and maintenance resources based on knowledge graph platform | |
CN114817548A (en) | Text classification method, device, equipment and storage medium | |
CN114117050B (en) | Full-automatic accounting flow popup window processing method, device and system | |
CN117271701A (en) | Method and system for extracting system operation abnormal event relation based on TGGAT and CNN | |
CN115240179A (en) | Bill text classification method and system | |
CN111159410A (en) | Text emotion classification method, system and device and storage medium | |
Liang et al. | Research on Text Categorization Based on Natural Language Processing and Machine Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |