CN111198947B - Convolutional neural network fraud short message classification method and system based on naive Bayes optimization - Google Patents

Convolutional neural network fraud short message classification method and system based on naive Bayes optimization Download PDF

Info

Publication number
CN111198947B
CN111198947B CN202010008497.7A CN202010008497A CN111198947B CN 111198947 B CN111198947 B CN 111198947B CN 202010008497 A CN202010008497 A CN 202010008497A CN 111198947 B CN111198947 B CN 111198947B
Authority
CN
China
Prior art keywords
short message
fraud
template
short
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010008497.7A
Other languages
Chinese (zh)
Other versions
CN111198947A (en
Inventor
石嘉
王秀丽
李盛超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING SINOVATIO TECHNOLOGY CO LTD
Original Assignee
NANJING SINOVATIO TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING SINOVATIO TECHNOLOGY CO LTD filed Critical NANJING SINOVATIO TECHNOLOGY CO LTD
Priority to CN202010008497.7A priority Critical patent/CN111198947B/en
Publication of CN111198947A publication Critical patent/CN111198947A/en
Application granted granted Critical
Publication of CN111198947B publication Critical patent/CN111198947B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a convolutional neural network fraud short message classification method and a system based on naive Bayes optimization, which are characterized in that a template library is established, and short messages to be distinguished are matched and classified with short message templates in the template library; carrying out secondary discrimination on the short message with the failed template matching through a textCNN model, and classifying the short message successfully discriminated through the textCNN model to determine the short message as a fraud short message; and reclassifying the short message which fails to be judged by the textCNN model by calculating the Bayesian probability, and determining the short message which is unsuccessfully classified by the Bayesian probability as the non-fraud short message. According to the invention, a scheme of parallel short message fraud classification of naive Bayes and textCNN is realized, the whole model is optimized through statistics generation templates of keywords, and self-learning can be realized through self-increment of a template library, so that the accuracy and recall rate of short message fraud classification are greatly improved.

Description

Convolutional neural network fraud short message classification method and system based on naive Bayes optimization
Technical Field
The invention relates to a method and a system for classifying natural language processing texts, in particular to a method and a system for classifying convolutional neural network fraud short messages based on naive Bayesian optimization.
Background
With the development of the internet, unstructured text data has increased sharply, and at the same time, it has become more difficult for people to manage their own desired data from massive data, so how to effectively organize and manage such information, and quickly, accurately and comprehensively find information required by users from such information is a great challenge currently faced. Because of the large amount of information, if text data is collected and mined only manually, a great deal of manpower and time are consumed, and the text data is difficult to realize. The realization of automatic text classification is particularly important, is a basic function of text information processing, and also becomes a core technology for processing and organizing text data.
For text classification problems, there is no common method to extract features of the text, such as using word2vec or LDA models to convert the text into a feature vector of fixed dimensions, and then training a classifier based on the extracted features. However, along with the TextCNN proposed in paper Convolutional Neural Networks for Sentence Classification of yoon kim, a gate for classifying texts by convolutional neural networks (Convolutional Neural Networks, CNN) is opened, and experimental research shows that the CNN has strong capability of extracting shallow features, good effect in classifying short texts, wide application and high speed, and is generally preferred. Meanwhile, naive bayes are one of ten big algorithms of data, and are widely used for solving classification because they are easy to construct and interpret and have good performance.
However, in the application of actual fraud short messages, the classifying effect of textCNN on normal text short messages is very good, however, many of the fraud short messages contain non-standard text, and at the moment, when word2vec is used for converting text into feature vectors, the feature of the non-standard text cannot be accurately represented, so the classifying effect of textCNN is not ideal, and therefore, the classification of short messages containing non-standard text needs to be optimized.
Disclosure of Invention
The invention aims to: the invention aims to provide a convolutional neural network fraud short message classification method based on naive Bayesian optimization, which can improve the accuracy and recall rate of short message fraud classification; the second purpose of the invention is a convolutional neural network fraud short message classification system based on naive Bayes optimization.
The technical scheme is as follows: the invention discloses a convolutional neural network fraud short message classification method based on naive Bayes optimization, which comprises the following steps:
establishing a template library, carrying out matching classification on the short message to be judged and the short message templates in the template library, and judging the short message with successful template matching as a fraud short message; the construction of the template library comprises the steps of clustering all fraud short messages from manual research and judgment, finding out a representative short message sample from each class to form a short message template, and constructing the template library.
Carrying out secondary discrimination on the short message with the failed template matching through a textCNN model, and carrying out Bayesian probability calculation classification on the short message with the failed discrimination through the textCNN model;
and reclassifying the short messages with failed textCNN model discrimination by calculating Bayesian probability, determining the short messages with successful Bayesian probability classification as fraud short messages and finishing classification, and determining the short messages with failed Bayesian probability classification as non-fraud short messages.
Preferably, the short message successfully judged by the textCNN model is formed into a new short message template, and an amplified template is formed and put into a template library. The template library increases the data size of the templates through textCNN, so that the effect of self-adaptive matching is achieved.
Preferably, the specific process of matching and classifying the short message to be distinguished and the short message templates in the template library is as follows: and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful.
Preferably, the specific process of the textCNN model secondary discrimination is as follows: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel can be regarded as a representation form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message; after the model is trained, when the probability of a certain scene is a certain threshold value, the model is judged to be a fraud short message.
When convolution is carried out through convolution kernels with the sizes of 2,3 and 4 respectively, each type of convolution kernel has two convolution kernels and 6 convolution kernels in total, so that six matrixes are obtained; when the convolution is carried out, the convolution kernel carries out one-dimensional convolution from top to bottom; the size of the convolution kernel can be seen as a representation of n-gram, representing 3-gram,4-gram,5-gram, respectively, and then performing a maximum merge, i.e., recording the maximum number from each feature map; generating univariate feature vectors from all six graphs, and the six features are connected to form the feature vector of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message; after the model is trained, when the probability of a certain scene is a certain threshold value, the model is judged to be a fraud short message.
Preferably, the specific process of reclassifying by calculating the bayesian probability is as follows: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords as the prior probability of each keyword, calculating the posterior probability of each scene according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the fraud messages when the probability of a certain scene is larger than a set threshold value.
The invention also provides a convolutional neural network fraud short message classification system based on naive Bayes optimization, which comprises a construction module, a template matching module, a textCNN model discrimination module and a Bayes probability classification module.
The construction module is used for constructing a template library, and after clustering all fraud short messages from manual research and judgment, each class finds out a representative short message sample to form a short message template, so that the template library is constructed;
the template matching module is used for carrying out matching classification on the short message to be judged and the short message templates in the template library, judging the short message with successful template matching as a fraud short message, and sending the short message with failed template matching to the textCNN model judging module for secondary judgment;
the textCNN model judging module judges the short message with failed template matching for the second time, judges the successfully judged short message as a fraud short message, and sends the short message with failed judgment to the Bayesian probability classifying module for reclassifying; forming a new short message template by using the short message successfully judged by the textCNN model, forming an amplification template and putting the amplification template into a template library;
and the Bayesian probability classification module is used for reclassifying the short messages which are failed to be judged by the textCNN model by calculating the Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which are unsuccessfully classified by the Bayesian probability as non-fraud short messages.
Preferably, the process of matching and classifying the short message to be distinguished and the short message templates in the template library comprises the following steps: and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful.
the discriminating process of the textCNN model discriminating module comprises the following steps: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel can be regarded as a representation form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message; after the model is trained, when the probability of a certain scene is a certain threshold value, the model is judged to be a fraud short message.
When convolution is carried out through convolution kernels with the sizes of 2,3 and 4 respectively, each type of convolution kernel has two, and 6 convolution kernels are obtained, six matrixes are obtained, and when the convolution is carried out, one-dimensional convolution is carried out on the convolution kernels from top to bottom; the size of the convolution kernel can be seen as a representation of n-gram, representing 3-gram,4-gram,5-gram, respectively, and then performing a maximum merge, i.e., recording the maximum number from each feature map; generating univariate feature vectors from all six graphs, and the six features are connected to form the feature vector of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message.
The classification process of the Bayesian probability classification module comprises the following steps: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords as the prior probability of each keyword, calculating the posterior probability of each scene according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the fraud messages when the probability of a certain scene is larger than a set threshold value.
The invention designs and realizes an algorithm of naive Bayes, volume and neural network parallelization in the classification of the fraud messages, and generates a multi-classifier according to the intention of the fraud messages. Experiments show that the accuracy of a single multi-classifier generated by using CNN is particularly high, but the recall rate is relatively low, because the characteristic extraction of non-standard characters is inaccurate, so that the short messages which are not identified by the CNN model can be classified and identified by counting common characters (including standard and non-standard) in the short messages of various types and then establishing a naive Bayesian model to form the multi-classifier, meanwhile, in order to further improve the accuracy and recall rate of the whole model, a template similarity matching model is added at the beginning, and finally, the result of the three models is synthesized to judge the short messages, so that the accuracy and recall rate are greatly improved.
In the template similarity matching, the template matching can be matched to 20% of short messages, the accuracy is more than 90%, and the recall rate can be dynamically increased through the self-increment of a model library; in textCNN, about 60% of short messages can be classified, and the accuracy is more than 92%. In the Bayesian probability calculation, the number of models with certain large types is limited, otherwise, classification is extremely prone to the large types, and at present, the Bayesian probability can classify about 20% of short messages, and the accuracy is above 90%.
The beneficial effects are that: the invention designs and realizes the scheme of parallel short message fraud classification of naive Bayes and textCNN, optimizes the whole model by counting and generating templates of keywords, and simultaneously can realize self-learning by self-increasing a template library, so that the accuracy and recall rate of short message fraud classification are greatly improved.
Drawings
FIG. 1 is a flow chart of a fraud SMS classification method of the present invention;
fig. 2 is a schematic structural diagram of textCNN model.
Detailed Description
The present invention will be described in further detail with reference to examples.
Example 1:
fig. 1 is a flow chart of a fraud message classification method, and the convolutional neural network fraud message classification method based on naive bayes optimization in this embodiment includes the following steps:
(1) Marking the data category, simultaneously carrying out keyword statistics on the short messages of each category, and generating a template library and training data by using keywords with high frequency of each category. The aim is to construct a template library and training data of textCNN.
(2) The text is segmented through jieba, so that the characteristic words in sentences are matched with a template library.
(3) And establishing a stop word dictionary, wherein the stop words mainly comprise some adverbs, adjectives and some connecting words. By maintaining a stop vocabulary, which is essentially a feature extraction process, it is essentially part of the feature selection in order to remove redundant information from the sms.
(4) Carrying out matching classification on the short message to be judged and the short message templates in the template library, and judging that the template is successfully matched as a fraud short message; and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful. Namely: by cosine formulaAnd calculating the average cosine similarity of each word vector in the short message and the word vector of the template library, and judging that the short message is a fraud short message when the average cosine similarity is larger than a set threshold value of 0.5, wherein the text CNN model is adopted for judging whether the short message is identified as the fraud short message continuously.
(5) The short message with the failure template matching is subjected to secondary judgment through a textCNN model, and the short message with the successful judgment through the textCNN model is judged to be a fraud short message;
and performing supervision training on the textCNN model by using the well-trained data, and storing the model and parameters so that the textCNN model judges the short message with the template matching failure. Fig. 2 shows a textCNN model structure schematic diagram.
The characters of the short message are subjected to feature matrix processing, a feature matrix is formed after word embedding, then convolution is carried out through convolution kernels with the sizes of 2,3 and 4 respectively, two convolution kernels of each type and 6 total convolution kernels are obtained, six matrixes are obtained, meanwhile, when the convolution is carried out, the convolution kernels can only carry out one-dimensional convolution from top to bottom and cannot carry out left-right convolution, the reason is that words cannot be separated for convolution, even if data obtained by left-right convolution has no meaning, the application of the feature matrix is different from that of CNN in images; the size of the convolution kernel can be considered as a representation of an n-gram, representing 3-gram,4-gram,5-gram, respectively; maximum merging is then performed, i.e. the maximum number from each feature is recorded; thus, univariate feature vectors are generated from all six graphs, and these six features are connected to form the feature vector of the penultimate layer. The final softmax layer receives the feature vector as input and uses it to classify sentences. After the model is trained, when the probability of a certain scene is greater than a set threshold value of 0.85, judging that the short message is a fraud short message.
(6) And reclassifying the short messages with failed textCNN model discrimination by calculating Bayesian probability, determining the short messages with successful Bayesian probability classification as fraud short messages and finishing classification, and determining the short messages with failed Bayesian probability classification as non-fraud short messages. The method comprises the following specific processes of calculating the naive Bayes probability by utilizing the statistics of the keywords and reclassifying by calculating the Bayes probability: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords in each scene as the prior probability of each keyword, calculating the probability of the short message which fails to judge according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the short message as the fraud message when the probability is larger than a set threshold value.
Namely: using bayesian formulasCounting the keywords of each fraud category, calculating the conditional probability of each keyword in each category, and then calculating the posterior probability of the keywords hit in the judging short message by using a Bayesian formula; when the posterior probability of a certain scene is greater than the set threshold value of 0.9, judging that the short message is a fraud short message of the category.
(7) And (6) warehousing test data to obtain a result.
In the template similarity matching of the embodiment, the template matching can be matched to 20% of short messages, the accuracy is more than 90%, and the recall rate can be dynamically increased through the self-increment of a model library; in textCNN, about 60% of short messages can be classified, and the accuracy is more than 92%. In the Bayesian probability calculation, the number of models with certain large types is limited, otherwise, classification is extremely prone to the large types, and at present, the Bayesian probability can classify about 20% of short messages, and the accuracy is above 90%.
Example 2:
the convolutional neural network fraud short message classification system based on naive Bayesian optimization is characterized by comprising a construction module, a template matching module, a textCNN model discrimination module and a Bayesian probability classification module.
The construction module is used for constructing a template library, and after clustering all fraud short messages from manual research and judgment, each class of the template library finds out a representative short message sample to form a short message template, and the template library is constructed;
and the template matching module is used for carrying out matching classification on the short message to be judged and the short message templates in the template library, judging that the template matching is successful as a fraud short message, and sending the short message with failed template matching to the textCNN model judging module for secondary judgment. The process for matching and classifying the short message to be distinguished and the short message templates in the template library comprises the following steps: and calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful.
the textCNN model judging module judges the short message with failed template matching for the second time, judges the successfully judged short message as a fraud short message, and sends the short message with failed judgment to the Bayesian probability classifying module for reclassifying; and forming a new short message template by the short message successfully judged by the textCNN model, and adding the formed amplified template into a template library. The discriminating process of the textCNN model discriminating module comprises the following steps: carrying out feature matrix processing on characters in the short message, carrying out convolution through convolution kernels with the sizes of 2,3 and 4 respectively, wherein each type of convolution kernel has two, and 6 convolution kernels are used to obtain six matrixes, and simultaneously, when carrying out convolution, the convolution kernels can only carry out one-dimensional convolution from top to bottom; the size of the convolution kernel can be seen as a representation of n-gram, representing 3-gram,4-gram,5-gram, respectively, and then performing a maximum merge, i.e., recording the maximum number from each feature map; generating univariate feature vectors from all six graphs, and the six features are connected to form the feature vector of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message.
And the Bayesian probability classification module is used for reclassifying the short messages which are failed to be judged by the textCNN model by calculating the Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which are unsuccessfully classified by the Bayesian probability as non-fraud short messages. The classification process of the Bayesian probability classification module comprises the following steps: counting the number of all fraud messages based on manual research and judgment in each scene, taking the number of the keywords in each scene as the prior probability of each keyword, calculating the probability of the short message which fails to judge according to the prior probability of the hit keyword based on the naive Bayesian theory, and judging the short message as the fraud message when the probability is larger than a set threshold value.

Claims (2)

1. A convolutional neural network fraud short message classification method based on naive Bayesian optimization is characterized by comprising the following steps:
constructing a template library, carrying out matching classification on the short message to be judged and the short message templates in the template library, and judging that the template matching is successful as a fraud short message; the construction of the template library comprises the following steps: after clustering all fraud short messages from manual research and judgment, finding out representative short message samples from each class to form a short message template, and constructing a template library; the specific process of matching and classifying the short message to be distinguished and the short message templates in the template library is as follows: calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful;
the short message with the failure template matching is subjected to secondary judgment through a textCNN model, and the short message with the successful judgment through the textCNN model is judged to be a fraud short message; the text CNN model judges the successful short message to form a new short message template, and forms an amplified template to be added into a template library;
the secondary discrimination of the textCNN model comprises the following specific processes: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel is regarded as the expression form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message;
reclassifying the short messages which fail to be judged by the textCNN model by calculating Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which fail to be classified by the Bayesian probability as non-fraud short messages; the specific process of calculating the Bayesian probability for reclassifying is as follows: counting the number of all fraud short messages based on manual research and judgment in each scene, taking the number of the occurrence of each keyword in each scene as the prior probability of each keyword, calculating the posterior probability of each keyword belonging to the text short message based on the naive Bayesian theory according to the prior probability of the hit keyword, and judging the text short message as the fraud short message when the posterior probability is larger than a set threshold value; wherein the set threshold is 0.85.
2. A system of convolutional neural network fraud short message classification method based on naive Bayesian optimization comprises:
the construction module is used for constructing a template library, and after clustering all fraud short messages from manual research and judgment, each class of the template library finds out a representative short message sample to form a short message template, and the template library is constructed;
the template matching module is used for carrying out matching classification on the short message to be judged and the short message templates in the template library, judging that the template matching is successful as a fraud short message, and sending the short message with failed template matching to the textCNN model judging module for secondary judgment; the process for matching and classifying the short message to be distinguished and the short message templates in the template library comprises the following steps: calculating cosine similarity of the short message to be judged and the short message templates in the template library, and judging that the short message is a fraud short message when the similarity is larger than a set threshold value, wherein the template matching is successful;
the textCNN model judging module judges the short message with failed template matching for the second time, judges the successfully judged short message as a fraud short message, and sends the short message with failed judgment to the Bayesian probability classifying module for reclassifying; the text CNN model is judged to be successful, a new short message template is formed by the short message, and an amplification template is formed and added into a template library; the discriminating process of the textCNN model discriminating module comprises the following steps: performing feature matrix processing on characters in the short message; performing one-dimensional convolution on a convolution kernel from top to bottom, wherein the size of the convolution kernel is regarded as the expression form of n-gram, then performing maximum merging, and recording the maximum number from each feature map; generating univariate feature vectors from all the graphs, and the features are connected to form feature vectors of the penultimate layer; the final softmax layer receives the feature vector as input and uses it to classify the short message;
the Bayesian probability classification module is used for reclassifying the short messages which are failed to be judged by the textCNN model by calculating the Bayesian probability, determining the short messages which are successfully classified by the Bayesian probability as fraud short messages and completing classification, and determining the short messages which are unsuccessfully classified by the Bayesian probability as non-fraud short messages; the classification process of the Bayesian probability classification module comprises the following steps: counting all fraud messages based on manual research and judgment, counting the number of the keywords in each scene, taking the number as the prior probability of each keyword, calculating the probability of the short message which fails to judge the text CNN model based on the naive Bayesian theory according to the prior probability of the hit keyword, and judging the short message as the fraud message when the posterior probability is larger than a set threshold value.
CN202010008497.7A 2020-01-06 2020-01-06 Convolutional neural network fraud short message classification method and system based on naive Bayes optimization Active CN111198947B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010008497.7A CN111198947B (en) 2020-01-06 2020-01-06 Convolutional neural network fraud short message classification method and system based on naive Bayes optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010008497.7A CN111198947B (en) 2020-01-06 2020-01-06 Convolutional neural network fraud short message classification method and system based on naive Bayes optimization

Publications (2)

Publication Number Publication Date
CN111198947A CN111198947A (en) 2020-05-26
CN111198947B true CN111198947B (en) 2024-02-13

Family

ID=70744553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010008497.7A Active CN111198947B (en) 2020-01-06 2020-01-06 Convolutional neural network fraud short message classification method and system based on naive Bayes optimization

Country Status (1)

Country Link
CN (1) CN111198947B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231537A (en) * 2020-11-09 2021-01-15 张印祺 Intelligent reading system based on deep learning and web crawler
CN115022464A (en) * 2022-05-06 2022-09-06 中国联合网络通信集团有限公司 Number processing method, system, computing device and storage medium
CN114928498A (en) * 2022-06-15 2022-08-19 中国联合网络通信集团有限公司 Fraud information identification method and device and computer readable storage medium
CN114786184B (en) * 2022-06-21 2022-09-16 中国信息通信研究院 Method and device for generating fraud-related short message interception template
CN116150379B (en) * 2023-04-04 2023-06-30 中国信息通信研究院 Short message text classification method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101257671A (en) * 2007-07-06 2008-09-03 浙江大学 Real-time filtering method of large-scale spam text messages based on content
CN103634473A (en) * 2013-12-05 2014-03-12 南京理工大学连云港研究院 Naive Bayesian classification based mobile phone spam short message filtering method and system
CN107835496A (en) * 2017-11-24 2018-03-23 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and server
CN108268461A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of document sorting apparatus based on hybrid classifer
CN110059189A (en) * 2019-04-11 2019-07-26 厦门点触科技股份有限公司 A kind of categorizing system and method for gaming platform message

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101257671A (en) * 2007-07-06 2008-09-03 浙江大学 Real-time filtering method of large-scale spam text messages based on content
CN103634473A (en) * 2013-12-05 2014-03-12 南京理工大学连云港研究院 Naive Bayesian classification based mobile phone spam short message filtering method and system
CN108268461A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of document sorting apparatus based on hybrid classifer
CN107835496A (en) * 2017-11-24 2018-03-23 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and server
CN110059189A (en) * 2019-04-11 2019-07-26 厦门点触科技股份有限公司 A kind of categorizing system and method for gaming platform message

Also Published As

Publication number Publication date
CN111198947A (en) 2020-05-26

Similar Documents

Publication Publication Date Title
CN111198947B (en) Convolutional neural network fraud short message classification method and system based on naive Bayes optimization
CN110162593B (en) Search result processing and similarity model training method and device
CN108595632B (en) A Hybrid Neural Network Text Classification Method Fusing Abstract and Main Features
CN107193801B (en) Short text feature optimization and emotion analysis method based on deep belief network
US7689531B1 (en) Automatic charset detection using support vector machines with charset grouping
CN105389379A (en) Rubbish article classification method based on distributed feature representation of text
CN110362819B (en) Text emotion analysis method based on convolutional neural network
US20110213736A1 (en) Method and arrangement for automatic charset detection
CN105183715B (en) A kind of word-based distribution and the comment spam automatic classification method of file characteristics
CN109446333A (en) A kind of method that realizing Chinese Text Categorization and relevant device
CN110516098A (en) An Image Annotation Method Based on Convolutional Neural Network and Binary Coded Features
Bouguila A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
CN109063185A (en) Social networks short text data filter method towards event detection
CN107357895B (en) Text representation processing method based on bag-of-words model
CN114579746A (en) An optimized high-precision text classification method and device
CN112434686B (en) End-to-end misplaced text classification identifier for OCR (optical character) pictures
Huang A CNN model for SMS spam detection
CN114372145B (en) Scheduling method for dynamic allocation of operation and maintenance resources based on knowledge graph platform
CN114817548A (en) Text classification method, device, equipment and storage medium
CN114117050B (en) Full-automatic accounting flow popup window processing method, device and system
CN117271701A (en) Method and system for extracting system operation abnormal event relation based on TGGAT and CNN
CN115240179A (en) Bill text classification method and system
CN111159410A (en) Text emotion classification method, system and device and storage medium
Liang et al. Research on Text Categorization Based on Natural Language Processing and Machine Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant