Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for identifying spam, in which key information with a set tag in a target email is obtained, and the key information is added to text data, so that the proportion of the key information in the text data is increased, and then a trained email identification model is combined to identify spam, so that the accuracy of identifying spam is increased, the maintenance cost is reduced, and the method and the apparatus are not affected by human factors.
To achieve the above object, according to an aspect of an embodiment of the present invention, a spam email recognition method is provided.
The junk mail identification method of the embodiment of the invention comprises the following steps: analyzing a target mail, acquiring mail information of the target mail and key information with a set label, and splicing the mail information and the key information to obtain text data; performing word segmentation on the text data to obtain word segmentation results, and performing vectorization representation on the word segmentation results by using a pre-trained word vector model to obtain corresponding feature matrices; inputting the characteristic matrix into a pre-trained mail recognition model for recognition, and outputting a recognition result of the target mail; the mail identification model is used for identifying whether the target mail is a junk mail.
Optionally, the mail information includes a mail title, and the tag is any one or a combination of more than one of the following: title label, bold label and hyperlink label; splicing the mail information and the key information to obtain text data, wherein the text data comprises: respectively copying the mail title and the key information according to a set first repetition frequency and a set second repetition frequency; and splicing the mail information, the copied mail title and the copied key information to obtain text data, so that the mail title and/or the key information are repeatedly added in the text data.
Optionally, performing word segmentation on the text data to obtain a word segmentation result, including: acquiring a candidate word set from the sentence of the text data, and searching the occurrence probability of candidate words in the candidate word set by using a pre-established dictionary; combining the candidate words according to the sentences to obtain a plurality of segmentation combinations, and calculating the total probability corresponding to the segmentation combinations according to the occurrence probability of the candidate words in the segmentation combinations; judging whether the maximum total probability is greater than a set probability threshold, and if the maximum total probability is greater than the probability threshold, taking the segmentation combination corresponding to the maximum total probability as a word segmentation result; and if the maximum total probability is less than or equal to the probability threshold, performing word segmentation on the sentence again by using a pre-trained word segmentation model to obtain a word segmentation result.
Optionally, the method further comprises: respectively labeling the states of the characters in the training corpus by using a word segmentation lexicon to obtain the labeling characteristics of the characters; according to the labeled training corpus, counting the probability of the same character in different states as the labeling probability of the character; inputting the training corpus into a hidden Markov model to obtain the probability of the same character in different states as the training probability of the character; and adjusting parameters of the hidden Markov model according to the errors of the marking probability and the corresponding training probability of the characters until the errors reach the minimum, and obtaining the word segmentation model.
Optionally, the method further comprises: acquiring a newly added junk mail sample, respectively labeling the character states in the newly added junk mail sample, and adding the newly added junk mail sample after labeling into a first training sample set; and when the number of the training samples in the first training sample set is determined to be larger than or equal to a set first number threshold, retraining the word segmentation model based on the training samples to obtain a new word segmentation model.
Optionally, the word vector model is obtained based on word2vec model training, and the method further includes: acquiring a newly added junk mail sample, adding a word segmentation label to the newly added junk mail sample, and adding the newly added junk mail sample with the label added to a second training sample set; and when the number of the training samples in the second training sample set is determined to be larger than or equal to a set second number threshold, retraining the word vector model based on the training samples to obtain a new word vector model.
Optionally, the method further comprises: acquiring a mail sample set, and respectively marking the mail samples of the mail sample set with category labels; extracting the characteristics of the mail samples to obtain corresponding characteristic matrixes, and dividing the characteristic matrixes into a training set and a testing set; inputting the training set into a machine learning model for training to obtain an initial mail recognition model, and inputting the test set into the initial mail recognition model to obtain a prediction result; comparing the prediction result with the corresponding category label of the test set to obtain model evaluation data; and adjusting the initial mail identification model according to the model evaluation data to obtain a final mail identification model.
To achieve the above object, according to another aspect of the embodiments of the present invention, a spam recognition apparatus is provided.
The junk mail recognition device of the embodiment of the invention comprises: the mail analysis module is used for analyzing a target mail, acquiring mail information of the target mail and key information with a set label, and splicing the mail information and the key information to obtain text data; the feature extraction module is used for segmenting words of the text data to obtain word segmentation results, and vectorizing the word segmentation results by using a pre-trained word vector model to obtain a corresponding feature matrix; the mail identification module is used for inputting the characteristic matrix into a pre-trained mail identification model for identification and outputting an identification result of the target mail; the mail identification model is used for identifying whether the target mail is a junk mail.
Optionally, the mail information includes a mail title, and the tag is any one or a combination of more than one of the following: title label, bold label and hyperlink label; the mail analysis module is further configured to copy the mail header and the key information according to a set first repetition number and a set second repetition number; and splicing the mail information, the copied mail title and the copied key information to obtain text data, so that the mail title and/or the key information are repeatedly added in the text data.
Optionally, the feature extraction module is further configured to obtain a candidate word set from a sentence of the text data, and search for an occurrence probability of a candidate word in the candidate word set by using a pre-established dictionary; combining the candidate words according to the sentences to obtain a plurality of segmentation combinations, and calculating the total probability corresponding to the segmentation combinations according to the occurrence probability of the candidate words in the segmentation combinations; judging whether the maximum total probability is greater than a set probability threshold, and if the maximum total probability is greater than the probability threshold, taking the segmentation combination corresponding to the maximum total probability as a word segmentation result; and if the maximum total probability is less than or equal to the probability threshold, performing word segmentation on the sentence again by using a pre-trained word segmentation model to obtain a word segmentation result.
Optionally, the apparatus further comprises: the word segmentation model training module is used for labeling the states of the characters in the training corpus respectively by utilizing a word segmentation word bank to obtain the labeling characteristics of the characters; according to the labeled training corpus, counting the probability of the same character in different states as the labeling probability of the character; inputting the training corpus into a hidden Markov model to obtain the probability of the same character in different states as the training probability of the character; and adjusting parameters of the hidden Markov model according to the errors of the marking probability and the corresponding training probability of the characters until the errors reach the minimum, and obtaining the word segmentation model.
Optionally, the apparatus further comprises: the word segmentation model optimization module is used for acquiring a newly added junk mail sample, labeling the character states in the newly added junk mail sample respectively, and adding the newly added junk mail sample after labeling into a first training sample set; and when the number of the training samples in the first training sample set is determined to be larger than or equal to a set first number threshold, retraining the word segmentation model based on the training samples to obtain a new word segmentation model.
Optionally, the word vector model is obtained based on word2vec model training, and the apparatus further includes: the word vector model optimization module is used for acquiring a newly added junk mail sample, adding a word segmentation label to the newly added junk mail sample, and adding the newly added junk mail sample with the label added to a second training sample set; and when the number of the training samples in the second training sample set is determined to be larger than or equal to a set second number threshold, retraining the word vector model based on the training samples to obtain a new word vector model.
Optionally, the apparatus further comprises: the mail recognition model training module is used for acquiring a mail sample set and respectively marking the mail samples of the mail sample set with class labels; extracting the characteristics of the mail samples to obtain corresponding characteristic matrixes, and dividing the characteristic matrixes into a training set and a testing set; inputting the training set into a machine learning model for training to obtain an initial mail recognition model, and inputting the test set into the initial mail recognition model to obtain a prediction result; comparing the prediction result with the corresponding category label of the test set to obtain model evaluation data; and adjusting the initial mail identification model according to the model evaluation data to obtain a final mail identification model.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided an electronic apparatus.
An electronic device of an embodiment of the present invention includes: one or more processors; a storage device, configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a spam recognition method according to an embodiment of the present invention.
To achieve the above object, according to still another aspect of embodiments of the present invention, there is provided a computer-readable medium.
A computer-readable medium of an embodiment of the present invention stores thereon a computer program that, when executed by a processor, implements a spam recognition method of an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits: by acquiring the key information with the set label in the target mail and adding the key information into the text data, the proportion of the key information in the text data is improved, and the spam mail is identified by combining a trained mail identification model, so that the accuracy of spam mail identification is improved, the maintenance cost is reduced, and the influence of human factors is avoided;
according to the set repetition times, the mail title and the key information in the target mail are respectively and repeatedly added into the text data, so that the proportion of the mail title and the key information in the text data is reasonably improved, and the prediction effect of the mail recognition model is ensured; the two word segmentation modes of word segmentation based on the dictionary and word segmentation based on the word segmentation model are combined, so that the word segmentation effect on homomorphic characters and homophones in text data is improved on the premise of ensuring the word segmentation efficiency;
the hidden Markov model is used for word segmentation, new words can be automatically identified according to the front-back relevance of the words, and the accuracy of word segmentation is improved; by automatically acquiring newly added spam samples and using the newly added spam samples to carry out continuous training and updating on the word model and the word vector model, the manual maintenance cost is reduced, the training sample set is enriched, and the accuracy of spam recognition is further improved; the machine learning method is used for training and optimizing the mail recognition model, so that the process of manually analyzing and formulating the filtering rule is avoided, and the recognition accuracy of the junk mails is improved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of main steps of a spam identification method according to an embodiment of the present invention. As shown in fig. 1, the spam email identification method according to the embodiment of the present invention mainly includes the following steps:
step S101: analyzing a target mail, acquiring mail information of the target mail and key information with a set label, and splicing the mail information and the key information to obtain text data. Analyzing the target mail mainly comprises the steps of obtaining mail information and key information with set labels from a mail header and a mail body, decoding the information according to mail decoding rules, and splicing into a complete text to obtain text data.
The mail information may include a mail title and a body content. The key information is some information with specified labels in the target mail, and the labels are set by a sender, such as a title label, a bold label, a hyperlink label and the like. In an embodiment, the content in the corresponding tag may be obtained using a regular expression.
Step S102: and performing word segmentation on the text data to obtain word segmentation results, and performing vectorization representation on the word segmentation results by using a pre-trained word vector model to obtain a corresponding feature matrix. The method includes the steps that a sentence in text data is segmented, specifically, a dictionary-based word segmentation mode can be used for word segmentation, if the obtained word segmentation effect is poor, a pre-trained word segmentation model can be further used for re-segmenting the sentence, and therefore the word segmentation effect is improved.
After word segmentation processing is finished, words in word segmentation results need to be converted into corresponding word vectors according to the mapping relation between the words and the word vectors, and then the words are replaced by the word vectors through word embedding, so that corresponding feature matrixes are obtained. The mapping relationship between words and word vectors can be obtained based on word vector model training, the word vector model includes a word2vec (word tovector), a Latent Semantic Analysis (LSA) model, a Probabilistic Latent Semantic Analysis (PLSA) model, and the like, and the word2vec can reflect context relationship between words more than other models.
Step S103: and inputting the characteristic matrix into a pre-trained mail recognition model for recognition, and outputting a recognition result of the target mail. The mail identification model is used for identifying whether the target mail is a junk mail. After the characteristic extraction, the obtained characteristic matrix is input into a trained mail recognition model, and after the mail recognition model is used for processing, the recognition result of whether the target mail is a junk mail can be obtained.
According to the embodiment, the key information with the set label in the target mail is obtained, the key information is added into the text data, the proportion of the key information in the text data is improved, the spam mail is identified by combining the trained mail identification model, the accuracy of spam mail identification is improved, the maintenance cost is reduced, and the influence of human factors is avoided.
Fig. 2 is a timing diagram of a spam recognition method according to an embodiment of the present invention. As shown in fig. 2, the junk mail identification method according to the embodiment of the present invention is implemented by a mail identification module, a mail parsing module, a text word segmentation module, and a text word embedding module, where the text word segmentation module and the text word embedding module form a feature extraction module, and mainly includes the following steps:
step S201: after the mail identification module acquires the target mail, a first calling request is sent to the mail analysis module. The first call request is used for transmitting parameters required for analyzing the target mail to the mail analysis module, wherein the parameters are mail content of the target mail.
Step S202: and the mail analysis module analyzes the target mail according to the first calling request to obtain text data, and returns the text data to the mail identification module. The parsing process here includes: the method comprises the steps of respectively obtaining mail headers and body contents from mail heads and mail bodies of target mails, obtaining key information with set labels from the mail bodies after determining that the mail bodies are in an HTML format, decoding the information, and splicing the information into a complete text so as to improve the proportion of the mail headers and the key information in the text.
The full English name of HTML is Hyper Text Marked Language, i.e. hypertext markup Language. The document format on the network can be unified through the labels, so that the scattered Internet resources are connected into a logic whole. The markup method of HTML is generally: < h1> is marked content </h1>, so regular expressions can be used to extract the content in the corresponding tags. The specific implementation process of the parsing process is described with reference to fig. 3.
Step S203: and after receiving the text data, the mail identification module sends a second calling request to the text word segmentation module. The second call request is used for transmitting parameters required by word segmentation processing to the text word segmentation module, wherein the parameters are text data.
Step S204: and the text word segmentation module performs word segmentation processing on the text data according to the second calling request to obtain word segmentation results, and returns the word segmentation results to the mail identification module. The text word segmentation module firstly carries out language identification operation on text data, and if the language identification result is English, a space is used as a separator for word segmentation; if the language recognition result is Chinese, performing word segmentation by using a word segmentation mode based on a dictionary, and if the obtained word segmentation effect is poor, performing word segmentation again on the sentence by further using a pre-trained word segmentation model.
The word segmentation method based on the dictionary has a good word segmentation effect on text data of normal mails, but the word segmentation effect of certain junk mails which are detected by homomorphic characters and homophones is greatly reduced. The word segmentation model can be trained on the basis of newly-added junk mails regularly, so that the word segmentation method based on the word segmentation model has a good word segmentation effect on the junk mails using homomorphic characters and homophones. Therefore, in the embodiment, the word segmentation is performed by combining the dictionary-based word segmentation and the word segmentation based on the word segmentation model.
The sentence is divided by the dictionary-based word segmentation mode, namely, the dictionary matching mode. The format of the dictionary is: word frequency, i.e. the probability of occurrence of a word in a text. The matching method may be a forward maximum matching method, a reverse maximum matching method, a bidirectional maximum matching method, and the like, which are all the prior art and are not described herein again. And performing word segmentation based on the word segmentation model, namely predicting by using a word segmentation module to obtain a word segmentation result. In an embodiment, the word segmentation model is a hidden markov model. The specific implementation of this step is described later with reference to fig. 4.
Step S205: and after receiving the word segmentation result, the mail identification module sends a third calling request to the text word embedding module. And the third calling request is used for transmitting parameters required for generating the feature matrix to the text word embedding module. The parameters here are the word segmentation results.
Step S206: and the text word embedding module performs vectorization representation on the word segmentation result by using the pre-trained word vector model according to the third calling request to obtain a corresponding feature matrix, and returns the feature matrix to the mail recognition module. In this embodiment, the word vector model is a word2vec model, and a mapping relationship between words and word vectors can be obtained by training the word2vec model. In the step, words in the word segmentation result can be converted into corresponding word vectors according to the mapping relation, and the words are replaced by the word vectors through word embedding, so that the feature matrix representing the text data is obtained.
Wherein, the word2vec model can be understood as a series of parameters + calculation method + parameter adjustment algorithm. Training the model refers to adjusting parameters by using a parameter adjusting algorithm so that the result obtained by calculating the input data by a calculation method is consistent with the real result. The word vector refers to the parameter after adjustment.
Assuming that a corpus contains three words, computer and calculator, the context of the corpus indicates that computer is more similar to computer. Inputting the corpus into word2vec model, "computer", "calculator" will be randomly labeled as different vectors (random initialization parameters), where the word vectors are replaced by numbers such as:
"computer": 1,
the computer comprises: 10,
"calculator": 2
If the distance between the computer and the computer is calculated to be not in accordance with the requirement by the calculation method (such as cosine similarity), the parameters can be adjusted by a parameter adjustment algorithm, such as:
"computer": 1,
the computer comprises: 2,
"calculator": 10
At this time, if the calculation result meets the requirement, the adjusted parameter is the word vector to be used subsequently.
Step S207: and the mail identification module calls a pre-trained mail identification model, inputs the characteristic matrix into the mail identification model for identification, and obtains the identification result of the target mail. The mail recognition model needs to be trained in advance, and the specific training process is described later with reference to fig. 7. And after the mail identification module obtains the characteristic matrix corresponding to the target mail, calling a mail identification model to identify the target mail so as to determine whether the target mail belongs to junk mail or normal mail.
The word segmentation method combines two word segmentation modes, namely word segmentation based on the dictionary and word segmentation based on the word segmentation model, and improves the word segmentation effect on homomorphic characters and homophones in text data on the premise of ensuring the word segmentation efficiency.
Fig. 3 is a schematic main flow diagram of a mail parsing module parsing a target mail according to an embodiment of the present invention. As shown in fig. 3, the processing flow of the mail parsing module according to the embodiment of the present invention includes the following steps:
step S301: and acquiring the mail title and the text content from the mail header and the mail text of the target mail, decoding the mail title and the text content, and splicing the decoding result to obtain initial text data. The original text information of the mail contains the coding mode (generally base64) of the mail header and the text content, and after reading the value, the corresponding mode is used for decoding. The splicing in the step is to splice the decoding information corresponding to the mail title and the decoding information corresponding to the text content together.
Step S302: the mail header is repeatedly added to the initial text data by the set first repetition number. Since the mail title is a particularly important content in the entire mail, this step appropriately increases the weight of the mail title in the original text data by repeated addition. The first repetition number may be set by itself, for example, 1 time, 2 times, and the like, and the number may be repeatedly tried and adjusted according to subsequent training and predicted effects.
Step S303: judging whether the format of the mail text is an HTML format, if not, executing the step S304; if it is in the HTML format, step S305 is performed. The format of the mail body may be an HTML format or a plain text format. Since the HTML format can highlight the key content of the body of the mail by using a tag, such as a title tag, a bold tag, and the like, the format determination is performed here. In an embodiment, the tag of the mail type may be obtained from the original mail text, and then it is determined whether the mail text is in the HTML format according to the tag.
Step S304: and taking the initial text data as final text data, returning to the mail identification module, and ending the process.
Step S305: and acquiring key information with a set label from the mail body. The label is set by the sender, and when the mail is sent, the effects such as thickening and the like selected by the sender are realized through the label. The labels mainly collected here are: title tag (h1-h 6): used for marking title information in the article; bold label (b, strong): the message used for marking the sender and wishing the recipient to pay attention to; hyperlink label (a): for marking and explaining the jump page information.
The markup method of the HTML tag is generally: the < h1> is marked with content 1, so when key information is obtained, a regular expression can be used for obtaining the content in the corresponding label, namely the key information.
Step S306: and repeatedly adding the key information to the initial text data according to the set second repetition times to obtain final text data, returning to the mail identification module, and ending the process. The location of the mail title, body content, and key information in the final text data is not limited in the embodiments. For example, the key information may be uniformly added to the end of the text, and the final text data may be composed of: mail title + body content + key information.
The step increases the weight of the key information in the initial text data in a repeated adding mode so as to improve the proportion of the key information in the final text data. The second number of repetitions may also be set by itself, for example, 1 time, 2 times, etc., and the number of repetitions may also be adjusted and tried repeatedly according to subsequent training and predicted effects.
The embodiment respectively and repeatedly adds the mail title and the key information in the target mail into the text data, reasonably improves the proportion of the mail title and the key information in the text data, and ensures the prediction effect of the mail recognition model.
Fig. 4 is a schematic main flow chart of the text segmentation module performing segmentation on text data according to the embodiment of the present invention. As shown in fig. 4, the processing flow of the text word segmentation module according to the embodiment of the present invention mainly includes the following steps:
step S401: and acquiring a candidate word set from the sentence of the text data, and searching the occurrence probability of the candidate words in the candidate word set by utilizing a pre-established dictionary. In an embodiment, the dictionary includes a common vocabulary and a deactivated vocabulary. The common word list is words which are used frequently and have practical significance in reality, such as 'computer', 'invoice', and the like, and the recording mode is 'the occurrence probability of the words in the text', and the occurrence probability is a statistical result based on a large amount of linguistic data; the stop word list is a word with high frequency of occurrence but without practical meaning, such as "should", "then", etc.
For a sentence in text data, firstly finding out all possible word combinations (namely candidate words) from left to right to form a candidate word set; and then inquiring the field to obtain the occurrence probability of each candidate word in the candidate word set. For example, as follows, for the sentence "buy water then world expo", all possible word combinations: buy/fruit/then/later/next/world garden/garden.
Step S402: and combining the candidate words according to the sentences to obtain a plurality of segmentation combinations, and calculating the total probability corresponding to the plurality of segmentation combinations according to the occurrence probability of the candidate words in the segmentation combinations. In connection with the above example, the segmentation combination may be: buy/fruit/then/come/world, it can also be: buy/fruit/then/coming/Boyuan, it can also be: buy/water/fruit/back/world, etc. And multiplying the occurrence probability of each candidate word in the segmentation combination to obtain the total probability corresponding to the segmentation combination.
This step calculates the combination with the highest total probability among all the segmentation combinations based on this sentence by the occurrence probability of each word. The embodiment can find the segmentation combination with the maximum total probability through a dynamic planning algorithm.
Step S403: judging whether the maximum total probability is greater than a set probability threshold, and if so, executing a step S404; if the maximum total probability is less than or equal to the probability threshold, step S405 is performed. The setting of the probability threshold value needs to be repeatedly tried according to the word segmentation result.
Step S404: and taking the segmentation combination corresponding to the maximum total probability as a word segmentation result, and ending the process. If the maximum total probability is larger than the probability threshold, the word segmentation effect based on the dictionary is better, and the segmentation combination with the maximum total probability can be directly used as the final word segmentation result.
Step S405: and (5) carrying out word segmentation again on the sentence by using the pre-trained word segmentation model to obtain a word segmentation result, and ending the process. And if the maximum total probability is less than or equal to the probability threshold, the word segmentation effect based on the dictionary is poor, and then word segmentation is carried out again by further adopting a word segmentation model.
In an optional embodiment, the word segmentation model is a hidden markov model, and the model can automatically identify new words according to the context of the words, so that the accuracy of word segmentation is improved. The training process of the word segmentation model comprises the following steps:
(1) and respectively labeling the states of the characters in the training corpus by utilizing the word segmentation lexicon to obtain the labeling characteristics of the characters. Firstly, segmenting the training corpus by matching the training corpus with words in a segmentation word bank; and then, marking the state of each character according to the position of each character in the training corpus after word segmentation. These states include: the initial, middle and end of the word and the individual word. In an embodiment, B may be used to represent the beginning of a word, M may be used to represent the middle of a word, E may be used to represent the end of a word, and S may be used to represent a separate word. The corpus may be news data, encyclopedia data, etc.
Suppose that the result of word segmentation on the corpus is: chinese/participle/yes/text processing/indispensable/one step, the labeling result is: Zhongzhuweibei/Efen/Bye is/Swen/Bben/Mprocess/Eyuen/Byuen/M or/Mlacked/E/S-I/B-step/E. In an embodiment, the annotated features of each word may be represented by a four-dimensional vector (each dimension representing S, B, M, E, respectively). For example, for the word "middle" in the above example, the labeled features are: (0,1,0,0).
(2) According to the labeled training corpus, counting the probability of the same character in different states as the labeling probability of the character; and inputting the training corpus into a hidden Markov model to obtain the probability of the same character in different states as the training probability of the character. The parameters of the hidden Markov model mainly comprise initial state distribution, a state transition probability matrix and a probability matrix for generating a visible state under a hidden state. The hidden Markov model is trained with the training corpus as a visible state sequence and the word state (S, B, M, E) as a hidden state, and the probability of the same word in different states can be obtained as the training probability of the word.
(3) And adjusting parameters of the hidden Markov model according to the errors of the marking probability and the corresponding training probability of the characters until the errors reach the minimum, and obtaining a word segmentation model. And constructing a loss function according to the errors of the marking probability and the training probability of each word, and minimizing the loss function by continuously adjusting the parameters of the hidden Markov model to obtain the trained hidden Markov model.
In a preferred embodiment, after the segmentation model is trained, the segmentation model can be periodically retrained based on the newly added spam mails so as to further optimize the segmentation model and improve the segmentation effect of the segmentation model on homomorphic characters and homophones in text data.
FIG. 5 is a diagram illustrating an optimization process of a word segmentation model according to an embodiment of the present invention. As shown in fig. 5, the optimization process of the word segmentation model according to the embodiment of the present invention mainly includes the following steps:
step S501: and acquiring a newly added junk mail sample, respectively labeling the character states in the newly added junk mail sample, and adding the newly added junk mail sample after labeling into a first training sample set. In the embodiment, the newly added spam sample can be obtained through a honeypot technology or a manual feedback mode, then the state of the characters in the newly added spam sample can be labeled manually, and the newly added spam sample after labeling is added to the first training sample set.
Step S502: judging whether the number of the training samples in the first training sample set is greater than or equal to a set first number threshold, and if the number of the training samples is greater than or equal to the first number threshold, executing step S503; otherwise, no processing is performed. The step is used for judging whether the number of the training samples in the first training sample set meets the requirement of model training. The specific value of the first number threshold may be set according to the requirement.
Step S503: and re-training the segmentation model based on the training samples to obtain a new segmentation model. And when the number of the training samples in the first training sample set meets the requirement of model training, triggering the training of the word segmentation model to obtain a new word segmentation model.
According to the embodiment, the newly added junk mail samples are automatically obtained, and the segmentation model is continuously trained and updated by using the newly added junk mail samples, so that a training sample set is enriched, the segmentation effect is improved, and the accuracy of junk mail identification is further improved.
The text word embedding module obtains word vectors based on word2vec, and because a normal corpus (which can be news data, encyclopedia entry data and the like) is used for word2vec model training, the obtained word vectors do not contain words in junk mails detected by homomorphic characters and homophones, and the situation can cause that the characteristics of the texts are seriously lost during final prediction. In order to improve the conversion effect of the partial word vectors, in a preferred embodiment, the word vector model may be retrained periodically based on the new spam to further optimize the word vector model.
Fig. 6 is a schematic diagram of an optimization process of the word vector model according to the embodiment of the present invention. As shown in fig. 6, the optimization process of the word vector model according to the embodiment of the present invention mainly includes the following steps:
step S601: and acquiring a newly added junk mail sample, adding a word segmentation label to the newly added junk mail sample, and adding the newly added junk mail sample with the label added to a second training sample set. In the embodiment, the newly added spam sample can be obtained through a honeypot technology or a manual feedback mode, then, the mail content of the newly added spam sample can be subjected to word segmentation, word segmentation labels (such as blank spaces) are added, and the newly added spam sample after the labels are added is added to the second training sample set.
Step S602: judging whether the number of the training samples in the second training sample set is greater than or equal to a set second number threshold, and if the number of the training samples is greater than or equal to the second number threshold, executing the step S603; otherwise, no processing is performed. This step is used to determine whether the number of training samples in the second set of training samples meets the model training requirements. The specific value of the second number threshold may be set according to the requirements.
Step S603: and re-training the word vector model based on the training samples to obtain a new word vector model. And triggering the training of the word vector model to obtain a new word vector model when the number of the training samples in the second training sample set meets the model training requirement.
According to the embodiment, the newly added junk mail samples are automatically obtained, and the word vector model is continuously trained and updated by using the newly added junk mail samples, so that the training sample set is enriched, the word vector conversion effect is improved, and the accuracy of junk mail identification is further improved.
FIG. 7 is a diagram illustrating a training process of a mail recognition model according to an embodiment of the present invention. As shown in fig. 7, the training process of the mail recognition model according to the embodiment of the present invention mainly includes the following steps:
step S701: and acquiring a mail sample set, and respectively marking the mail samples of the mail sample set with category labels. The mail samples of the mail sample set may include historical mail samples and new mail samples. The mark type label is a label which is used for endowing the mail sample with a junk mail sample and a normal mail sample. In an embodiment, the mail sample may be filtered to filter out the mails with repetitive features and high content similarity.
Step S702: and extracting the characteristics of the mail samples to obtain corresponding characteristic matrixes, and dividing the characteristic matrixes into a training set and a testing set. And extracting the characteristics of the screened mail samples to obtain corresponding characteristic matrixes. The specific implementation of the feature extraction is shown in step S204 and step S206, which are not described herein again.
It should be noted that, in vectorization representation, for words that do not exist in the mapping relationship, 0 vector is used instead; for samples with the number of words lower than a specific threshold, filling by supplementing 0 vectors; for samples with a number of words above a certain threshold, only the portion equal to the threshold number is truncated for subsequent training.
And after the feature extraction work is finished, dividing the obtained feature matrix into a training set and a testing set after random disorder. Wherein, the division ratio of the training set and the test set can be set as 8: 2,7: 3, and the specific numerical value can be adjusted according to the experimental result to obtain the optimized proportion.
Step S703: inputting the training set into a machine learning model for training to obtain an initial mail recognition model, and inputting the test set into the initial mail recognition model to obtain a prediction result. The machine learning module in an embodiment may be a Gated Recurrent Unit (GRU) network. After the model is trained, the test set can be input into the initial mail recognition model to obtain a prediction result.
Step S704: and comparing the prediction result with the corresponding class label of the test set to obtain model evaluation data. And comparing the prediction result with the class label of the test set to obtain model evaluation data. Wherein the model evaluation data may be prediction accuracy.
Step S705: and adjusting the initial mail identification model according to the model evaluation data to obtain a final mail identification model. If the model evaluation data does not meet the set standard, the super parameters of the initial mail recognition model can be adjusted, and then training is carried out. And repeating the training, model evaluation and super parameter adjustment processes until an ideal mail recognition model is obtained and used as a final mail recognition model.
In this step, the hyper-parameters may be a deep learning model type (LSTM or GRU), a hidden neuron number (128, 256), and an optimizer type (RMSProp or Adam). In an embodiment, different super-parameter combinations can be used for cross training on a training set with a small data volume, and the optimal super-parameter combination is selected after the training time and the mail recognition effect are compared. Wherein, LSTM is called Long Short-Term Memory and is a Long and Short Term Memory network; the RMSProp is called Root mean Square Prop; adam is collectively called Adaptive motion Estimation.
The machine learning method is used for automatically training and optimizing the mail recognition model in the steps, so that the process of manually analyzing and formulating the filtering rule is avoided, the manual maintenance cost is reduced, and meanwhile, the recognition accuracy of the junk mails can be effectively improved along with the continuous enrichment of training samples.
Fig. 8 is a schematic diagram of the main modules of a spam recognition apparatus according to an embodiment of the present invention. As shown in fig. 8, a spam recognition apparatus 800 according to an embodiment of the present invention mainly includes:
the mail analysis module 801 is configured to analyze a target mail, acquire mail information of the target mail and key information with a set tag, and splice the mail information and the key information to obtain text data. Analyzing the target mail mainly comprises the steps of obtaining mail information and key information with set labels from a mail header and a mail body, decoding the information according to mail decoding rules, and splicing into a complete text to obtain text data.
The mail information may include a mail title and a body content. The key information is some information with specified labels in the target mail, and the labels are set by a sender, such as a title label, a bold label, a hyperlink label and the like. In an embodiment, the content in the corresponding tag may be obtained using a regular expression.
The feature extraction module 802 is configured to perform word segmentation on the text data to obtain word segmentation results, and perform vectorization representation on the word segmentation results by using a pre-trained word vector model to obtain corresponding feature matrices. The method includes the steps that a sentence in text data is segmented, specifically, a dictionary-based word segmentation mode can be used for word segmentation, if the obtained word segmentation effect is poor, a pre-trained word segmentation model can be further used for re-segmenting the sentence, and therefore the word segmentation effect is improved.
After word segmentation processing is finished, words in word segmentation results need to be converted into corresponding word vectors according to the mapping relation between the words and the word vectors, and then the words are replaced by the word vectors through word embedding, so that corresponding feature matrixes are obtained. The mapping relation between words and word vectors can be obtained based on word vector model training, the word vector model comprises word2vec, LSA model, PLSA model and the like, and the word2vec can reflect context relation between words compared with other models.
And the mail identification module 803 is configured to input the feature matrix into a pre-trained mail identification model for identification, and output an identification result of the target mail. The mail identification model is used for identifying whether the target mail is a junk mail. After the characteristic extraction, the obtained characteristic matrix is input into a trained mail recognition model, and after the mail recognition model is used for processing, the recognition result of whether the target mail is a junk mail can be obtained.
In addition, the spam email recognition device 800 according to the embodiment of the present invention may further include: a word segmentation model training module, a word segmentation model optimization module, a word vector model optimization module, and a mail recognition model training module (not shown in fig. 8). The functions realized by the modules are as described above, and are not described in detail here.
From the above description, it can be seen that by acquiring the key information with the set tag in the target email and adding the key information into the text data, the proportion of the key information in the text data is increased, and then the spam email is identified by combining the trained email identification model, so that the accuracy of spam email identification is improved, the maintenance cost is reduced, and the influence of human factors is avoided.
Fig. 9 shows an exemplary system architecture 900 to which the spam recognition method or device of an embodiment of the invention can be applied.
As shown in fig. 9, the system architecture 900 may include end devices 901, 902, 903, a network 904, and a server 905. Network 904 is the medium used to provide communication links between terminal devices 901, 902, 903 and server 905. Network 904 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 901, 902, 903 to interact with a server 905 over a network 904 to receive or send messages and the like. The terminal devices 901, 902, 903 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 905 may be a server that provides various services, such as a background management server that processes target mails transmitted by users using the terminal apparatuses 901, 902, 903. The background management server can analyze, extract features of the target mail, input the mail identification model and the like, and feed back a processing result (such as a generated identification result) to the terminal device.
It should be noted that the spam email identification method provided by the embodiment of the present application is generally executed by the server 905, and accordingly, the spam email identification apparatus is generally disposed in the server 905.
It should be understood that the number of terminal devices, networks, and servers in fig. 9 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The invention also provides an electronic device and a computer readable medium according to the embodiment of the invention.
The electronic device of the present invention includes: one or more processors; a storage device, configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a spam recognition method according to an embodiment of the present invention.
The computer readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements a spam recognition method of an embodiment of the present invention.
Referring now to FIG. 10, shown is a block diagram of a computer system 1000 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the computer system 1000 are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.
In particular, the processes described above with respect to the main step diagrams may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program containing program code for performing the method illustrated in the main step diagram. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 1001.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a mail parsing module, a feature extraction module, and a mail identification module. For example, the mail parsing module may be further described as a module that parses a target mail, acquires mail information of the target mail and key information with a set tag, and concatenates the mail information and the key information to obtain text data.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: analyzing a target mail, acquiring mail information of the target mail and key information with a set label, and splicing the mail information and the key information to obtain text data; performing word segmentation on the text data to obtain word segmentation results, and performing vectorization representation on the word segmentation results by using a pre-trained word vector model to obtain corresponding feature matrices; inputting the characteristic matrix into a pre-trained mail recognition model for recognition, and outputting a recognition result of the target mail; the mail identification model is used for identifying whether the target mail is a junk mail.
According to the technical scheme of the embodiment of the invention, the key information with the set label in the target mail is obtained, and the key information is added into the text data, so that the proportion of the key information in the text data is improved, and the spam mail is identified by combining a trained mail identification model, so that the accuracy of spam mail identification is improved, the maintenance cost is reduced, and the influence of human factors is avoided.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.