CN106506327B - Junk mail identification method and device - Google Patents

Junk mail identification method and device Download PDF

Info

Publication number
CN106506327B
CN106506327B CN201610888007.0A CN201610888007A CN106506327B CN 106506327 B CN106506327 B CN 106506327B CN 201610888007 A CN201610888007 A CN 201610888007A CN 106506327 B CN106506327 B CN 106506327B
Authority
CN
China
Prior art keywords
vector
words
feature vectors
classifier
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610888007.0A
Other languages
Chinese (zh)
Other versions
CN106506327A (en
Inventor
杜强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201610888007.0A priority Critical patent/CN106506327B/en
Publication of CN106506327A publication Critical patent/CN106506327A/en
Application granted granted Critical
Publication of CN106506327B publication Critical patent/CN106506327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for identifying junk mails, wherein the method comprises the following steps: extracting a text in the mail to be identified, and dividing the text by taking words as units to obtain a word sequence; and converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors with corresponding relations to the words in the word sequence. And grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups. And taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails. The invention combines the influence of the context correlation on the mail identification, and improves the accuracy of the spam identification.

Description

Junk mail identification method and device
Technical Field
The invention relates to the field of data processing, in particular to a method and a device for identifying junk mails.
Background
With the continuous development of the internet, the use of e-mail is becoming more and more popular, and the commercial promotion using e-mail as a carrier is also widely used, and simultaneously, the spam is also caused to be inundated. The junk mail usually needs to occupy a large amount of resources, and has the problems of inaccurate delivery objects, forced delivery, large amount of unreal information and the like. Therefore, spam is always a very painful internet product for users.
To deter spam, various spam-identifying techniques, such as whitelisting, blacklisting, content-based filtering, etc., are embedded in current email systems. However, the existing junk mail identification method basically identifies the junk mails based on the keywords or the word frequency, has a single angle, ignores other reasons influencing the identification accuracy of the junk mails, and causes the identification accuracy of the junk mails to be insufficient.
Disclosure of Invention
The invention provides a junk mail identification method and device, which can improve the accuracy of junk mail identification.
The invention provides a junk mail identification method, which comprises the following steps:
extracting a text in the mail to be identified, and dividing the text by taking words as units to obtain a word sequence;
converting words in the word sequence into feature vectors having a corresponding relationship with the words according to a corresponding relationship between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors having a corresponding relationship with each word in the word sequence;
grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups;
and taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails.
Preferably, after the feature vectors in the vector sequence are grouped according to a preset standard, a plurality of vector groups are obtained, including:
and grouping the characteristic vectors in the vector sequence by taking sentences or paragraphs as a standard to obtain a plurality of vector groups.
Preferably, the classifier is formed by a convolutional neural network;
the step of using the vector group as an input parameter of a classifier so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails, and comprises the steps of:
taking the feature vectors in the vector group as input parameters of a first layer convolutional neural network of the classifier to obtain feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing the semantics of sentences or paragraphs;
taking the feature vector corresponding to the vector group as an input parameter of a second-layer convolutional neural network of the classifier to obtain a feature vector of a text in the mail to be recognized, wherein the feature vector of the text in the mail to be recognized is used for representing the semantic of the text combined with context correlation;
and taking the feature vector of the text in the mail to be identified as an input parameter of a full connection layer of the classifier, and obtaining a classification result after classification processing of the full connection layer, wherein the classification result is used for determining whether the mail to be identified belongs to a junk mail.
Preferably, the first layer of convolutional neural network of the classifier includes N convolutional kernels, where N is a natural number;
taking the feature vectors in the vector group as input parameters of a first layer convolutional neural network of the classifier to obtain feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing semantics of sentences or paragraphs, and the method comprises the following steps:
obtaining a convolution layer output result of the vector group at each convolution kernel by utilizing one-dimensional convolution operation, wherein the convolution layer output result comprises output results of convolution operation performed on the vector group and the convolution kernels respectively by taking each feature vector as a convolution operation initial value in sequence;
respectively obtaining the maximum value of the vector group in the convolution layer output result of each convolution kernel;
and combining the maximum values of the vector group in the convolution layer output result of each convolution kernel to obtain the characteristic vector corresponding to the vector group.
Preferably, before the converting the words in the word sequence into the feature vectors having a correspondence with the words according to the correspondence between the words and the feature vectors obtained in advance, the method further includes:
replacing words of a preset type in the word sequence with a preset label;
constructing a feature vector for the label in advance, and acquiring a corresponding relation between the label and the feature vector;
correspondingly, the converting the words in the word sequence into the feature vectors having a corresponding relationship with the words according to the corresponding relationship between the words and the feature vectors obtained in advance to obtain a vector sequence includes:
converting words in the word sequence into feature vectors with corresponding relations with the words according to the corresponding relations between the words and the feature vectors acquired in advance; and converting the labels in the word sequence into the feature vectors with corresponding relations to the labels according to the corresponding relations between the labels and the feature vectors to obtain a vector sequence.
Preferably, the constructing a feature vector for the tag in advance includes:
randomly generating a feature vector, and judging whether the Euclidean distance between the feature vector and each feature vector in the corresponding relation between the word and the feature vector is smaller than a preset constant or not;
and when the Euclidean distance between the feature vector and each feature vector is smaller than a preset constant, the feature vector is allocated to a label.
The invention also provides a spam recognition device, which comprises:
the segmentation module is used for extracting a text in the mail to be identified and segmenting the text by taking words as units to obtain a word sequence;
the conversion module is used for converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, and the vector sequence comprises the feature vectors with corresponding relations to all the words in the word sequence;
the grouping module is used for grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups;
and the classification module is used for taking the vector group as an input parameter of a classifier so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, and the classification result is used for determining whether the mails to be identified belong to junk mails.
Preferably, the grouping module is specifically configured to:
and grouping the characteristic vectors in the vector sequence by taking sentences or paragraphs as a standard to obtain a plurality of vector groups.
Preferably, the classifier is formed by a convolutional neural network; the classification module comprises:
the first classification submodule is used for taking the feature vectors in the vector group as input parameters of a first layer of convolutional neural network of the classifier to obtain the feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing the semantics of sentences or paragraphs;
the second classification submodule is used for taking the feature vectors corresponding to the vector group as input parameters of a second layer convolutional neural network of the classifier to obtain feature vectors of texts in the mails to be recognized, wherein the feature vectors of the texts in the mails to be recognized are used for expressing semantics of the texts after context correlation is combined;
and the third classification submodule is used for taking the feature vector of the text in the mail to be recognized as the input parameter of the full connection layer of the classifier, and obtaining a classification result after classification processing of the full connection layer, wherein the classification result is used for determining whether the mail to be recognized belongs to the junk mail.
Preferably, the first layer of convolutional neural network of the classifier includes N convolutional kernels, where N is a natural number;
the first classification submodule includes:
the convolution operation submodule is used for obtaining a convolution layer output result of the vector group in each convolution kernel by utilizing one-dimensional convolution operation, and the convolution layer output result comprises output results of convolution operation performed on the convolution layer and the convolution kernels by sequentially taking each feature vector in the vector group as a convolution operation initial value;
the obtaining submodule is used for respectively obtaining the maximum value of the vector group in the convolution layer output result of each convolution kernel;
and the combination submodule is used for combining the maximum values of the vector group in the convolution layer output result of each convolution kernel to obtain the characteristic vector corresponding to the vector group.
The invention provides a junk mail identification method, which comprises the steps of firstly extracting a text in a mail to be identified, and segmenting the text by taking words as units to obtain a word sequence; and converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors with corresponding relations to the words in the word sequence. And secondly, grouping the characteristic vectors in the vector sequence by a preset standard to obtain a plurality of vector groups. And finally, taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails. Compared with the junk mail identification method in the prior art, the method combines the influence of the context correlation on the mail identification, and improves the accuracy of the junk mail identification.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
Fig. 1 is a flowchart of a spam email recognition method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a vector sequence after being grouped according to an embodiment of the present invention;
FIG. 3 is a flowchart of a processing method of a classifier according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a classifier according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a spam email recognition device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The context relevance of text content in e-mail has a crucial influence on spam recognition, for example, the word "viagra" (viagra) is often given a higher spam weight through rules or training of samples. But if a friend sends you a joke that mentions "viao" or a serious email for a discussion of a medical professional would be recognized as spam. This is obviously a consequence of identifying spam without combining contextual relevance, and the method of identifying spam that usually departs from contextual relevance and semantics inevitably has a great disadvantage in identification accuracy, especially a very high error rate when distinguishing normal mail in the professional field from spam in the field.
Therefore, the junk mail identification method provided by the invention combines the influence of context correlation and can identify the junk mail more accurately.
The following description will be made of specific contents of examples.
An embodiment of the present invention provides a spam identification method, and referring to fig. 1, is a flowchart of a spam identification method provided in an embodiment of the present invention, where the method specifically includes:
s101: extracting a text in the mail to be identified, and dividing the text by taking words as units to obtain a word sequence.
The junk mail identification method provided by the embodiment of the invention can be applied to terminals such as a mail gateway, a mail server or a client and the like. In practical application, the mail data in different terminals are all encapsulated by specific codes or protocols, and the embodiment of the invention can shield the processing difference of the mail data from different terminals in the subsequent processing process by converting the texts of the mail data in different terminals in advance, so that the system has good adaptability.
In addition, the embodiment of the invention realizes the identification of the junk mails based on the text content in the emails, and does not relate to the identification of the contents such as pictures, attachments and the like in the emails.
In practical application, firstly, a text in an email to be recognized is extracted, and as the spam email is recognized based on the semantic meaning of the text, the text is segmented by taking words as units to obtain a word sequence after the text is extracted, wherein the word sequence is the text segmented by taking the words as units.
In the embodiment of the present invention, the method for segmenting the text in units of words may include a method based on character string matching, such as a two-way maximum matching method, and may further include a method based on a hidden markov model HMM and a method based on deep learning, and the like. The embodiment of the present invention does not limit which method is used for segmenting the text, and preferably, the present invention uses the HMM-based method and the deep learning-based method to have better effects than other methods.
S102: and converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors with corresponding relations to the words in the word sequence.
In the embodiment of the invention, the corresponding relation between the words and the characteristic vectors is obtained in advance and is stored in the system for calling. Specifically, in an implementation manner, a GloVe (Global Vectors for word reconstruction) method may be used to train a pre-obtained sample to obtain a corresponding relationship between words and feature Vectors. The samples used in the GloVe method may be natural corpora obtained from news, web pages, and the like. In addition, the method for obtaining the corresponding relationship between the word and the feature vector in the embodiment of the present invention is not limited to the GloVe method, and other existing technologies can be used to obtain the corresponding relationship between the word and the feature vector, which is not described herein again.
The value emphasizes that the feature vector in the correspondence between the word and the feature vector obtained by the GloVe method satisfies the following condition: first, the nearest neighbors of the feature vector corresponding to each word should be the synonyms of the word, e.g., the nearest neighbors of the feature vector corresponding to the word frog should be respectively frog, toad, litoria, leptodectidae, rana, lizard, eleutherodactylus, etc. Second, the feature vector corresponding to a word has a linear relationship between related words, e.g., the linear relationship v (queen) ≈ v (king) -v (man) + v (woman), where v () is a transfer function for the word to the feature vector, and queen, king, man, woman are related words.
In actual operation, each word in the obtained word sequence is converted into a feature vector having a corresponding relationship with the word sequence according to the corresponding relationship between the word and the feature vector stored in the system in advance, so as to obtain a vector sequence. And the vector sequence comprises characteristic vectors which respectively correspond to all words in the word sequence.
In a preferred embodiment, after obtaining a word sequence, the embodiment of the present invention finds a preset type of word, such as a number, a symbol, and the like, in the word sequence, and replaces the preset type of word with a preset tag. For example, the date "2016-6-1" is replaced with the tag "< date >".
Because the words of the preset type are generally words irrelevant to identifying the junk mails, the words of the preset type are uniformly replaced by the preset labels in the embodiment of the invention, on one hand, the identification process of the junk mails can be simplified, on the other hand, the normalization capability of the classifier can be increased, so that the classifier can regard the emails only changing some numbers, dates and the like as the emails of one type, and the processing process is simplified.
In practical application, the embodiment of the invention can realize the matching of words of preset types and the replacement of preset labels by utilizing the regular expression, and the embodiment of the invention can replace the words matched with the regular expression table entries into the corresponding labels by maintaining a regular expression library.
In addition, because the corresponding relation between the words and the feature vectors obtained by the GloVe method does not include the feature vector corresponding to the label, after the label is preset, the GloVe method can be used for constructing the feature vector for the label. Specifically, a GloVe method is used for randomly generating a feature vector, and whether the Euclidean distance between the feature vector and the feature vector corresponding to each word in the pre-acquired corresponding relation between the words and the feature vector is smaller than a preset constant or not is judged. And if the Euclidean distance between the feature vector and the feature vector corresponding to each word is smaller than a preset constant, the feature vector is allocated to a label. In the above manner, corresponding feature vectors are constructed for each label.
In practical application, each tag in the word sequence is also converted into a corresponding feature vector according to the corresponding relationship between each tag and the feature vector.
S103: and grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups.
In the embodiment of the present invention, the preset standard may be a standard using a sentence or a paragraph, or may be a standard using a fixed length or a fixed number of words.
In practical application, the vector sequences are grouped according to a preset standard to obtain a plurality of vector groups, wherein each vector group comprises grouped feature vectors.
In practical application, when the vector sequence is grouped by taking sentences as a standard, the sentences can be identified according to punctuations in the vector sequence, and finally, the feature vectors are grouped by taking the sentences as a unit. As shown in fig. 2, fig. 2 is a schematic diagram of a vector sequence grouped according to sentences as a standard. In order to balance the contribution of each word in each sentence to spam mail identification, a plurality of occupancy vectors are respectively added before and after the vector group corresponding to each sentence after grouping. Wherein the number of the respectively added occupancy vectors is equal to the maximum window length of the convolution kernel in the classifier minus 1.
S104: and taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails.
The classifier in the embodiment of the invention can be formed by deep neural networks such as a Convolutional Neural Network (CNN), a cyclic neural network (RNN) and the like, and can be used for classifying the mails to be recognized by utilizing the context correlation capability of the deep neural networks, so that the recognition accuracy of the junk mails can be improved.
In a preferred embodiment, the convolutional neural network CNN is used to form the classifier in the embodiment of the present invention. For a vector group obtained by grouping a vector sequence with a sentence or a paragraph as a standard, a processing procedure of the classifier is as follows, referring to fig. 3, where fig. 3 is a flowchart of a processing method of the classifier according to an embodiment of the present invention:
s301: and taking the feature vectors in the vector group as input parameters of a first-layer convolutional neural network of the classifier to obtain the feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing the semantics of sentences or paragraphs.
In practical application, since sentences or paragraphs are used as grouping criteria in the embodiment, the classifier that uses the convolutional neural network for training and classification in the embodiment may be composed of two layers of convolutional neural networks. In fact, the classifier of the embodiment of the present invention may also be composed of three or more layers of convolutional neural networks according to different grouping standards. Fig. 4 is a schematic structural diagram of a classifier formed by two layers of convolutional neural networks according to an embodiment of the present invention. The first layer of convolutional neural network is composed of N convolutional kernels and a posing layer 1, and N is a natural number.
Specifically, a vector group obtained by grouping sentences or paragraphs as a standard is denoted as S1:n=[X1,X2...Xn]And Xn is a feature vector corresponding to the nth word. That is, the set of vectors S1:nIs composed of feature vectors of n words.
In practical application, firstly, a convolution layer output result of the vector group at each convolution kernel is obtained by utilizing one-dimensional convolution operation, and the convolution layer output result comprises output results of convolution operation performed on the convolution layer and the convolution kernels by sequentially taking each feature vector in the vector group as a convolution operation initial value.
In particular, the vector set S is used in sequence1:nEach feature vector X in (2)1And X2.. Xn is used as the initial value of convolution operation and is respectively convolved with convolution kernels to obtain the vector group S1:nThe result is output at the convolution layer of each convolution kernel. Taking the ith characteristic vector in the vector group as an initial value of convolution operation, and recording an output result obtained after convolution operation is carried out on the ith characteristic vector to the (i + hj-1) th characteristic vector in the vector group and the jth convolution kernel Wj as:
Figure GDA0002785559160000091
wherein the vector group is the m-th vector group h obtained after groupingjWindow length of jth convolution kernel, bjFor the offset, f () is a non-linear function, such as tanh ().
In practical application, in the vector set S1:nEach feature vector X in (2)1And X2.. Xn is used as the initial value of convolution operation and is respectively convolved with convolution kernels to obtain
Figure GDA0002785559160000092
Then, will
Figure GDA0002785559160000093
Are combined to finally obtain
Figure GDA0002785559160000094
Cm,jI.e. the set of vectors S1:nAnd outputting the result at the convolution layer of the jth convolution kernel.
And the vector group outputs the result of convolution layer of the convolution kernel, wherein the result of convolution layer output of the vector group comprises the output result of convolution operation with the convolution kernel respectively by taking each feature vector in the vector group as the initial value of convolution operation.
Then, the maximum value of the vector group in the convolution layer output result of each convolution kernel is obtained respectively. Specifically, in the posing layer 1 in fig. 4, the max-out posing method is adopted to obtain the maximum value of the vector group in the convolution layer output result of each convolution kernel. And recording the maximum value of the m-th vector group obtained after grouping in the convolution layer output result of the j-th convolution kernel as:
Figure GDA0002785559160000101
and finally, combining the maximum values of the vector group in the convolution layer output result of each convolution kernel to obtain the characteristic vector corresponding to the vector group. And recording the feature vectors corresponding to the mth vector group obtained after grouping as:
Ym=[Pm,1,Pm,2...Pm,N];
the first layer of convolutional neural network comprises N convolutional kernels, and the mth vector group forms a feature vector Y corresponding to the vector group in the convolutional layer output results of the N convolutional kernels respectively through the maximum valuem
S302: and taking the feature vector corresponding to the vector group as an input parameter of a second-layer convolutional neural network of the classifier to obtain a feature vector of the text in the mail to be recognized, wherein the feature vector of the text in the mail to be recognized is used for representing the semantic meaning of the text combined with context correlation.
As shown in fig. 4, the second layer of convolutional neural network in the classifier may be composed of M convolutional kernels and a posing layer 2, where M is a natural number, and the second layer of convolutional neural network has the same algorithm logic as the first layer of convolutional neural network. Specifically, the feature vectors corresponding to the vector groups output by the first layer of convolutional neural network are used as input parameters of the second layer of convolutional neural network. After the processing of the M convolution kernels and the posing layer 2 in the second layer of convolution neural network, finally, the feature vectors of the text in the mail to be recognized are output by the second layer of convolution neural network.
S303: and taking the feature vector of the text in the mail to be identified as an input parameter of a full connection layer of the classifier, and obtaining a classification result after classification processing of the full connection layer, wherein the classification result is used for determining whether the mail to be identified belongs to a junk mail.
As shown in fig. 4, the classifier in the embodiment of the present invention further includes a fully-connected layer, the feature vector of the text in the mail to be identified output by the second layer convolutional neural network is used as an input parameter of the fully-connected layer, the fully-connected layer outputs probabilities on a plurality of classifications through a softmax function, and it is possible to determine whether the mail to be identified belongs to spam using the probabilities. The algorithm logic of the full connection layer is the same as that of the traditional neural network, and is not described herein again.
In the embodiment of the invention, before the classifier is used for identifying the junk mails, the classifier is trained by using the mail samples. Specifically, the process of training the classifier by using the mail samples is basically the same as the process of identifying the spam mails by using the classifier, and the differences include the following two points: first, in the stage of training the classifier by using the mail samples, the classifier includes not only the forward propagation process of processing the mail samples, i.e. the above-mentioned S301 to S303, but also the backward propagation process, in order to adjust the network parameters (such as the weight and offset of the fully connected layer) of each layer of the classifier, so that the finally obtained training result is more accurate. Secondly, the dropout algorithm is applied to a full connection layer of the classifier, and the overfitting problem of the mail samples in the training stage is solved. Specifically, during the forward propagation of the training phase, the output of some hidden layers is randomly set to 0, and the neurons do not participate in the backward propagation parameter adjustment. The method reduces the dependency relationship among the neurons and solves the overfitting problem of the deep neural network to the sample.
In the junk mail identification method provided by the embodiment of the invention, firstly, a text in a mail to be identified is extracted, and the text is divided by taking words as units to obtain a word sequence; and converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors with corresponding relations to the words in the word sequence. And secondly, grouping the characteristic vectors in the vector sequence by a preset standard to obtain a plurality of vector groups. And finally, taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails. Compared with the junk mail identification method in the prior art, the method and the device have the advantages that the influence of the context correlation on the mail identification is combined, and the accuracy of the junk mail identification is improved.
An embodiment of the present invention further provides a spam recognition apparatus, and referring to fig. 5, the spam recognition apparatus according to the embodiment of the present invention is shown in a schematic structural diagram, where the apparatus includes:
the segmentation module 501 is configured to extract a text in the email to be identified, and segment the text by taking a word as a unit to obtain a word sequence;
a conversion module 502, configured to convert words in the word sequence into feature vectors having a correspondence with the words according to a correspondence between words and feature vectors obtained in advance, so as to obtain a vector sequence, where the vector sequence includes feature vectors having a correspondence with each word in the word sequence;
a grouping module 503, configured to group the feature vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups;
a classifying module 504, configured to use the vector group as an input parameter of a classifier, so that the classifier classifies the to-be-identified email according to context correlation to obtain a classification result, where the classification result is used to determine whether the to-be-identified email belongs to a spam email.
Specifically, the grouping module 503 is specifically configured to:
and grouping the characteristic vectors in the vector sequence by taking sentences or paragraphs as a standard to obtain a plurality of vector groups.
In a preferred embodiment, the classifier is formed by a convolutional neural network; the classification module 504 includes:
the first classification submodule is used for taking the feature vectors in the vector group as input parameters of a first layer of convolutional neural network of the classifier to obtain the feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing the semantics of sentences or paragraphs;
the second classification submodule is used for taking the feature vectors corresponding to the vector group as input parameters of a second layer convolutional neural network of the classifier to obtain feature vectors of texts in the mails to be recognized, wherein the feature vectors of the texts in the mails to be recognized are used for expressing semantics of the texts after context correlation is combined;
and the third classification submodule is used for taking the feature vector of the text in the mail to be recognized as the input parameter of the full connection layer of the classifier, and obtaining a classification result after classification processing of the full connection layer, wherein the classification result is used for determining whether the mail to be recognized belongs to the junk mail.
In a preferred embodiment, the first layer convolutional neural network of the classifier includes N convolutional kernels, where N is a natural number;
the first classification submodule includes:
the convolution operation submodule is used for obtaining a convolution layer output result of the vector group in each convolution kernel by utilizing one-dimensional convolution operation, and the convolution layer output result comprises output results of convolution operation performed on the convolution layer and the convolution kernels by sequentially taking each feature vector in the vector group as a convolution operation initial value;
the obtaining submodule is used for respectively obtaining the maximum value of the vector group in the convolution layer output result of each convolution kernel;
and the combination submodule is used for combining the maximum values of the vector group in the convolution layer output result of each convolution kernel to obtain the characteristic vector corresponding to the vector group.
The junk mail recognition device provided by the embodiment of the invention can realize the following functions: extracting a text in the mail to be identified, and dividing the text by taking words as units to obtain a word sequence; and converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors with corresponding relations to the words in the word sequence. And grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups. And taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails. Compared with the junk mail identification method in the prior art, the method and the device have the advantages that the influence of the context correlation on the mail identification is combined, and the accuracy of the junk mail identification is improved.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The method and the device for identifying spam mails provided by the embodiment of the invention are described in detail, and the principle and the implementation mode of the invention are explained by applying a specific embodiment in the text, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method for spam identification, the method comprising:
extracting a text in the mail to be identified, and segmenting the text by taking words as units to obtain a word sequence, wherein the word sequence is the text segmented by taking the words as units;
converting words in the word sequence into feature vectors having a corresponding relationship with the words according to a corresponding relationship between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors having a corresponding relationship with each word in the word sequence;
grouping the characteristic vectors in the vector sequence by a preset standard to obtain a plurality of vector groups, wherein the preset standard is a standard of sentences, paragraphs, fixed lengths or fixed word numbers;
and taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails, and the classifier is formed by adopting a deep neural network.
2. The spam identification method according to claim 1, wherein the grouping the feature vectors in the vector sequence according to a preset criterion to obtain a plurality of vector groups comprises:
and grouping the characteristic vectors in the vector sequence by taking sentences or paragraphs as a standard to obtain a plurality of vector groups.
3. A spam recognition method according to claim 2 wherein said classifier is constructed using a convolutional neural network;
the step of using the vector group as an input parameter of a classifier so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails, and comprises the steps of:
taking the feature vectors in the vector group as input parameters of a first layer convolutional neural network of the classifier to obtain feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing the semantics of sentences or paragraphs;
taking the feature vector corresponding to the vector group as an input parameter of a second-layer convolutional neural network of the classifier to obtain a feature vector of a text in the mail to be recognized, wherein the feature vector of the text in the mail to be recognized is used for representing the semantic of the text combined with context correlation;
and taking the feature vector of the text in the mail to be identified as an input parameter of a full connection layer of the classifier, and obtaining a classification result after classification processing of the full connection layer, wherein the classification result is used for determining whether the mail to be identified belongs to a junk mail.
4. The spam identification method of claim 3 wherein the first layer convolutional neural network of the classifier comprises N convolutional kernels, N being a natural number;
taking the feature vectors in the vector group as input parameters of a first layer convolutional neural network of the classifier to obtain feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing semantics of sentences or paragraphs, and the method comprises the following steps:
obtaining a convolution layer output result of the vector group at each convolution kernel by utilizing one-dimensional convolution operation, wherein the convolution layer output result comprises output results of convolution operation performed on the vector group and the convolution kernels respectively by taking each feature vector as a convolution operation initial value in sequence;
respectively obtaining the maximum value of the vector group in the convolution layer output result of each convolution kernel;
and combining the maximum values of the vector group in the convolution layer output result of each convolution kernel to obtain the characteristic vector corresponding to the vector group.
5. The method according to any one of claims 1 to 4, wherein before the converting the words in the word sequence into the feature vectors having correspondence with the words according to the correspondence between the words and the feature vectors obtained in advance, the method further comprises:
replacing words of a preset type in the word sequence with a preset label;
constructing a feature vector for the label in advance, and acquiring a corresponding relation between the label and the feature vector;
correspondingly, the converting the words in the word sequence into the feature vectors having a corresponding relationship with the words according to the corresponding relationship between the words and the feature vectors obtained in advance to obtain a vector sequence includes:
converting words in the word sequence into feature vectors with corresponding relations with the words according to the corresponding relations between the words and the feature vectors acquired in advance; and converting the labels in the word sequence into the feature vectors with corresponding relations to the labels according to the corresponding relations between the labels and the feature vectors to obtain a vector sequence.
6. The spam identification method of claim 5, wherein said pre-constructing a feature vector for said tag comprises:
randomly generating a feature vector, and judging whether the Euclidean distance between the feature vector and each feature vector in the corresponding relation between the word and the feature vector is smaller than a preset constant or not;
and when the Euclidean distance between the feature vector and each feature vector is smaller than a preset constant, the feature vector is allocated to a label.
7. A spam recognition device, said device comprising:
the segmentation module is used for extracting a text in the mail to be identified and segmenting the text by taking words as units to obtain a word sequence;
the conversion module is used for converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, and the vector sequence comprises the feature vectors with corresponding relations to all the words in the word sequence;
the grouping module is used for grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups;
and the classification module is used for taking the vector group as an input parameter of a classifier so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, the classification result is used for determining whether the mails to be identified belong to junk mails, and the classifier is formed by adopting a deep neural network.
8. The spam recognition device of claim 7, wherein the grouping module is specifically configured to:
and grouping the characteristic vectors in the vector sequence by taking sentences or paragraphs as a standard to obtain a plurality of vector groups.
9. The spam recognition device of claim 8 wherein the classifier is constructed using a convolutional neural network; the classification module comprises:
the first classification submodule is used for taking the feature vectors in the vector group as input parameters of a first layer of convolutional neural network of the classifier to obtain the feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing the semantics of sentences or paragraphs;
the second classification submodule is used for taking the feature vectors corresponding to the vector group as input parameters of a second layer convolutional neural network of the classifier to obtain feature vectors of texts in the mails to be recognized, wherein the feature vectors of the texts in the mails to be recognized are used for expressing semantics of the texts after context correlation is combined;
and the third classification submodule is used for taking the feature vector of the text in the mail to be recognized as the input parameter of the full connection layer of the classifier, and obtaining a classification result after classification processing of the full connection layer, wherein the classification result is used for determining whether the mail to be recognized belongs to the junk mail.
10. The spam recognition device of claim 9 wherein the first layer convolutional neural network of the classifier comprises N convolutional kernels, N being a natural number;
the first classification submodule includes:
the convolution operation submodule is used for obtaining a convolution layer output result of the vector group in each convolution kernel by utilizing one-dimensional convolution operation, and the convolution layer output result comprises output results of convolution operation performed on the convolution layer and the convolution kernels by sequentially taking each feature vector in the vector group as a convolution operation initial value;
the obtaining submodule is used for respectively obtaining the maximum value of the vector group in the convolution layer output result of each convolution kernel;
and the combination submodule is used for combining the maximum values of the vector group in the convolution layer output result of each convolution kernel to obtain the characteristic vector corresponding to the vector group.
CN201610888007.0A 2016-10-11 2016-10-11 Junk mail identification method and device Active CN106506327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610888007.0A CN106506327B (en) 2016-10-11 2016-10-11 Junk mail identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610888007.0A CN106506327B (en) 2016-10-11 2016-10-11 Junk mail identification method and device

Publications (2)

Publication Number Publication Date
CN106506327A CN106506327A (en) 2017-03-15
CN106506327B true CN106506327B (en) 2021-02-19

Family

ID=58295096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610888007.0A Active CN106506327B (en) 2016-10-11 2016-10-11 Junk mail identification method and device

Country Status (1)

Country Link
CN (1) CN106506327B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108694202A (en) * 2017-04-10 2018-10-23 上海交通大学 Configurable Spam Filtering System based on sorting algorithm and filter method
CN106934068A (en) * 2017-04-10 2017-07-07 江苏东方金钰智能机器人有限公司 The method that robot is based on the semantic understanding of environmental context
CN107491434A (en) * 2017-08-10 2017-12-19 北京邮电大学 Text snippet automatic generation method and device based on semantic dependency
CN107302547B (en) * 2017-08-21 2021-07-02 深信服科技股份有限公司 Web service anomaly detection method and device
CN107577668A (en) * 2017-09-15 2018-01-12 电子科技大学 Social media non-standard word correcting method based on semanteme
CN107835496B (en) * 2017-11-24 2021-09-07 北京奇虎科技有限公司 Spam short message identification method and device and server
CN108038230B (en) * 2017-12-26 2022-05-20 北京百度网讯科技有限公司 Information generation method and device based on artificial intelligence
CN110048936B (en) * 2019-04-18 2021-09-10 宁波青年优品信息科技有限公司 Method for judging junk mail by semantic associated words

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010191693A (en) * 2009-02-18 2010-09-02 Nippon Telegr & Teleph Corp <Ntt> Electronic mail transmission host classification system, electronic mail transmission host classification method, and program therefor
CN103488689A (en) * 2013-09-02 2014-01-01 新浪网技术(中国)有限公司 Mail classification method and mail classification system based on clustering

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7885466B2 (en) * 2006-09-19 2011-02-08 Xerox Corporation Bags of visual context-dependent words for generic visual categorization
US8266179B2 (en) * 2009-09-30 2012-09-11 Hewlett-Packard Development Company, L.P. Method and system for processing text
CN102169493A (en) * 2011-04-02 2011-08-31 北京奥米时代生物技术有限公司 Method for automatically identifying experimental scheme from literatures
US20150278194A1 (en) * 2012-11-07 2015-10-01 Nec Corporation Information processing device, information processing method and medium
CN103744905B (en) * 2013-12-25 2018-03-30 新浪网技术(中国)有限公司 Method for judging rubbish mail and device
CN104834747B (en) * 2015-05-25 2018-04-27 中国科学院自动化研究所 Short text classification method based on convolutional neural networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010191693A (en) * 2009-02-18 2010-09-02 Nippon Telegr & Teleph Corp <Ntt> Electronic mail transmission host classification system, electronic mail transmission host classification method, and program therefor
CN103488689A (en) * 2013-09-02 2014-01-01 新浪网技术(中国)有限公司 Mail classification method and mail classification system based on clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于语义集合模型及有限状态机的垃圾邮件分类研究;周尔强;《中国优秀硕士学位论文全文数据库 信息科技辑》;20090515(第05期);第I139-237页 *

Also Published As

Publication number Publication date
CN106506327A (en) 2017-03-15

Similar Documents

Publication Publication Date Title
CN106506327B (en) Junk mail identification method and device
US11853704B2 (en) Classification model training method, classification method, device, and medium
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
CN107608956B (en) Reader emotion distribution prediction algorithm based on CNN-GRNN
CN110609897B (en) Multi-category Chinese text classification method integrating global and local features
CN108446271B (en) Text emotion analysis method of convolutional neural network based on Chinese character component characteristics
CN106649434B (en) Cross-domain knowledge migration label embedding method and device
CN111401061A (en) Method for identifying news opinion involved in case based on BERT and Bi L STM-Attention
CN107025284A (en) The recognition methods of network comment text emotion tendency and convolutional neural networks model
CN110597961B (en) Text category labeling method and device, electronic equipment and storage medium
CN109471946B (en) Chinese text classification method and system
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN111125354A (en) Text classification method and device
CN112084335A (en) Social media user account classification method based on information fusion
CN110188195B (en) Text intention recognition method, device and equipment based on deep learning
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN109446423B (en) System and method for judging sentiment of news and texts
CN111522908A (en) Multi-label text classification method based on BiGRU and attention mechanism
CN112667782A (en) Text classification method, device, equipment and storage medium
CN112732872A (en) Biomedical text-oriented multi-label classification method based on subject attention mechanism
CN114925205A (en) GCN-GRU text classification method based on comparative learning
CN113220964B (en) Viewpoint mining method based on short text in network message field
CN113158667B (en) Event detection method based on entity relationship level attention mechanism
CN114443846A (en) Classification method and device based on multi-level text abnormal composition and electronic equipment
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant