CN106506327B

CN106506327B - Junk mail identification method and device

Info

Publication number: CN106506327B
Application number: CN201610888007.0A
Authority: CN
Inventors: 杜强
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2016-10-11
Filing date: 2016-10-11
Publication date: 2021-02-19
Anticipated expiration: 2036-10-11
Also published as: CN106506327A

Abstract

The invention discloses a method and a device for identifying junk mails, wherein the method comprises the following steps: extracting a text in the mail to be identified, and dividing the text by taking words as units to obtain a word sequence; and converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors with corresponding relations to the words in the word sequence. And grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups. And taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails. The invention combines the influence of the context correlation on the mail identification, and improves the accuracy of the spam identification.

Description

Junk mail identification method and device

Technical Field

The invention relates to the field of data processing, in particular to a method and a device for identifying junk mails.

Background

With the continuous development of the internet, the use of e-mail is becoming more and more popular, and the commercial promotion using e-mail as a carrier is also widely used, and simultaneously, the spam is also caused to be inundated. The junk mail usually needs to occupy a large amount of resources, and has the problems of inaccurate delivery objects, forced delivery, large amount of unreal information and the like. Therefore, spam is always a very painful internet product for users.

To deter spam, various spam-identifying techniques, such as whitelisting, blacklisting, content-based filtering, etc., are embedded in current email systems. However, the existing junk mail identification method basically identifies the junk mails based on the keywords or the word frequency, has a single angle, ignores other reasons influencing the identification accuracy of the junk mails, and causes the identification accuracy of the junk mails to be insufficient.

Disclosure of Invention

The invention provides a junk mail identification method and device, which can improve the accuracy of junk mail identification.

The invention provides a junk mail identification method, which comprises the following steps:

extracting a text in the mail to be identified, and dividing the text by taking words as units to obtain a word sequence;

converting words in the word sequence into feature vectors having a corresponding relationship with the words according to a corresponding relationship between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors having a corresponding relationship with each word in the word sequence;

grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups;

and taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails.

Preferably, after the feature vectors in the vector sequence are grouped according to a preset standard, a plurality of vector groups are obtained, including:

and grouping the characteristic vectors in the vector sequence by taking sentences or paragraphs as a standard to obtain a plurality of vector groups.

Preferably, the classifier is formed by a convolutional neural network;

the step of using the vector group as an input parameter of a classifier so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails, and comprises the steps of:

taking the feature vectors in the vector group as input parameters of a first layer convolutional neural network of the classifier to obtain feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing the semantics of sentences or paragraphs;

taking the feature vector corresponding to the vector group as an input parameter of a second-layer convolutional neural network of the classifier to obtain a feature vector of a text in the mail to be recognized, wherein the feature vector of the text in the mail to be recognized is used for representing the semantic of the text combined with context correlation;

and taking the feature vector of the text in the mail to be identified as an input parameter of a full connection layer of the classifier, and obtaining a classification result after classification processing of the full connection layer, wherein the classification result is used for determining whether the mail to be identified belongs to a junk mail.

Preferably, the first layer of convolutional neural network of the classifier includes N convolutional kernels, where N is a natural number;

taking the feature vectors in the vector group as input parameters of a first layer convolutional neural network of the classifier to obtain feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing semantics of sentences or paragraphs, and the method comprises the following steps:

obtaining a convolution layer output result of the vector group at each convolution kernel by utilizing one-dimensional convolution operation, wherein the convolution layer output result comprises output results of convolution operation performed on the vector group and the convolution kernels respectively by taking each feature vector as a convolution operation initial value in sequence;

respectively obtaining the maximum value of the vector group in the convolution layer output result of each convolution kernel;

and combining the maximum values of the vector group in the convolution layer output result of each convolution kernel to obtain the characteristic vector corresponding to the vector group.

Preferably, before the converting the words in the word sequence into the feature vectors having a correspondence with the words according to the correspondence between the words and the feature vectors obtained in advance, the method further includes:

replacing words of a preset type in the word sequence with a preset label;

constructing a feature vector for the label in advance, and acquiring a corresponding relation between the label and the feature vector;

correspondingly, the converting the words in the word sequence into the feature vectors having a corresponding relationship with the words according to the corresponding relationship between the words and the feature vectors obtained in advance to obtain a vector sequence includes:

converting words in the word sequence into feature vectors with corresponding relations with the words according to the corresponding relations between the words and the feature vectors acquired in advance; and converting the labels in the word sequence into the feature vectors with corresponding relations to the labels according to the corresponding relations between the labels and the feature vectors to obtain a vector sequence.

Preferably, the constructing a feature vector for the tag in advance includes:

randomly generating a feature vector, and judging whether the Euclidean distance between the feature vector and each feature vector in the corresponding relation between the word and the feature vector is smaller than a preset constant or not;

and when the Euclidean distance between the feature vector and each feature vector is smaller than a preset constant, the feature vector is allocated to a label.

The invention also provides a spam recognition device, which comprises:

the segmentation module is used for extracting a text in the mail to be identified and segmenting the text by taking words as units to obtain a word sequence;

the conversion module is used for converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, and the vector sequence comprises the feature vectors with corresponding relations to all the words in the word sequence;

the grouping module is used for grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups;

and the classification module is used for taking the vector group as an input parameter of a classifier so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, and the classification result is used for determining whether the mails to be identified belong to junk mails.

Preferably, the grouping module is specifically configured to:

Preferably, the classifier is formed by a convolutional neural network; the classification module comprises:

the first classification submodule is used for taking the feature vectors in the vector group as input parameters of a first layer of convolutional neural network of the classifier to obtain the feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing the semantics of sentences or paragraphs;

the second classification submodule is used for taking the feature vectors corresponding to the vector group as input parameters of a second layer convolutional neural network of the classifier to obtain feature vectors of texts in the mails to be recognized, wherein the feature vectors of the texts in the mails to be recognized are used for expressing semantics of the texts after context correlation is combined;

and the third classification submodule is used for taking the feature vector of the text in the mail to be recognized as the input parameter of the full connection layer of the classifier, and obtaining a classification result after classification processing of the full connection layer, wherein the classification result is used for determining whether the mail to be recognized belongs to the junk mail.

the first classification submodule includes:

the convolution operation submodule is used for obtaining a convolution layer output result of the vector group in each convolution kernel by utilizing one-dimensional convolution operation, and the convolution layer output result comprises output results of convolution operation performed on the convolution layer and the convolution kernels by sequentially taking each feature vector in the vector group as a convolution operation initial value;

the obtaining submodule is used for respectively obtaining the maximum value of the vector group in the convolution layer output result of each convolution kernel;

and the combination submodule is used for combining the maximum values of the vector group in the convolution layer output result of each convolution kernel to obtain the characteristic vector corresponding to the vector group.

The invention provides a junk mail identification method, which comprises the steps of firstly extracting a text in a mail to be identified, and segmenting the text by taking words as units to obtain a word sequence; and converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors with corresponding relations to the words in the word sequence. And secondly, grouping the characteristic vectors in the vector sequence by a preset standard to obtain a plurality of vector groups. And finally, taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails. Compared with the junk mail identification method in the prior art, the method combines the influence of the context correlation on the mail identification, and improves the accuracy of the junk mail identification.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

Fig. 1 is a flowchart of a spam email recognition method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a vector sequence after being grouped according to an embodiment of the present invention;

FIG. 3 is a flowchart of a processing method of a classifier according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a classifier according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a spam email recognition device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The context relevance of text content in e-mail has a crucial influence on spam recognition, for example, the word "viagra" (viagra) is often given a higher spam weight through rules or training of samples. But if a friend sends you a joke that mentions "viao" or a serious email for a discussion of a medical professional would be recognized as spam. This is obviously a consequence of identifying spam without combining contextual relevance, and the method of identifying spam that usually departs from contextual relevance and semantics inevitably has a great disadvantage in identification accuracy, especially a very high error rate when distinguishing normal mail in the professional field from spam in the field.

Therefore, the junk mail identification method provided by the invention combines the influence of context correlation and can identify the junk mail more accurately.

The following description will be made of specific contents of examples.

An embodiment of the present invention provides a spam identification method, and referring to fig. 1, is a flowchart of a spam identification method provided in an embodiment of the present invention, where the method specifically includes:

s101: extracting a text in the mail to be identified, and dividing the text by taking words as units to obtain a word sequence.

The junk mail identification method provided by the embodiment of the invention can be applied to terminals such as a mail gateway, a mail server or a client and the like. In practical application, the mail data in different terminals are all encapsulated by specific codes or protocols, and the embodiment of the invention can shield the processing difference of the mail data from different terminals in the subsequent processing process by converting the texts of the mail data in different terminals in advance, so that the system has good adaptability.

In addition, the embodiment of the invention realizes the identification of the junk mails based on the text content in the emails, and does not relate to the identification of the contents such as pictures, attachments and the like in the emails.

In practical application, firstly, a text in an email to be recognized is extracted, and as the spam email is recognized based on the semantic meaning of the text, the text is segmented by taking words as units to obtain a word sequence after the text is extracted, wherein the word sequence is the text segmented by taking the words as units.

In the embodiment of the present invention, the method for segmenting the text in units of words may include a method based on character string matching, such as a two-way maximum matching method, and may further include a method based on a hidden markov model HMM and a method based on deep learning, and the like. The embodiment of the present invention does not limit which method is used for segmenting the text, and preferably, the present invention uses the HMM-based method and the deep learning-based method to have better effects than other methods.

S102: and converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors with corresponding relations to the words in the word sequence.

In the embodiment of the invention, the corresponding relation between the words and the characteristic vectors is obtained in advance and is stored in the system for calling. Specifically, in an implementation manner, a GloVe (Global Vectors for word reconstruction) method may be used to train a pre-obtained sample to obtain a corresponding relationship between words and feature Vectors. The samples used in the GloVe method may be natural corpora obtained from news, web pages, and the like. In addition, the method for obtaining the corresponding relationship between the word and the feature vector in the embodiment of the present invention is not limited to the GloVe method, and other existing technologies can be used to obtain the corresponding relationship between the word and the feature vector, which is not described herein again.

The value emphasizes that the feature vector in the correspondence between the word and the feature vector obtained by the GloVe method satisfies the following condition: first, the nearest neighbors of the feature vector corresponding to each word should be the synonyms of the word, e.g., the nearest neighbors of the feature vector corresponding to the word frog should be respectively frog, toad, litoria, leptodectidae, rana, lizard, eleutherodactylus, etc. Second, the feature vector corresponding to a word has a linear relationship between related words, e.g., the linear relationship v (queen) ≈ v (king) -v (man) + v (woman), where v () is a transfer function for the word to the feature vector, and queen, king, man, woman are related words.

In actual operation, each word in the obtained word sequence is converted into a feature vector having a corresponding relationship with the word sequence according to the corresponding relationship between the word and the feature vector stored in the system in advance, so as to obtain a vector sequence. And the vector sequence comprises characteristic vectors which respectively correspond to all words in the word sequence.

In a preferred embodiment, after obtaining a word sequence, the embodiment of the present invention finds a preset type of word, such as a number, a symbol, and the like, in the word sequence, and replaces the preset type of word with a preset tag. For example, the date "2016-6-1" is replaced with the tag "< date >".

Because the words of the preset type are generally words irrelevant to identifying the junk mails, the words of the preset type are uniformly replaced by the preset labels in the embodiment of the invention, on one hand, the identification process of the junk mails can be simplified, on the other hand, the normalization capability of the classifier can be increased, so that the classifier can regard the emails only changing some numbers, dates and the like as the emails of one type, and the processing process is simplified.

In practical application, the embodiment of the invention can realize the matching of words of preset types and the replacement of preset labels by utilizing the regular expression, and the embodiment of the invention can replace the words matched with the regular expression table entries into the corresponding labels by maintaining a regular expression library.

In addition, because the corresponding relation between the words and the feature vectors obtained by the GloVe method does not include the feature vector corresponding to the label, after the label is preset, the GloVe method can be used for constructing the feature vector for the label. Specifically, a GloVe method is used for randomly generating a feature vector, and whether the Euclidean distance between the feature vector and the feature vector corresponding to each word in the pre-acquired corresponding relation between the words and the feature vector is smaller than a preset constant or not is judged. And if the Euclidean distance between the feature vector and the feature vector corresponding to each word is smaller than a preset constant, the feature vector is allocated to a label. In the above manner, corresponding feature vectors are constructed for each label.

In practical application, each tag in the word sequence is also converted into a corresponding feature vector according to the corresponding relationship between each tag and the feature vector.

S103: and grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups.

In the embodiment of the present invention, the preset standard may be a standard using a sentence or a paragraph, or may be a standard using a fixed length or a fixed number of words.

In practical application, the vector sequences are grouped according to a preset standard to obtain a plurality of vector groups, wherein each vector group comprises grouped feature vectors.

In practical application, when the vector sequence is grouped by taking sentences as a standard, the sentences can be identified according to punctuations in the vector sequence, and finally, the feature vectors are grouped by taking the sentences as a unit. As shown in fig. 2, fig. 2 is a schematic diagram of a vector sequence grouped according to sentences as a standard. In order to balance the contribution of each word in each sentence to spam mail identification, a plurality of occupancy vectors are respectively added before and after the vector group corresponding to each sentence after grouping. Wherein the number of the respectively added occupancy vectors is equal to the maximum window length of the convolution kernel in the classifier minus 1.

S104: and taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails.

The classifier in the embodiment of the invention can be formed by deep neural networks such as a Convolutional Neural Network (CNN), a cyclic neural network (RNN) and the like, and can be used for classifying the mails to be recognized by utilizing the context correlation capability of the deep neural networks, so that the recognition accuracy of the junk mails can be improved.

In a preferred embodiment, the convolutional neural network CNN is used to form the classifier in the embodiment of the present invention. For a vector group obtained by grouping a vector sequence with a sentence or a paragraph as a standard, a processing procedure of the classifier is as follows, referring to fig. 3, where fig. 3 is a flowchart of a processing method of the classifier according to an embodiment of the present invention:

s301: and taking the feature vectors in the vector group as input parameters of a first-layer convolutional neural network of the classifier to obtain the feature vectors corresponding to the vector group, wherein the feature vectors corresponding to the vector group are used for representing the semantics of sentences or paragraphs.

In practical application, since sentences or paragraphs are used as grouping criteria in the embodiment, the classifier that uses the convolutional neural network for training and classification in the embodiment may be composed of two layers of convolutional neural networks. In fact, the classifier of the embodiment of the present invention may also be composed of three or more layers of convolutional neural networks according to different grouping standards. Fig. 4 is a schematic structural diagram of a classifier formed by two layers of convolutional neural networks according to an embodiment of the present invention. The first layer of convolutional neural network is composed of N convolutional kernels and a posing layer 1, and N is a natural number.

Specifically, a vector group obtained by grouping sentences or paragraphs as a standard is denoted as S_1:n＝[X₁,X2...Xn]And Xn is a feature vector corresponding to the nth word. That is, the set of vectors S_1:nIs composed of feature vectors of n words.

In practical application, firstly, a convolution layer output result of the vector group at each convolution kernel is obtained by utilizing one-dimensional convolution operation, and the convolution layer output result comprises output results of convolution operation performed on the convolution layer and the convolution kernels by sequentially taking each feature vector in the vector group as a convolution operation initial value.

In particular, the vector set S is used in sequence_1:nEach feature vector X in (2)₁And X2.. Xn is used as the initial value of convolution operation and is respectively convolved with convolution kernels to obtain the vector group S_1:nThe result is output at the convolution layer of each convolution kernel. Taking the ith characteristic vector in the vector group as an initial value of convolution operation, and recording an output result obtained after convolution operation is carried out on the ith characteristic vector to the (i + hj-1) th characteristic vector in the vector group and the jth convolution kernel Wj as:

wherein the vector group is the m-th vector group h obtained after grouping_jWindow length of jth convolution kernel, b_jFor the offset, f () is a non-linear function, such as tanh ().

In practical application, in the vector set S_1:nEach feature vector X in (2)₁And X2.. Xn is used as the initial value of convolution operation and is respectively convolved with convolution kernels to obtain

Then, will

Are combined to finally obtain

C^m,jI.e. the set of vectors S_1:nAnd outputting the result at the convolution layer of the jth convolution kernel.

And the vector group outputs the result of convolution layer of the convolution kernel, wherein the result of convolution layer output of the vector group comprises the output result of convolution operation with the convolution kernel respectively by taking each feature vector in the vector group as the initial value of convolution operation.

Then, the maximum value of the vector group in the convolution layer output result of each convolution kernel is obtained respectively. Specifically, in the posing layer 1 in fig. 4, the max-out posing method is adopted to obtain the maximum value of the vector group in the convolution layer output result of each convolution kernel. And recording the maximum value of the m-th vector group obtained after grouping in the convolution layer output result of the j-th convolution kernel as:

and finally, combining the maximum values of the vector group in the convolution layer output result of each convolution kernel to obtain the characteristic vector corresponding to the vector group. And recording the feature vectors corresponding to the mth vector group obtained after grouping as:

Y^m＝[P^m,1,P^m,2...P^m,N]；

the first layer of convolutional neural network comprises N convolutional kernels, and the mth vector group forms a feature vector Y corresponding to the vector group in the convolutional layer output results of the N convolutional kernels respectively through the maximum value^m。

S302: and taking the feature vector corresponding to the vector group as an input parameter of a second-layer convolutional neural network of the classifier to obtain a feature vector of the text in the mail to be recognized, wherein the feature vector of the text in the mail to be recognized is used for representing the semantic meaning of the text combined with context correlation.

As shown in fig. 4, the second layer of convolutional neural network in the classifier may be composed of M convolutional kernels and a posing layer 2, where M is a natural number, and the second layer of convolutional neural network has the same algorithm logic as the first layer of convolutional neural network. Specifically, the feature vectors corresponding to the vector groups output by the first layer of convolutional neural network are used as input parameters of the second layer of convolutional neural network. After the processing of the M convolution kernels and the posing layer 2 in the second layer of convolution neural network, finally, the feature vectors of the text in the mail to be recognized are output by the second layer of convolution neural network.

S303: and taking the feature vector of the text in the mail to be identified as an input parameter of a full connection layer of the classifier, and obtaining a classification result after classification processing of the full connection layer, wherein the classification result is used for determining whether the mail to be identified belongs to a junk mail.

As shown in fig. 4, the classifier in the embodiment of the present invention further includes a fully-connected layer, the feature vector of the text in the mail to be identified output by the second layer convolutional neural network is used as an input parameter of the fully-connected layer, the fully-connected layer outputs probabilities on a plurality of classifications through a softmax function, and it is possible to determine whether the mail to be identified belongs to spam using the probabilities. The algorithm logic of the full connection layer is the same as that of the traditional neural network, and is not described herein again.

In the embodiment of the invention, before the classifier is used for identifying the junk mails, the classifier is trained by using the mail samples. Specifically, the process of training the classifier by using the mail samples is basically the same as the process of identifying the spam mails by using the classifier, and the differences include the following two points: first, in the stage of training the classifier by using the mail samples, the classifier includes not only the forward propagation process of processing the mail samples, i.e. the above-mentioned S301 to S303, but also the backward propagation process, in order to adjust the network parameters (such as the weight and offset of the fully connected layer) of each layer of the classifier, so that the finally obtained training result is more accurate. Secondly, the dropout algorithm is applied to a full connection layer of the classifier, and the overfitting problem of the mail samples in the training stage is solved. Specifically, during the forward propagation of the training phase, the output of some hidden layers is randomly set to 0, and the neurons do not participate in the backward propagation parameter adjustment. The method reduces the dependency relationship among the neurons and solves the overfitting problem of the deep neural network to the sample.

In the junk mail identification method provided by the embodiment of the invention, firstly, a text in a mail to be identified is extracted, and the text is divided by taking words as units to obtain a word sequence; and converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors with corresponding relations to the words in the word sequence. And secondly, grouping the characteristic vectors in the vector sequence by a preset standard to obtain a plurality of vector groups. And finally, taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails. Compared with the junk mail identification method in the prior art, the method and the device have the advantages that the influence of the context correlation on the mail identification is combined, and the accuracy of the junk mail identification is improved.

An embodiment of the present invention further provides a spam recognition apparatus, and referring to fig. 5, the spam recognition apparatus according to the embodiment of the present invention is shown in a schematic structural diagram, where the apparatus includes:

the segmentation module 501 is configured to extract a text in the email to be identified, and segment the text by taking a word as a unit to obtain a word sequence;

a conversion module 502, configured to convert words in the word sequence into feature vectors having a correspondence with the words according to a correspondence between words and feature vectors obtained in advance, so as to obtain a vector sequence, where the vector sequence includes feature vectors having a correspondence with each word in the word sequence;

a grouping module 503, configured to group the feature vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups;

a classifying module 504, configured to use the vector group as an input parameter of a classifier, so that the classifier classifies the to-be-identified email according to context correlation to obtain a classification result, where the classification result is used to determine whether the to-be-identified email belongs to a spam email.

Specifically, the grouping module 503 is specifically configured to:

In a preferred embodiment, the classifier is formed by a convolutional neural network; the classification module 504 includes:

In a preferred embodiment, the first layer convolutional neural network of the classifier includes N convolutional kernels, where N is a natural number;

the first classification submodule includes:

The junk mail recognition device provided by the embodiment of the invention can realize the following functions: extracting a text in the mail to be identified, and dividing the text by taking words as units to obtain a word sequence; and converting words in the word sequence into feature vectors with corresponding relations to the words according to the corresponding relations between the words and the feature vectors acquired in advance to obtain a vector sequence, wherein the vector sequence comprises the feature vectors with corresponding relations to the words in the word sequence. And grouping the characteristic vectors in the vector sequence according to a preset standard to obtain a plurality of vector groups. And taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails. Compared with the junk mail identification method in the prior art, the method and the device have the advantages that the influence of the context correlation on the mail identification is combined, and the accuracy of the junk mail identification is improved.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The method and the device for identifying spam mails provided by the embodiment of the invention are described in detail, and the principle and the implementation mode of the invention are explained by applying a specific embodiment in the text, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for spam identification, the method comprising:

extracting a text in the mail to be identified, and segmenting the text by taking words as units to obtain a word sequence, wherein the word sequence is the text segmented by taking the words as units;

grouping the characteristic vectors in the vector sequence by a preset standard to obtain a plurality of vector groups, wherein the preset standard is a standard of sentences, paragraphs, fixed lengths or fixed word numbers;

and taking the vector group as an input parameter of a classifier, so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, wherein the classification result is used for determining whether the mails to be identified belong to junk mails, and the classifier is formed by adopting a deep neural network.

2. The spam identification method according to claim 1, wherein the grouping the feature vectors in the vector sequence according to a preset criterion to obtain a plurality of vector groups comprises:

3. A spam recognition method according to claim 2 wherein said classifier is constructed using a convolutional neural network;

4. The spam identification method of claim 3 wherein the first layer convolutional neural network of the classifier comprises N convolutional kernels, N being a natural number;

5. The method according to any one of claims 1 to 4, wherein before the converting the words in the word sequence into the feature vectors having correspondence with the words according to the correspondence between the words and the feature vectors obtained in advance, the method further comprises:

replacing words of a preset type in the word sequence with a preset label;

6. The spam identification method of claim 5, wherein said pre-constructing a feature vector for said tag comprises:

7. A spam recognition device, said device comprising:

and the classification module is used for taking the vector group as an input parameter of a classifier so that the classifier classifies the mails to be identified by combining context correlation to obtain a classification result, the classification result is used for determining whether the mails to be identified belong to junk mails, and the classifier is formed by adopting a deep neural network.

8. The spam recognition device of claim 7, wherein the grouping module is specifically configured to:

9. The spam recognition device of claim 8 wherein the classifier is constructed using a convolutional neural network; the classification module comprises:

10. The spam recognition device of claim 9 wherein the first layer convolutional neural network of the classifier comprises N convolutional kernels, N being a natural number;

the first classification submodule includes: