CN106506327A - A kind of spam filtering method and device - Google Patents

A kind of spam filtering method and device Download PDF

Info

Publication number
CN106506327A
CN106506327A CN201610888007.0A CN201610888007A CN106506327A CN 106506327 A CN106506327 A CN 106506327A CN 201610888007 A CN201610888007 A CN 201610888007A CN 106506327 A CN106506327 A CN 106506327A
Authority
CN
China
Prior art keywords
vector
characteristic vector
word
sequence
mail
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610888007.0A
Other languages
Chinese (zh)
Other versions
CN106506327B (en
Inventor
杜强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201610888007.0A priority Critical patent/CN106506327B/en
Publication of CN106506327A publication Critical patent/CN106506327A/en
Application granted granted Critical
Publication of CN106506327B publication Critical patent/CN106506327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention discloses a kind of spam filtering method and device, and methods described includes:The text in mail to be identified is extracted, and the text is split in units of word, obtain word sequence;According to the corresponding relation of the word and characteristic vector for obtaining in advance, word in the word sequence is converted to the characteristic vector that there is corresponding relation with institute predicate, sequence vector is obtained, the sequence vector includes the characteristic vector with each word in the word sequence respectively with corresponding relation.After characteristic vector in the sequence vector is grouped with preset standard, some Vector Groups are obtained.Using the Vector Groups as the |input paramete of grader, so that the grader is classified to the mail to be identified with reference to context dependence, classification results are obtained, the classification results are used for determining whether the mail to be identified belongs to spam.Present invention incorporates impact of the context dependence to mail recognition, improves the accuracy of spam filtering.

Description

A kind of spam filtering method and device
Technical field
The present invention relates to data processing field, and in particular to a kind of spam filtering method and device.
Background technology
With the continuous development of internet, the use of Email is more and more universal, using Email as the business of carrier Industry publicity is also widely used, while also resulting in spreading unchecked for spam.Spam generally requires to take in a large number Resource, and there is the problems such as delivering inaccurate object, pressure delivery and contain unreal information in a large number.So, spam one It is directly the disgustful internet product of user.
In order to prevent spam, the technology of various identification spams, example in current e-mail system, is embedded in Such as white list, blacklist, content-based filtering etc..But existing identification spam method is substantially based only on pass Key word or word frequency are identified to spam, and angle is single, the reason for have ignored other influences spam filtering accuracy, Cause the recognition accuracy of spam inadequate.
Content of the invention
The invention provides a kind of spam filtering method and device, it is possible to increase the degree of accuracy of spam filtering.
The invention provides a kind of spam filtering method, methods described includes:
The text in mail to be identified is extracted, and the text is split in units of word, obtain word sequence;
According to the word for obtaining in advance and the corresponding relation of characteristic vector, the word in the word sequence is converted to and institute's predicate Have corresponding relation characteristic vector, obtain sequence vector, the sequence vector include respectively with the word sequence in each Word has the characteristic vector of corresponding relation;
After characteristic vector in the sequence vector is grouped with preset standard, some Vector Groups are obtained;
Using the Vector Groups as grader |input paramete so that the grader combine context dependence to described Mail to be identified is classified, and obtains classification results, and the classification results are used for determining whether the mail to be identified belongs to rubbish Rubbish mail.
Preferably, after the characteristic vector by the sequence vector is grouped with preset standard, obtain some to Amount group, including:
With sentence or paragraph as standard, after being grouped to the characteristic vector in the sequence vector, some vectors are obtained Group.
Preferably, the grader is constituted using convolutional neural networks;
Described using the Vector Groups as grader |input paramete so that the grader combine context dependence pair The mail to be identified is classified, and obtains classification results, and the classification results are used for determining whether the mail to be identified belongs to In spam, including:
Using the characteristic vector in the Vector Groups as the ground floor convolutional neural networks of the grader |input paramete, The corresponding characteristic vector of the Vector Groups is obtained, wherein, the corresponding characteristic vector of the Vector Groups is used for representing sentence or paragraph Semanteme;
Corresponding for Vector Groups characteristic vector is joined as the input of the second layer convolutional neural networks of the grader Number, obtains the characteristic vector of the text in the mail to be identified, wherein, the characteristic vector of the text in the mail to be identified For representing the semanteme for combining the text after context dependence;
Using the characteristic vector of the text in the mail to be identified as the full articulamentum of the grader |input paramete, After the classification of the full articulamentum is processed, classification results are obtained, the classification results are used for determining the mail to be identified Whether spam is belonged to.
Preferably, the ground floor convolutional neural networks of the grader include N number of convolution kernel, and N is natural number;
Using the characteristic vector in the Vector Groups as the ground floor convolutional neural networks of the grader |input paramete, The corresponding characteristic vector of the Vector Groups is obtained, wherein, the corresponding characteristic vector of the Vector Groups is used for representing sentence or paragraph Semanteme, including:
Using one-dimensional convolution algorithm, convolutional layer output result of the Vector Groups in each convolution kernel, the convolution is obtained Layer output result is included successively using each characteristic vector in the Vector Groups as convolution algorithm initial value, respectively with the convolution Core carries out the output result of convolution algorithm;
Vector Groups maximum in the convolutional layer output result of each convolution kernel is obtained respectively;
Maximum of the Vector Groups in the convolutional layer output result of each convolution kernel is combined, obtain described to The corresponding characteristic vector of amount group.
Preferably, the word that the basis is obtained in advance and the corresponding relation of characteristic vector, the word in the word sequence is turned The characteristic vector that there is corresponding relation with institute predicate is changed to, before obtaining sequence vector, is also included:
The word of preset kind in the word sequence is replaced with default label;
It is label construction feature vector in advance, and obtains the corresponding relation of the label and the characteristic vector;
Accordingly, the word that the basis is obtained in advance and the corresponding relation of characteristic vector, the word in the word sequence is turned The characteristic vector that there is corresponding relation with institute predicate is changed to, sequence vector is obtained, including:
According to the word for obtaining in advance and the corresponding relation of characteristic vector, the word in the word sequence is converted to and institute's predicate There is the characteristic vector of corresponding relation;And, according to the corresponding relation of the label and the characteristic vector, by the word sequence In label be converted to the characteristic vector that there is corresponding relation with the label, obtain sequence vector.
Preferably, described is label construction feature vector in advance, including:
Random generate characteristic vector, and judge in corresponding relation of the characteristic vector with institute's predicate with characteristic vector each Whether the Euclidean distance between characteristic vector is less than preset constant;
When the Euclidean distance between the characteristic vector and each characteristic vector described is less than preset constant, by the spy Levy vector and distribute to label.
Present invention also offers a kind of spam filtering device, described device includes:
Segmentation module, for extracting the text in mail to be identified, and the text is split in units of word, is obtained Arrive word sequence;
Modular converter, for the corresponding relation according to the word and characteristic vector for obtaining in advance, by the word in the word sequence Be converted to the characteristic vector that there is corresponding relation with institute predicate, obtain sequence vector, the sequence vector include respectively with institute In predicate sequence, each word has the characteristic vector of corresponding relation;
Grouping module, after the characteristic vector in the sequence vector is grouped with preset standard, obtains some Vector Groups;
Sort module, for using the Vector Groups as grader |input paramete so that the grader is combined up and down Literary correlation is classified to the mail to be identified, obtains classification results, and the classification results are described to be identified for determining Whether mail belongs to spam.
Preferably, the grouping module, specifically for:
With sentence or paragraph as standard, after being grouped to the characteristic vector in the sequence vector, some vectors are obtained Group.
Preferably, the grader is constituted using convolutional neural networks;The sort module, including:
First classification submodule, for using the characteristic vector in the Vector Groups as the grader ground floor convolution The |input paramete of neutral net, obtains the corresponding characteristic vector of the Vector Groups, wherein, the corresponding characteristic vector of the Vector Groups For representing the semanteme of sentence or paragraph;
Second classification submodule, for rolling up corresponding for Vector Groups characteristic vector as the second layer of the grader The |input paramete of product neutral net, obtains the characteristic vector of the text in the mail to be identified, wherein, the mail to be identified In text characteristic vector be used for represent combine context dependence after the text semanteme;
3rd classification submodule, for using the characteristic vector of the text in the mail to be identified as the grader The |input paramete of full articulamentum, after the classification of the full articulamentum is processed, obtains classification results, and the classification results are used for Determine whether the mail to be identified belongs to spam.
Preferably, the ground floor convolutional neural networks of the grader include N number of convolution kernel, and N is natural number;
The first classification submodule, including:
Convolution algorithm submodule, for using one-dimensional convolution algorithm, obtaining convolution of the Vector Groups in each convolution kernel Layer output result, the convolutional layer output result include rising using each characteristic vector in the Vector Groups as convolution algorithm successively Initial value, carries out the output result of convolution algorithm respectively with the convolution kernel;
Acquisition submodule, for obtaining maximum of the Vector Groups in the convolutional layer output result of each convolution kernel respectively Value;
Combination submodule, is carried out for the maximum by the Vector Groups in the convolutional layer output result of each convolution kernel Combination, obtains the corresponding characteristic vector of the Vector Groups.
The invention provides a kind of spam filtering method, extracts the text in mail to be identified first, and will be described Text is split in units of word, obtains word sequence;According to the corresponding relation of the word and characteristic vector for obtaining in advance, will be described Word in word sequence is converted to the characteristic vector that there is corresponding relation with institute predicate, obtains sequence vector, in the sequence vector Including the characteristic vector with each word in the word sequence respectively with corresponding relation.Secondly, by the spy in the sequence vector Levy vector be grouped with preset standard after, obtain some Vector Groups.Finally, the Vector Groups are joined as the input of grader Number, so that the grader is classified to the mail to be identified with reference to context dependence, obtains classification results, described point Class result is used for determining whether the mail to be identified belongs to spam.Spam filtering method phase with prior art Than present invention incorporates impact of the context dependence to mail recognition, improves the accuracy of spam filtering.
Description of the drawings
For the technical scheme being illustrated more clearly that in the embodiment of the present application, below will be to making needed for embodiment description Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present application, for For those of ordinary skill in the art, without having to pay creative labor, can be obtaining which according to these accompanying drawings His accompanying drawing.
Fig. 1 is a kind of spam filtering method flow diagram provided in an embodiment of the present invention;
Fig. 2 is a kind of sequence vector schematic diagram after packet provided in an embodiment of the present invention;
Fig. 3 is a kind of process flow figure of grader provided in an embodiment of the present invention;
Fig. 4 is a kind of grader structural representation provided in an embodiment of the present invention;
Fig. 5 is a kind of spam filtering apparatus structure schematic diagram provided in an embodiment of the present invention.
Specific embodiment
Accompanying drawing in below in conjunction with the embodiment of the present application, to the embodiment of the present application in technical scheme carry out clear, complete Site preparation is described, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on Embodiment in the application, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of the application protection.
The context dependence of Email Chinese version content has vital impact, example to the identification of spam Such as training of " viagra " (vigour) this word often through rule or to sample, higher spam weight is endowed. But if a friend issues your joke for mentioning " vigour ", or the electronics of the discussion of a serious medical speciality Mail can be identified as spam.This is clearly the consequence that uncombined context dependence is identified to spam, Be normally disengaged context dependence and semanteme carry out spam filtering method necessarily exist on recognition accuracy very big Defect, particularly in the spam of the normal email and the field of distinguishing professional domain, has very high error rate.
So, the spam filtering method that the present invention is provided combines the impact of context dependence, can be more accurate Spam is identified.
The introduction of embodiment particular content is below carried out.
A kind of spam filtering method is embodiments provided, with reference to Fig. 1, is provided in an embodiment of the present invention one Spam filtering method flow diagram is planted, methods described is specifically included:
S101:The text in mail to be identified is extracted, and the text is split in units of word, obtain word order Row.
Spam filtering method provided in an embodiment of the present invention can apply to Mail Gateway, mail server or visitor The terminals such as family end.In practical application, the mail data in different terminals is all through specific coding or protocol encapsulation, sheet Inventive embodiments can shield follow-up place by the advance conversion that the mail data being located in different terminals is carried out text Processing difference of the reason process to the mail data from different terminals so that system has good adaptability.
In addition, the embodiment of the present invention be based on Email in content of text realize identification to spam, no It is related to the identification to contents such as the picture in Email, annexes.
In practical application, the text in mail to be identified is extracted first, as the present invention is that text based semanteme is carried out The identification of spam, so, the embodiment of the present invention is split to the text after text is extracted in units of word, Word sequence is obtained, wherein, the word order is classified as the text after splitting in units of word.
In the embodiment of the present invention, the method for carrying out the segmentation in units of word to text can be included based on string matching Method, such as two-way maximum matching method can also include the method based on HMM and the method based on deep learning etc..Wherein, The embodiment of the present invention does not limit the segmentation which kind of method to carry out text using, it is preferable that the present invention using based on HMM method and Method based on deep learning has preferable effect compared with additive method.
S102:According to the corresponding relation of the word that obtains in advance and characteristic vector, by the word in the word sequence be converted to Institute's predicate has the characteristic vector of corresponding relation, obtains sequence vector, the sequence vector include respectively with the word sequence In each word there is the characteristic vector of corresponding relation.
In the embodiment of the present invention, obtain in advance the corresponding relation of word and characteristic vector, and be stored in system for calling. Specifically, in a kind of implementation, it is possible to use GloVe (English:Global Vectors for Word Representation) method is trained to the sample for obtaining in advance, obtains the corresponding relation of word and characteristic vector.Wherein, Sample used in GloVe methods can be the natural discourse from acquisitions such as news, webpages.In addition, the embodiment of the present invention is used for Obtain word and GloVe methods are not limited to the method for the corresponding relation of characteristic vector, also there are other existing technology can be used in The corresponding relation of word and characteristic vector is obtained, be will not be described here.
Value, it is emphasized that the word that obtained using GloVe methods is full with the characteristic vector in the corresponding relation of characteristic vector The following condition of foot:First, the nearest-neighbors of the corresponding characteristic vector of each word should be the near synonym of the word, such as word frog pair The nearest-neighbors of the characteristic vector that answers should be frogs, toad, litoria, leptodactylidae, rana respectively, Lizard, eleutherodactylus etc..Secondly, the corresponding characteristic vector of word has linear relationship, example between related word Such as, linear relationship v (queen) ≈ v (king) v (man)+v (woman), wherein v () are the conversion letters to word to characteristic vector Number, queen, king, man, woman are related words.
In practical operation, according to the corresponding relation of the word and characteristic vector being pre-stored within system, the word order that will be obtained Each word in row is converted to the characteristic vector that there is corresponding relation with which, obtains sequence vector.Wherein, in the sequence vector Including the characteristic vector that each word in the word sequence respectively has corresponding relation.
A kind of preferred embodiment in, the embodiment of the present invention is found in the word sequence after word sequence is got Preset kind word, such word is replaced with default label by such as numeral, symbol etc..For example, by date " 2016-6- 1 " replace with label "<date>”.
Due to the word that the word of preset kind is generally unrelated with identification spam, the embodiment of the present invention is using default label The unified word for replacing preset kind, on the one hand can simplify the identification process of spam, on the other hand can also increase classification The generalized ability of device, the Email for enabling grader only to be changed some numeral, dates etc. regard the electricity of a class as Sub- mail, simplifies processing procedure.
In practical application, the embodiment of the present invention can realize the coupling of word of preset kind and pre- using regular expression The replacement that bidding is signed, the embodiment of the present invention can pass through one regular expression storehouse of maintenance, will match regular expression list item Word be substituted for corresponding label.
Further, since the word obtained using GloVe methods with do not include in the corresponding relation of characteristic vector that label is corresponding Characteristic vector, so, after label is pre-set, it is possible to use GloVe methods are label construction feature vector.Specifically , characteristic vector is generated at random using GloVe methods, and judge the characteristic vector and the word for obtaining in advance and characteristic vector Whether the Euclidean distance in corresponding relation between the corresponding characteristic vector of each word is less than preset constant.If the feature to When Euclidean distance between amount characteristic vector corresponding with each word is less than preset constant, the characteristic vector is distributed to mark Sign.According to aforesaid way, it is that each label builds corresponding characteristic vector.
In practical application, according to each label and the corresponding relation of characteristic vector, each label in word sequence is also turned It is changed to corresponding characteristic vector.
S103:After characteristic vector in the sequence vector is grouped with preset standard, some Vector Groups are obtained.
In the embodiment of the present invention, the preset standard can be with sentence or paragraph as standard, it is also possible to regular length Or fixed word number is standard.
In practical application, after being grouped to the sequence vector with preset standard, some Vector Groups are obtained, wherein, respectively Individual Vector Groups include the characteristic vector after being grouped.
In practical application, with sentence as standard the sequence vector is grouped when, can be according in sequence vector Punctuation mark recognizes sentence, finally the characteristic vector is grouped in units of sentence.As shown in Fig. 2 Fig. 2 for a kind of with Sentence be grouped for standard after sequence vector schematic diagram.Wherein, in order to each word allowed in each sentence is to spam The contribution of identification is balanced, will increase several occupy-places vectors respectively before and after corresponding for each sentence after packet Vector Groups.Its In, the number of the occupy-place vector for increasing respectively is equal to the maximum length of window of convolution kernel in grader and deducts 1.
S104:Using the Vector Groups as grader |input paramete so that the grader combines context dependence The mail to be identified is classified, obtains classification results, the classification results are used for whether determining the mail to be identified Belong to spam.
Grader in the embodiment of the present invention can adopt convolutional neural networks CNN, Recognition with Recurrent Neural Network RNN even depth god Constitute through network, using association ability of the deep neural network to context, mail to be identified is classified, it is possible to increase right The accuracy of identification of spam.
A kind of preferred embodiment in, using convolutional neural networks CNN constitute the embodiment of the present invention in grader.Right The Vector Groups obtained after being grouped to sequence vector with sentence or paragraph as standard, the processing procedure of the grader is such as Under, with reference to shown in Fig. 3, Fig. 3 is a kind of process flow figure of grader provided in an embodiment of the present invention:
S301:Using the characteristic vector in the Vector Groups as the ground floor convolutional neural networks of the grader input Parameter, obtains the corresponding characteristic vector of the Vector Groups, wherein, the corresponding characteristic vector of the Vector Groups be used for representing sentence or The semanteme of paragraph.
In practical application, due in the present embodiment mode using sentence or paragraph as packet standard, so in the present embodiment The grader for being trained using convolutional neural networks and being classified can be made up of two-layer convolutional neural networks.In fact, according to The difference of packet standard, the grader of the embodiment of the present invention can be with by three layers, or even more layers convolutional neural networks are constituted.Such as Shown in Fig. 4, it is a kind of grader structural representation being made up of two-layer convolutional neural networks provided in an embodiment of the present invention.Its In, ground floor convolutional neural networks are made up of N number of convolution kernel and pooling layers 1, and N is natural number.
Specifically, the Vector Groups obtained after being grouped using sentence or paragraph as standard are designated as S1:n=[X1, X2...Xn], Xn is the corresponding characteristic vector of n-th word.That is, Vector Groups S1:nCharacteristic vector structure by n word Into.
In practical application, first with one-dimensional convolution algorithm, the convolutional layer for obtaining the Vector Groups in each convolution kernel is defeated Go out result, the convolutional layer output result is included successively with each characteristic vector in the Vector Groups as convolution algorithm initial value, The output result of convolution algorithm is carried out with the convolution kernel respectively.
Specifically, successively with Vector Groups S1:nIn each feature vector, X1, X2...Xn is used as convolution algorithm starting Value, carries out convolution algorithm with convolution kernel respectively, obtains Vector Groups S1:nConvolutional layer output result in each convolution kernel.Its In, using in the Vector Groups ith feature vector as convolution algorithm initial value, by ith feature in the Vector Groups to Measuring the i-th+hj-1 characteristic vectors and the output result for obtaining after convolution algorithm being carried out with j-th convolution kernel Wj be designated as:
Wherein, the Vector Groups are m-th Vector Groups obtaining after packet, hjFor the length of window of j-th convolution kernel, bj For side-play amount, f () is a nonlinear function, such as tanh () etc..
In practical application, with Vector Groups S1:nIn each feature vector, X1, X2...Xn is used as convolution algorithm Initial value, carries out convolution algorithm with convolution kernel respectively, obtainsAfterwards, willIt is combined, Finally giveCm,jAs described Vector Groups S1:nExport in the convolutional layer of j-th convolution kernel As a result.
Wherein, the Vector Groups the convolution kernel convolutional layer output result include successively with the Vector Groups in each Characteristic vector is convolution algorithm initial value, carries out the output result of convolution algorithm respectively with the convolution kernel.
Then, maximum of the Vector Groups in the convolutional layer output result of each convolution kernel is obtained respectively.Specifically, Pooling layers 1 in the diagram, using max-out pooling methods, obtain the Vector Groups respectively in each convolution kernel Maximum in convolutional layer output result.Convolutional layer output of m-th Vector Groups obtained after packet in j-th convolution kernel is tied Maximum in fruit is designated as:
Finally, the maximum by the Vector Groups in the convolutional layer output result of each convolution kernel is combined, and obtains The corresponding characteristic vector of the Vector Groups.Corresponding for m-th Vector Groups obtained after packet characteristic vector is designated as:
Ym=[Pm,1,Pm,2...Pm,N];
Wherein, the ground floor convolutional neural networks include N number of convolution kernel, and m-th Vector Groups is respectively in N number of convolution kernel In convolutional layer output result, maximum constitutes corresponding characteristic vector Y of the Vector Groupsm.
S302:Using corresponding for Vector Groups characteristic vector as the defeated of the second layer convolutional neural networks of the grader Enter parameter, obtain the characteristic vector of the text in the mail to be identified, wherein, the feature of the text in the mail to be identified Vector combines the semanteme of the text after context dependence for representing.
As shown in figure 4, the second layer convolutional neural networks in grader can be by 2 groups of M convolution kernel and pooling layers Into M is natural number, and the second layer convolutional neural networks are identical with the algorithm logic of the ground floor convolutional neural networks.Tool Body, the corresponding characteristic vector of each Vector Groups that the ground floor convolutional neural networks are exported is used as second layer convolution god |input paramete through network.After the process of M convolution kernel and pooling layers 2 in the second layer convolutional neural networks, most What the second layer convolutional neural networks were exported is the characteristic vector of the mail Chinese version to be identified eventually.
S303:Using the characteristic vector of the text in the mail to be identified as the full articulamentum of the grader input Parameter, after the classification of the full articulamentum is processed, obtains classification results, and the classification results are described to be identified for determining Whether mail belongs to spam.
As shown in figure 4, the grader in the embodiment of the present invention also includes full articulamentum, the second layer convolutional neural networks |input paramete of the characteristic vector of the mail Chinese version described to be identified of output as the full articulamentum, by the full articulamentum Exported in multiple classificatory probability by softmax functions, whether the mail to be identified is can determine that using the probability Belong to spam.Wherein, the algorithm logic of the full articulamentum is identical with traditional neural network algorithm logic, and here is no longer Repeat.
In the embodiment of the present invention, before recycling grader to carry out the identification of spam, first with mail sample pair The grader is trained.Specifically, process grader being trained using mail sample is with grader to rubbish postal The process that part is identified is essentially identical, and difference includes at following 2 points:First, rank is trained to grader using mail sample Duan Zhong, grader not only include that the forward-propagating process processed by mail sample, i.e., above-mentioned S301-S303 also include anti- To communication process, it is therefore an objective to which the network parameter (such as weight and the skew of full articulamentum) of each layer of the grader is adjusted, So that the training result for finally giving is more accurate.Second, the full articulamentum that dropout algorithms are applied to grader solves postal Overfitting problem of the part sample in the training stage.Specifically, during the forward-propagating of training stage, random by some The output of hidden layer is set to 0, while these neurons do not participate in the parameter adjustment of backpropagation.This method reduces nerve Dependence between unit, solves the problems, such as overfitting of the deep neural network to sample.
In a kind of spam filtering method provided in an embodiment of the present invention, the text in mail to be identified is extracted first, And the text is split in units of word, obtain word sequence;Corresponding pass according to the word for obtaining in advance and characteristic vector System, the word in the word sequence is converted to the characteristic vector that there is corresponding relation with institute predicate, sequence vector is obtained, described to Amount sequence includes the characteristic vector with each word in the word sequence respectively with corresponding relation.Secondly, by the vectorial sequence After characteristic vector in row is grouped with preset standard, some Vector Groups are obtained.Finally, using the Vector Groups as grader |input paramete so that the grader is classified to the mail to be identified with reference to context dependence, obtain classification knot Really, the classification results are used for determining whether the mail to be identified belongs to spam.Know with the spam of prior art Other method is compared, and the embodiment of the present invention combines impact of the context dependence to mail recognition, improves spam filtering Accuracy.
The embodiment of the present invention additionally provides a kind of spam filtering device, with reference to Fig. 5, is provided in an embodiment of the present invention A kind of spam filtering apparatus structure schematic diagram, described device include:
Segmentation module 501, for extracting the text in mail to be identified, and the text is carried out in units of word point Cut, obtain word sequence;
Modular converter 502, for the corresponding relation according to the word and characteristic vector for obtaining in advance, by the word sequence Word is converted to the characteristic vector that there is corresponding relation with institute predicate, obtains sequence vector, the sequence vector include respectively with In the word sequence, each word has the characteristic vector of corresponding relation;
Grouping module 503, after the characteristic vector in the sequence vector is grouped with preset standard, if obtain Dry Vector Groups;
Sort module 504, for using the Vector Groups as grader |input paramete so that the grader combine upper Context correlation is classified to the mail to be identified, obtains classification results, and the classification results are used for determining described to be waited to know Whether other mail belongs to spam.
Specifically, the grouping module 503, specifically for:
With sentence or paragraph as standard, after being grouped to the characteristic vector in the sequence vector, some vectors are obtained Group.
One kind is preferably carried out in mode, and the grader is constituted using convolutional neural networks;The sort module 504, Including:
First classification submodule, for using the characteristic vector in the Vector Groups as the grader ground floor convolution The |input paramete of neutral net, obtains the corresponding characteristic vector of the Vector Groups, wherein, the corresponding characteristic vector of the Vector Groups For representing the semanteme of sentence or paragraph;
Second classification submodule, for rolling up corresponding for Vector Groups characteristic vector as the second layer of the grader The |input paramete of product neutral net, obtains the characteristic vector of the text in the mail to be identified, wherein, the mail to be identified In text characteristic vector be used for represent combine context dependence after the text semanteme;
3rd classification submodule, for using the characteristic vector of the text in the mail to be identified as the grader The |input paramete of full articulamentum, after the classification of the full articulamentum is processed, obtains classification results, and the classification results are used for Determine whether the mail to be identified belongs to spam.
One kind is preferably carried out in mode, and the ground floor convolutional neural networks of the grader include that N number of convolution kernel, N are Natural number;
The first classification submodule, including:
Convolution algorithm submodule, for using one-dimensional convolution algorithm, obtaining convolution of the Vector Groups in each convolution kernel Layer output result, the convolutional layer output result include rising using each characteristic vector in the Vector Groups as convolution algorithm successively Initial value, carries out the output result of convolution algorithm respectively with the convolution kernel;
Acquisition submodule, for obtaining maximum of the Vector Groups in the convolutional layer output result of each convolution kernel respectively Value;
Combination submodule, is carried out for the maximum by the Vector Groups in the convolutional layer output result of each convolution kernel Combination, obtains the corresponding characteristic vector of the Vector Groups.
A kind of spam filtering device provided in an embodiment of the present invention can realize following functions:Extract mail to be identified In text, and the text is split in units of word, is obtained word sequence;According to the word and characteristic vector that obtain in advance Corresponding relation, the word in the word sequence is converted to the characteristic vector that there is corresponding relation with institute predicate, vectorial sequence is obtained Row, the sequence vector include the characteristic vector with each word in the word sequence respectively with corresponding relation.By described to After characteristic vector in amount sequence is grouped with preset standard, some Vector Groups are obtained.Using the Vector Groups as grader |input paramete so that the grader is classified to the mail to be identified with reference to context dependence, obtain classification knot Really, the classification results are used for determining whether the mail to be identified belongs to spam.Know with the spam of prior art Other method is compared, and the embodiment of the present invention combines impact of the context dependence to mail recognition, improves spam filtering Accuracy.
For device embodiment, as which corresponds essentially to embodiment of the method, so related part is referring to method reality Apply the part explanation of example.Device embodiment described above is only schematically, wherein described as separating component The unit of explanation can be or may not be physically separate, as the part that unit shows can be or can also It is not physical location, you can be located at a place, or can also be distributed on multiple NEs.Can be according to reality Need to select some or all of module therein to realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not In the case of paying creative work, you can to understand and implement.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation are made a distinction with another entity or operation, and are not necessarily required or implied these entities or deposit between operating In any this actual relation or order.And, term " including ", "comprising" or its any other variant are intended to Nonexcludability includes, so that a series of process, method, article or equipment including key elements not only includes that those will Element, but also other key elements including being not expressly set out, or also include for this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element for being limited by sentence "including a ...", it is not excluded that Also there is other identical element in process, method, article or equipment including the key element.
A kind of spam filtering method and device that above embodiment of the present invention is provided is described in detail, this Apply specific case to be set forth principle of the invention and embodiment in text, the explanation of above example is only intended to Help understands the method for the present invention and its core concept;Simultaneously for one of ordinary skill in the art, according to the think of of the present invention Think, will change in specific embodiments and applications, in sum, it is right that this specification content should not be construed as The restriction of the present invention.

Claims (10)

1. a kind of spam filtering method, it is characterised in that methods described includes:
The text in mail to be identified is extracted, and the text is split in units of word, obtain word sequence;
According to the corresponding relation of the word that obtains in advance and characteristic vector, the word in the word sequence is converted to and is had with institute predicate The characteristic vector of corresponding relation, obtains sequence vector, and the sequence vector includes having with each word in the word sequence respectively There is the characteristic vector of corresponding relation;
After characteristic vector in the sequence vector is grouped with preset standard, some Vector Groups are obtained;
Using the Vector Groups as grader |input paramete so that the grader is waited to know to described with reference to context dependence Other mail is classified, and obtains classification results, and the classification results are used for determining whether the mail to be identified belongs to rubbish postal Part.
2. spam filtering method according to claim 1, it is characterised in that the spy by the sequence vector Levy vector be grouped with preset standard after, obtain some Vector Groups, including:
With sentence or paragraph as standard, after being grouped to the characteristic vector in the sequence vector, some Vector Groups are obtained.
3. spam filtering method according to claim 2, it is characterised in that the grader adopts convolutional Neural net Network is constituted;
Described using the Vector Groups as grader |input paramete so that the grader combine context dependence to described Mail to be identified is classified, and obtains classification results, and the classification results are used for determining whether the mail to be identified belongs to rubbish Rubbish mail, including:
Using the characteristic vector in the Vector Groups as the |input paramete of the ground floor convolutional neural networks of the grader, obtain The corresponding characteristic vector of the Vector Groups, wherein, the corresponding characteristic vector of the Vector Groups is used for the language for representing sentence or paragraph Justice;
Using corresponding for Vector Groups characteristic vector as the |input paramete of the second layer convolutional neural networks of the grader, obtain The characteristic vector of the text in the mail to be identified, wherein, the characteristic vector of the text in the mail to be identified is used for Represent the semanteme of the text after combining context dependence;
Using the characteristic vector of the text in the mail to be identified as the |input paramete of the full articulamentum of the grader, pass through After the classification of the full articulamentum is processed, classification results are obtained, the classification results are used for whether determining the mail to be identified Belong to spam.
4. spam filtering method according to claim 3, it is characterised in that the ground floor convolution god of the grader Include N number of convolution kernel through network, N is natural number;
Using the characteristic vector in the Vector Groups as the |input paramete of the ground floor convolutional neural networks of the grader, obtain The corresponding characteristic vector of the Vector Groups, wherein, the corresponding characteristic vector of the Vector Groups is used for the language for representing sentence or paragraph Justice, including:
Using one-dimensional convolution algorithm, convolutional layer output result of the Vector Groups in each convolution kernel is obtained, the convolutional layer is defeated Going out result is included successively using each characteristic vector in the Vector Groups as convolution algorithm initial value, enters with the convolution kernel respectively The output result of row convolution algorithm;
Vector Groups maximum in the convolutional layer output result of each convolution kernel is obtained respectively;
Maximum of the Vector Groups in the convolutional layer output result of each convolution kernel is combined, the Vector Groups are obtained Corresponding characteristic vector.
5. the spam filtering method according to any one of claim 1-4, it is characterised in that the basis is advance The word of acquisition and the corresponding relation of characteristic vector, the word in the word sequence is converted to the spy that there is corresponding relation with institute predicate Vector is levied, before obtaining sequence vector, is also included:
The word of preset kind in the word sequence is replaced with default label;
It is label construction feature vector in advance, and obtains the corresponding relation of the label and the characteristic vector;
Accordingly, the word that the basis is obtained in advance and the corresponding relation of characteristic vector, the word in the word sequence is converted to There is with institute predicate the characteristic vector of corresponding relation, obtain sequence vector, including:
According to the corresponding relation of the word that obtains in advance and characteristic vector, the word in the word sequence is converted to and is had with institute predicate The characteristic vector of corresponding relation;And, according to the corresponding relation of the label and the characteristic vector, by the word sequence Label is converted to the characteristic vector that there is corresponding relation with the label, obtains sequence vector.
6. spam filtering method according to claim 5, it is characterised in that described build for the label in advance special Vector is levied, including:
Random generation characteristic vector, and judge each feature in corresponding relation of the characteristic vector with institute's predicate with characteristic vector Whether the Euclidean distance between vector is less than preset constant;
When the Euclidean distance between the characteristic vector and each characteristic vector described be less than preset constant when, by the feature to Amount distributes to label.
7. a kind of spam filtering device, it is characterised in that described device includes:
Segmentation module, for extracting the text in mail to be identified, and the text is split in units of word, is obtained word Sequence;
Modular converter, for according to the word for obtaining in advance and the corresponding relation of characteristic vector, the word in the word sequence being changed Be the characteristic vector that there is corresponding relation with institute predicate, obtain sequence vector, the sequence vector include respectively with institute's predicate In sequence, each word has the characteristic vector of corresponding relation;
Grouping module, after the characteristic vector in the sequence vector is grouped with preset standard, obtains some vectors Group;
Sort module, for using the Vector Groups as grader |input paramete so that the grader combine context phase Closing property is classified to the mail to be identified, obtains classification results, and the classification results are used for determining the mail to be identified Whether spam is belonged to.
8. spam filtering device according to claim 7, it is characterised in that the grouping module, specifically for:
With sentence or paragraph as standard, after being grouped to the characteristic vector in the sequence vector, some Vector Groups are obtained.
9. spam filtering device according to claim 8, it is characterised in that the grader adopts convolutional Neural net Network is constituted;The sort module, including:
First classification submodule, for using the characteristic vector in the Vector Groups as the grader ground floor convolutional Neural The |input paramete of network, obtains the corresponding characteristic vector of the Vector Groups, and wherein, the corresponding characteristic vector of the Vector Groups is used for Represent the semanteme of sentence or paragraph;
Second classification submodule, for refreshing as the second layer convolution of the grader using corresponding for Vector Groups characteristic vector Through the |input paramete of network, the characteristic vector of the text in the mail to be identified is obtained, wherein, in the mail to be identified The characteristic vector of text is used for the semanteme for representing the text after combining context dependence;
3rd classification submodule, for characteristic vector the connecting entirely as the grader using the text in the mail to be identified The |input paramete of layer is connect, after the classification of the full articulamentum is processed, classification results is obtained, the classification results are used for determining Whether the mail to be identified belongs to spam.
10. spam filtering device according to claim 9, it is characterised in that the ground floor convolution of the grader Neutral net includes N number of convolution kernel, and N is natural number;
The first classification submodule, including:
Convolution algorithm submodule, for utilizing one-dimensional convolution algorithm, the convolutional layer for obtaining the Vector Groups in each convolution kernel is defeated Go out result, the convolutional layer output result is included successively using each characteristic vector in the Vector Groups as convolution algorithm starting Value, carries out the output result of convolution algorithm respectively with the convolution kernel;
Acquisition submodule, for obtaining maximum of the Vector Groups in the convolutional layer output result of each convolution kernel respectively;
Combination submodule, carries out group for the maximum by the Vector Groups in the convolutional layer output result of each convolution kernel Close, obtain the corresponding characteristic vector of the Vector Groups.
CN201610888007.0A 2016-10-11 2016-10-11 Junk mail identification method and device Active CN106506327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610888007.0A CN106506327B (en) 2016-10-11 2016-10-11 Junk mail identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610888007.0A CN106506327B (en) 2016-10-11 2016-10-11 Junk mail identification method and device

Publications (2)

Publication Number Publication Date
CN106506327A true CN106506327A (en) 2017-03-15
CN106506327B CN106506327B (en) 2021-02-19

Family

ID=58295096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610888007.0A Active CN106506327B (en) 2016-10-11 2016-10-11 Junk mail identification method and device

Country Status (1)

Country Link
CN (1) CN106506327B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106934068A (en) * 2017-04-10 2017-07-07 江苏东方金钰智能机器人有限公司 The method that robot is based on the semantic understanding of environmental context
CN107302547A (en) * 2017-08-21 2017-10-27 深信服科技股份有限公司 A kind of web service exceptions detection method and device
CN107491434A (en) * 2017-08-10 2017-12-19 北京邮电大学 Text snippet automatic generation method and device based on semantic dependency
CN107577668A (en) * 2017-09-15 2018-01-12 电子科技大学 Social media non-standard word correcting method based on semanteme
CN107835496A (en) * 2017-11-24 2018-03-23 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and server
CN108038230A (en) * 2017-12-26 2018-05-15 北京百度网讯科技有限公司 Information generating method and device based on artificial intelligence
CN108694202A (en) * 2017-04-10 2018-10-23 上海交通大学 Configurable Spam Filtering System based on sorting algorithm and filter method
CN110048936A (en) * 2019-04-18 2019-07-23 合肥天毅网络传媒有限公司 A kind of method that semantic association word judges spam

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010191693A (en) * 2009-02-18 2010-09-02 Nippon Telegr & Teleph Corp <Ntt> Electronic mail transmission host classification system, electronic mail transmission host classification method, and program therefor
US20110078152A1 (en) * 2009-09-30 2011-03-31 George Forman Method and system for processing text
US20110091105A1 (en) * 2006-09-19 2011-04-21 Xerox Corporation Bags of visual context-dependent words for generic visual categorization
CN102169493A (en) * 2011-04-02 2011-08-31 北京奥米时代生物技术有限公司 Method for automatically identifying experimental scheme from literatures
CN103488689A (en) * 2013-09-02 2014-01-01 新浪网技术(中国)有限公司 Mail classification method and mail classification system based on clustering
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN104834747A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Short text classification method based on convolution neutral network
US20150278194A1 (en) * 2012-11-07 2015-10-01 Nec Corporation Information processing device, information processing method and medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110091105A1 (en) * 2006-09-19 2011-04-21 Xerox Corporation Bags of visual context-dependent words for generic visual categorization
JP2010191693A (en) * 2009-02-18 2010-09-02 Nippon Telegr & Teleph Corp <Ntt> Electronic mail transmission host classification system, electronic mail transmission host classification method, and program therefor
US20110078152A1 (en) * 2009-09-30 2011-03-31 George Forman Method and system for processing text
CN102169493A (en) * 2011-04-02 2011-08-31 北京奥米时代生物技术有限公司 Method for automatically identifying experimental scheme from literatures
US20150278194A1 (en) * 2012-11-07 2015-10-01 Nec Corporation Information processing device, information processing method and medium
CN103488689A (en) * 2013-09-02 2014-01-01 新浪网技术(中国)有限公司 Mail classification method and mail classification system based on clustering
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN104834747A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Short text classification method based on convolution neutral network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周尔强: "基于语义集合模型及有限状态机的垃圾邮件分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106934068A (en) * 2017-04-10 2017-07-07 江苏东方金钰智能机器人有限公司 The method that robot is based on the semantic understanding of environmental context
CN108694202A (en) * 2017-04-10 2018-10-23 上海交通大学 Configurable Spam Filtering System based on sorting algorithm and filter method
CN107491434A (en) * 2017-08-10 2017-12-19 北京邮电大学 Text snippet automatic generation method and device based on semantic dependency
CN107302547A (en) * 2017-08-21 2017-10-27 深信服科技股份有限公司 A kind of web service exceptions detection method and device
CN107577668A (en) * 2017-09-15 2018-01-12 电子科技大学 Social media non-standard word correcting method based on semanteme
CN107835496A (en) * 2017-11-24 2018-03-23 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and server
CN107835496B (en) * 2017-11-24 2021-09-07 北京奇虎科技有限公司 Spam short message identification method and device and server
CN108038230A (en) * 2017-12-26 2018-05-15 北京百度网讯科技有限公司 Information generating method and device based on artificial intelligence
CN108038230B (en) * 2017-12-26 2022-05-20 北京百度网讯科技有限公司 Information generation method and device based on artificial intelligence
CN110048936A (en) * 2019-04-18 2019-07-23 合肥天毅网络传媒有限公司 A kind of method that semantic association word judges spam
CN110048936B (en) * 2019-04-18 2021-09-10 宁波青年优品信息科技有限公司 Method for judging junk mail by semantic associated words

Also Published As

Publication number Publication date
CN106506327B (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN106506327A (en) A kind of spam filtering method and device
US20230013306A1 (en) Sensitive Data Classification
CN106383815B (en) In conjunction with the neural network sentiment analysis method of user and product information
CN105589948B (en) A kind of reference citation network visualization and literature recommendation method and system
Li et al. Weakly supervised user profile extraction from twitter
CN107038480A (en) A kind of text sentiment classification method based on convolutional neural networks
CN107025284A (en) The recognition methods of network comment text emotion tendency and convolutional neural networks model
CN110008338A (en) A kind of electric business evaluation sentiment analysis method of fusion GAN and transfer learning
CN104598611B (en) The method and system being ranked up to search entry
CN105975984B (en) Network quality evaluation method based on evidence theory
CN103218444B (en) Based on semantic method of Tibetan language webpage text classification
CN108804689A (en) The label recommendation method of the fusion hidden connection relation of user towards answer platform
CN104142995B (en) The social event recognition methods of view-based access control model attribute
CN110134765A (en) A kind of dining room user comment analysis system and method based on sentiment analysis
CN109918560A (en) A kind of answering method and device based on search engine
CN107908715A (en) Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion
CN109670542A (en) A kind of false comment detection method based on comment external information
CN108665064A (en) Neural network model training, object recommendation method and device
CN110096575B (en) Psychological portrait method facing microblog user
CN106599054A (en) Method and system for title classification and push
CN103257957A (en) Chinese word segmentation based text similarity identifying method and device
CN103559199B (en) Method for abstracting web page information and device
CN109871485A (en) A kind of personalized recommendation method and device
CN107025239A (en) The method and apparatus of filtering sensitive words
CN108492290A (en) Image evaluation method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant