CN106506327A - A kind of spam filtering method and device - Google Patents
A kind of spam filtering method and device Download PDFInfo
- Publication number
- CN106506327A CN106506327A CN201610888007.0A CN201610888007A CN106506327A CN 106506327 A CN106506327 A CN 106506327A CN 201610888007 A CN201610888007 A CN 201610888007A CN 106506327 A CN106506327 A CN 106506327A
- Authority
- CN
- China
- Prior art keywords
- vector
- characteristic vector
- word
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/21—Monitoring or handling of messages
- H04L51/212—Monitoring or handling of messages using filtering or selective blocking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/42—Mailbox-related aspects, e.g. synchronisation of mailboxes
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention discloses a kind of spam filtering method and device, and methods described includes:The text in mail to be identified is extracted, and the text is split in units of word, obtain word sequence;According to the corresponding relation of the word and characteristic vector for obtaining in advance, word in the word sequence is converted to the characteristic vector that there is corresponding relation with institute predicate, sequence vector is obtained, the sequence vector includes the characteristic vector with each word in the word sequence respectively with corresponding relation.After characteristic vector in the sequence vector is grouped with preset standard, some Vector Groups are obtained.Using the Vector Groups as the |input paramete of grader, so that the grader is classified to the mail to be identified with reference to context dependence, classification results are obtained, the classification results are used for determining whether the mail to be identified belongs to spam.Present invention incorporates impact of the context dependence to mail recognition, improves the accuracy of spam filtering.
Description
Technical field
The present invention relates to data processing field, and in particular to a kind of spam filtering method and device.
Background technology
With the continuous development of internet, the use of Email is more and more universal, using Email as the business of carrier
Industry publicity is also widely used, while also resulting in spreading unchecked for spam.Spam generally requires to take in a large number
Resource, and there is the problems such as delivering inaccurate object, pressure delivery and contain unreal information in a large number.So, spam one
It is directly the disgustful internet product of user.
In order to prevent spam, the technology of various identification spams, example in current e-mail system, is embedded in
Such as white list, blacklist, content-based filtering etc..But existing identification spam method is substantially based only on pass
Key word or word frequency are identified to spam, and angle is single, the reason for have ignored other influences spam filtering accuracy,
Cause the recognition accuracy of spam inadequate.
Content of the invention
The invention provides a kind of spam filtering method and device, it is possible to increase the degree of accuracy of spam filtering.
The invention provides a kind of spam filtering method, methods described includes:
The text in mail to be identified is extracted, and the text is split in units of word, obtain word sequence;
According to the word for obtaining in advance and the corresponding relation of characteristic vector, the word in the word sequence is converted to and institute's predicate
Have corresponding relation characteristic vector, obtain sequence vector, the sequence vector include respectively with the word sequence in each
Word has the characteristic vector of corresponding relation;
After characteristic vector in the sequence vector is grouped with preset standard, some Vector Groups are obtained;
Using the Vector Groups as grader |input paramete so that the grader combine context dependence to described
Mail to be identified is classified, and obtains classification results, and the classification results are used for determining whether the mail to be identified belongs to rubbish
Rubbish mail.
Preferably, after the characteristic vector by the sequence vector is grouped with preset standard, obtain some to
Amount group, including:
With sentence or paragraph as standard, after being grouped to the characteristic vector in the sequence vector, some vectors are obtained
Group.
Preferably, the grader is constituted using convolutional neural networks;
Described using the Vector Groups as grader |input paramete so that the grader combine context dependence pair
The mail to be identified is classified, and obtains classification results, and the classification results are used for determining whether the mail to be identified belongs to
In spam, including:
Using the characteristic vector in the Vector Groups as the ground floor convolutional neural networks of the grader |input paramete,
The corresponding characteristic vector of the Vector Groups is obtained, wherein, the corresponding characteristic vector of the Vector Groups is used for representing sentence or paragraph
Semanteme;
Corresponding for Vector Groups characteristic vector is joined as the input of the second layer convolutional neural networks of the grader
Number, obtains the characteristic vector of the text in the mail to be identified, wherein, the characteristic vector of the text in the mail to be identified
For representing the semanteme for combining the text after context dependence;
Using the characteristic vector of the text in the mail to be identified as the full articulamentum of the grader |input paramete,
After the classification of the full articulamentum is processed, classification results are obtained, the classification results are used for determining the mail to be identified
Whether spam is belonged to.
Preferably, the ground floor convolutional neural networks of the grader include N number of convolution kernel, and N is natural number;
Using the characteristic vector in the Vector Groups as the ground floor convolutional neural networks of the grader |input paramete,
The corresponding characteristic vector of the Vector Groups is obtained, wherein, the corresponding characteristic vector of the Vector Groups is used for representing sentence or paragraph
Semanteme, including:
Using one-dimensional convolution algorithm, convolutional layer output result of the Vector Groups in each convolution kernel, the convolution is obtained
Layer output result is included successively using each characteristic vector in the Vector Groups as convolution algorithm initial value, respectively with the convolution
Core carries out the output result of convolution algorithm;
Vector Groups maximum in the convolutional layer output result of each convolution kernel is obtained respectively;
Maximum of the Vector Groups in the convolutional layer output result of each convolution kernel is combined, obtain described to
The corresponding characteristic vector of amount group.
Preferably, the word that the basis is obtained in advance and the corresponding relation of characteristic vector, the word in the word sequence is turned
The characteristic vector that there is corresponding relation with institute predicate is changed to, before obtaining sequence vector, is also included:
The word of preset kind in the word sequence is replaced with default label;
It is label construction feature vector in advance, and obtains the corresponding relation of the label and the characteristic vector;
Accordingly, the word that the basis is obtained in advance and the corresponding relation of characteristic vector, the word in the word sequence is turned
The characteristic vector that there is corresponding relation with institute predicate is changed to, sequence vector is obtained, including:
According to the word for obtaining in advance and the corresponding relation of characteristic vector, the word in the word sequence is converted to and institute's predicate
There is the characteristic vector of corresponding relation;And, according to the corresponding relation of the label and the characteristic vector, by the word sequence
In label be converted to the characteristic vector that there is corresponding relation with the label, obtain sequence vector.
Preferably, described is label construction feature vector in advance, including:
Random generate characteristic vector, and judge in corresponding relation of the characteristic vector with institute's predicate with characteristic vector each
Whether the Euclidean distance between characteristic vector is less than preset constant;
When the Euclidean distance between the characteristic vector and each characteristic vector described is less than preset constant, by the spy
Levy vector and distribute to label.
Present invention also offers a kind of spam filtering device, described device includes:
Segmentation module, for extracting the text in mail to be identified, and the text is split in units of word, is obtained
Arrive word sequence;
Modular converter, for the corresponding relation according to the word and characteristic vector for obtaining in advance, by the word in the word sequence
Be converted to the characteristic vector that there is corresponding relation with institute predicate, obtain sequence vector, the sequence vector include respectively with institute
In predicate sequence, each word has the characteristic vector of corresponding relation;
Grouping module, after the characteristic vector in the sequence vector is grouped with preset standard, obtains some
Vector Groups;
Sort module, for using the Vector Groups as grader |input paramete so that the grader is combined up and down
Literary correlation is classified to the mail to be identified, obtains classification results, and the classification results are described to be identified for determining
Whether mail belongs to spam.
Preferably, the grouping module, specifically for:
With sentence or paragraph as standard, after being grouped to the characteristic vector in the sequence vector, some vectors are obtained
Group.
Preferably, the grader is constituted using convolutional neural networks;The sort module, including:
First classification submodule, for using the characteristic vector in the Vector Groups as the grader ground floor convolution
The |input paramete of neutral net, obtains the corresponding characteristic vector of the Vector Groups, wherein, the corresponding characteristic vector of the Vector Groups
For representing the semanteme of sentence or paragraph;
Second classification submodule, for rolling up corresponding for Vector Groups characteristic vector as the second layer of the grader
The |input paramete of product neutral net, obtains the characteristic vector of the text in the mail to be identified, wherein, the mail to be identified
In text characteristic vector be used for represent combine context dependence after the text semanteme;
3rd classification submodule, for using the characteristic vector of the text in the mail to be identified as the grader
The |input paramete of full articulamentum, after the classification of the full articulamentum is processed, obtains classification results, and the classification results are used for
Determine whether the mail to be identified belongs to spam.
Preferably, the ground floor convolutional neural networks of the grader include N number of convolution kernel, and N is natural number;
The first classification submodule, including:
Convolution algorithm submodule, for using one-dimensional convolution algorithm, obtaining convolution of the Vector Groups in each convolution kernel
Layer output result, the convolutional layer output result include rising using each characteristic vector in the Vector Groups as convolution algorithm successively
Initial value, carries out the output result of convolution algorithm respectively with the convolution kernel;
Acquisition submodule, for obtaining maximum of the Vector Groups in the convolutional layer output result of each convolution kernel respectively
Value;
Combination submodule, is carried out for the maximum by the Vector Groups in the convolutional layer output result of each convolution kernel
Combination, obtains the corresponding characteristic vector of the Vector Groups.
The invention provides a kind of spam filtering method, extracts the text in mail to be identified first, and will be described
Text is split in units of word, obtains word sequence;According to the corresponding relation of the word and characteristic vector for obtaining in advance, will be described
Word in word sequence is converted to the characteristic vector that there is corresponding relation with institute predicate, obtains sequence vector, in the sequence vector
Including the characteristic vector with each word in the word sequence respectively with corresponding relation.Secondly, by the spy in the sequence vector
Levy vector be grouped with preset standard after, obtain some Vector Groups.Finally, the Vector Groups are joined as the input of grader
Number, so that the grader is classified to the mail to be identified with reference to context dependence, obtains classification results, described point
Class result is used for determining whether the mail to be identified belongs to spam.Spam filtering method phase with prior art
Than present invention incorporates impact of the context dependence to mail recognition, improves the accuracy of spam filtering.
Description of the drawings
For the technical scheme being illustrated more clearly that in the embodiment of the present application, below will be to making needed for embodiment description
Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present application, for
For those of ordinary skill in the art, without having to pay creative labor, can be obtaining which according to these accompanying drawings
His accompanying drawing.
Fig. 1 is a kind of spam filtering method flow diagram provided in an embodiment of the present invention;
Fig. 2 is a kind of sequence vector schematic diagram after packet provided in an embodiment of the present invention;
Fig. 3 is a kind of process flow figure of grader provided in an embodiment of the present invention;
Fig. 4 is a kind of grader structural representation provided in an embodiment of the present invention;
Fig. 5 is a kind of spam filtering apparatus structure schematic diagram provided in an embodiment of the present invention.
Specific embodiment
Accompanying drawing in below in conjunction with the embodiment of the present application, to the embodiment of the present application in technical scheme carry out clear, complete
Site preparation is described, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on
Embodiment in the application, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of the application protection.
The context dependence of Email Chinese version content has vital impact, example to the identification of spam
Such as training of " viagra " (vigour) this word often through rule or to sample, higher spam weight is endowed.
But if a friend issues your joke for mentioning " vigour ", or the electronics of the discussion of a serious medical speciality
Mail can be identified as spam.This is clearly the consequence that uncombined context dependence is identified to spam,
Be normally disengaged context dependence and semanteme carry out spam filtering method necessarily exist on recognition accuracy very big
Defect, particularly in the spam of the normal email and the field of distinguishing professional domain, has very high error rate.
So, the spam filtering method that the present invention is provided combines the impact of context dependence, can be more accurate
Spam is identified.
The introduction of embodiment particular content is below carried out.
A kind of spam filtering method is embodiments provided, with reference to Fig. 1, is provided in an embodiment of the present invention one
Spam filtering method flow diagram is planted, methods described is specifically included:
S101:The text in mail to be identified is extracted, and the text is split in units of word, obtain word order
Row.
Spam filtering method provided in an embodiment of the present invention can apply to Mail Gateway, mail server or visitor
The terminals such as family end.In practical application, the mail data in different terminals is all through specific coding or protocol encapsulation, sheet
Inventive embodiments can shield follow-up place by the advance conversion that the mail data being located in different terminals is carried out text
Processing difference of the reason process to the mail data from different terminals so that system has good adaptability.
In addition, the embodiment of the present invention be based on Email in content of text realize identification to spam, no
It is related to the identification to contents such as the picture in Email, annexes.
In practical application, the text in mail to be identified is extracted first, as the present invention is that text based semanteme is carried out
The identification of spam, so, the embodiment of the present invention is split to the text after text is extracted in units of word,
Word sequence is obtained, wherein, the word order is classified as the text after splitting in units of word.
In the embodiment of the present invention, the method for carrying out the segmentation in units of word to text can be included based on string matching
Method, such as two-way maximum matching method can also include the method based on HMM and the method based on deep learning etc..Wherein,
The embodiment of the present invention does not limit the segmentation which kind of method to carry out text using, it is preferable that the present invention using based on HMM method and
Method based on deep learning has preferable effect compared with additive method.
S102:According to the corresponding relation of the word that obtains in advance and characteristic vector, by the word in the word sequence be converted to
Institute's predicate has the characteristic vector of corresponding relation, obtains sequence vector, the sequence vector include respectively with the word sequence
In each word there is the characteristic vector of corresponding relation.
In the embodiment of the present invention, obtain in advance the corresponding relation of word and characteristic vector, and be stored in system for calling.
Specifically, in a kind of implementation, it is possible to use GloVe (English:Global Vectors for Word
Representation) method is trained to the sample for obtaining in advance, obtains the corresponding relation of word and characteristic vector.Wherein,
Sample used in GloVe methods can be the natural discourse from acquisitions such as news, webpages.In addition, the embodiment of the present invention is used for
Obtain word and GloVe methods are not limited to the method for the corresponding relation of characteristic vector, also there are other existing technology can be used in
The corresponding relation of word and characteristic vector is obtained, be will not be described here.
Value, it is emphasized that the word that obtained using GloVe methods is full with the characteristic vector in the corresponding relation of characteristic vector
The following condition of foot:First, the nearest-neighbors of the corresponding characteristic vector of each word should be the near synonym of the word, such as word frog pair
The nearest-neighbors of the characteristic vector that answers should be frogs, toad, litoria, leptodactylidae, rana respectively,
Lizard, eleutherodactylus etc..Secondly, the corresponding characteristic vector of word has linear relationship, example between related word
Such as, linear relationship v (queen) ≈ v (king) v (man)+v (woman), wherein v () are the conversion letters to word to characteristic vector
Number, queen, king, man, woman are related words.
In practical operation, according to the corresponding relation of the word and characteristic vector being pre-stored within system, the word order that will be obtained
Each word in row is converted to the characteristic vector that there is corresponding relation with which, obtains sequence vector.Wherein, in the sequence vector
Including the characteristic vector that each word in the word sequence respectively has corresponding relation.
A kind of preferred embodiment in, the embodiment of the present invention is found in the word sequence after word sequence is got
Preset kind word, such word is replaced with default label by such as numeral, symbol etc..For example, by date " 2016-6-
1 " replace with label "<date>”.
Due to the word that the word of preset kind is generally unrelated with identification spam, the embodiment of the present invention is using default label
The unified word for replacing preset kind, on the one hand can simplify the identification process of spam, on the other hand can also increase classification
The generalized ability of device, the Email for enabling grader only to be changed some numeral, dates etc. regard the electricity of a class as
Sub- mail, simplifies processing procedure.
In practical application, the embodiment of the present invention can realize the coupling of word of preset kind and pre- using regular expression
The replacement that bidding is signed, the embodiment of the present invention can pass through one regular expression storehouse of maintenance, will match regular expression list item
Word be substituted for corresponding label.
Further, since the word obtained using GloVe methods with do not include in the corresponding relation of characteristic vector that label is corresponding
Characteristic vector, so, after label is pre-set, it is possible to use GloVe methods are label construction feature vector.Specifically
, characteristic vector is generated at random using GloVe methods, and judge the characteristic vector and the word for obtaining in advance and characteristic vector
Whether the Euclidean distance in corresponding relation between the corresponding characteristic vector of each word is less than preset constant.If the feature to
When Euclidean distance between amount characteristic vector corresponding with each word is less than preset constant, the characteristic vector is distributed to mark
Sign.According to aforesaid way, it is that each label builds corresponding characteristic vector.
In practical application, according to each label and the corresponding relation of characteristic vector, each label in word sequence is also turned
It is changed to corresponding characteristic vector.
S103:After characteristic vector in the sequence vector is grouped with preset standard, some Vector Groups are obtained.
In the embodiment of the present invention, the preset standard can be with sentence or paragraph as standard, it is also possible to regular length
Or fixed word number is standard.
In practical application, after being grouped to the sequence vector with preset standard, some Vector Groups are obtained, wherein, respectively
Individual Vector Groups include the characteristic vector after being grouped.
In practical application, with sentence as standard the sequence vector is grouped when, can be according in sequence vector
Punctuation mark recognizes sentence, finally the characteristic vector is grouped in units of sentence.As shown in Fig. 2 Fig. 2 for a kind of with
Sentence be grouped for standard after sequence vector schematic diagram.Wherein, in order to each word allowed in each sentence is to spam
The contribution of identification is balanced, will increase several occupy-places vectors respectively before and after corresponding for each sentence after packet Vector Groups.Its
In, the number of the occupy-place vector for increasing respectively is equal to the maximum length of window of convolution kernel in grader and deducts 1.
S104:Using the Vector Groups as grader |input paramete so that the grader combines context dependence
The mail to be identified is classified, obtains classification results, the classification results are used for whether determining the mail to be identified
Belong to spam.
Grader in the embodiment of the present invention can adopt convolutional neural networks CNN, Recognition with Recurrent Neural Network RNN even depth god
Constitute through network, using association ability of the deep neural network to context, mail to be identified is classified, it is possible to increase right
The accuracy of identification of spam.
A kind of preferred embodiment in, using convolutional neural networks CNN constitute the embodiment of the present invention in grader.Right
The Vector Groups obtained after being grouped to sequence vector with sentence or paragraph as standard, the processing procedure of the grader is such as
Under, with reference to shown in Fig. 3, Fig. 3 is a kind of process flow figure of grader provided in an embodiment of the present invention:
S301:Using the characteristic vector in the Vector Groups as the ground floor convolutional neural networks of the grader input
Parameter, obtains the corresponding characteristic vector of the Vector Groups, wherein, the corresponding characteristic vector of the Vector Groups be used for representing sentence or
The semanteme of paragraph.
In practical application, due in the present embodiment mode using sentence or paragraph as packet standard, so in the present embodiment
The grader for being trained using convolutional neural networks and being classified can be made up of two-layer convolutional neural networks.In fact, according to
The difference of packet standard, the grader of the embodiment of the present invention can be with by three layers, or even more layers convolutional neural networks are constituted.Such as
Shown in Fig. 4, it is a kind of grader structural representation being made up of two-layer convolutional neural networks provided in an embodiment of the present invention.Its
In, ground floor convolutional neural networks are made up of N number of convolution kernel and pooling layers 1, and N is natural number.
Specifically, the Vector Groups obtained after being grouped using sentence or paragraph as standard are designated as S1:n=[X1,
X2...Xn], Xn is the corresponding characteristic vector of n-th word.That is, Vector Groups S1:nCharacteristic vector structure by n word
Into.
In practical application, first with one-dimensional convolution algorithm, the convolutional layer for obtaining the Vector Groups in each convolution kernel is defeated
Go out result, the convolutional layer output result is included successively with each characteristic vector in the Vector Groups as convolution algorithm initial value,
The output result of convolution algorithm is carried out with the convolution kernel respectively.
Specifically, successively with Vector Groups S1:nIn each feature vector, X1, X2...Xn is used as convolution algorithm starting
Value, carries out convolution algorithm with convolution kernel respectively, obtains Vector Groups S1:nConvolutional layer output result in each convolution kernel.Its
In, using in the Vector Groups ith feature vector as convolution algorithm initial value, by ith feature in the Vector Groups to
Measuring the i-th+hj-1 characteristic vectors and the output result for obtaining after convolution algorithm being carried out with j-th convolution kernel Wj be designated as:
Wherein, the Vector Groups are m-th Vector Groups obtaining after packet, hjFor the length of window of j-th convolution kernel, bj
For side-play amount, f () is a nonlinear function, such as tanh () etc..
In practical application, with Vector Groups S1:nIn each feature vector, X1, X2...Xn is used as convolution algorithm
Initial value, carries out convolution algorithm with convolution kernel respectively, obtainsAfterwards, willIt is combined,
Finally giveCm,jAs described Vector Groups S1:nExport in the convolutional layer of j-th convolution kernel
As a result.
Wherein, the Vector Groups the convolution kernel convolutional layer output result include successively with the Vector Groups in each
Characteristic vector is convolution algorithm initial value, carries out the output result of convolution algorithm respectively with the convolution kernel.
Then, maximum of the Vector Groups in the convolutional layer output result of each convolution kernel is obtained respectively.Specifically,
Pooling layers 1 in the diagram, using max-out pooling methods, obtain the Vector Groups respectively in each convolution kernel
Maximum in convolutional layer output result.Convolutional layer output of m-th Vector Groups obtained after packet in j-th convolution kernel is tied
Maximum in fruit is designated as:
Finally, the maximum by the Vector Groups in the convolutional layer output result of each convolution kernel is combined, and obtains
The corresponding characteristic vector of the Vector Groups.Corresponding for m-th Vector Groups obtained after packet characteristic vector is designated as:
Ym=[Pm,1,Pm,2...Pm,N];
Wherein, the ground floor convolutional neural networks include N number of convolution kernel, and m-th Vector Groups is respectively in N number of convolution kernel
In convolutional layer output result, maximum constitutes corresponding characteristic vector Y of the Vector Groupsm.
S302:Using corresponding for Vector Groups characteristic vector as the defeated of the second layer convolutional neural networks of the grader
Enter parameter, obtain the characteristic vector of the text in the mail to be identified, wherein, the feature of the text in the mail to be identified
Vector combines the semanteme of the text after context dependence for representing.
As shown in figure 4, the second layer convolutional neural networks in grader can be by 2 groups of M convolution kernel and pooling layers
Into M is natural number, and the second layer convolutional neural networks are identical with the algorithm logic of the ground floor convolutional neural networks.Tool
Body, the corresponding characteristic vector of each Vector Groups that the ground floor convolutional neural networks are exported is used as second layer convolution god
|input paramete through network.After the process of M convolution kernel and pooling layers 2 in the second layer convolutional neural networks, most
What the second layer convolutional neural networks were exported is the characteristic vector of the mail Chinese version to be identified eventually.
S303:Using the characteristic vector of the text in the mail to be identified as the full articulamentum of the grader input
Parameter, after the classification of the full articulamentum is processed, obtains classification results, and the classification results are described to be identified for determining
Whether mail belongs to spam.
As shown in figure 4, the grader in the embodiment of the present invention also includes full articulamentum, the second layer convolutional neural networks
|input paramete of the characteristic vector of the mail Chinese version described to be identified of output as the full articulamentum, by the full articulamentum
Exported in multiple classificatory probability by softmax functions, whether the mail to be identified is can determine that using the probability
Belong to spam.Wherein, the algorithm logic of the full articulamentum is identical with traditional neural network algorithm logic, and here is no longer
Repeat.
In the embodiment of the present invention, before recycling grader to carry out the identification of spam, first with mail sample pair
The grader is trained.Specifically, process grader being trained using mail sample is with grader to rubbish postal
The process that part is identified is essentially identical, and difference includes at following 2 points:First, rank is trained to grader using mail sample
Duan Zhong, grader not only include that the forward-propagating process processed by mail sample, i.e., above-mentioned S301-S303 also include anti-
To communication process, it is therefore an objective to which the network parameter (such as weight and the skew of full articulamentum) of each layer of the grader is adjusted,
So that the training result for finally giving is more accurate.Second, the full articulamentum that dropout algorithms are applied to grader solves postal
Overfitting problem of the part sample in the training stage.Specifically, during the forward-propagating of training stage, random by some
The output of hidden layer is set to 0, while these neurons do not participate in the parameter adjustment of backpropagation.This method reduces nerve
Dependence between unit, solves the problems, such as overfitting of the deep neural network to sample.
In a kind of spam filtering method provided in an embodiment of the present invention, the text in mail to be identified is extracted first,
And the text is split in units of word, obtain word sequence;Corresponding pass according to the word for obtaining in advance and characteristic vector
System, the word in the word sequence is converted to the characteristic vector that there is corresponding relation with institute predicate, sequence vector is obtained, described to
Amount sequence includes the characteristic vector with each word in the word sequence respectively with corresponding relation.Secondly, by the vectorial sequence
After characteristic vector in row is grouped with preset standard, some Vector Groups are obtained.Finally, using the Vector Groups as grader
|input paramete so that the grader is classified to the mail to be identified with reference to context dependence, obtain classification knot
Really, the classification results are used for determining whether the mail to be identified belongs to spam.Know with the spam of prior art
Other method is compared, and the embodiment of the present invention combines impact of the context dependence to mail recognition, improves spam filtering
Accuracy.
The embodiment of the present invention additionally provides a kind of spam filtering device, with reference to Fig. 5, is provided in an embodiment of the present invention
A kind of spam filtering apparatus structure schematic diagram, described device include:
Segmentation module 501, for extracting the text in mail to be identified, and the text is carried out in units of word point
Cut, obtain word sequence;
Modular converter 502, for the corresponding relation according to the word and characteristic vector for obtaining in advance, by the word sequence
Word is converted to the characteristic vector that there is corresponding relation with institute predicate, obtains sequence vector, the sequence vector include respectively with
In the word sequence, each word has the characteristic vector of corresponding relation;
Grouping module 503, after the characteristic vector in the sequence vector is grouped with preset standard, if obtain
Dry Vector Groups;
Sort module 504, for using the Vector Groups as grader |input paramete so that the grader combine upper
Context correlation is classified to the mail to be identified, obtains classification results, and the classification results are used for determining described to be waited to know
Whether other mail belongs to spam.
Specifically, the grouping module 503, specifically for:
With sentence or paragraph as standard, after being grouped to the characteristic vector in the sequence vector, some vectors are obtained
Group.
One kind is preferably carried out in mode, and the grader is constituted using convolutional neural networks;The sort module 504,
Including:
First classification submodule, for using the characteristic vector in the Vector Groups as the grader ground floor convolution
The |input paramete of neutral net, obtains the corresponding characteristic vector of the Vector Groups, wherein, the corresponding characteristic vector of the Vector Groups
For representing the semanteme of sentence or paragraph;
Second classification submodule, for rolling up corresponding for Vector Groups characteristic vector as the second layer of the grader
The |input paramete of product neutral net, obtains the characteristic vector of the text in the mail to be identified, wherein, the mail to be identified
In text characteristic vector be used for represent combine context dependence after the text semanteme;
3rd classification submodule, for using the characteristic vector of the text in the mail to be identified as the grader
The |input paramete of full articulamentum, after the classification of the full articulamentum is processed, obtains classification results, and the classification results are used for
Determine whether the mail to be identified belongs to spam.
One kind is preferably carried out in mode, and the ground floor convolutional neural networks of the grader include that N number of convolution kernel, N are
Natural number;
The first classification submodule, including:
Convolution algorithm submodule, for using one-dimensional convolution algorithm, obtaining convolution of the Vector Groups in each convolution kernel
Layer output result, the convolutional layer output result include rising using each characteristic vector in the Vector Groups as convolution algorithm successively
Initial value, carries out the output result of convolution algorithm respectively with the convolution kernel;
Acquisition submodule, for obtaining maximum of the Vector Groups in the convolutional layer output result of each convolution kernel respectively
Value;
Combination submodule, is carried out for the maximum by the Vector Groups in the convolutional layer output result of each convolution kernel
Combination, obtains the corresponding characteristic vector of the Vector Groups.
A kind of spam filtering device provided in an embodiment of the present invention can realize following functions:Extract mail to be identified
In text, and the text is split in units of word, is obtained word sequence;According to the word and characteristic vector that obtain in advance
Corresponding relation, the word in the word sequence is converted to the characteristic vector that there is corresponding relation with institute predicate, vectorial sequence is obtained
Row, the sequence vector include the characteristic vector with each word in the word sequence respectively with corresponding relation.By described to
After characteristic vector in amount sequence is grouped with preset standard, some Vector Groups are obtained.Using the Vector Groups as grader
|input paramete so that the grader is classified to the mail to be identified with reference to context dependence, obtain classification knot
Really, the classification results are used for determining whether the mail to be identified belongs to spam.Know with the spam of prior art
Other method is compared, and the embodiment of the present invention combines impact of the context dependence to mail recognition, improves spam filtering
Accuracy.
For device embodiment, as which corresponds essentially to embodiment of the method, so related part is referring to method reality
Apply the part explanation of example.Device embodiment described above is only schematically, wherein described as separating component
The unit of explanation can be or may not be physically separate, as the part that unit shows can be or can also
It is not physical location, you can be located at a place, or can also be distributed on multiple NEs.Can be according to reality
Need to select some or all of module therein to realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not
In the case of paying creative work, you can to understand and implement.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality
Body or operation are made a distinction with another entity or operation, and are not necessarily required or implied these entities or deposit between operating
In any this actual relation or order.And, term " including ", "comprising" or its any other variant are intended to
Nonexcludability includes, so that a series of process, method, article or equipment including key elements not only includes that those will
Element, but also other key elements including being not expressly set out, or also include for this process, method, article or equipment
Intrinsic key element.In the absence of more restrictions, the key element for being limited by sentence "including a ...", it is not excluded that
Also there is other identical element in process, method, article or equipment including the key element.
A kind of spam filtering method and device that above embodiment of the present invention is provided is described in detail, this
Apply specific case to be set forth principle of the invention and embodiment in text, the explanation of above example is only intended to
Help understands the method for the present invention and its core concept;Simultaneously for one of ordinary skill in the art, according to the think of of the present invention
Think, will change in specific embodiments and applications, in sum, it is right that this specification content should not be construed as
The restriction of the present invention.
Claims (10)
1. a kind of spam filtering method, it is characterised in that methods described includes:
The text in mail to be identified is extracted, and the text is split in units of word, obtain word sequence;
According to the corresponding relation of the word that obtains in advance and characteristic vector, the word in the word sequence is converted to and is had with institute predicate
The characteristic vector of corresponding relation, obtains sequence vector, and the sequence vector includes having with each word in the word sequence respectively
There is the characteristic vector of corresponding relation;
After characteristic vector in the sequence vector is grouped with preset standard, some Vector Groups are obtained;
Using the Vector Groups as grader |input paramete so that the grader is waited to know to described with reference to context dependence
Other mail is classified, and obtains classification results, and the classification results are used for determining whether the mail to be identified belongs to rubbish postal
Part.
2. spam filtering method according to claim 1, it is characterised in that the spy by the sequence vector
Levy vector be grouped with preset standard after, obtain some Vector Groups, including:
With sentence or paragraph as standard, after being grouped to the characteristic vector in the sequence vector, some Vector Groups are obtained.
3. spam filtering method according to claim 2, it is characterised in that the grader adopts convolutional Neural net
Network is constituted;
Described using the Vector Groups as grader |input paramete so that the grader combine context dependence to described
Mail to be identified is classified, and obtains classification results, and the classification results are used for determining whether the mail to be identified belongs to rubbish
Rubbish mail, including:
Using the characteristic vector in the Vector Groups as the |input paramete of the ground floor convolutional neural networks of the grader, obtain
The corresponding characteristic vector of the Vector Groups, wherein, the corresponding characteristic vector of the Vector Groups is used for the language for representing sentence or paragraph
Justice;
Using corresponding for Vector Groups characteristic vector as the |input paramete of the second layer convolutional neural networks of the grader, obtain
The characteristic vector of the text in the mail to be identified, wherein, the characteristic vector of the text in the mail to be identified is used for
Represent the semanteme of the text after combining context dependence;
Using the characteristic vector of the text in the mail to be identified as the |input paramete of the full articulamentum of the grader, pass through
After the classification of the full articulamentum is processed, classification results are obtained, the classification results are used for whether determining the mail to be identified
Belong to spam.
4. spam filtering method according to claim 3, it is characterised in that the ground floor convolution god of the grader
Include N number of convolution kernel through network, N is natural number;
Using the characteristic vector in the Vector Groups as the |input paramete of the ground floor convolutional neural networks of the grader, obtain
The corresponding characteristic vector of the Vector Groups, wherein, the corresponding characteristic vector of the Vector Groups is used for the language for representing sentence or paragraph
Justice, including:
Using one-dimensional convolution algorithm, convolutional layer output result of the Vector Groups in each convolution kernel is obtained, the convolutional layer is defeated
Going out result is included successively using each characteristic vector in the Vector Groups as convolution algorithm initial value, enters with the convolution kernel respectively
The output result of row convolution algorithm;
Vector Groups maximum in the convolutional layer output result of each convolution kernel is obtained respectively;
Maximum of the Vector Groups in the convolutional layer output result of each convolution kernel is combined, the Vector Groups are obtained
Corresponding characteristic vector.
5. the spam filtering method according to any one of claim 1-4, it is characterised in that the basis is advance
The word of acquisition and the corresponding relation of characteristic vector, the word in the word sequence is converted to the spy that there is corresponding relation with institute predicate
Vector is levied, before obtaining sequence vector, is also included:
The word of preset kind in the word sequence is replaced with default label;
It is label construction feature vector in advance, and obtains the corresponding relation of the label and the characteristic vector;
Accordingly, the word that the basis is obtained in advance and the corresponding relation of characteristic vector, the word in the word sequence is converted to
There is with institute predicate the characteristic vector of corresponding relation, obtain sequence vector, including:
According to the corresponding relation of the word that obtains in advance and characteristic vector, the word in the word sequence is converted to and is had with institute predicate
The characteristic vector of corresponding relation;And, according to the corresponding relation of the label and the characteristic vector, by the word sequence
Label is converted to the characteristic vector that there is corresponding relation with the label, obtains sequence vector.
6. spam filtering method according to claim 5, it is characterised in that described build for the label in advance special
Vector is levied, including:
Random generation characteristic vector, and judge each feature in corresponding relation of the characteristic vector with institute's predicate with characteristic vector
Whether the Euclidean distance between vector is less than preset constant;
When the Euclidean distance between the characteristic vector and each characteristic vector described be less than preset constant when, by the feature to
Amount distributes to label.
7. a kind of spam filtering device, it is characterised in that described device includes:
Segmentation module, for extracting the text in mail to be identified, and the text is split in units of word, is obtained word
Sequence;
Modular converter, for according to the word for obtaining in advance and the corresponding relation of characteristic vector, the word in the word sequence being changed
Be the characteristic vector that there is corresponding relation with institute predicate, obtain sequence vector, the sequence vector include respectively with institute's predicate
In sequence, each word has the characteristic vector of corresponding relation;
Grouping module, after the characteristic vector in the sequence vector is grouped with preset standard, obtains some vectors
Group;
Sort module, for using the Vector Groups as grader |input paramete so that the grader combine context phase
Closing property is classified to the mail to be identified, obtains classification results, and the classification results are used for determining the mail to be identified
Whether spam is belonged to.
8. spam filtering device according to claim 7, it is characterised in that the grouping module, specifically for:
With sentence or paragraph as standard, after being grouped to the characteristic vector in the sequence vector, some Vector Groups are obtained.
9. spam filtering device according to claim 8, it is characterised in that the grader adopts convolutional Neural net
Network is constituted;The sort module, including:
First classification submodule, for using the characteristic vector in the Vector Groups as the grader ground floor convolutional Neural
The |input paramete of network, obtains the corresponding characteristic vector of the Vector Groups, and wherein, the corresponding characteristic vector of the Vector Groups is used for
Represent the semanteme of sentence or paragraph;
Second classification submodule, for refreshing as the second layer convolution of the grader using corresponding for Vector Groups characteristic vector
Through the |input paramete of network, the characteristic vector of the text in the mail to be identified is obtained, wherein, in the mail to be identified
The characteristic vector of text is used for the semanteme for representing the text after combining context dependence;
3rd classification submodule, for characteristic vector the connecting entirely as the grader using the text in the mail to be identified
The |input paramete of layer is connect, after the classification of the full articulamentum is processed, classification results is obtained, the classification results are used for determining
Whether the mail to be identified belongs to spam.
10. spam filtering device according to claim 9, it is characterised in that the ground floor convolution of the grader
Neutral net includes N number of convolution kernel, and N is natural number;
The first classification submodule, including:
Convolution algorithm submodule, for utilizing one-dimensional convolution algorithm, the convolutional layer for obtaining the Vector Groups in each convolution kernel is defeated
Go out result, the convolutional layer output result is included successively using each characteristic vector in the Vector Groups as convolution algorithm starting
Value, carries out the output result of convolution algorithm respectively with the convolution kernel;
Acquisition submodule, for obtaining maximum of the Vector Groups in the convolutional layer output result of each convolution kernel respectively;
Combination submodule, carries out group for the maximum by the Vector Groups in the convolutional layer output result of each convolution kernel
Close, obtain the corresponding characteristic vector of the Vector Groups.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610888007.0A CN106506327B (en) | 2016-10-11 | 2016-10-11 | Junk mail identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610888007.0A CN106506327B (en) | 2016-10-11 | 2016-10-11 | Junk mail identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106506327A true CN106506327A (en) | 2017-03-15 |
CN106506327B CN106506327B (en) | 2021-02-19 |
Family
ID=58295096
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610888007.0A Active CN106506327B (en) | 2016-10-11 | 2016-10-11 | Junk mail identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106506327B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106934068A (en) * | 2017-04-10 | 2017-07-07 | 江苏东方金钰智能机器人有限公司 | The method that robot is based on the semantic understanding of environmental context |
CN107302547A (en) * | 2017-08-21 | 2017-10-27 | 深信服科技股份有限公司 | A kind of web service exceptions detection method and device |
CN107491434A (en) * | 2017-08-10 | 2017-12-19 | 北京邮电大学 | Text snippet automatic generation method and device based on semantic dependency |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
CN107835496A (en) * | 2017-11-24 | 2018-03-23 | 北京奇虎科技有限公司 | A kind of recognition methods of refuse messages, device and server |
CN108038230A (en) * | 2017-12-26 | 2018-05-15 | 北京百度网讯科技有限公司 | Information generating method and device based on artificial intelligence |
CN108694202A (en) * | 2017-04-10 | 2018-10-23 | 上海交通大学 | Configurable Spam Filtering System based on sorting algorithm and filter method |
CN110048936A (en) * | 2019-04-18 | 2019-07-23 | 合肥天毅网络传媒有限公司 | A kind of method that semantic association word judges spam |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010191693A (en) * | 2009-02-18 | 2010-09-02 | Nippon Telegr & Teleph Corp <Ntt> | Electronic mail transmission host classification system, electronic mail transmission host classification method, and program therefor |
US20110078152A1 (en) * | 2009-09-30 | 2011-03-31 | George Forman | Method and system for processing text |
US20110091105A1 (en) * | 2006-09-19 | 2011-04-21 | Xerox Corporation | Bags of visual context-dependent words for generic visual categorization |
CN102169493A (en) * | 2011-04-02 | 2011-08-31 | 北京奥米时代生物技术有限公司 | Method for automatically identifying experimental scheme from literatures |
CN103488689A (en) * | 2013-09-02 | 2014-01-01 | 新浪网技术(中国)有限公司 | Mail classification method and mail classification system based on clustering |
CN103744905A (en) * | 2013-12-25 | 2014-04-23 | 新浪网技术(中国)有限公司 | Junk mail judgment method and device |
CN104834747A (en) * | 2015-05-25 | 2015-08-12 | 中国科学院自动化研究所 | Short text classification method based on convolution neutral network |
US20150278194A1 (en) * | 2012-11-07 | 2015-10-01 | Nec Corporation | Information processing device, information processing method and medium |
-
2016
- 2016-10-11 CN CN201610888007.0A patent/CN106506327B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110091105A1 (en) * | 2006-09-19 | 2011-04-21 | Xerox Corporation | Bags of visual context-dependent words for generic visual categorization |
JP2010191693A (en) * | 2009-02-18 | 2010-09-02 | Nippon Telegr & Teleph Corp <Ntt> | Electronic mail transmission host classification system, electronic mail transmission host classification method, and program therefor |
US20110078152A1 (en) * | 2009-09-30 | 2011-03-31 | George Forman | Method and system for processing text |
CN102169493A (en) * | 2011-04-02 | 2011-08-31 | 北京奥米时代生物技术有限公司 | Method for automatically identifying experimental scheme from literatures |
US20150278194A1 (en) * | 2012-11-07 | 2015-10-01 | Nec Corporation | Information processing device, information processing method and medium |
CN103488689A (en) * | 2013-09-02 | 2014-01-01 | 新浪网技术(中国)有限公司 | Mail classification method and mail classification system based on clustering |
CN103744905A (en) * | 2013-12-25 | 2014-04-23 | 新浪网技术(中国)有限公司 | Junk mail judgment method and device |
CN104834747A (en) * | 2015-05-25 | 2015-08-12 | 中国科学院自动化研究所 | Short text classification method based on convolution neutral network |
Non-Patent Citations (1)
Title |
---|
周尔强: "基于语义集合模型及有限状态机的垃圾邮件分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106934068A (en) * | 2017-04-10 | 2017-07-07 | 江苏东方金钰智能机器人有限公司 | The method that robot is based on the semantic understanding of environmental context |
CN108694202A (en) * | 2017-04-10 | 2018-10-23 | 上海交通大学 | Configurable Spam Filtering System based on sorting algorithm and filter method |
CN107491434A (en) * | 2017-08-10 | 2017-12-19 | 北京邮电大学 | Text snippet automatic generation method and device based on semantic dependency |
CN107302547A (en) * | 2017-08-21 | 2017-10-27 | 深信服科技股份有限公司 | A kind of web service exceptions detection method and device |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
CN107835496A (en) * | 2017-11-24 | 2018-03-23 | 北京奇虎科技有限公司 | A kind of recognition methods of refuse messages, device and server |
CN107835496B (en) * | 2017-11-24 | 2021-09-07 | 北京奇虎科技有限公司 | Spam short message identification method and device and server |
CN108038230A (en) * | 2017-12-26 | 2018-05-15 | 北京百度网讯科技有限公司 | Information generating method and device based on artificial intelligence |
CN108038230B (en) * | 2017-12-26 | 2022-05-20 | 北京百度网讯科技有限公司 | Information generation method and device based on artificial intelligence |
CN110048936A (en) * | 2019-04-18 | 2019-07-23 | 合肥天毅网络传媒有限公司 | A kind of method that semantic association word judges spam |
CN110048936B (en) * | 2019-04-18 | 2021-09-10 | 宁波青年优品信息科技有限公司 | Method for judging junk mail by semantic associated words |
Also Published As
Publication number | Publication date |
---|---|
CN106506327B (en) | 2021-02-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106506327A (en) | A kind of spam filtering method and device | |
US20230013306A1 (en) | Sensitive Data Classification | |
CN106383815B (en) | In conjunction with the neural network sentiment analysis method of user and product information | |
CN105589948B (en) | A kind of reference citation network visualization and literature recommendation method and system | |
Li et al. | Weakly supervised user profile extraction from twitter | |
CN107038480A (en) | A kind of text sentiment classification method based on convolutional neural networks | |
CN107025284A (en) | The recognition methods of network comment text emotion tendency and convolutional neural networks model | |
CN110008338A (en) | A kind of electric business evaluation sentiment analysis method of fusion GAN and transfer learning | |
CN104598611B (en) | The method and system being ranked up to search entry | |
CN105975984B (en) | Network quality evaluation method based on evidence theory | |
CN103218444B (en) | Based on semantic method of Tibetan language webpage text classification | |
CN108804689A (en) | The label recommendation method of the fusion hidden connection relation of user towards answer platform | |
CN104142995B (en) | The social event recognition methods of view-based access control model attribute | |
CN110134765A (en) | A kind of dining room user comment analysis system and method based on sentiment analysis | |
CN109918560A (en) | A kind of answering method and device based on search engine | |
CN107908715A (en) | Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion | |
CN109670542A (en) | A kind of false comment detection method based on comment external information | |
CN108665064A (en) | Neural network model training, object recommendation method and device | |
CN110096575B (en) | Psychological portrait method facing microblog user | |
CN106599054A (en) | Method and system for title classification and push | |
CN103257957A (en) | Chinese word segmentation based text similarity identifying method and device | |
CN103559199B (en) | Method for abstracting web page information and device | |
CN109871485A (en) | A kind of personalized recommendation method and device | |
CN107025239A (en) | The method and apparatus of filtering sensitive words | |
CN108492290A (en) | Image evaluation method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |