CN109947936A

CN109947936A - A method of based on machine learning dynamic detection spam

Info

Publication number: CN109947936A
Application number: CN201810952482.9A
Authority: CN
Inventors: 文伟平; 冯超
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2019-06-28
Anticipated expiration: 2038-08-21
Also published as: CN109947936B

Abstract

The method of the invention discloses a kind of Dynamic Recognition spam based on machine learning, it is related to spam filtering technology, by using LDA topic model, Autoencoder self-encoding encoder, and creates linear model, it realizes dynamic detection spam, achievees the purpose that efficient identification spam；It include: to be pre-processed to template mail and sample post；Text is trained using LDA topic model；One-hot coding vector is compressed using self-encoding encoder bag of words vector, is converted into term vector；Linear model is created, mail to be identified is predicted by linear model.Sample technical solution of the present invention, can dynamic detection, efficiently identify spam.

Description

A method of based on machine learning dynamic detection spam

Technical field

The present invention relates to spam filtering technologies more particularly to a kind of based on machine learning dynamic detection spam Method.

Background technique

With the development of big data era, the event of user data leakage is more and more, and communication mailbox leaks thing on a large scale Part emerges one after another.Undue profits group often utilizes the mailbox got, the rubbish postal of the types such as Batch sending commercial advertisement Part has seriously affected the working efficiency of E-mail address, occupies the memory space of mailbox, directly influences the user's body of mailbox It tests.

The detection method of existing spam manually extracts spam mainly by using conventional statistics model Text feature, which is done, classifies.But it sends anti-spam technology and also upgrades therewith and constantly.It is this kind of based on artificial extraction text feature The method for detecting spam, can not efficiently intercept New-type refuse vehicle mail.

Summary of the invention

In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a kind of based on machine learning dynamic detection spam Method, i.e. AAS (Autoencoder Anti-Spam System), by using LDA topic model, Autoencoder from Encoder and customized linear model, can preferably identify spam, to guarantee the fortune of mailbox normally and efficiently Turn, guarantees that user is not harassed by spam.

Core of the invention is: by using implicit Di Li Cray distribution (Latent Dirichlet Allocation, Abbreviation LDA once) LDA topic model, Autoencoder self-encoding encoder and customized linear model, dynamic detection rubbish Mail preferably identifies spam.Wherein, LDA) model is a kind of unsupervised machine learning method.It is related to: gamma letter Number, beta distribution, Dirichlet distribution, conjugate prior and Gibbs sampling.Especially by the study to magnanimity document, push away Measure the theme distribution of specific document.AutoEncoder self-encoding encoder is a kind of unsupervised neural network model, can will be believed Breath is effectively compressed.The present invention carries out term vector generation using AutoEncoder self-encoding encoder.Linear portion subpackage in AAS Ballot, two stages of prediction are included.In the ballot stage, the selection of weight is carried out to theme distribution by Softmax function；Prediction Stage is to be matched by sample post with template mail (including multiple representative spams and normal email), It whether is spam by Sigmoid function prediction result.

Technical solution provided by the invention is as follows:

A method of based on machine learning Dynamic Recognition spam, by using LDA topic model, Autoencoder self-encoding encoder and the customized linear model of creation, dynamic detection spam reach efficient identification rubbish postal The purpose of part；Include the following steps:

A. preprocessing process is carried out to template and sample post, performed the following operations:

A1. Chinese template and sample post are segmented with jieba software, English template and sample post use Spacy software is segmented；

A2. the stop words of removal Chinese and English rejects the punctuate in text using the re module of python；

A3. one_hot coding vector is converted by the sentence of point good word using sklearn software.

B. using open source GibbsLDA++, text is trained

B1. the parameter that setting Di Li Cray is distributed, is arranged theme number；

B2., the number of iterations is set；

B3. save training as a result, observation be arranged how many themes when, text subject Clustering Effect is relatively good, by this article Number of this number of topics as Autoencoder subject layer.

C. one_hot coding vector obtained in step A3 is pressed using self-encoding encoder (AutoEncoder) model Contracting, is converted into term vector (embedding)；

C1. TensorFlow deep learning frame is used, the foundation of model is carried out, input layer, middle layer, master is respectively set Inscribe the number of layer, output layer；Specific training process indicates are as follows:

p(z^l| v)=softmax (- F (v, z^l)) (formula 1)

Wherein, v is the one_hot coding generated by A3 step；p(z^l| it v) is v first of theme in theme set z Lower gained probability value；Z: the set of theme；L: first theme；z^l: indicate first of theme in theme set z；

d^l: at theme l, a scalar d of computer random initialization；v_k: in the vocabulary that size is k, generation One_hot coding vector；In theme l, under conditions of vocabulary size is k, j-th of scalar ginseng of computer random initialization Number；

p(h_j|v,z^l): the lower target probability value of j-th of the term vector generated after dimensionality reduction；Hj: under hidden layer h

J-th of coordinate；σ: sigmoid function (provides specific formula) in specific implementation.

C2. it is the loss function of model using quadratic loss function, optimizes mould using AdamOptimizer optimization method Type；

C3. the method for using cross validation, selecting a best model, (i.e. AutoEncoder method trains Model), the term vector by this model, using one_hot coding vector as input, after obtaining dimensionality reduction；

D. linear model is created, the prediction of mail is carried out using linear model:

Linear model includes ballot, two stages of prediction；Ballot the stage, by Softmax function to theme distribution into The selection of row weight；Forecast period is to be matched by sample post with template mail, passes through Sigmoid function prediction knot Fruit；Specifically perform the following operations:

D1. softmax function operation is done using the theme distribution that self-encoding encoder returns, as Voting Model；

A part that the theme distribution that self-encoding encoder returns is inputted as formula 3:

Wherein:

σ: sigmoid function；

s^l(q, r)=cos (z_q,z_r): wherein z_q,z_rSample post q and template mail r are expressed as in theme set Probability distribution on z, s^l(q, r) indicates sample post q and cosine of the template mail r under first of theme in theme set z Similitude.

D2. the cosine similarity for calculating each sample post Yu marked template mail, as feature s^l(q,r)；

Marked template mail: spam classification in template mail is denoted as 0, normal email type is denoted as 1 and (uses y Value indicates the y value in 0 or 1, that is, step D4)；

D3. the sample term vector and template term vector for using (step C3) to generate do dot product multiplication, and resulting result is passed through Sigmoid function produces general between the 0-1 that one group of dimension values is template mail (template mail represented by term vector) number Rate value.

D4. by probability value obtained by Voting Model in (step D1) in (step D3) sigmoid function acquired results it is general Rate value is weighted and averaged, and obtains predicted value y^ (decimal between 0-1)；The true value of template mail is denoted as y, and (spam is denoted as 0,1) normal email is denoted as；

D5. loss function (function about y and y^) of the cross entropy formula as model finally is used, AdamOptimizer optimization method Optimized model, the training taken turns by 200-500, preservation model.

D6. the model and template mail for utilizing (step D5) to save.It has been trained whenever a new mail enters AAS system when, available numerical value 0 or 1 (0 to represent be not spam, 1 represent be spam).

Compared with prior art, the beneficial effects of the present invention are:

The present invention provides a kind of method based on machine learning dynamic detection spam, by using LDA topic model, Autoencoder self-encoding encoder and customized linear model, dynamic detection spam, efficient identification spam.

Technical advantage of the invention includes:

Firstly, present invention utilizes deep learning methods, to improve the accuracy rate of garbage screen mail；

Secondly, linear model computation complexity is not high in the present invention, so that the speed of screening system mail is accelerated；

Finally, the technical scheme is that an expansible method/system, is expressed if there is novel spam Form, it is only necessary to New-type refuse vehicle mail be added in template mail set, again by step one new model of training.

Detailed description of the invention

The flow diagram of Fig. 1 the method for the present invention.

Specific embodiment

With reference to the accompanying drawing, the present invention, the model of but do not limit the invention in any way are further described by embodiment It encloses.

A specific embodiment of the invention is as follows:

A. when pre-processing to sample and template mail (including multiple rubbish and normal email), following behaviour is executed Make:

A1. jieba is found on GitHub and segment software, and install；

A2. using the simplifying mode in jieba.cut method, the character string for needing to segment in text is decomposed, and is utilized Re module in Python removes the punctuate in text；

A3. the vocabulary of the stop words of Chinese is imported in the code of data prediction, and existed after rejecting character string participle Stop words, generate character string S；

A4., sklearn kit is installed, using in sklearn.feature_extraction.text file CountVectorizer class processing character string S generates the corresponding one-hot coding (one hot) of character string, as (step C) Input data.

When being B. trained using GibbsLDA++ to text, perform the following operations:

B1. download and compile the GibbsLDA++ source code of open source；

B2. training corpus prepares, and file format is bat, and the content of file: the first row is sample post sum, the second row A line is all participles and the sample post text for removing stop words to the end；

B3., hyper parameter alpha (alpha determines text-theme distribution hyper parameter, is defaulted as 50/ number of topics) is set, (beta determines theme-word distribution hyper parameter to beta, is defaulted as 0.1)；

B4. theme number is arranged, and (default: 100), setting iteration number of run (is defaulted: 1000)；

B5. the keyword number wishing to retain under each theme (topic) is arranged, and (keyword number is silent under each theme Recognize value: 20), setting generates the store path of file；The parameter of step B4, B5 are the data distributions according to existing template sample It is got with experience estimation, needs to attempt repeatedly, training could be passed through and obtain a good model；

B6. above-mentioned parameter is utilized, text is trained using GibbsLDA++；After the wheel number of training iteration, Generate keyword (the keyword number are as follows: step for having recorded in file model_final.towords and being polymerize under each theme The parameter value of rapid B5, i.e., keyword number under each theme) distribution situation.Finally, manually evaluation and test GibbsLDA++ model obtains Whether as a result up to standard, standard up to standard is depending on specific requirements, the mark of GibbsLDA++ model definition up to standard in this patent Standard, which is the theme, to be distributed lower keyword cluster result and meets the 70% of sample distribution.If meeting, number of topics that step B4 is arranged Otherwise (scalar), repeats step B as the dimension (scalar) of subject layer in step C2, until the result of training meets template postal The data distribution of part theme.

It C. the use of self-encoding encoder will hot (one_hot) code conversion be solely compressed term vector.It performs the following operations:

C1., TensorFlow deep learning frame is installed；

C2., input layer, hidden layer, subject layer, the number of four layers of neuron of output layer, in which: input layer=output are set Layer=one_hot number of dimensions (v: the character representation of input layer vector, K:| v |, the dimension of input layer, k: kth dimensional input vector Scalar value), subject layer=GibbsLDA theme number (z: subject layer character representation, the scalar value of the theme of the l: the l dimension) is hidden Layer is set as between 50-200 (h_j, h: hidden layer character representation, J:| h |, the dimension of hidden layer, j: the scalar of jth dimension intermediate vector Value), W: the three-dimensional tensor of random initializtion k*z*l, the resulting result of formula 2 are the term vector that autoencoder is generated, as The importation of (step D)；

Specific formula:

p(z^l| v)=softmax (- F (v, z^l)) (formula 1)

Wherein,

p(z^l| v): gained is general under the l theme in theme set z by v (one_hot generated by A3 step is encoded) Rate value；v_kV (similarly hereinafter) of equal value

Z: the set of theme；

L: first theme；

z^l: indicate first of theme in theme set z；

Softmax function: f (x)=e^x/∑e^x

d^l: at theme l, a scalar d of computer random initialization；

v_k: in the vocabulary that size is k, the one_hot coding vector of generation；

In theme l, under conditions of vocabulary size is k, j-th of scalar parameter of computer random initialization；

Wherein,

σ (x)=1/ [1+exp (- x)], i.e. sigmoid function；

h_j: j-th of coordinate under hidden layer h

C3. it is the loss function (loss function) of model using difference of two squares loss function, is optimized using AdamOptimizer Method carrys out Optimized model；

C4. the model optimal using the method choice of cross validation generates the term vector of mail with the optimal model (embedding)。

D. linear model is created, is given a forecast using linear model, specific method carried out therewith is as follows.

D1. sample post is obtained into the theme distribution of sample post by formula 1.Input as 3 first part of formula, it may be assumed that

∑_l∈Lp(z^l|q)；

Wherein, q is sample post；R is template mail；R is template mail set；Z is the theme set；L is set

In a certain theme；L is equivalent to z, also illustrates that theme set.

Metzler matrix: the matrix of random initializtion h*h size

Wherein:

σ: sigmoid function is shown in formula 2；

D2. using between the theme distribution of sklearn kit calculation template mail and the theme distribution of sample post Cosine similarity, and be normalized, specific formula is shown in s^l(q,r)；

D3. sample form term vector is obtained by (step C)With template mail term vectorMatrix Multiplication is done with M Method, specific formula are shown in p^l(yes|q,r)。

D4. D1-D3 acquired results are subjected to product calculation by formula 3；

D5. loss function of the cross entropy loss function as model is used, is optimized using AdamOptimizer optimization method Model；

Wherein, intersect entropy function are as follows: and H (y | y^)=- ∑ y logy^；AdamOptimizer optimization method can be used The class AdamOptimizer realization that TensorFlow software provides, i.e. tf.train.AdamOptimizer.

D6. loss function (function about y and y^) of the cross entropy formula as model finally is used, AdamOptimizer optimization method Optimized model, the training taken turns by 200-500 save trained model parameter and existing Template mail.

When a new mail enters AAS system, by the calculating of formula 1- formula 3, available new input mail Score, i.e. numerical value 0 or 1；0 representative is not spam, and 1 representative is spam.

It should be noted that the purpose for publicizing and implementing example is to help to further understand the present invention, but the skill of this field Art personnel, which are understood that, not to be departed from the present invention and spirit and scope of the appended claims, and various substitutions and modifications are all It is possible.Therefore, the present invention should not be limited to embodiment disclosure of that, and the scope of protection of present invention is with claim Subject to the range that book defines.

Claims

1. a kind of method of the Dynamic Recognition spam based on machine learning, by using LDA topic model, Autoencoder self-encoding encoder, and linear model is created, it realizes dynamic detection spam, reaches efficient identification spam Purpose；Include the following steps:

A. template mail and sample post are pre-processed, are performed the following operations:

A1. template mail and sample post are segmented；

A2. the stop words in text is removed, the punctuate in Chinese text is rejected；

A3. one-hot coding vector is converted by the sentence of point good word；

B. text is trained using LDA topic model；It performs the following operations:

B1., the parameter of LDA topic model is set, theme number is set；

B2., the number of iterations is set；Text is trained；

B3. save training as a result, obtaining the theme number when text subject Clustering Effect is best；By the theme of the text Number of the number as the subject layer of Autoencoder self-encoding encoder；

C. one_hot coding vector obtained in step A3 is compressed using self-encoding encoder AutoEncoder model, is converted For term vector embedding；It performs the following operations:

C1. model is established using TensorFlow deep learning frame, input layer, middle layer, subject layer, output layer is respectively set Number；Training process is expressed as formula 1:

p(z^l| v)=softmax (- F (v, z^l)) (formula 1)

Wherein, v is the one_hot coding generated by A3 step；p(z^l| v) for v in theme set z under first of theme gained Probability value；Z: the set of theme；L: first theme；z^l: indicate first of theme in theme set z；

Wherein, d^l: at theme l, a scalar d of computer random initialization；v_k: in the vocabulary that size is k, generation One_hot coding vector；In theme l, under conditions of vocabulary size is k, j-th of scalar ginseng of computer random initialization Number；

Wherein, p (h_j|v,z^l): the lower target probability value of j-th of the term vector generated after dimensionality reduction；Hj: j-th of seat under hidden layer h Mark；σ: sigmoid function；

C2. it uses quadratic loss function as the loss function of model, passes through AdamOptimizer optimization method Optimized model；

C3. using the method choice AutoEncoder method for verifying of reporting to the leadship after accomplishing a task train come model, by one_hot coding vector Term vector as input, after obtaining dimensionality reduction；

D. linear model is created, mail to be identified is predicted by linear model:

Linear model includes ballot stage and forecast period；In the ballot stage, the power of theme distribution is selected by Softmax function Weight；In forecast period, is matched by sample post with template mail, pass through Sigmoid function prediction result；

Specifically perform the following operations:

D1. softmax function operation is carried out to the theme distribution that AutoEncoder self-encoding encoder returns, as Voting Model；

In formula 3:σ: sigmoid function；Wherein z_q,z_rPoint It is not expressed as the probability distribution of sample post q and template mail r on theme set z；s^l(q, r) indicates sample post q and mould Cosine similarity of the plate mail r under first of theme in theme set z；

D2. the cosine similarity for calculating sample post and template mail that each has been marked, as feature s^l(q,r)；Wherein, Spam classification is denoted as 0, and normal email type is denoted as 1；

D3. the operation of dot product multiplication is carried out using the step C3 sample term vector generated and template term vector, acquired results pass through Sigmoid function generates the probability value between the 0-1 that one group of dimension values is template mail number；

D4. the probability value of sigmoid function acquired results in probability value obtained by Voting Model in step D1 and step D3 is carried out Weighted average, obtains predicted value y^；

D5. it uses cross entropy formula as the loss function of model, indicates are as follows:

Trained model is obtained, preservation is acquired by taking turns training using AdamOptimizer optimization method Optimized model more Model parameter；

D6. trained model parameter and template mail are utilized, each mail to be identified is predicted, predicted value is obtained, Thus identify whether it is spam；

Through the above steps, it realizes and is based on machine learning Dynamic Recognition spam.

2. the method for the Dynamic Recognition spam based on machine learning as described in claim 1, characterized in that in step A1, Chinese template and sample post are segmented with jieba software, English template and sample post are divided using spacy software Word；The re module of the specifically used python of step A2 rejects the punctuate in text；The specifically used sklearn software of step A3 will divide The sentence of good word is converted into one-hot coding vector.

3. the method for the Dynamic Recognition spam based on machine learning as described in claim 1, characterized in that step B is specific Text is trained using open source GibbsLDA++；Include the following steps:

B11. download and compile the GibbsLDA++ source code of open source

B12. training corpus file format is bat；The content of training corpus file are as follows: the first row is that sample post is total, second A line is all participles and the sample post text for removing stop words to row to the end；

B13. setting determines that text-theme distribution hyper parameter alpha, preferably 50/ number of topics determine theme-word distribution Hyper parameter beta, preferably 0.1；

B14. setting theme number, preferably 100；Setting iteration number of run, preferably 1000；

B15. be arranged wish under each theme topic retain keyword number, preferably 20；The storage road for generating file is set Diameter；

B16. using the parameter set, text is trained using GibbsLDA++；After the wheel number of training iteration, File model_final.towords is generated, wherein recording the distribution situation for the keyword being polymerize under each theme；

B17. whether meet the substantially distribution of sample by artificially evaluating and testing the cluster situation of keyword under each theme；Such as meet, The number of topics that step B14 is arranged is as the dimension of subject layer in step C1；Otherwise, step B is repeated, until the result of training is full The data distribution of sufficient template mail matter topics.

4. the method for the Dynamic Recognition spam based on machine learning as described in claim 1, characterized in that step D training Wheel number be 200-500.