CN107086952A

CN107086952A - A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations

Info

Publication number: CN107086952A
Application number: CN201710257123.7A
Authority: CN
Inventors: 崔玉文; 石乐义; 刘晓彤; 陈鸿龙; 郭宏斌; 孙慧; 薛智宇; 李剑蓝
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2017-04-19
Filing date: 2017-04-19
Publication date: 2017-08-22

Abstract

The invention discloses a kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations, methods described includes：Set up Chinese email training text collection；TF IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and update stop words dictionary；Feature Words extraction is carried out to Chinese email training text collection by TF IDF Chinese Word Automatic Segmentations, according to the Feature Words of extraction and Feature Words right value update feature word lexicon；Feature Words and Feature Words weights after TF IDF Chinese word segmentations are input to bayes filter and perform mail classification；Classification results feed back to daily record storehouse.Present invention rate of false alarm in Chinese Spam Filtering is low, and execution efficiency is high.

Description

A kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations

Technical field

The present invention relates to a kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations, more particularly in Literary Email is carried out during Spam filtering, and Chinese email content is divided by TF-IDF Chinese Word Automatic Segmentations Word, and extract Feature Words and calculate Feature Words weights, Feature Words and its weights are input in Bayes classifier judged afterwards Whether mail is spam, realizes the filtering to spam.

Background technology

Network has become today's society human lives' inalienable part.The high speed development of network technology, allows people The life of class and working method have huge change, and the quality of life of the mankind and the efficiency of work have obtained huge carry Rise.In recent years, Email changed the wastes of manpower such as conventional letter, material resources as the communication technology emerging in network technology With the communication mode of financial resources.Interpersonal communication, study and work become simple efficient.But Email gives our life While work brings convenient, also allowing some, certain interests sends the individual of a large amount of illegal mails or enterprise annoyings to obtain Email User.Spreading unchecked for spam generates huge negative effect to the live and work of Email User.With If being flooded with substantial amounts of spam in the mailbox of family, this does not only bring higher to the study and work of Email User Efficiency, Email User can be made to waste substantial amounts of time and efforts on the contrary and go to handle spam.In face of increasing The puzzlement of spam, it is necessary that a kind of reliable and effective Spam filtering has become development.

Bayesian algorithm with its efficiently, be easily achieved, favorable expandability the characteristics of, be widely applied to Spam filtering In technology.In addition, bayesian algorithm can be by the training to mail sample, automatic learning sample content is entered to spam Row filtering.In existing Spam filtering, bayesian algorithm has shown fabulous answer in Spam filtering Use effect.Especially when to English E-mail classification, the accuracy rate of better simply Bayesian SPAM Filtering device has reached More than 99%.And in the judging rubbish mail of Chinese email and filtering, due to the particularity of Chinese, rate of false alarm is always very It is high.If before Chinese email is classified, the accurate participle of Mail Contents can be accomplished, it will substantially reduce Chinese email The rate of false alarm of classification.

TF-IDF (Term Frequency-Inverse Document Frequency) segmentation methods are by two parts structure Into：TF (Term Frequency, characteristic frequency is word frequency) and IDF (Inverse Document Frequency, reverse text Shelves frequency).Wherein, word frequency (TF) refers to the number of times that Feature Words occur in selected document, and this just illustrates when calculating word , it is necessary to be divided to the word combination in text during frequency, the number of word is counted after division again.Reverse document frequency (IDF) Refer to the measurement of Feature Words general importance.The reverse document frequency for estimating Feature Words is counted by the corpus to foundation The degree that Feature Words occur.Reverse document frequency (IDF) effectively reduction can act on the weights of less high-frequency characteristic word, so that Weaken the influence to text classification, imparting is estimated compared with authority than larger Feature Words while also being acted on than relatively low word frequency Value, improves the accuracy of text classification.

The content of the invention

The present invention is reduces the rate of false alarm of Spam filtering in Chinese email, to improve accuracy, in naive Bayesian On the basis of rubbish mail filtering method, introduce TF-IDF Chinese Word Automatic Segmentations to Mail Contents carry out Feature Words accurately extract with And the appraisal of Feature Words weights, realize a kind of spam high efficiency filter method for Chinese content.

To reach above-mentioned purpose, a kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations of proposition, Mainly include the following steps that：

(1) Chinese email training sample set, including spam and legitimate mail are collected, Chinese email training text is set up Collection；

(2) TF-IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and updates stop words Dictionary；

(3) spam and legitimate mail concentrated by TF-IDF Chinese Word Automatic Segmentations to Chinese email training text enter Row Feature Words are extracted, according to the Feature Words of extraction and Feature Words right value update feature word lexicon；

(4) Feature Words after TF-IDF Chinese word segmentations and Feature Words weights are input to bayes filter；

(5) Bayes classifier judges that mail is according to the Feature Words and Feature Words weights in the Chinese email content of input No is spam, and result is fed back to daily record storehouse.

In the step (2), during Chinese email content carries out participle, Chinese Academy of Sciences's ictclas Chinese word segmentations are called Plug-in unit and stop words dictionary, filter out the stop words in Chinese email content, and then realize Chinese email content characteristic word Precisely extract, and stop words Word library updating is carried out to the new stop words occurred in Chinese email content.

In the step (3), Feature Words extraction process is carried out for Chinese email, by TF-IDF Chinese word segmentations to mail Feature Words weights after the Feature Words and statistics of contents extraction are compared with the Feature Words included in feature word lexicon, if deposited In identical Feature Words, corresponding Feature Words weights in dictionary are updated, if it does not, the new Feature Words of addition and its weights are arrived Feature word lexicon.

In the step (4), the Chinese email training set after TF-IDF Chinese word segmentations or new mail are produced Feature Words and Feature Words weights are input in Bayes classifier, pass through the Feature Words and the feature word lexicon meter of foundation of input The probability that Email belongs to spam is calculated, when the probability of spam is more than the threshold value of setting, Email is can determine whether It is otherwise legitimate mail for spam.

In the step (5), the influence for reduction noise characteristic word to mail classification accuracy, in Bayes classifier pair Set up the condition feeds back after E-mail classification, the content and classification results of Email is fed back to daily record storehouse, afterwards day Will storehouse carries out sample training as sample training collection.

Above technical scheme can be seen that in the present invention, than existing Bayes's rubbish postal for Chinese email For part filter method, TF-IDF Chinese Word Automatic Segmentations are combined with Bayesian Classification Arithmetic, pass through TF-IDF Chinese word segmentations The literary Mail Contents of the direct automatic centering of algorithm carry out Feature Words and accurately extracted, and are set up without artificially collecting spam Feature Words Feature dictionary, so as to avoid the accuracy rate in artificial treatment caused by subjectivity in inaccuracy, raising Spam filtering.

In addition, the Email after Bayes's classification can feed back to daily record storehouse, by the way that periodically daily record storehouse is recorded Email type and Mail Contents set up new regular training set automatically, for reconstructing feature word lexicon in Spam filtering Key feature word and its weights, and then automatically update the classifying rules of spam, improve the reliability of Spam filtering And accuracy.

Brief description of the drawings

For the technical scheme in the clearer explanation embodiment of the present invention, below in conjunction with the accompanying drawings with specific embodiment pair The present invention is described further：

Fig. 1 is the Bayesian SPAM Filtering method flow diagram based on TF-IDF Chinese word segmentations of disclosure of the invention；

Fig. 2 is the TF-IDF Chinese email participle flow charts of disclosure of the invention；

Fig. 3 is the Bayesian SPAM Filtering method process of feedback figure based on TF-IDF Chinese word segmentations of disclosure of the invention.

Embodiment

Referring to Fig. 1, it is the Bayesian SPAM Filtering method flow diagram of the invention based on TF-IDF Chinese word segmentations.

Step (1)：Chinese email training sample set, including spam and legitimate mail are collected, Chinese email instruction is set up Practice text set.

The Chinese email training sample set of the step (1) is the set of a number of spam and legitimate mail. Spam filtering is the expression according to particular text in Mail Contents, and the judgement for spam is made whether to mail, is entered And carry out Spam filtering.During the Spam Classification based on Bayes classifier, first have to collect certain amount Mail set up training sample set.Feature database is set up according to training sample set, and then according to some of mail feature in feature Performance statistics in storehouse belong to the probability of some classification, so as to realize the classification of mail.For example in the presence of a mail training sample Collect M={ m₁, m₂..., m_n}.Wherein, the mail training sample concentrates the text set that can show itself classification to be assumed to be W= {w₁, w₂..., w_n}.Moreover, it is assumed that the content type of mail text set is expressed as C={ c₁, c₂..., c_n}.So M={ m₁, m₂..., m_nIt is text M to be sorted_qCharacteristic vector.The process classified according to Bayes classifier to content of text, can Make P={ p₁, p₂..., p_nRepresent W={ w₁, w₂..., w_nBelong to particular category C={ c₁, c₂..., c_nProbable value.

Step (2)：TF-IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and renewal stops Word dictionary.

During Chinese email text set carries out TF-IDF Chinese word segmentations, it is necessary first to the stop words dictionary number of structure According to training text collection data inputting into TF-IDF Chinese word segmentation modules.Referring to TF-IDF Chinese emails shown in Fig. 2 Participle flow chart.In TF-IDF Chinese word segmentation modules, by calling Chinese Academy of Sciences ictclas Chinese word segmentations plug-in unit and stopping for setting up Word dictionary, the useless word filtering such as the function word that Mail Contents participle is obtained, preposition.For being produced after useless word filtering Keyword be the Feature Words of judging rubbish mail, then count Feature Words in judging rubbish mail by TF-IDF algorithms The weights possessed.It is as follows to Feature Words weight computing main process through TF-IDF Chinese Word Automatic Segmentations：

Here it is assumed that it is t that the text set that the mail that receives is set up, which is Feature Words in D, Mail Contents,.By Chinese point After word calculated first is frequency of the Feature Words in some mail text：

Wherein, TF (t, D) is frequency of the Feature Words t in text set D.AvgTF (D) is all Feature Words in text set Average frequency value.

In formula (2), the quantity that some specific Feature Words occurs in the document that text set is included is more, and denominator value will It is bigger, and then the IDF (inverse document frequency) tried to achieve is smaller.

Weight_TF-IDF=TF (t, D) × INFT (t, D) (formula 4)

It is assumed that N represents the total quantity of document in document sets, t represents occur Feature Words in document, and n represents the spy that classification is i Levy the total quantity of word, then TF-IDF normalization formula is：

Step (3)：The spam concentrated by TF-IDF Chinese Word Automatic Segmentations to Chinese email training text and legal Mail carries out Feature Words extraction, according to the Feature Words of extraction and Feature Words right value update feature word lexicon and dictionary.

Step (4)：Feature Words after TF-IDF Chinese word segmentations and Feature Words weights are input to bayes filter.

It is assumed that the mail set M={ f of input₁, f₂..., f_n, its mail classes is C={ good, spam }.Wherein, it is special It is each other completely self-contained to levy word, and the probability calculation for Feature Words can be described as：

P (F) is the probable value for any Feature Words in message,It is for n feature Joint probability as total individual probability product.Because its feature is equal to the probability of spam, equal to any spam Probability is multiplied by the probability of the probability of the feature occurred jointly in spam divided by the feature in any message of observation.For mailbox In the Email that receives whether be spam only judges it is insufficient by the value drawn, we are needed to drawing herein This value do one judgement standard, that is, so-called threshold value.The value and threshold value calculated according to formula (6)Phase Whether more just can determine that the mail received is spam.If P (C=spam | F) ＞ t, that is, value after calculating surpass The limit of threshold value has been crossed, now the Email received can be determined as spam.Conversely, this envelope mail without departing from The limit of threshold value, this envelope mail reception to Email can temporarily be judged as legitimate mail, do not filtered.

Step (5)：Bayes classifier judges according to the Feature Words and Feature Words weights in the Chinese email content of input Whether mail is spam, and result is fed back to daily record storehouse.

Bayes classifier in mail sample classification, it is necessary to calculate posterior probability according to the prior probability of statistics, And then decision-making is carried out to mail classification.Referring to Bayesian SPAM Filtering sides of the Fig. 3 based on TF-IDF Chinese word segmentations Method process of feedback figure.After Bayes classifier is classified to mail, mail point can be effectively handled using feedback mechanism Because noise characteristic word produces the influence of mistake classification in class., can according to the Mail Contents and classification results for feeding back to daily record storehouse Periodically to set up new mail training text collection.According to the new text set of establishment, traditional naive Bayesian rubbish can be effectively solved Because combinations of features is continually changing the defect for causing classification error rate high in rubbish mail filtering method.

Above the Bayesian SPAM Filtering method basic step based on TF-IDF Chinese word segmentations retouch in detail State.Rubbish mail filtering method under this scheme, by the way that TF-IDF Chinese Word Automatic Segmentations are applied into Bayes's spam In filter method, it is intended to which Chinese email content is carried out into accurate participle, thus solve Bayesian SPAM Filtering method due to The influence of Chinese word segmentation causes the problem of error rate is high.In addition add anti-during Bayes classifier is to Spam Classification Infeed mechanism, the feature that can effectively solve to be continually changing causes Spam filtering to fail or the problem of poor accuracy.

Claims

1. a kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations is characterized in that, mainly include following step Suddenly：

(1) Chinese email training sample set, including spam and legitimate mail are collected, Chinese email training text collection is set up；

(3) spam and legitimate mail concentrated by TF-IDF Chinese Word Automatic Segmentations to Chinese email training text carry out special Word extraction is levied, according to the Feature Words of extraction and Feature Words right value update feature word lexicon；

(5) Bayes classifier according to the Feature Words and Feature Words weights in the Chinese email content of input judge mail whether be Spam, and result is fed back to daily record storehouse.

2. a kind of Bayesian SPAM Filtering method its feature based on TF-IDF Chinese word segmentations according to claim 1 It is：In the step (2), during Chinese email content carries out participle, Chinese Academy of Sciences ictclas Chinese word segmentation plug-in units are called And stop words dictionary, the stop words in Chinese email content is filtered out, and then realize the accurate of Chinese email content characteristic word Extract, and stop words Word library updating is carried out to the new stop words occurred in Chinese email content.

3. a kind of Bayes's spam mistake based on TF-IDF Chinese word segmentations according to claim 1 and claim 2 Filtering method is characterized in that：In the step (3), Feature Words extraction process is carried out for Chinese email, passes through TF-IDF Chinese The Feature Words included in Feature Words weights and feature word lexicon after Feature Words and statistics that participle is extracted to Mail Contents are carried out Compare, if there is identical Feature Words, corresponding Feature Words weights in dictionary are updated, if it does not, the new feature of addition Word and its weights are to feature word lexicon.

4. a kind of Bayesian SPAM Filtering method its feature based on TF-IDF Chinese word segmentations according to claim 1 It is：In the step (4), the feature that the Chinese email training set after TF-IDF Chinese word segmentations or new mail are produced Word and Feature Words weights are input in Bayes classifier, and electricity is calculated by the Feature Words of input and the feature word lexicon of foundation Sub- mail belongs to the probability of spam, and when the probability of spam is more than the threshold value of setting, can determine whether Email is rubbish Rubbish mail, is otherwise legitimate mail.

5. a kind of Bayesian SPAM Filtering method its feature based on TF-IDF Chinese word segmentations according to claim 1 It is：In the step (5), the influence for reduction noise characteristic word to mail classification accuracy, in Bayes classifier to electricity Set up the condition feeds back after sub- mail classification, the content and classification results of Email is fed back to daily record storehouse, afterwards daily record Storehouse carries out sample training as sample training collection.